2. Vorlesung
This commit is contained in:
File diff suppressed because one or more lines are too long
850
Material/wise_24_25/lernmaterial/regex/Regular Expressions.ipynb
Normal file
850
Material/wise_24_25/lernmaterial/regex/Regular Expressions.ipynb
Normal file
@@ -0,0 +1,850 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c850ea25-9bde-4feb-a1d0-056c5870d59e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Regular Expressions (Regex)\n",
|
||||
"\n",
|
||||
"Wir schreiben das Jahr 1950 der Mathematiker __Stephen Cole Kleene__ erfand das Konzept der _Regulären Sprache_. Ein Konzept der theoretischen Informatik zum Beschreiben von syntaktischen Ausdrücken. Damit einhergehend lassen sich durch spezifische ausdrücke, den _Regular Expressions_, verschiedene Formen des _pattern matching_ durchführen. Eine der mit abstand wichtigensten Anwendungsfälle für _regual expressions_ ist das Kompilieren von Quellcode in Maschinensprache. Dabei werden ausdrücke wie _while_, _for_, _if_ etc. formalisiert und können einfacher in Übersetzt (Kompiliert) werden. \n",
|
||||
"\n",
|
||||
"Ein weiterer Nutzen von _regual expressions_ ist das _just-in-time compiling_ von dem auch Python als interpretierte Sprache gebrauch macht. Dabei wird der Quellcode zur Laufzeit für die Maschine übersetzt (meist nicht direkt der Quellcode, sondern eine zwischenstufe die als _Bytecode_ bezeichnet wird). Es wäre sonst nicht möglich so einfach Jupyter Notebooks zu verwenden.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Ein paar Fakten zu _regular expressions_:\n",
|
||||
"\n",
|
||||
"- _Regex_ findet sich in vielen Dialekten wieder. (vgl. [Regular Expression Engine Comparison](https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816))\n",
|
||||
"- Die Programmiersprache _Perl_ entstand aus einer Bibliothek von Henry Spencer zum nutzen von _Regex_ \n",
|
||||
"- Eine frei Nutzbare Seite (Achtung mit Werbung) zum testen und prüfen von Regulären Ausdrücken in verschiedenen Dialekten ist [Regex101](https://regex101.com/)\n",
|
||||
"- Jedes Unix(-ähnliche) System (Linux, MacOS, BSD, etc.) hat das Programm _grep (**G**lobal/**R**egular **E**xpression/**P**rint)_ zum analysieren von Datenströmen/Textdateien vorinstalliert.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<p><a href=\"https://commons.wikimedia.org/wiki/File:Kleene.jpg#/media/File:Kleene.jpg\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/1/1c/Kleene.jpg\" alt=\"Kleene.jpg\" width=\"10%\"></a><br>By Konrad Jacobs, Erlangen, Copyright is MFO - Mathematisches Forschungsinstitut Oberwolfach,<a rel=\"nofollow\" class=\"external free\" href=\"https://opc.mfo.de/detail?photo_id=2122\">https://opc.mfo.de/detail?photo_id=2122</a>, <a href=\"https://creativecommons.org/licenses/by-sa/2.0/de/deed.en\" title=\"Creative Commons Attribution-Share Alike 2.0 de\">CC BY-SA 2.0 de</a>, <a href=\"https://commons.wikimedia.org/w/index.php?curid=12342617\">Link</a></p>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b689ee80",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-27269d9f8e03f3e9",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Introduction\n",
|
||||
"\n",
|
||||
"You can find _a lot_ of material on regular expressions (regex) online.\n",
|
||||
"Therefore, we will not repeat the background but focus on some practical exercises in this notebook. Some tutorials/useful links can be found below.\n",
|
||||
"\n",
|
||||
"The way that we need and use regular expressions is to describe patterns of characters to match in a given string.\n",
|
||||
"\n",
|
||||
"You can think of them as a string of characters, which describe a certain pattern, e.g., \"four numbers followed by a word of at least 5 characters\". \n",
|
||||
"This can then be used to test given strings/texts and match the pattern specified in the regex.\n",
|
||||
"This is done using the [Python Standard Library `re`](https://docs.python.org/3/library/re.html).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**Material on Regular Expressions:**\n",
|
||||
"\n",
|
||||
"- [RegEx Howto in Python](https://docs.python.org/3/howto/regex.html)\n",
|
||||
"- [RegEx Tutorial](https://www.regular-expressions.info/tutorial.html)\n",
|
||||
"- [Interactive RegEx Tutorial](https://regexone.com/)\n",
|
||||
"- [WikiBook on RegEx](https://en.wikibooks.org/wiki/Regular_Expressions)\n",
|
||||
"- [RegExr: Testing & Visualizing RegEx](https://regexr.com/)\n",
|
||||
"- [Debuggex: Visualization of individual regex as finite state machine](https://www.debuggex.com/)\n",
|
||||
"\n",
|
||||
"**Testing with Regular Expressions:**\n",
|
||||
"- [Regex101](https://regex101.com/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "8a5d3654",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-168430a9112ab605",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import re"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6ccac77",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-4c79f2d5a1e62a04",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Example 1\n",
|
||||
"The regular expression `Hello [A-Z][a-z]+` specifies a pattern that begins with the literal string `Hello ` and is followed by a capital letter (specified by `[A-Z]`) and at least one small letter. (`[a-z]` describes the lowercase letters and `+` specifies that there is at least one of them)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "7e25056b",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-98f2d91954c191a3",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Testing the string: 'Hello World'\n",
|
||||
"Found pattern at characters: 0 to 11\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'Hello You!'\n",
|
||||
"Found pattern at characters: 0 to 9\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'This does not match the pattern...'\n",
|
||||
"Pattern not found in string.\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'We can also have the Hello World pattern somewhere within the string.'\n",
|
||||
"Found pattern at characters: 21 to 32\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'Hello world does not match'\n",
|
||||
"Pattern not found in string.\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'Hello W does not match either'\n",
|
||||
"Pattern not found in string.\n",
|
||||
"---------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"example_re = r'Hello [A-Z][a-z]+'\n",
|
||||
"test_strings = ['Hello World',\n",
|
||||
" 'Hello You!',\n",
|
||||
" 'This does not match the pattern...',\n",
|
||||
" 'We can also have the Hello World pattern somewhere within the string.',\n",
|
||||
" 'Hello world does not match',\n",
|
||||
" 'Hello W does not match either']\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for test_word in test_strings:\n",
|
||||
" print(f\"Testing the string: '{test_word}'\")\n",
|
||||
" match_object = re.search(example_re, test_word)\n",
|
||||
" if match_object:\n",
|
||||
" print(f\"Found pattern at characters: {match_object.span()[0]:d} to {match_object.span()[1]:d}\")\n",
|
||||
" else:\n",
|
||||
" print(\"Pattern not found in string.\")\n",
|
||||
" print(\"-\"*45)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5ec979b2",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-aca8488169bc0df9",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"_Note:_ Since regex often use special characters like backslash `\\`, it is helpful to define them in Python as raw strings, i.e., using a preceding `r` (see `example_re` above)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "820c31ae",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-4d3281e8922cd534",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Task 1\n",
|
||||
"\n",
|
||||
"Write a regular expression `r1` which matches the following words:\n",
|
||||
"- hello\n",
|
||||
"- yellow\n",
|
||||
"- jello"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "e7e426b0",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-c48986402655ab08",
|
||||
"locked": false,
|
||||
"schema_version": 3,
|
||||
"solution": true,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### BEGIN SOLUTION ###\n",
|
||||
"r1 = r'.*ello.*'\n",
|
||||
"### END SOLUTION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "223fa54c",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-0a761cfdabd44f1b",
|
||||
"locked": true,
|
||||
"points": 1,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<re.Match object; span=(0, 5), match='hello'>\n",
|
||||
"<re.Match object; span=(0, 6), match='yellow'>\n",
|
||||
"<re.Match object; span=(0, 5), match='jello'>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"test_words = ['hello', 'yellow', 'jello']\n",
|
||||
"for _word in test_words:\n",
|
||||
" match = re.match(r1, _word)\n",
|
||||
" print(match)\n",
|
||||
" if match is None: assert False\n",
|
||||
" assert match[0] == _word"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c3086449",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-bea454dd22c7499a",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Example 2\n",
|
||||
"\n",
|
||||
"In the first example, we have use the `[A-Z]` and `[a-z]` patterns to specify capital and lowercase letters, respectively.\n",
|
||||
"There are a lot more of such predefined patterns, e.g., `[0-9]` or `\\d` for matching a (single-digit) number.\n",
|
||||
"\n",
|
||||
"A list of these special characters can be found in the [`re` documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The following regex can be used to match a word with at least 3 letters (both capital and lowercase are accepted), followed by a two-digit number, a comma, and a four-digit number where the first number is either a one or a two."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "5a02b00a",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-1a01734fc48cc488",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Testing the string: 'November 21, 2022'\n",
|
||||
"Found pattern at characters: 0 to 17\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'Jan 01, 1970'\n",
|
||||
"Found pattern at characters: 0 to 12\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'JuNE 45, 4521'\n",
|
||||
"Pattern not found in string.\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'Abc 1, 2020'\n",
|
||||
"Pattern not found in string.\n",
|
||||
"---------------------------------------------\n",
|
||||
"Testing the string: 'July 02, 90'\n",
|
||||
"Pattern not found in string.\n",
|
||||
"---------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"example_re2 = r'[A-Za-z]{3,} \\d{2}, [12]\\d{3}'\n",
|
||||
"\n",
|
||||
"test_strings = ['November 21, 2022',\n",
|
||||
" 'Jan 01, 1970',\n",
|
||||
" 'JuNE 45, 4521',\n",
|
||||
" 'Abc 1, 2020',\n",
|
||||
" 'July 02, 90']\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for test_word in test_strings:\n",
|
||||
" print(f\"Testing the string: '{test_word}'\")\n",
|
||||
" match_object = re.search(example_re2, test_word)\n",
|
||||
" if match_object:\n",
|
||||
" print(f\"Found pattern at characters: {match_object.span()[0]:d} to {match_object.span()[1]:d}\")\n",
|
||||
" else:\n",
|
||||
" print(\"Pattern not found in string.\")\n",
|
||||
" print(\"-\"*45)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b565244d",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-0abe35e63e18f0d9",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Task 2\n",
|
||||
"\n",
|
||||
"Write a regular expression `r2` that only matches dates in the ISO format `YYYY-MM-DD`.\n",
|
||||
"It should _only_ match a string, if the whole string is a date. If the date is only part of the string, it should *not* match it.\n",
|
||||
"\n",
|
||||
"_Hint:_ You can use `(a[0-9]|b[01])` to specify the pattern that matches either an `a` followed by a single digit **or** a `b` followed by either `0` or `1`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "1e2bb2bd",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-c264d2e9cac73db0",
|
||||
"locked": false,
|
||||
"schema_version": 3,
|
||||
"solution": true,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### BEGIN SOLUTION\n",
|
||||
"r2 = r'^(\\d{4})-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])$'\n",
|
||||
"### END SOLUTION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "5bbd62f5",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-c80282e7adcccb6a",
|
||||
"locked": true,
|
||||
"points": 1,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<re.Match object; span=(0, 10), match='1970-01-01'>\n",
|
||||
"<re.Match object; span=(0, 10), match='1999-12-31'>\n",
|
||||
"<re.Match object; span=(0, 10), match='2000-02-28'>\n",
|
||||
"<re.Match object; span=(0, 10), match='2022-12-09'>\n",
|
||||
"<re.Match object; span=(0, 10), match='4250-09-10'>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"# The following strings should be matched\n",
|
||||
"dates = [\"1970-01-01\", \"1999-12-31\", \"2000-02-28\", \"2022-12-09\", \"4250-09-10\"]\n",
|
||||
"for _date in dates:\n",
|
||||
" match = re.match(r2, _date)\n",
|
||||
" print(match)\n",
|
||||
" if match is None: assert False\n",
|
||||
" assert match[0] == _date"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "0d8e4b98",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-e46e8f78178eb2b7",
|
||||
"locked": true,
|
||||
"points": 1,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"None\n",
|
||||
"None\n",
|
||||
"None\n",
|
||||
"None\n",
|
||||
"None\n",
|
||||
"None\n",
|
||||
"None\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"# The following strings should not be matched\n",
|
||||
"no_dates = [\"1970-01-32\", \"abcd-12-31\", \"2000/02/28\", \"2022-14-20\", \"2002.12.02\", \"1234-2-1\", \"77-09-02\"]\n",
|
||||
"for _date in no_dates:\n",
|
||||
" match = re.match(r2, _date)\n",
|
||||
" print(match)\n",
|
||||
" if match is not None: assert False"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "b72e49ac",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-48f63facb72e517a",
|
||||
"locked": true,
|
||||
"points": 1,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"None\n",
|
||||
"None\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"# The following strings should not be matched\n",
|
||||
"no_match = [\"This text contains the date 1999-12-31 but it should not be matched.\",\n",
|
||||
" \"2020-02-20 is a date in the beginning of the string\"]\n",
|
||||
"for _text in no_match:\n",
|
||||
" match = re.match(r2, _text)\n",
|
||||
" print(match)\n",
|
||||
" if match is not None: assert False"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ce239065",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-31d99fd79761847d",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Example 3\n",
|
||||
"\n",
|
||||
"You can save parts of the found pattern in a group to have access to it later.\n",
|
||||
"\n",
|
||||
"In the following example, we modify the regex from [Example 2](#Example-2) to capture the individual parts into groups."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "89ba4f51",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-7d320972e47ae922",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"('November', '21', '2022')\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"example_re3 = r'([A-Za-z]{3,}) (\\d{2}), ([12]\\d{3})'\n",
|
||||
"\n",
|
||||
"test_string = 'November 21, 2022'\n",
|
||||
"match = re.search(example_re3, test_string)\n",
|
||||
"print(match.groups())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "393ff9c6",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-68cbff25c972809f",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Task 3\n",
|
||||
"\n",
|
||||
"Write a regular expression `r3` which matches text between `<li>...</li>` tags and adds the found text to a group. This should be the only capturing group!\n",
|
||||
"\n",
|
||||
"_Hint:_ You might want to check how to define non-capturing groups and non-greedy matching."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "c93ee04d",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-420f01248c7eddeb",
|
||||
"locked": false,
|
||||
"schema_version": 3,
|
||||
"solution": true,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### BEGIN SOLUTION\n",
|
||||
"r3 = r'<li>((?:.|\\n)*?)</li>'\n",
|
||||
"### END SOLUTION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "37681e3d",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-488cd60d5bed2019",
|
||||
"locked": true,
|
||||
"points": 2,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['Item 1', '\\nItem 2', '\\n Item 3\\n ']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"test_html = \"\"\"\n",
|
||||
"<html>\n",
|
||||
" <head>\n",
|
||||
" <title>Test HTML</title>\n",
|
||||
" </head>\n",
|
||||
" <body>\n",
|
||||
" <h1>Heading 1</h1>\n",
|
||||
" <ol>\n",
|
||||
" <li>Item 1</li>\n",
|
||||
" <li>\n",
|
||||
"Item 2</li>\n",
|
||||
" <li>\n",
|
||||
" Item 3\n",
|
||||
" </li>\n",
|
||||
" </ol>\n",
|
||||
" </body>\n",
|
||||
"</html>\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"matches = re.findall(r3, test_html)\n",
|
||||
"print(matches)\n",
|
||||
"assert len(matches) == 3\n",
|
||||
"assert matches == ['Item 1', '\\nItem 2', '\\n Item 3\\n ']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4370f245",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-53152b78922af0b1",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Task 4\n",
|
||||
"\n",
|
||||
"Write a regular expression `r4` to find all words in a string that are acronmyms, i.e., written in all capital letters, and all words that have a capital letter in them which is not at the first position.\n",
|
||||
"\n",
|
||||
"Next, write a function `shield_acronyms` that uses this regular expression and adds curly brackets `{...}` around the found words and returns a new string.\n",
|
||||
"\n",
|
||||
"_Hint:_ You can use the [`re.sub` function](https://docs.python.org/3/library/re.html#re.sub) for this task."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "ed6b99f1",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-545bc5786ee8e947",
|
||||
"locked": false,
|
||||
"schema_version": 3,
|
||||
"solution": true,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Define r4 here\n",
|
||||
"### BEGIN SOLUTION\n",
|
||||
"r4 = r'([0-9A-Z]+\\b|[a-zA-Z]+[A-Z0-9]+[a-zA-Z\\b]*)'\n",
|
||||
"### END SOLUTION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "504cd6d3",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-900922b2243d5a55",
|
||||
"locked": true,
|
||||
"points": 1,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['MIMO']\n",
|
||||
"['M2M']\n",
|
||||
"['IN', 'mmWave']\n",
|
||||
"['5G', 'SHIELded']\n",
|
||||
"[]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"test_words = [(\"MIMO\", [\"MIMO\"]),\n",
|
||||
" (\"M2M\", [\"M2M\"]),\n",
|
||||
" (r\"Acro IN mmWave Title\", [\"IN\", \"mmWave\"]),\n",
|
||||
" (r\"5G should be SHIELded\", [\"5G\", \"SHIELded\"]),\n",
|
||||
" (r\"Regular title with Names\", []),\n",
|
||||
" ]\n",
|
||||
"for text, matches in test_words:\n",
|
||||
" result = re.findall(r4, text)\n",
|
||||
" print(result)\n",
|
||||
" assert result == matches"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "f955d228",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-2c36d0ef19bac550",
|
||||
"locked": false,
|
||||
"schema_version": 3,
|
||||
"solution": true,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def shield_acronyms(text: str) -> str:\n",
|
||||
" ### BEGIN SOLUTION\n",
|
||||
" r4 = r4 = r'([0-9A-Z]+\\b|[a-zA-Z]+[A-Z0-9]+[a-zA-Z\\b]*)'\n",
|
||||
" new_text = re.sub(r4, r'{\\g<0>}', text)\n",
|
||||
" return new_text\n",
|
||||
" ### END SOLUTION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "3b71b683",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": true,
|
||||
"grade_id": "cell-550110e95fccc717",
|
||||
"locked": true,
|
||||
"points": 2,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
},
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{MIMO}\n",
|
||||
"{M2M}\n",
|
||||
"Acro {IN} {mmWave} Title\n",
|
||||
"{5G} should be {SHIELded}\n",
|
||||
"Regular title with Names\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test Cell\n",
|
||||
"\n",
|
||||
"test_words = [(\"MIMO\", r\"{MIMO}\"),\n",
|
||||
" (\"M2M\", r\"{M2M}\"),\n",
|
||||
" (r\"Acro IN mmWave Title\", r\"Acro {IN} {mmWave} Title\"),\n",
|
||||
" (r\"5G should be SHIELded\", r\"{5G} should be {SHIELded}\"),\n",
|
||||
" (r\"Regular title with Names\", r'Regular title with Names'),\n",
|
||||
" ]\n",
|
||||
"for text, expected in test_words:\n",
|
||||
" result = shield_acronyms(text)\n",
|
||||
" print(result)\n",
|
||||
" assert result == expected"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "41222440-923d-44a4-8dc7-d7a6309d4e0a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Create Assignment",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
171
Material/wise_24_25/lernmaterial/regex/Web Parsing.ipynb
Normal file
171
Material/wise_24_25/lernmaterial/regex/Web Parsing.ipynb
Normal file
@@ -0,0 +1,171 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8f7ee9ed",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-fd19a00f47ad1a34",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"- [Beautiful Soup Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "ebaad76f",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-9138585fc343d8a7",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from bs4 import BeautifulSoup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1336423a",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-235041934d89cb33",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Example of Parsing a Website"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "8bf54e3b",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-c6761d82e17018f0",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(\"example.html\") as html_file:\n",
|
||||
" soup = BeautifulSoup(html_file)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "14566e25",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-93b2d5726c5469a8",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<title>Test HTML</title>\n",
|
||||
"Test HTML\n",
|
||||
"------------------------------\n",
|
||||
"Print all list elements on the website:\n",
|
||||
"Item 1\n",
|
||||
"\n",
|
||||
"Item 2\n",
|
||||
"\n",
|
||||
" Item 3\n",
|
||||
" \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(soup.title)\n",
|
||||
"print(soup.title.get_text())\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"print(\"-\"*30)\n",
|
||||
"print(\"Print all list elements on the website:\")\n",
|
||||
"\n",
|
||||
"li = soup.find_all(\"li\")\n",
|
||||
"for element in li:\n",
|
||||
" print(element.get_text()) # you can use .strip() to get rid of trailing whitespace"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "d64b13b5",
|
||||
"metadata": {
|
||||
"nbgrader": {
|
||||
"grade": false,
|
||||
"grade_id": "cell-3a99db5db1577717",
|
||||
"locked": true,
|
||||
"schema_version": 3,
|
||||
"solution": false,
|
||||
"task": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4bdf24a4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Create Assignment",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
16
Material/wise_24_25/lernmaterial/regex/example.html
Normal file
16
Material/wise_24_25/lernmaterial/regex/example.html
Normal file
@@ -0,0 +1,16 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>Test HTML</title>
|
||||
</head>
|
||||
<body>
|
||||
<h1>Heading 1</h1>
|
||||
<ol class="mylist">
|
||||
<li>Item 1</li>
|
||||
<li>
|
||||
Item 2</li>
|
||||
<li>
|
||||
Item 3
|
||||
</li>
|
||||
</ol>
|
||||
</body>
|
||||
</html>
|
Reference in New Issue
Block a user