2. Vorlesung

This commit is contained in:
2024-10-25 13:28:49 +02:00
parent 9ea256c27e
commit 71b9d91eeb
168 changed files with 1172650 additions and 33 deletions

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,850 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c850ea25-9bde-4feb-a1d0-056c5870d59e",
"metadata": {},
"source": [
"# Regular Expressions (Regex)\n",
"\n",
"Wir schreiben das Jahr 1950 der Mathematiker __Stephen Cole Kleene__ erfand das Konzept der _Regulären Sprache_. Ein Konzept der theoretischen Informatik zum Beschreiben von syntaktischen Ausdrücken. Damit einhergehend lassen sich durch spezifische ausdrücke, den _Regular Expressions_, verschiedene Formen des _pattern matching_ durchführen. Eine der mit abstand wichtigensten Anwendungsfälle für _regual expressions_ ist das Kompilieren von Quellcode in Maschinensprache. Dabei werden ausdrücke wie _while_, _for_, _if_ etc. formalisiert und können einfacher in Übersetzt (Kompiliert) werden. \n",
"\n",
"Ein weiterer Nutzen von _regual expressions_ ist das _just-in-time compiling_ von dem auch Python als interpretierte Sprache gebrauch macht. Dabei wird der Quellcode zur Laufzeit für die Maschine übersetzt (meist nicht direkt der Quellcode, sondern eine zwischenstufe die als _Bytecode_ bezeichnet wird). Es wäre sonst nicht möglich so einfach Jupyter Notebooks zu verwenden.\n",
"\n",
"\n",
"Ein paar Fakten zu _regular expressions_:\n",
"\n",
"- _Regex_ findet sich in vielen Dialekten wieder. (vgl. [Regular Expression Engine Comparison](https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816))\n",
"- Die Programmiersprache _Perl_ entstand aus einer Bibliothek von Henry Spencer zum nutzen von _Regex_ \n",
"- Eine frei Nutzbare Seite (Achtung mit Werbung) zum testen und prüfen von Regulären Ausdrücken in verschiedenen Dialekten ist [Regex101](https://regex101.com/)\n",
"- Jedes Unix(-ähnliche) System (Linux, MacOS, BSD, etc.) hat das Programm _grep (**G**lobal/**R**egular **E**xpression/**P**rint)_ zum analysieren von Datenströmen/Textdateien vorinstalliert.\n",
"\n",
"\n",
"<p><a href=\"https://commons.wikimedia.org/wiki/File:Kleene.jpg#/media/File:Kleene.jpg\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/1/1c/Kleene.jpg\" alt=\"Kleene.jpg\" width=\"10%\"></a><br>By Konrad Jacobs, Erlangen, Copyright is MFO - Mathematisches Forschungsinstitut Oberwolfach,&lt;a rel=\"nofollow\" class=\"external free\" href=\"https://opc.mfo.de/detail?photo_id=2122\"&gt;https://opc.mfo.de/detail?photo_id=2122&lt;/a&gt;, <a href=\"https://creativecommons.org/licenses/by-sa/2.0/de/deed.en\" title=\"Creative Commons Attribution-Share Alike 2.0 de\">CC BY-SA 2.0 de</a>, <a href=\"https://commons.wikimedia.org/w/index.php?curid=12342617\">Link</a></p>"
]
},
{
"cell_type": "markdown",
"id": "b689ee80",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-27269d9f8e03f3e9",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Introduction\n",
"\n",
"You can find _a lot_ of material on regular expressions (regex) online.\n",
"Therefore, we will not repeat the background but focus on some practical exercises in this notebook. Some tutorials/useful links can be found below.\n",
"\n",
"The way that we need and use regular expressions is to describe patterns of characters to match in a given string.\n",
"\n",
"You can think of them as a string of characters, which describe a certain pattern, e.g., \"four numbers followed by a word of at least 5 characters\". \n",
"This can then be used to test given strings/texts and match the pattern specified in the regex.\n",
"This is done using the [Python Standard Library `re`](https://docs.python.org/3/library/re.html).\n",
"\n",
"\n",
"**Material on Regular Expressions:**\n",
"\n",
"- [RegEx Howto in Python](https://docs.python.org/3/howto/regex.html)\n",
"- [RegEx Tutorial](https://www.regular-expressions.info/tutorial.html)\n",
"- [Interactive RegEx Tutorial](https://regexone.com/)\n",
"- [WikiBook on RegEx](https://en.wikibooks.org/wiki/Regular_Expressions)\n",
"- [RegExr: Testing & Visualizing RegEx](https://regexr.com/)\n",
"- [Debuggex: Visualization of individual regex as finite state machine](https://www.debuggex.com/)\n",
"\n",
"**Testing with Regular Expressions:**\n",
"- [Regex101](https://regex101.com/)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8a5d3654",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-168430a9112ab605",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "markdown",
"id": "b6ccac77",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-4c79f2d5a1e62a04",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Example 1\n",
"The regular expression `Hello [A-Z][a-z]+` specifies a pattern that begins with the literal string `Hello ` and is followed by a capital letter (specified by `[A-Z]`) and at least one small letter. (`[a-z]` describes the lowercase letters and `+` specifies that there is at least one of them)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7e25056b",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-98f2d91954c191a3",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Testing the string: 'Hello World'\n",
"Found pattern at characters: 0 to 11\n",
"---------------------------------------------\n",
"Testing the string: 'Hello You!'\n",
"Found pattern at characters: 0 to 9\n",
"---------------------------------------------\n",
"Testing the string: 'This does not match the pattern...'\n",
"Pattern not found in string.\n",
"---------------------------------------------\n",
"Testing the string: 'We can also have the Hello World pattern somewhere within the string.'\n",
"Found pattern at characters: 21 to 32\n",
"---------------------------------------------\n",
"Testing the string: 'Hello world does not match'\n",
"Pattern not found in string.\n",
"---------------------------------------------\n",
"Testing the string: 'Hello W does not match either'\n",
"Pattern not found in string.\n",
"---------------------------------------------\n"
]
}
],
"source": [
"example_re = r'Hello [A-Z][a-z]+'\n",
"test_strings = ['Hello World',\n",
" 'Hello You!',\n",
" 'This does not match the pattern...',\n",
" 'We can also have the Hello World pattern somewhere within the string.',\n",
" 'Hello world does not match',\n",
" 'Hello W does not match either']\n",
"\n",
"\n",
"for test_word in test_strings:\n",
" print(f\"Testing the string: '{test_word}'\")\n",
" match_object = re.search(example_re, test_word)\n",
" if match_object:\n",
" print(f\"Found pattern at characters: {match_object.span()[0]:d} to {match_object.span()[1]:d}\")\n",
" else:\n",
" print(\"Pattern not found in string.\")\n",
" print(\"-\"*45)"
]
},
{
"cell_type": "markdown",
"id": "5ec979b2",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-aca8488169bc0df9",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"_Note:_ Since regex often use special characters like backslash `\\`, it is helpful to define them in Python as raw strings, i.e., using a preceding `r` (see `example_re` above)."
]
},
{
"cell_type": "markdown",
"id": "820c31ae",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-4d3281e8922cd534",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Task 1\n",
"\n",
"Write a regular expression `r1` which matches the following words:\n",
"- hello\n",
"- yellow\n",
"- jello"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e7e426b0",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-c48986402655ab08",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
},
"tags": []
},
"outputs": [],
"source": [
"### BEGIN SOLUTION ###\n",
"r1 = r'.*ello.*'\n",
"### END SOLUTION"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "223fa54c",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-0a761cfdabd44f1b",
"locked": true,
"points": 1,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<re.Match object; span=(0, 5), match='hello'>\n",
"<re.Match object; span=(0, 6), match='yellow'>\n",
"<re.Match object; span=(0, 5), match='jello'>\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"test_words = ['hello', 'yellow', 'jello']\n",
"for _word in test_words:\n",
" match = re.match(r1, _word)\n",
" print(match)\n",
" if match is None: assert False\n",
" assert match[0] == _word"
]
},
{
"cell_type": "markdown",
"id": "c3086449",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-bea454dd22c7499a",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Example 2\n",
"\n",
"In the first example, we have use the `[A-Z]` and `[a-z]` patterns to specify capital and lowercase letters, respectively.\n",
"There are a lot more of such predefined patterns, e.g., `[0-9]` or `\\d` for matching a (single-digit) number.\n",
"\n",
"A list of these special characters can be found in the [`re` documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).\n",
"\n",
"\n",
"The following regex can be used to match a word with at least 3 letters (both capital and lowercase are accepted), followed by a two-digit number, a comma, and a four-digit number where the first number is either a one or a two."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5a02b00a",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-1a01734fc48cc488",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Testing the string: 'November 21, 2022'\n",
"Found pattern at characters: 0 to 17\n",
"---------------------------------------------\n",
"Testing the string: 'Jan 01, 1970'\n",
"Found pattern at characters: 0 to 12\n",
"---------------------------------------------\n",
"Testing the string: 'JuNE 45, 4521'\n",
"Pattern not found in string.\n",
"---------------------------------------------\n",
"Testing the string: 'Abc 1, 2020'\n",
"Pattern not found in string.\n",
"---------------------------------------------\n",
"Testing the string: 'July 02, 90'\n",
"Pattern not found in string.\n",
"---------------------------------------------\n"
]
}
],
"source": [
"example_re2 = r'[A-Za-z]{3,} \\d{2}, [12]\\d{3}'\n",
"\n",
"test_strings = ['November 21, 2022',\n",
" 'Jan 01, 1970',\n",
" 'JuNE 45, 4521',\n",
" 'Abc 1, 2020',\n",
" 'July 02, 90']\n",
"\n",
"\n",
"for test_word in test_strings:\n",
" print(f\"Testing the string: '{test_word}'\")\n",
" match_object = re.search(example_re2, test_word)\n",
" if match_object:\n",
" print(f\"Found pattern at characters: {match_object.span()[0]:d} to {match_object.span()[1]:d}\")\n",
" else:\n",
" print(\"Pattern not found in string.\")\n",
" print(\"-\"*45)"
]
},
{
"cell_type": "markdown",
"id": "b565244d",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-0abe35e63e18f0d9",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Task 2\n",
"\n",
"Write a regular expression `r2` that only matches dates in the ISO format `YYYY-MM-DD`.\n",
"It should _only_ match a string, if the whole string is a date. If the date is only part of the string, it should *not* match it.\n",
"\n",
"_Hint:_ You can use `(a[0-9]|b[01])` to specify the pattern that matches either an `a` followed by a single digit **or** a `b` followed by either `0` or `1`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1e2bb2bd",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-c264d2e9cac73db0",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
},
"tags": []
},
"outputs": [],
"source": [
"### BEGIN SOLUTION\n",
"r2 = r'^(\\d{4})-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])$'\n",
"### END SOLUTION"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5bbd62f5",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-c80282e7adcccb6a",
"locked": true,
"points": 1,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<re.Match object; span=(0, 10), match='1970-01-01'>\n",
"<re.Match object; span=(0, 10), match='1999-12-31'>\n",
"<re.Match object; span=(0, 10), match='2000-02-28'>\n",
"<re.Match object; span=(0, 10), match='2022-12-09'>\n",
"<re.Match object; span=(0, 10), match='4250-09-10'>\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"# The following strings should be matched\n",
"dates = [\"1970-01-01\", \"1999-12-31\", \"2000-02-28\", \"2022-12-09\", \"4250-09-10\"]\n",
"for _date in dates:\n",
" match = re.match(r2, _date)\n",
" print(match)\n",
" if match is None: assert False\n",
" assert match[0] == _date"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0d8e4b98",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-e46e8f78178eb2b7",
"locked": true,
"points": 1,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n",
"None\n",
"None\n",
"None\n",
"None\n",
"None\n",
"None\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"# The following strings should not be matched\n",
"no_dates = [\"1970-01-32\", \"abcd-12-31\", \"2000/02/28\", \"2022-14-20\", \"2002.12.02\", \"1234-2-1\", \"77-09-02\"]\n",
"for _date in no_dates:\n",
" match = re.match(r2, _date)\n",
" print(match)\n",
" if match is not None: assert False"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b72e49ac",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-48f63facb72e517a",
"locked": true,
"points": 1,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n",
"None\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"# The following strings should not be matched\n",
"no_match = [\"This text contains the date 1999-12-31 but it should not be matched.\",\n",
" \"2020-02-20 is a date in the beginning of the string\"]\n",
"for _text in no_match:\n",
" match = re.match(r2, _text)\n",
" print(match)\n",
" if match is not None: assert False"
]
},
{
"cell_type": "markdown",
"id": "ce239065",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-31d99fd79761847d",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Example 3\n",
"\n",
"You can save parts of the found pattern in a group to have access to it later.\n",
"\n",
"In the following example, we modify the regex from [Example 2](#Example-2) to capture the individual parts into groups."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "89ba4f51",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-7d320972e47ae922",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('November', '21', '2022')\n"
]
}
],
"source": [
"example_re3 = r'([A-Za-z]{3,}) (\\d{2}), ([12]\\d{3})'\n",
"\n",
"test_string = 'November 21, 2022'\n",
"match = re.search(example_re3, test_string)\n",
"print(match.groups())"
]
},
{
"cell_type": "markdown",
"id": "393ff9c6",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-68cbff25c972809f",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Task 3\n",
"\n",
"Write a regular expression `r3` which matches text between `<li>...</li>` tags and adds the found text to a group. This should be the only capturing group!\n",
"\n",
"_Hint:_ You might want to check how to define non-capturing groups and non-greedy matching."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "c93ee04d",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-420f01248c7eddeb",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
},
"tags": []
},
"outputs": [],
"source": [
"### BEGIN SOLUTION\n",
"r3 = r'<li>((?:.|\\n)*?)</li>'\n",
"### END SOLUTION"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "37681e3d",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-488cd60d5bed2019",
"locked": true,
"points": 2,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Item 1', '\\nItem 2', '\\n Item 3\\n ']\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"test_html = \"\"\"\n",
"<html>\n",
" <head>\n",
" <title>Test HTML</title>\n",
" </head>\n",
" <body>\n",
" <h1>Heading 1</h1>\n",
" <ol>\n",
" <li>Item 1</li>\n",
" <li>\n",
"Item 2</li>\n",
" <li>\n",
" Item 3\n",
" </li>\n",
" </ol>\n",
" </body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"matches = re.findall(r3, test_html)\n",
"print(matches)\n",
"assert len(matches) == 3\n",
"assert matches == ['Item 1', '\\nItem 2', '\\n Item 3\\n ']"
]
},
{
"cell_type": "markdown",
"id": "4370f245",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-53152b78922af0b1",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Task 4\n",
"\n",
"Write a regular expression `r4` to find all words in a string that are acronmyms, i.e., written in all capital letters, and all words that have a capital letter in them which is not at the first position.\n",
"\n",
"Next, write a function `shield_acronyms` that uses this regular expression and adds curly brackets `{...}` around the found words and returns a new string.\n",
"\n",
"_Hint:_ You can use the [`re.sub` function](https://docs.python.org/3/library/re.html#re.sub) for this task."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ed6b99f1",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-545bc5786ee8e947",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
},
"tags": []
},
"outputs": [],
"source": [
"# Define r4 here\n",
"### BEGIN SOLUTION\n",
"r4 = r'([0-9A-Z]+\\b|[a-zA-Z]+[A-Z0-9]+[a-zA-Z\\b]*)'\n",
"### END SOLUTION"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "504cd6d3",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-900922b2243d5a55",
"locked": true,
"points": 1,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['MIMO']\n",
"['M2M']\n",
"['IN', 'mmWave']\n",
"['5G', 'SHIELded']\n",
"[]\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"test_words = [(\"MIMO\", [\"MIMO\"]),\n",
" (\"M2M\", [\"M2M\"]),\n",
" (r\"Acro IN mmWave Title\", [\"IN\", \"mmWave\"]),\n",
" (r\"5G should be SHIELded\", [\"5G\", \"SHIELded\"]),\n",
" (r\"Regular title with Names\", []),\n",
" ]\n",
"for text, matches in test_words:\n",
" result = re.findall(r4, text)\n",
" print(result)\n",
" assert result == matches"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "f955d228",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-2c36d0ef19bac550",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
},
"tags": []
},
"outputs": [],
"source": [
"def shield_acronyms(text: str) -> str:\n",
" ### BEGIN SOLUTION\n",
" r4 = r4 = r'([0-9A-Z]+\\b|[a-zA-Z]+[A-Z0-9]+[a-zA-Z\\b]*)'\n",
" new_text = re.sub(r4, r'{\\g<0>}', text)\n",
" return new_text\n",
" ### END SOLUTION"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3b71b683",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "cell-550110e95fccc717",
"locked": true,
"points": 2,
"schema_version": 3,
"solution": false,
"task": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{MIMO}\n",
"{M2M}\n",
"Acro {IN} {mmWave} Title\n",
"{5G} should be {SHIELded}\n",
"Regular title with Names\n"
]
}
],
"source": [
"# Test Cell\n",
"\n",
"test_words = [(\"MIMO\", r\"{MIMO}\"),\n",
" (\"M2M\", r\"{M2M}\"),\n",
" (r\"Acro IN mmWave Title\", r\"Acro {IN} {mmWave} Title\"),\n",
" (r\"5G should be SHIELded\", r\"{5G} should be {SHIELded}\"),\n",
" (r\"Regular title with Names\", r'Regular title with Names'),\n",
" ]\n",
"for text, expected in test_words:\n",
" result = shield_acronyms(text)\n",
" print(result)\n",
" assert result == expected"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41222440-923d-44a4-8dc7-d7a6309d4e0a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Create Assignment",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,171 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "8f7ee9ed",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-fd19a00f47ad1a34",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"- [Beautiful Soup Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ebaad76f",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-9138585fc343d8a7",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"id": "1336423a",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-235041934d89cb33",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"source": [
"## Example of Parsing a Website"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8bf54e3b",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-c6761d82e17018f0",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"with open(\"example.html\") as html_file:\n",
" soup = BeautifulSoup(html_file)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "14566e25",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-93b2d5726c5469a8",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<title>Test HTML</title>\n",
"Test HTML\n",
"------------------------------\n",
"Print all list elements on the website:\n",
"Item 1\n",
"\n",
"Item 2\n",
"\n",
" Item 3\n",
" \n"
]
}
],
"source": [
"print(soup.title)\n",
"print(soup.title.get_text())\n",
"\n",
"\n",
"print(\"-\"*30)\n",
"print(\"Print all list elements on the website:\")\n",
"\n",
"li = soup.find_all(\"li\")\n",
"for element in li:\n",
" print(element.get_text()) # you can use .strip() to get rid of trailing whitespace"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "d64b13b5",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "cell-3a99db5db1577717",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"import requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4bdf24a4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Create Assignment",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,16 @@
<html>
<head>
<title>Test HTML</title>
</head>
<body>
<h1>Heading 1</h1>
<ol class="mylist">
<li>Item 1</li>
<li>
Item 2</li>
<li>
Item 3
</li>
</ol>
</body>
</html>