Challenges#
import csv
import json
from pprint import pprint
Parsing challenge - Spam killer#
Roughly half of all emails sent in the world are spam.
Enraged by the number of pointless messages arriving each day, you decide to develop the definitive spam filter.
Spam killer 1. mail reader#
A mail is a text file formatted as specified by standard RFC 822 (you don’t need to read the specs, but keep in mind RFC are typically specs!)
A mail contains a certain number of fields, an empty line, and then the mail body:
Received: from forwarder@mailforeverybody.net
Message-Id: <v121c0404ad6a23934739@>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thursday, 4 Jun 2020 09:43:14 -0800
To: noreply@softpython.org
From: Harvey The Salesman <harvey@thegreatvacuum.com>
Subject: DISCOUNTED Vacuum Cleaners
Precedence: bulk
Hi!
Find the best offers on our website: thegreatvacuum.com !!!
Cheers,
Harvey
Each field name is separated from the value by a colon :
For example, in:
From: Harvey The Salesman <harvey@thegreatvacuum.com>
From
is the field name, and Harvey The Salesman <harvey@thegreatvacuum.com>
is the field value.
Implement a function read_mail(filename)
which parses a mailn.txt
file and RETURN a dictionary holding all the field names
the body has no field name in the file: in the dictionary you can use
Body
as field nameREMEMBER to remove newlines from field values
DO NOT remove newlines from the body
HINT: Getting the body text right might be tricky, so first try just parsing the fields
Example:
>>> pprint(read_mail('mail1.txt'))
{'Received': 'from forwarder@mailforeverybody.net',
'Message-Id': '<v121c0404ad6a23934739@>',
'Mime-Version': '1.0',
'Content-Type': 'text/plain; charset="us-ascii"',
'Date': 'Thursday, 4 Jun 2020 09:43:14 -0800',
'To': 'noreply@softpython.org',
'From': 'Harvey The Salesman <harvey@thegreatvacuum.com>',
'Subject': 'DISCOUNTED Vacuum Cleaners',
'Precedence': 'bulk',
'Body': 'Hi!\nFind the best offers on our website: thegreatvacuum.com !!!\nCheers, \nHarvey'}
def read_mail(filename):
"""RETURN a NEW dictionary"""
raise Exception("TODO IMPLEMENT ME !")
pprint(read_mail("mail1.txt"))
assert read_mail("mail1.txt") == {
"Body": "Hi!\n"
"Find the best offers on our website: thegreatvacuum.com !!!\n"
"Cheers, \n"
"Harvey",
"Content-Type": 'text/plain; charset="us-ascii"',
"Date": "Thursday, 4 Jun 2020 09:43:14 -0800",
"From": "Harvey The Salesman <harvey@thegreatvacuum.com>",
"Message-Id": "<v121c0404ad6a23934739@>",
"Mime-Version": "1.0",
"Precedence": "bulk",
"Received": "from forwarder@mailforeverybody.net",
"Subject": "DISCOUNTED Vacuum Cleaners",
"To": "noreply@softpython.org",
}
read_mail("mail2.txt") == {
"Received": "from mailman@networked-solutions.net",
"Message-Id": "<v47gc04e7ad6a249f4539@>",
"Mime-Version": "1.0",
"Content-Type": 'text/plain; charset="us-ascii"',
"Date": "Tuesday, 7 Jul 2020 16:25:14 -0800",
"To": "info@softpython.org",
"From": "Mr Boss <head@overpaid-data-scientists.com>",
"Subject": "20K/month Job offer",
"Precedence": "bulk",
"Body": "Congratulations! You've been crunching so many matrices \nduring the job interview you deserve 20.000€ salary/month + benefits.\nWe will install in your office three pinball machines \nand a dispenser of M&Ms - which colors do you prefer?\nBest,\nYour Next Boss\n",
}
Spam killer 2. running filters#
You defined various filters you want to run on the mails. Each filter is defined as a tuple containing a field name and a string to search for. If the field value contains the string, the mail is marked as spam.
Write a function run_filter
which takes filters as list of tuples and a list of mail files, and RETURN a report as a list of lists.
It must have:
a header
rows
columns
Subject
,From
column
SPAM?
as a boolean:True
if any of the filters detected the mail as spam,False
otherwise
Example:
>>> report = run_filters([('From', 'secret-encounters-at-night.com'),
('Body','offer') ],
['mail1.txt', 'mail2.txt', 'mail3.txt', 'mail4.txt'])
>>> pprint(report, width=90)
[['Subject', 'From', 'SPAM?'],
['DISCOUNTED Vacuum Cleaners', 'Harvey The Salesman <harvey@thegreatvacuum.com>', True],
['20K/month Job offer', 'Mr Boss <head@overpaid-data-scientists.com>', False],
['I noticed you ...', 'That lady <lady@secret-encounters-at-night.com>', True],
['Some help with your thesis', 'John <john@yourfriends.net>', False]]
def run_filters(filters, filenames):
"""RETURN a NEW list of lists"""
raise Exception("TODO IMPLEMENT ME !")
report1 = run_filters(
[("From", "secret-encounters-at-night.com"), ("Body", "offer")],
["mail1.txt", "mail2.txt", "mail3.txt", "mail4.txt"],
)
assert report1 == [
["Subject", "From", "SPAM?"],
[
"DISCOUNTED Vacuum Cleaners",
"Harvey The Salesman <harvey@thegreatvacuum.com>",
True,
],
["20K/month Job offer", "Mr Boss <head@overpaid-data-scientists.com>", False],
["I noticed you ...", "That lady <lady@secret-encounters-at-night.com>", True],
["Some help with your thesis", "John <john@yourfriends.net>", False],
]
report2 = run_filters(
[("From", "vacuum"), ("From", "Guru")], ["mail4.txt", "mail1.txt", "mail5.txt"]
)
assert report2 == [
["Subject", "From", "SPAM?"],
["Some help with your thesis", "John <john@yourfriends.net>", False],
[
"DISCOUNTED Vacuum Cleaners",
"Harvey The Salesman <harvey@thegreatvacuum.com>",
True,
],
[
"Is somebody stealing your domain?",
"Internet Guru <service@cndomaintrouble.org>",
True,
],
]
Parsing challenge - Markdown#
Markdown is a language for writing documents, which allows writing plain text with additional syntax to express the way it should be formatted. Many editors support Markdown (Jupyter and Github included). For example, some Markdown text like this:
# My Heading
some paragraph, so much interesting
another paragraph, with a some bla bla
# Another big heading
There is **something notable** and then regular words.
would be displayed in Jupyter like this:
Try writing some Python code which reads a text file with a subset of markdown syntax and translates it into suitable Python data structures. See Markdown basic syntax
DO NOT use special purpose libraries!
IMPORTANT: markdown supports arbitrary depth of subparagraphs: to keep things simple start supporting one level, then two. Doing more would require some kind of level tracking, which could be cumbersome to implement.
Example - a possible model for the above text could be this one:
parsed = [
{
"type": "header",
"level": 1,
"text": "My Heading",
"subelements": [
{
"type": "paragraph",
"level": 2,
"text": "some paragraph, so much interesting",
},
{
"type": "paragraph",
"level": 2,
"rich_text": [("normal", "another paragraph, with a some bla bla")],
},
],
},
{
"type": "header",
"level": 1,
"text": "Another big heading",
"subelements": [
{
"type": "paragraph",
"level": 2,
"rich_text": [
("normal", "There is"),
("bold", "something notable"),
("normal", "and then regular words."),
],
}
],
},
]
Parsing challenge - Other languages#
Try developing simple parsers for other languages, like:
JSON: syntax (it’s very similar to Python)
HTML web pages: Basic syntax
YAML: Wikipedia
See others lightweight markup languages
DO NOT use special purpose libraries!
IMPORTANT: Many of these languages support arbitrary depth of subparagraphs: to keep things simple start supporting one level, then two. Doing more would require some kind of level tracking, which could be cumbersome to implement.
CSV Challenge - Over the top#
With your friends, you’re opening a start-up for tourists who like mountain hiking.
You decide to focus on the north-east region of Italy and develop an app: one of the first tasks is to collect in a table all the mountain peaks with the names in italian, german, latitude, longitude and elevation.
You take some data from OpenStreetMap ( openstreetmap.org ) the free world map made by volunteers (OSM for short). As data format, you choose an CSV export generated by SLIPO Project.
Over the top 1. reading OpenStreetMap data#
Have a look at osm.csv file, try also to open it with LibreOffice or Microsoft Office
Then implement function read_osm
which reads a given CSV file with a csv.DictReader
and just PRINTS ONLY the peaks (with pprint
).
At this stage you can just PRINT the whole retrieved dictionary, we will extract stuff later.
You should see something like this:
NOTE 1: here we show only some printed rows:
NOTE 2: according to the python version you have, you might see instead regular dictionaries instead of
OrderedDict
OrderedDict([('ID', 'node/26862480'),
('NAME', 'Alpe di Succiso'),
('CATEGORY', 'TOURISM'),
('SUBCATEGORY', 'PEAK'),
('LON', '10.1955113'),
('LAT', '44.3327854'),
('SRID', '4326'),
('WKT', 'POINT (10.195511300000001 44.332785400000006)'),
('INTERNATIONAL_NAME', ''),
('STREET', ''),
('WIKIPEDIA', 'it:Alpe di Succiso'),
('PHONE', ''),
('CITY', ''),
('EMAIL', ''),
('ALTERNATIVE_NAME', ''),
('OPENING_HOURS', ''),
('DESCRIPTION', ''),
('WEBSITE', ''),
('LAST_UPDATE', ''),
('OPERATOR', ''),
('POSTCODE', ''),
('COUNTRY', ''),
('FAX', ''),
('IMAGE', ''),
('HOUSENUMBER', ''),
('OTHER_TAGS',
'{"PDOP":"1.87","natural":"peak","importance":"regional","name":"Alpe '
'di '
'Succiso","source":"survey","wikidata":"Q1810954","ele":"2016"}')])
OrderedDict([('ID', 'node/26862538'),
('NAME', 'Becco di Filadonna'),
('CATEGORY', 'TOURISM'),
('SUBCATEGORY', 'PEAK'),
('LON', '11.1934654'),
('LAT', '45.9636324'),
('SRID', '4326'),
.
.
.
def read_osm(in_filename):
raise Exception('TODO IMPLEMENT ME !')
read_osm('data/osm.csv')
Over the top 2. extract peak#
Implement function extract_peak
which given a peak as a raw dictionary, RETURN the list of relevant values in this order: italian name, german name, latitude, longitude, elevation
Note elevation, italian and german names are inside the field other_tags
as name:it
, name:de
, ele
WARNING 1:
name:it
is not always present! In such cases use NAME field from the main dictionaryWARNING 2:
name:de
is not always present! In such cases put an empty stringHINT: the field
other_tags
looks very much like an embedded JSON. To parse it quickly, use the function json.loads, which takes a string as input and outputs a Python object, in this case you will obtain a dictionary. NOTE THEs
at the end ofjson.loads
!!
Example - given:
d = OrderedDict([('ID', 'node/26862713'),
('NAME', 'Cima Undici'),
('CATEGORY', 'TOURISM'),
('SUBCATEGORY', 'PEAK'),
('LON', '12.3783333'),
('LAT', '46.6363889'),
('SRID', '4326'),
('WKT', 'POINT (12.378333300000001 46.6363889)'),
('INTERNATIONAL_NAME', ''),
('STREET', ''),
('WIKIPEDIA', 'it:Cima Undici'),
('PHONE', ''),
('CITY', ''),
('EMAIL', ''),
('ALTERNATIVE_NAME', ''),
('OPENING_HOURS', ''),
('DESCRIPTION', ''),
('WEBSITE', ''),
('LAST_UPDATE', ''),
('OPERATOR', ''),
('POSTCODE', ''),
('COUNTRY', ''),
('FAX', ''),
('IMAGE', ''),
('HOUSENUMBER', ''),
('OTHER_TAGS',
'{"name:de":"Elferkofel","natural":"peak","name":"Cima '
'Undici","name:it":"Cima '
'Undici","wikidata":"Q628931","ele":"3090"}')])
You should obtain:
>>> extract_peak(d)
['Cima Undici', 'Elferkofel', 46.6363889, 12.3783333, 3090.0]
NOTE: numbers should be numbers, not strings!
def extract_peak(rawd):
"""Takes a dictionary and RETURN a list
"""
raise Exception('TODO IMPLEMENT ME !')
from collections import OrderedDict
d = OrderedDict([('ID', 'node/26862713'),
('NAME', 'Cima Undici'),
('CATEGORY', 'TOURISM'),
('SUBCATEGORY', 'PEAK'),
('LON', '12.3783333'),
('LAT', '46.6363889'),
('SRID', '4326'),
('WKT', 'POINT (12.378333300000001 46.6363889)'),
('INTERNATIONAL_NAME', ''),
('STREET', ''),
('WIKIPEDIA', 'it:Cima Undici'),
('PHONE', ''),
('CITY', ''),
('EMAIL', ''),
('ALTERNATIVE_NAME', ''),
('OPENING_HOURS', ''),
('DESCRIPTION', ''),
('WEBSITE', ''),
('LAST_UPDATE', ''),
('OPERATOR', ''),
('POSTCODE', ''),
('COUNTRY', ''),
('FAX', ''),
('IMAGE', ''),
('HOUSENUMBER', ''),
('OTHER_TAGS',
'{"name:de":"Elferkofel","natural":"peak","name":"Cima '
'Undici","name:it":"Cima '
'Undici","wikidata":"Q628931","ele":"3090"}')])
extract_peak(d)
Over the top 3. write file#
Implement function write_peaks
so it calls extract_peak
and writes the obtained lists into peaks.csv
with a csv.writer
(so this time we write lists, not dictionaries!)
REMEMBER to put also the header
First lines should be like (for complete expected file see expected-peaks.csv)
name_it,name_de,latitude,longitude,elevation
Alpe di Succiso,,44.3327854,10.1955113,2016.0
Becco di Filadonna,,45.9636324,11.1934654,2150.0
Bechei di Sopra,,46.6077439,12.0444775,2794.0
Catinaccio d'Antermoia,Kesselkogel,46.4740893,11.6438283,3004.0
Cima Ambrizzola,,46.4791667,12.0980556,2715.0
Cima Bastioni,,46.4851159,12.2678531,2926.0
Cima Brenta,,46.1797021,10.900036,3151.0
Cima Cadin di San Lucano,,46.5776149,12.2882724,2839.0
Cima d'Asta,,46.1766183,11.6052937,2847.0
Cima dei Preti,,46.3423245,12.4210592,2707.0
Cima della Vezzana,,46.2899137,11.8297409,3192.0
Cima Dodici,,45.9976856,11.4680336,2337.0
Cima Mora,,46.240557,12.3431523,1940.0
Cima Palon,,45.7922301,11.1765372,2232.0
Cima Pape,,46.3343734,11.9283766,2503.0
Punta di mezzodì,,45.731185,11.1380772,1858.0
Cima Presanella,Cima Presanella,46.2199321,10.6641189,3556.0
Cima Rolle,Rollspitze,46.9463889,11.5077778,2800.0
Cima Tosa,,46.1565222,10.8711276,3136.0
Cima Undici,Elferkofel,46.6363889,12.3783333,3090.0
.
.
def write_peaks(in_filename):
raise Exception('TODO IMPLEMENT ME !')
write_peaks('data/osm.csv')