The Land of Poets Challenge#
For a digital humanities project you need to display Italian poets by filtering a csv table according to various criteria. This challenge will be only about querying with pandas, which is something you might find convenient to do during exams for quickly understanding datasets content (using pandas will always be optional, you will never be asked to perform complex modifications with it)
You are given a dataset taken from Wikidata, a project by the Wikimedia foundation which aims to store only machine-readable data, like numbers, strings, and so on interlinked with many references. Each entity in Wikidata has an identifier, for example Dante Alighieri is the entity Q1067 and Florence is Q2044
Wikidata can be queried using the SPARQL language: the data was obtained with this query and downloaded in CSV format (among the many which can be chosen). Even if not necessary for the purposes of the exercise, you are invited to play a bit with the interface, like trying different visualizations (i.e. try select map in the middle-left corner) - or see other examples
Load the dataset#
First load the dataset italian-poets.csv in pandas dataframe df
USE
UTF-8
asencoding
# write here
Tell me more#
Show some info about the dataset
# write here
Getting in shape#
Show the rows and the columns counts:
# write here
10 rows#
Display first 10 rows
# write here
Born in Verona#
Display all people born in Verona
# write here
How many people in Verona#
Display how many people were born in Verona
# write here
Python is everywhere#
Show poets born in Catania in the year -500
mind the minus
I swear we did not altered the dataset in any way :-)
# write here
Verona after 1500#
Display all people born in Verona after the year 1500
# write here
First Antonio#
Display all people with Antonio as first name
# write here
Some Antonio#
Display all people with Antonio as one of the names (so also include 'Paolo Antonio Rolli'
)
# write here
Cesares during 1800#
Display all people named Cesare who were born in 1800 century
# write here
The old ones#
Show poets in year of birth order
DO NOT include in the result NaN values
HINT: see pd.notnull
# write here
Cities of poets#
Find the 5 cities with most poets, sorted from most to least.
use
groupby
andsort_values
methods
# write here
Most duplicated poets#
Find first 8 duplicated poets
# write here
All duplicated poets#
Print the number of all duplicated poets
NOTE: a Series object has only one column, even if they look two (the apparent other is the index) - so if you have a Series object you don’t need to specify a column
# write here
Northern poets#
Find all the poets born north of a given town
. In other words, look for town latitude (the second coordinate in coords
), print it, and then filter the table.
DO NOT put constants like
46.5
in your code!DO NOT add new columns for longitude and latitude
NOTE:
coord
column holds just simple strings!HINT: to get an element at a given numerical index
i
of a filtered Series (regardless of the original dataframe row index), you need to use.iloc[i]
property - note the square brackets!
town = 'Bolzano'
#town = 'Trento'
# write here
Papers please#
Extract subject id (i.e. Q8797
) and place id (i.e. Q2028
) and MODIFY df
by putting them into two new columns subj_id
and place_id
# write here
Unknown poets#
Find all the ids of nameless poets and put them in a python list.
DO NOT use loops
NOTE a Series object from the point of view of Python is just a sequence
# write here
Better unknown poets#
Find all the ids, the birthplace and birthdate of nameless poets born after year 0, and put them in a python list of tuples.
birthplaces must be integers - if not specified, put
-1
print also how many results were found
DO NOT use loops nor list comprehensions
# write here