The Land of Poets Challenge#

For a digital humanities project you need to display Italian poets by filtering a csv table according to various criteria. This challenge will be only about querying with pandas, which is something you might find convenient to do during exams for quickly understanding datasets content (using pandas will always be optional, you will never be asked to perform complex modifications with it)

You are given a dataset taken from Wikidata, a project by the Wikimedia foundation which aims to store only machine-readable data, like numbers, strings, and so on interlinked with many references. Each entity in Wikidata has an identifier, for example Dante Alighieri is the entity Q1067 and Florence is Q2044

Wikidata can be queried using the SPARQL language: the data was obtained with this query and downloaded in CSV format (among the many which can be chosen). Even if not necessary for the purposes of the exercise, you are invited to play a bit with the interface, like trying different visualizations (i.e. try select map in the middle-left corner) - or see other examples

Load the dataset#

First load the dataset italian-poets.csv in pandas dataframe df

  • USE UTF-8 as encoding

# write here

Tell me more#

Show some info about the dataset

# write here

Getting in shape#

Show the rows and the columns counts:

# write here

10 rows#

Display first 10 rows

# write here

Born in Verona#

Display all people born in Verona

# write here

How many people in Verona#

Display how many people were born in Verona

# write here

Python is everywhere#

Show poets born in Catania in the year -500

  • mind the minus

  • I swear we did not altered the dataset in any way :-)

# write here

Verona after 1500#

Display all people born in Verona after the year 1500

# write here

First Antonio#

Display all people with Antonio as first name

# write here

Some Antonio#

Display all people with Antonio as one of the names (so also include 'Paolo Antonio Rolli')

# write here

Cesares during 1800#

Display all people named Cesare who were born in 1800 century

# write here

The old ones#

Show poets in year of birth order

  • DO NOT include in the result NaN values

HINT: see pd.notnull

# write here

Cities of poets#

Find the 5 cities with most poets, sorted from most to least.

  • use groupby and sort_values methods

# write here

Most duplicated poets#

Find first 8 duplicated poets

# write here

All duplicated poets#

Print the number of all duplicated poets

NOTE: a Series object has only one column, even if they look two (the apparent other is the index) - so if you have a Series object you don’t need to specify a column

# write here

Northern poets#

Find all the poets born north of a given town. In other words, look for town latitude (the second coordinate in coords), print it, and then filter the table.

  • DO NOT put constants like 46.5 in your code!

  • DO NOT add new columns for longitude and latitude

  • NOTE: coord column holds just simple strings!

  • HINT: to get an element at a given numerical index i of a filtered Series (regardless of the original dataframe row index), you need to use .iloc[i] property - note the square brackets!

town = 'Bolzano'
#town = 'Trento'

# write here

Papers please#

Extract subject id (i.e. Q8797) and place id (i.e. Q2028) and MODIFY df by putting them into two new columns subj_id and place_id

# write here

Unknown poets#

Find all the ids of nameless poets and put them in a python list.

  • DO NOT use loops

  • NOTE a Series object from the point of view of Python is just a sequence

# write here

Better unknown poets#

Find all the ids, the birthplace and birthdate of nameless poets born after year 0, and put them in a python list of tuples.

  • birthplaces must be integers - if not specified, put -1

  • print also how many results were found

  • DO NOT use loops nor list comprehensions

# write here