The basics#

Introduction#

Python gives powerful tools for data analysis - among the main ones we find Polars, which gives fast and flexible data structures, especially for interactive data analysis.

Let’s start off by importing it:

import polars as pl

A dataframe?#

A dataframe is basically a table, that is a collection of columns, also called Series, which all contain the same number of cells – meaning they’re all of the same height. Generally, each row can be thought of as corresponding to one unit of study (one measurement, one individual, one geographical area, you name it), while each column corresponds to a specific attribute, or variable (name, size, age, description, creation date…).

In Polars, each column has a specific type (integer, string, date, but also list or dictionary!), similarly to NumPy arrays. And, similarly to NumPy arrays, they can take “null” values in certain rows, to represent an absence of data.

But how to create one in practice? Let’s start with a simple example. The concept of a DataFrame is materialised in Polars in the DataFrame class, which we can instantiate (create) simply as follows:

pl.DataFrame()
shape: (0, 0)

Not much to be seen here, because we didn’t pass it any data!

Question

Before running the cells below, try to guess what their output will look like. What are the column names? The values in each row?

Let’s add some data in the form of a dictionary of lists:

df = pl.DataFrame(
    {
        "name": ["Alice", "Bob"],
        "age": [78, None],
    },
    orient="col",
)
df
Hide code cell output
shape: (2, 2)
nameage
stri64
"Alice"78
"Bob"null

Or, equivalently, with a list of dictionaries:

df = pl.DataFrame(
    [
        {"name": "Alice", "age": 78},
        {"name": "Bob"},
    ],
    orient="row",
)
df
Hide code cell output
shape: (2, 2)
nameage
stri64
"Alice"78
"Bob"null

Question

In the second case, adding the null entry to the output enables Polars to comply with an important property of dataframes. Which one?

Ingesting data#

In most cases, you actually want to deal with an existing dataset, that you’d like to analyse. For this course, let’s try analyzing data recording protests from around the globe, collected as part of the Mass Mobilization project. No need to download anything, the data are already in the repository in the form of a CSV file. So let’s read our data, using pl.read_csv:

protests = pl.read_csv("data/protests.csv")
protests
shape: (17_145, 30)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sources
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…
9102017002"Papua New Guinea"9102017"Oceania"12157201715720171"Mount Hagen""50-99""50+""protesters opposed to counting…"political behavior, process"nullnullnull"crowd dispersal"nullnullnullnullnullnull"violence, chaos and fraud: fra…
9102017003"Papua New Guinea"9102017"Oceania"1331102017311020170"Lorengau""100-999""About 100""locals""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"refugees dig in as camp closes…
9102018000"Papua New Guinea"9102018"Oceania"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
9102019000"Papua New Guinea"9102019"Oceania"00nullnullnullnullnullnullnullnullnullnullnullnullnullnull"."nullnullnullnull"."nullnullnull
9102020000"Papua New Guinea"9102020"Oceania"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

Here it is!

Note that here, we dealt with a simple, well-formatted file. In the wild (aka, “the Internet”), you will not always meet such well-behaved data. In that case you’ll probably need to specify one of the (many) arguments of read_csv. Inspect the function to take a quick look!

Not all data are in CSV format (and they shouldn’t, it’s a very slow-to-read and heavy file format!), so Polars provides a number of other functions to ingest data. They are all in the form pl.read_<file-format>(path). Symmetrically, to write from a DataFrame to a file, the equivalent .write_<file-format>(path) are available.

Tip

In case of doubt, choose the .parquet format to save your dataframes. It strikes a very good balance between file size and reading / writing time.

Taking a glimpse#

The one DataFrame method you will probably use the most is .head(). It allows you to see the first 5 rows of the DataFrame, which is very useful to have a quick look at its current state:

protests.head()
shape: (5, 30)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sources
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…

Note

In the rest of the course, whenever we wish to show the result of some computation, we’ll append .head() to these computations to limit the size of the output. Remember to remove it if you wish to get the full result!

Also, you can get the list of column names using the .columns attribute:

protests.columns
['id',
 'country',
 'ccode',
 'year',
 'region',
 'protest',
 'protestnumber',
 'startday',
 'startmonth',
 'startyear',
 'endday',
 'endmonth',
 'endyear',
 'protesterviolence',
 'location',
 'participants_category',
 'participants',
 'protesteridentity',
 'protesterdemand1',
 'protesterdemand2',
 'protesterdemand3',
 'protesterdemand4',
 'stateresponse1',
 'stateresponse2',
 'stateresponse3',
 'stateresponse4',
 'stateresponse5',
 'stateresponse6',
 'stateresponse7',
 'sources']

And the shape of the frame (number of rows and columns) using .shape:

protests.shape
(17145, 30)

Finally, you can take a quick glimpse at it, featuring the two pieces of info we got above, plus the data types and a few values, with .glimpse():

protests.glimpse()
Rows: 17145
Columns: 30
$ id                    <i64> 201990001, 201990002, 201990003, 201990004, 201990005, 201990006, 201991001, 201991002, 201992001, 201993001
$ country               <str> 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada'
$ ccode                 <i64> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20
$ year                  <i64> 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1992, 1993
$ region                <str> 'North America', 'North America', 'North America', 'North America', 'North America', 'North America', 'North America', 'North America', 'North America', 'North America'
$ protest               <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ protestnumber         <i64> 1, 2, 3, 4, 5, 6, 1, 2, 1, 1
$ startday              <i64> 15, 25, 1, 12, 14, 19, 10, 28, 4, 16
$ startmonth            <i64> 1, 6, 7, 7, 8, 9, 9, 9, 5, 5
$ startyear             <i64> 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1992, 1993
$ endday                <i64> 15, 25, 1, 6, 15, 19, 17, 2, 5, 16
$ endmonth              <i64> 1, 6, 7, 9, 8, 9, 9, 10, 5, 5
$ endyear               <i64> 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1992, 1993
$ protesterviolence     <i64> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0
$ location              <str> 'national', 'Montreal, Quebec', 'Montreal, Quebec', 'Montreal, Quebec', 'Montreal, Quebec', 'Kahnawake Reservation near Montreal, Quebec', 'national', 'national', 'Toronto', 'Ottawa'
$ participants_category <str> None, None, None, None, None, None, None, None, None, None
$ participants          <str> '1000s', '1000', '500', '100s', '950', '200', '110000', '110000', '1000', '10000s'
$ protesteridentity     <str> 'unspecified', 'unspecified', 'separatist parti quebecois', 'mohawk indians', 'local residents', 'mohawk indians', 'public service alliance of canada', 'public service alliance of canada', 'gangs of black and white youths', 'unspecified'
$ protesterdemand1      <str> 'political behavior, process', 'political behavior, process', 'political behavior, process', 'land farm issue', 'political behavior, process', 'police brutality', 'labor wage dispute', 'labor wage dispute', 'police brutality', 'political behavior, process'
$ protesterdemand2      <str> 'labor wage dispute', None, None, None, None, None, None, None, None, None
$ protesterdemand3      <str> None, None, None, None, None, None, None, None, None, None
$ protesterdemand4      <str> None, None, None, None, None, None, None, None, None, None
$ stateresponse1        <str> 'ignore', 'ignore', 'ignore', 'accomodation', 'crowd dispersal', 'crowd dispersal', 'ignore', 'ignore', 'arrests', 'ignore'
$ stateresponse2        <str> None, None, None, None, 'arrests', 'shootings', None, None, None, None
$ stateresponse3        <str> None, None, None, None, 'accomodation', None, None, None, None, None
$ stateresponse4        <str> None, None, None, None, None, None, None, None, None, None
$ stateresponse5        <str> None, None, None, None, None, None, None, None, None, None
$ stateresponse6        <str> None, None, None, None, None, None, None, None, None, None
$ stateresponse7        <str> None, None, None, None, None, None, None, None, None, None
$ sources               <str> '1. great canadian train journeys into history; passenger rail system cut the times london , january 16 1990, tuesday, issue 63604., 427 words, from john best, ottawa;', '1. autonomy s cry revived in quebec the new york times, june 25, 1990, monday, late edition final, section a; page 3, column 4; foreign desk, 1014 words, by alessandra stanley, special to the new york times;', '1. quebec protest after queen calls for unity the times, july 2, 1990, monday, overseas news, 703 words, from john best in ottawa 2. in canada, royal rejection ; quebecois protesters turn out for queen elizabeth s visit the washington post, july 2, 1990, monday, final edition, style; page b1, 633 words, william claiborne, washington post foreign service;', '1. indians gather as siege intensifies; armed confrontation in canada reflects growing militancy the washington post, july 19, 1990, thursday, final edition, first section; page a25, 1330 words, william claiborne, washington post foreign service 2. canada to acquire site of mohawk land dispute the washington post, july 20, 1990, friday, final edition, first section; page a13, 824 words, william claiborne, washington post foreign service 3. mohawks, army still at standoff; bridge to montreal reopens to traffic the washington post, september 7, 1990, friday, final edition, first section; page a17, 1199 words, william claiborne, washington post foreign service;', '1. dozens hurt in mohawk blockade protest the times, august 14, 1990, tuesday, overseas news, 254 words, from john best in ottawa 2. canadians clash with police near a mohawk reservation the new york times, august 15, 1990, wednesday, late edition final, section a; page 16, column 6; foreign desk, 165 words, reuters;', '1. mohawks tear gassed in clash the washington post, september 19, 1990, wednesday, final edition, first section; page a17, 172 words, washington post foreign service;', '1. mulroney acts to crush strike by federal workers the new york times, september 17, 1991, tuesday, late edition final, section a; page 6; column 4; foreign desk, 761 words, by clyde h. farnsworth, special to the new york times 2. canadian strike to end but talks to go on the new york times, september 18, 1991, wednesday, late edition final, section a; page 9; column 2; foreign desk, 188 words, ap;', '1. the washington post, canada seeks to compel end of nationwide federal strike, william claiborne, washington post foreign service, october 1, 1991, tuesday, final edition, first section; page a15, 342 words;', '1. window smashing and looting spree erupts in toronto the new york times, may 5, 1992, tuesday, late edition final, section a; page 3; column 1; foreign desk 2. toronto rioters pelt police the times, may 6, 1992, wednesday, overseas news, 243 words, from john best in ottawa 3. the washington post, toronto debates causes of 2 days of violence; normally peaceful city shaken by gangs, william claiborne, washington post foreign service, may 7, 1992, thursday, final edition, first section; page a27, 685 word;', '1. canadians protest trade pact the new york times, may 16, 1993, sunday, late edition final, section 1; page 7; column 1; foreign desk, 50 words, reuters;'

Operations on columns#

Let’s see now how to select, add and transform columns.

Selecting columns#

If we want a subset of columns, we pass the name of each column to .select():

protests.select('id', 'country', 'year').head()  
shape: (5, 3)
idcountryyear
i64stri64
201990001"Canada"1990
201990002"Canada"1990
201990003"Canada"1990
201990004"Canada"1990
201990005"Canada"1990

Note

This does not change the original DataFrame, to save your selection, you need to assign to a new variable!

Selecting and modifying#

Selecting columns is nice, but once they are selected, we will certainly want to do something with them, right? Like, how can we modify the values in them?

But before going into that, let’s see a slightly different way of selecting columns. The previous code is actually equivalent to:

protests.select(pl.col("id", "country", "year")).head()
shape: (5, 3)
idcountryyear
i64stri64
201990001"Canada"1990
201990002"Canada"1990
201990003"Canada"1990
201990004"Canada"1990
201990005"Canada"1990

But why bother calling pl.col() here, and what does it even do? To answer these questions, for now let’s go back to our data and the glimpse we had above. See the protesterviolence column, that, as far as we can see, is only filled with zeros and ones? Would be nice to check if it’s actually the case, because if it is, each value in this column can probably be treated as a boolean telling us whether part of the protesters used violence.

Well this can simply be done by selecting the column with pl.col(), and then calling the .unique() method on it, as follows:

protests.select(pl.col("protesterviolence").unique())
shape: (3, 1)
protesterviolence
i64
null
0
1

Calling pl.col() therefore enabled us to perform an operation on the selected column! And about the latter, it indeed turns out we only have zeros and ones in there, save some nulls that denote an absence of data.

Note

Putting zeros and ones to encode a boolean is actually a pretty common practice, you might come across this again!

Note

It is usually better to start reading any documentation you can find about the dataset before making tests like these. In some situations though, there is no documentation, so you have to make this kind of tests to validate your assumptions.

We can subsequently safely transform this integer column into a boolean one, using the .cast() method, and (optionally) renaming it using .alias():

protests.select(pl.col("protesterviolence").cast(bool).alias("There was protester violence")).head()
shape: (5, 1)
There was protester violence
bool
false
false
false
true
true

Question

Why does the following code have absolutely no chance of producing the desired behaviour?

protests.select("protesterviolence".cast(bool))

So, to wrap this up, the pl.col() function of Polars is absolutely central: it creates an Expression, allowing you to call one of its many methods to perform computations on the selected column.

Adding columns#

To keep existing columns and add new ones, the .with_columns() method will come in handy.

Let’s first see how to add a new column using data from outside the dataframe:

protests.with_columns(
    from_scalar=pl.lit("a"),
    from_series=pl.Series(list(range(protests.shape[0]))),
).head()
shape: (5, 32)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesfrom_scalarfrom_series
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstri64
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…"a"0
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…"a"1
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …"a"2
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…"a"3
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…"a"4

We’ve covered two cases here:

  • when you’d like to have a column filled with the same value, you will need to specify you’re providing the literal value you want in each row with pl.lit(),

  • when you want to provide directly all the distinct values to be put in rows, you will need pl.Series().

Now, let’s see how to add columns using the values from existing ones. We’ll try to get the duration in days of all the protests in our dataset. To that end, let’s first find out what columns could help us here:

protests.head()
shape: (5, 30)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sources
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…

It does seem that the start and end day of each protest are each encoded in 3 columns of integers, respectively: ["startyear", "startmonth", "startday"], and ["endyear", "endmonth", "endday"]. This makes it very annoying to compute the duration of protests, because, for instance, if a protest spans more than a month, we can’t just subtract "endday" with "startday", which are day numbers within their month. So let’s first compute two new columns for the start and end date, with an actual date data type, which will save us a headache. Here we’ll use the pl.date() function:

protests = protests.with_columns(
    pl.date("startyear", "startmonth", "startday").alias("start_date"),
    pl.date("endyear", "endmonth", "endday").alias("end_date"),
)
protests.head()
shape: (5, 32)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_date
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedate
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…1990-01-151990-01-15
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…1990-06-251990-06-25
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …1990-07-011990-07-01
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…1990-07-121990-09-06
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…1990-08-141990-08-15

And now we just have to subtract "start_date" from "end_date" to get our duration! There is one small problem though. As you can see from the result above, when a protest occurred over one day, the start_date and end_date are identical. There is a simple solution to that: we can add a duration of one day to every entry by adding pl.duration(days=1). That works exactly like adding a scalar to a NumPy array: the value is added to each element.

protests = protests.with_columns(
    duration=pl.col("end_date") - pl.col("start_date") + pl.duration(days=1)
)
protests.head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…1990-01-151990-01-151d
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…1990-06-251990-06-251d
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …1990-07-011990-07-011d
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…1990-07-121990-09-0657d
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…1990-08-141990-08-152d

Note

Here we gave our column a name through the name of the keyword argument passed to with_columns. This is just another way to .alias().

Exercise: years of protest#

In the protests dataframe, create the two following integer columns:

  • nr_years_calendar indicating how many different calendar years each protest spanned,

  • nr_years indicating how many (actual) years the protest lasted

protests.select(pl.col('duration'))
shape: (17_145, 1)
duration
duration[ms]
1d
1d
1d
57d
2d
1d
1d
null
null
null
Hide code cell source
protests.with_columns(
    nr_years_calendar=pl.col('endyear') - pl.col('startyear') + 1,
    nr_years=pl.col('duration').dt.total_days() // 365,
).head()
shape: (5, 35)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_datedurationnr_years_calendarnr_years
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]i64i64
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…1990-01-151990-01-151d10
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…1990-06-251990-06-251d10
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null"500""separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …1990-07-011990-07-011d10
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null"100s""mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…1990-07-121990-09-0657d10
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null"950""local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…1990-08-141990-08-152d10

Hint

The second one is tricky: take into account that the column “duration” is a Duration data type. You’ll need to head over to the documentation on temporal expressions, and search in particular for those that apply to a Duration to find a method that can help you out.

Operations on rows#

If we consider the rows of a dataset, typically we may wish to select only some of them, change their order or add some new rows.

Selecting by row numbers#

The simplest way to select only certain rows is through their corresponding number, i.e., their vertical position in the frame.

For instance, you can select the first two or the three last rows in the corresponding following ways:

protests.head(2)
shape: (2, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
201990001"Canada"201990"North America"11151199015119900"national"null"1000s""unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…1990-01-151990-01-151d
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null"1000""unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…1990-06-251990-06-251d
protests.tail(3)
shape: (3, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
9102018000"Papua New Guinea"9102018"Oceania"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
9102019000"Papua New Guinea"9102019"Oceania"00nullnullnullnullnullnullnullnullnullnullnullnullnullnull"."nullnullnullnull"."nullnullnullnullnullnull
9102020000"Papua New Guinea"9102020"Oceania"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

Note

As shown earlier, if you call .head() or .tail() without any argument, they will show you 5 rows.

It’s also possible to select contiguous rows by using slicing, as we already did for strings, lists and arrays.

For example, here we select two rows starting from the fifth (included):

protests.slice(5, 2)
shape: (2, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
201990006"Canada"201990"North America"16199199019919900"Kahnawake Reservation near Mon…null"200""mohawk indians""police brutality"nullnullnull"crowd dispersal""shootings"nullnullnullnullnull"1. mohawks tear gassed in clas…1990-09-191990-09-191d
201991001"Canada"201991"North America"11109199117919910"national"null"110000""public service alliance of can…"labor wage dispute"nullnullnull"ignore"nullnullnullnullnullnull"1. mulroney acts to crush stri…1991-09-101991-09-178d

Selecting rows fulfilling some conditions#

Selecting by row numbers is simple, but most of the time you do not know in which position the relevant rows are, as their relevance depends on one or more specific conditions. So let’s see now how to filter data according to a condition they should satisfy.

This can be done using the .filter() method of DataFrames. As argument, it takes an expression evaluated for each row. All rows for which the expression evaluates to True are then kept.

So, for instance, to only keep rows corresponding to protests that happened in Italy:

protests.filter(pl.col("country") == "Italy").head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
3251990001"Italy"3251990"Europe"11137199013719901"Sicilian town of Ribera""100-999""> 50""farmers""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"sicilians riot over water shor…1990-07-131990-07-131d
3251991001"Italy"3251991"Europe"116619916619910"the Italian port of Brindisi""100-999""hundreds""albanians""political behavior, process"nullnullnull"accomodation"nullnullnullnullnullnull"refugees riot. the times. june…1991-06-061991-06-061d
3251991002"Italy"3251991"Europe"1222101991221019910"major cities""5000-10000""Thousands ""people""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. italians strike to protest …1991-10-221991-10-221d
3251992001"Italy"3251992"Europe"11286199228619920"Palermo"">10000""100000""people""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. anti mafia protest. the was…1992-06-281992-06-281d
3251992002"Italy"3251992"Europe"12217199221719920"Milan, Rome,Palermo, and in ot…">10000""> 15000""people and police officers""political behavior, process"nullnullnull"ignore""beatings"nullnullnullnullnull"1. sicilians jeer italian lead…1992-07-211992-07-211d

Notice here the usage of pl.col() that we’ve seen earlier! Again, this allows us to access the values contained in a column, here to make an equality test and return a boolean.

And again, as seen previously when we computed the protests’ durations, values from different columns can be compared. To illustrate that, here below we’ll select all protests that spanned more than one calendar year:

protests.filter(pl.col("startyear") != pl.col("endyear")).head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
701991004"Mexico"701991"North America"142311199115119920"Tabasco to Mexico City"null"100s""democratic revolutionary party…"political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. the washington post, mexica…1991-11-231992-01-1554d
1001994005"Colombia"1001994"South America"152212199411119950"southern Colombia"null"5000""putumayo peasant movement""political behavior, process"nullnullnull"accomodation""killings""crowd dispersal"nullnullnullnull"xinhua news agency, january 3,…1994-12-221995-01-1121d
1011996013"Venezuela"1011996"South America"1132712199615119970"national level"null"50""health workers""labor wage dispute"nullnullnull"ignore"nullnullnullnullnullnull"the associated press, january …1996-12-271997-01-1520d
1012002010"Venezuela"1012002"South America"11031220024220030"national level"null"10000""workers""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"the washington post, december …2002-12-032003-02-0464d
1012004016"Venezuela"1012004"South America"1161612200422120050"national level 4 different pr…null"2500""inmates""political behavior, process"nullnullnull"accomodation"nullnullnullnullnullnull"associated press worldstream, …2004-12-162005-01-2238d

It’s also possible to combine conditions like we already did in Numpy filtering: for example by using the special operator conjunction & :

protests.filter(
    (pl.col("country") == "Italy")
    & (pl.col("year") == 1992)
)
shape: (6, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
3251992001"Italy"3251992"Europe"11286199228619920"Palermo"">10000""100000""people""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. anti mafia protest. the was…1992-06-281992-06-281d
3251992002"Italy"3251992"Europe"12217199221719920"Milan, Rome,Palermo, and in ot…">10000""> 15000""people and police officers""political behavior, process"nullnullnull"ignore""beatings"nullnullnullnullnull"1. sicilians jeer italian lead…1992-07-211992-07-211d
3251992003"Italy"3251992"Europe"13310199231019921"Rome""100-999""hundreds""left wing italian demonstrator…"political behavior, process"nullnullnull"beatings""crowd dispersal"nullnullnullnullnull"police attack rome rioters. th…1992-10-031992-10-031d
3251992004"Italy"3251992"Europe"1413101992131019920"Across Italy"">10000""millions""workers""political behavior, process"nullnullnull"ignore""accomodation"nullnullnullnullnull"1. millions in italy march in …1992-10-131992-10-131d
3251992005"Italy"3251992"Europe"1526101992261019920"Rome"">10000""10000""shopkeepers""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"italy shopowners protest tax p…1992-10-261992-10-261d
3251992006"Italy"3251992"Europe"1616121992161219920"Rome"">10000""20000""doctors""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"rome doctors demonstrate to pr…1992-12-161992-12-161d

Sorting#

To obtain a new dataframe sorted according to one or more columns, we can use the .sort() method. For instance, to see the longest protests, we’d do:

protests.sort(by='duration', descending=True).head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
201998000"Canada"201998"North America"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
201999000"Canada"201999"North America"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
202001000"Canada"202001"North America"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
202002000"Canada"202002"North America"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
202004000"Canada"202004"North America"00nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

Oops, that gives us only rows with a null duration, let’s put them last!

protests.sort(by='duration', descending=True, nulls_last=True).head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
6002011005"Morocco"6002011"MENA"1517201124120140"Imider""50-99"">50""protesters, students, locals, …"political behavior, process""price increases, tax policy"nullnull"killings"nullnullnullnullnullnull"on moroccan hill, villagers ma…2011-07-012014-01-24939d
7502014011"India"7502014"Asia"1113620143620161"Jawahar Bagh, a public garden …"2000-4999""About 2,000 people""a group that was calling for h…"political behavior, process""price increases, tax policy"nullnull"crowd dispersal""killings""shootings"nullnullnullnull"24 killed as armed protesters …2014-06-032016-06-03732d
3502013013"Greece"3502013"Europe"113227201311520150"Athens""100-999""100s""state paid cleaning ladies""labor wage dispute"nullnullnull"accomodation"nullnullnullnullnull".""germany urges greece to vote o…2013-07-222015-05-11659d
6522011003"Syria"6522011"MENA"13183201131220121"National"null"10000s "null"political behavior, process""removal of politician""police brutality"null"killings""shootings""beatings""crowd dispersal""arrests""accomodation"null"security forces restrained as …2011-03-182012-12-03627d
1552016001"Chile"1552016"South America"11155201617520170"Chiloe Island""50-99"">50""fishermen""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. toxic red tide in chile pro…2016-05-152017-05-17368d

Exercise - Longest Italian protests#

✪ Analyze data from protests to find the 5 longest protests in Italy:

Hide code cell source
# The two following ways are equivalent, the first might save you a few keystrokes.
protests.filter(pl.col("country") == "Italy").sort(by="duration").tail().reverse()

protests.filter(pl.col("country") == "Italy").sort(
    by="duration", descending=True, nulls_last=True
).head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
3252002008"Italy"3252002"Europe"18205200220720020"Sicily""100-999""100s""peasants, farmers""land farm issue"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"mafia sells water in a land hi…2002-05-202002-07-2062d
3252010003"Italy"3252010"Europe"13292010231020101"Terzigno and Boscotrecase""100-999""100s""residents""land farm issue""political behavior, process"nullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1 why italy s trash wars are s…2010-09-022010-10-2352d
3252017001"Italy"3252017"Europe"1124320176420170"Rome""100-999""100""residents""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"new bulbs turn romans into zom…2017-03-242017-04-0614d
3252018001"Italy"3252018"Europe"111120189120180"national""100-999""100""protesters""social restrictions"nullnullnull"ignore"nullnullnullnullnullnull"biodegradable bags cause outra…2018-01-012018-01-099d
3252017006"Italy"3252017"Europe"16157201722720170"Sicily""50-99""50""five star movement""land farm issue"nullnullnull"ignore"nullnullnullnullnullnull"italian town loses patience wi…2017-07-152017-07-228d

Operations on strings#

To perform operations on string columns, we need to use the .str accessor after selecting the right column. This gives you access to a number of methods, some similar to known Python method to manipulate strings, some not. For more details, head over to the documentation.

Filter by textual values#

When we want to filter by text values, we can use .str.contains, here for example we select all the protests in which the source article mentioned a “riot”:

protests.filter(pl.col("sources").str.contains("riot")).head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
201992001"Canada"201992"North America"114519925519921"Toronto"null"1000""gangs of black and white youth…"police brutality"nullnullnull"arrests"nullnullnullnullnullnull"1. window smashing and looting…1992-05-041992-05-052d
202008001"Canada"202008"North America"11108200810820081"Montreal, Quebec"null"100s""unspecified""police brutality"nullnullnull"crowd dispersal"nullnullnullnullnullnull"1. canada: rioting in montreal…2008-08-102008-08-101d
202012001"Canada"202012"North America"1110220126620121"Quebec"null"1000s""university students""price increases, tax policy"nullnullnull"crowd dispersal""arrests"nullnullnullnullnull"1. quebec plans to suspend cla…2012-02-102012-06-06118d
401994001"Cuba"401994"North America"115819945819941"Havana"null"1000""unspecified""political behavior, process"nullnullnull"arrests"nullnullnullnullnullnull"1. protesters battle police in…1994-08-051994-08-051d
411995002"Haiti"411995"North America"1223111995231119951"Cite Soleil"null"50""residents of cite soleil""police brutality"nullnullnull"ignore"nullnullnullnullnullnull"1. fatal clashes in haiti slum…1995-11-231995-11-231d

Extracting strings#

We can also use it, for instance, to extract a rough estimate of the number of participants as an integer, as follows:

protests.with_columns(pl.col('participants').str.extract("([0-9]+)").cast(int)).head()
shape: (5, 33)
idcountryccodeyearregionprotestprotestnumberstartdaystartmonthstartyearenddayendmonthendyearprotesterviolencelocationparticipants_categoryparticipantsprotesteridentityprotesterdemand1protesterdemand2protesterdemand3protesterdemand4stateresponse1stateresponse2stateresponse3stateresponse4stateresponse5stateresponse6stateresponse7sourcesstart_dateend_dateduration
i64stri64i64stri64i64i64i64i64i64i64i64i64strstri64strstrstrstrstrstrstrstrstrstrstrstrstrdatedateduration[ms]
201990001"Canada"201990"North America"11151199015119900"national"null1000"unspecified""political behavior, process""labor wage dispute"nullnull"ignore"nullnullnullnullnullnull"1. great canadian train journe…1990-01-151990-01-151d
201990002"Canada"201990"North America"12256199025619900"Montreal, Quebec"null1000"unspecified""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. autonomy s cry revived in q…1990-06-251990-06-251d
201990003"Canada"201990"North America"131719901719900"Montreal, Quebec"null500"separatist parti quebecois""political behavior, process"nullnullnull"ignore"nullnullnullnullnullnull"1. quebec protest after queen …1990-07-011990-07-011d
201990004"Canada"201990"North America"1412719906919901"Montreal, Quebec"null100"mohawk indians""land farm issue"nullnullnull"accomodation"nullnullnullnullnullnull"1. indians gather as siege int…1990-07-121990-09-0657d
201990005"Canada"201990"North America"15148199015819901"Montreal, Quebec"null950"local residents""political behavior, process"nullnullnull"crowd dispersal""arrests""accomodation"nullnullnullnull"1. dozens hurt in mohawk block…1990-08-141990-08-152d

Note

"([0-9]+)" is a regular expression. A whole course could be dedicated to them, so for now, you should just know that this particular one extracts any sequence of integers from a string. They are a very practical tool, so I wanted to make you aware of their existence, but you don’t need to know more than that for this course.

Note

There is a bunch of other data types with a special accessor, but we won’t cover them here.

Exercise - Game of Thrones#

Open with Polars the file game-of-thrones.csv which holds episodes from various years.

  • IMPORTANT: place the dataframe into the variable game, so not to confuse it with previous dataframes

Data source: Kaggle - License: CC0: Public Domain

Hide code cell source
game = pl.read_csv('data/game-of-thrones.csv')

Favorite episodes#

You are given a dictionary favorite with the most liked episodes of a group of people, who unfortunately don’t remember exactly the various titles which are often incomplete: Select the favorite episodes of Paolo and Chiara.

  • assume the capitalization in favorite is the correct one

  • NOTE: the dataset contains insidious double quotes around the titles, but if you write the code in the right way it shouldn’t be a problem

favorite = {
    "Paolo": "Winter Is",
    "Chiara": "Wolf and the Lion",
    "Anselmo": "Fire and",
    "Letizia": "Garden of",
}
Hide code cell source
fav_episodes = game.filter(
    pl.col("Title").str.contains(favorite["Paolo"])
    | pl.col("Title").str.contains(favorite["Chiara"])
)
fav_episodes
shape: (2, 10)
No. overallNo. in seasonSeasonTitleDirected byWritten byNovel(s) adaptedOriginal air dateU.S. viewers(millions)Imdb rating
i64i64i64strstrstrstrstrf64f64
111""Winter Is Coming"""Tim Van Patten""David Benioff & D. B. Weiss""A Game of Thrones""17-Apr-11"2.229.1
551""The Wolf and the Lion"""Brian Kirk""David Benioff & D. B. Weiss""A Game of Thrones""15-May-11"2.589.1

First airing#

Write a function first_aired that returns a dataframe containing all the episodes which have been aired the first time in a given year (Original air date column)

  • NOTE: year is given as an int

Hide code cell source
def first_aired(df, year):
    return df.filter(pl.col("Original air date").str.slice(-2) == str(year)[2:])
first_aired(game, 2017)
shape: (6, 10)
No. overallNo. in seasonSeasonTitleDirected byWritten byNovel(s) adaptedOriginal air dateU.S. viewers(millions)Imdb rating
i64i64i64strstrstrstrstrf64f64
6227""Stormborn"""Mark Mylod""Bryan Cogman""Outline from A Dream of Spring…"23-Jul-17"9.278.9
6337""The Queen's Justice"""Mark Mylod""David Benioff & D. B. Weiss""Outline from A Dream of Spring…"30-Jul-17"9.259.2
6447""The Spoils of War"""Matt Shakman""David Benioff & D. B. Weiss""Outline from A Dream of Spring…"6-Aug-17"10.179.8
6557""Eastwatch"""Matt Shakman""Dave Hill""Outline from A Dream of Spring…"13-Aug-17"10.728.8
6667""Beyond the Wall"""Alan Taylor""David Benioff & D. B. Weiss""Outline from A Dream of Spring…"20-Aug-17"10.249.0
6777""The Dragon and the Wolf"""Jeremy Podeswa""David Benioff & D. B. Weiss""Outline from A Dream of Spring…"27-Aug-17"12.079.4

A parallel with spreadsheets#

You’re probably familiar with some spreadsheets software, such as Excel or Google Sheets. To make an analogy, you can picture a dataframe as a single sheet, which in Python is stored as a variable. Then, if you do not reassign this variable, the dataframe stays the same – so in the analogy, the sheet is untouched. So while in a spreadsheet, you’d compute for instance the sum of a column in a cell on the same sheet, in Python, you’d assign the result of the sum to a new variable, that is, to a new sheet. Same goes for filtering or reordering the rows: if you do not explicitly override the original variable, the original data stays intact.