Advanced stuff

Advanced stuff#

Let’s see how to do more advanced operations with Polars, like plotting, grouping with group_by, and joining tables with join.

import polars as pl
import hvplot.polars
import matplotlib.pyplot as plt

Let’s first reload the dataframe from the previous course. As a warmup exercise, try to write code to get the following table in a single assignment (meaning, only one protests = ...). There’s just one new thing you should do compared to the previous course: start by filtering out rows for which the values in the column "protest" are different from 1.

protests

shape: (15_239, 7)

id	country	protester_violence	participants	start_date	end_date	duration
i64	str	bool	i64	date	date	duration[ms]
201990001	"Canada"	false	1000	1990-01-15	1990-01-15	1d
201990002	"Canada"	false	1000	1990-06-25	1990-06-25	1d
201990003	"Canada"	false	500	1990-07-01	1990-07-01	1d
201990004	"Canada"	true	100	1990-07-12	1990-09-06	57d
201990005	"Canada"	true	950	1990-08-14	1990-08-15	2d
…	…	…	…	…	…	…
9102014001	"Papua New Guinea"	true	100	2014-02-16	2014-02-18	3d
9102016001	"Papua New Guinea"	true	1000	2016-05-15	2016-06-09	26d
9102017001	"Papua New Guinea"	false	50	2017-06-15	2017-06-15	1d
9102017002	"Papua New Guinea"	true	50	2017-07-15	2017-07-15	1d
9102017003	"Papua New Guinea"	false	100	2017-10-31	2017-10-31	1d

Plotting data#

While exploring data, looking at tables is often not the best to gain insights about your dataset. Fortunately, it’s very easy to make quick plots based on a dataframe.

With Matplotlib#

The first option is to use Matplotlib, and in particular, to make a quick plot you may use the implicit interface that we didn’t cover in the Matplotlib chapter. Instead of first creating an Axes and then calling a plotting method, you might directly call an equivalent function from plt that will create the Figure and Axes for you:

f = pl.col("participants") < 1e4
plt.hist('participants', data=protests.filter(f));

../_images/4458f46e0d427964e9d705be3bde973093ac83178d69089cbd61456dbe15fee4.png

Note the use of the data argument in which we pass a dataframe, exactly as we did with a dictionary in the Matplotlib chapter. Here we filtered protests with more than 10000 participants for visualization purposes.

With hvplot#

If you have hvplot installed, you can also make interactive plots from a dataframe, using the .hvplot accessor:

protests.filter(f).hvplot.scatter(x='start_date', y='participants', by='protester_violence')

Exercise - time distribution#

Plot the distribution of start_date in the dataset, first with plt, and then with hvplot, using 20 bins in both cases.

../_images/955f6bb0b6bb6d8c4321370e09523c38740b4e39e98b54c645a88582d8309851.png

Note

Method and argument names are not always the same between plt and hvplot, but most of the time, they are!

Aggregate#

Here we’ll cover the very basics of Dataframe aggregation. To go further, you can check out the documentation.

Globally#

There is a number of statistics we can get from our dataset. For instance, we could compute the average duration of protests, as follows:

avg_duration = protests.select(pl.col("duration").mean().alias("average_duration"))
avg_duration

shape: (1, 1)

average_duration
duration[ms]
2d 14h 35m 1s 82ms

If you want to extract this single value into a usual Python type, you can use the .item() method:

avg_duration.item()

datetime.timedelta(days=2, seconds=52501, microseconds=82000)

Let’s group by something!#

Now what if we want to get this statistic by country? That’s what the group_by method is made for. To get an idea of what a group is, let us group by country, and look at the very first group as follows:

for (group, df) in protests.group_by('country'):
    print(group)
    display(df)
    break

('Mauritania',)

shape: (82, 7)

id	country	protester_violence	participants	start_date	end_date	duration
i64	str	bool	i64	date	date	duration[ms]
4351991001	"Mauritania"	true	null	1991-06-02	1991-06-02	1d
4351991002	"Mauritania"	false	150	1991-08-12	1991-08-12	1d
4351992001	"Mauritania"	false	1000	1992-01-16	1992-01-16	1d
4351992002	"Mauritania"	false	20000	1992-01-18	1992-01-18	1d
4351992003	"Mauritania"	false	50	1992-01-24	1992-01-24	1d
…	…	…	…	…	…	…
4352016003	"Mauritania"	true	50	2016-07-04	2016-07-04	1d
4352016004	"Mauritania"	false	50	2016-07-11	2016-07-11	1d
4352016005	"Mauritania"	false	100	2016-11-15	2016-11-15	1d
4352016006	"Mauritania"	true	1000	2016-12-20	2016-12-20	1d
4352017001	"Mauritania"	false	1000	2017-01-31	2017-01-31	1d

As you can see, when iterating over a group_by result, at each iteration we get a tuple. Its first element is the value of the group_by key (here, the country name), and its second the DataFrame corresponding to this value.

Let’s aggregate#

Now let’s say we want to get the average duration of protests that happened in each country. Naively, you could take the code above and compute the mean at each for iteration. Good news is, there is a simpler, faster way to do it, chaining the .agg() method:

protests.group_by("country").agg(pl.col("duration").mean().alias("average_duration")).head()

shape: (5, 2)

country	average_duration
str	duration[ms]
"Azerbaijan"	1d 18h 28m 34s 285ms
"Mongolia"	1d 6h 14m 24s
"Yemen"	1d 19h 6m 29s 808ms
"Egypt"	3d 13h 20m
"India"	6d 6h 26m 57s 621ms

And, exactly as with .with_columns() or .select(), you may perform as many aggregations as you want.

Try now to also add a new column to the dataframe above, featuring the number of protests by country, and sort the result so that countries with the most protests appear first (refer to the documentation linked above!):

country_stats

shape: (166, 3)

country	average_duration	number_of_protests
str	duration[ms]	u32
"United Kingdom"	1d 23h 55m 1s 38ms	578
"France"	2d 7h 24m 53s 967ms	547
"Ireland"	1d 1h 10m 9s 744ms	431
"Germany"	1d 23h 36m 15s 824ms	364
"Kenya"	1d 6h 55m 32s 571ms	350
…	…	…
"Laos"	1d	2
"Cape Verde"	1d	2
"Bhutan"	4d	2
"Qatar"	6d	1
"South Sudan"	2d	1

Exercise - Top protesters#

Make a bar plot showing the number of protests in the 10 countries which had the most protests in their history.

Tip

You can use Axes.barh to make a horizontal bar plot.

../_images/603568f3ad2fafd08c2c8650fa571da422f7b876d982924931799e840c168aae.png

Tip

When you have long categorical labels, such as country names here, it’s often a good idea to use a horizontal bar plot.

Modifying a dataframe by plugging in the result of a grouping#

Sometimes, we may want to compute an aggregate statistic, to then use it to compute a metric in a new column. For instance, it is very common to want to compute a proportion or a ratio, which involves dividing a count by the sum of elements pertaining to each group. Let’s then compute for each protest the ratio of participants that it represented.

If we proceed as above, we’d start by performing the following group_by aggregation:

protests.group_by("country").agg(pl.col("participants").sum().alias("participants_sum")).head()

shape: (5, 2)

country	participants_sum
str	i64
"Taiwan"	7275000
"Croatia"	290310
"Djibouti"	4550
"Netherlands"	2885
"Nepal"	1014954

In fact, in that case we don’t need to call group_by(), but rather:

protests = protests.with_columns(
    pl.col("participants").sum().over("country").alias("participants_sum")
).with_columns(participant_ratio=pl.col("participants") / pl.col("participants_sum"))
protests.head()

shape: (5, 9)

id	country	protester_violence	participants	start_date	end_date	duration	participants_sum	participant_ratio
i64	str	bool	i64	date	date	duration[ms]	i64	f64
201990001	"Canada"	false	1000	1990-01-15	1990-01-15	1d	617570	0.001619
201990002	"Canada"	false	1000	1990-06-25	1990-06-25	1d	617570	0.001619
201990003	"Canada"	false	500	1990-07-01	1990-07-01	1d	617570	0.00081
201990004	"Canada"	true	100	1990-07-12	1990-09-06	57d	617570	0.000162
201990005	"Canada"	true	950	1990-08-14	1990-08-15	2d	617570	0.001538

In the end, all you want to do is add a column, so you start as always by calling .with_columns() and selecting the column from which you’ll use the data with pl.col("participants"). The new element here is that, then, you perform the .sum().over() operation, that will perform the transformation (the sum) over the window defined by the argument of over() (here, the “country” column values).

Joining tables#

Suppose we want to add a column to protests with the population of each country, to compute the proportion of the population that participated in each protest, for instance. To do so, we would need to join our dataset with another one containing such information.

Getting another table#

To get the population of each country, we can take for example the dataset countries_pop.csv, from the World Bank:

countries_pop = pl.read_csv('data/countries_pop.csv')

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[18], line 1
----> 1 countries_pop = pl.read_csv('data/countries_pop.csv')

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
@wraps(function)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
   _rename_keyword_argument(
       old_name, new_name, kwargs, function.__qualname__, version
   )
---> 92     return function(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
@wraps(function)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
   _rename_keyword_argument(
       old_name, new_name, kwargs, function.__qualname__, version
   )
---> 92     return function(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/polars/_utils/deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
@wraps(function)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
   _rename_keyword_argument(
       old_name, new_name, kwargs, function.__qualname__, version
   )
---> 92     return function(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/polars/io/csv/functions.py:508, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, schema, schema_overrides, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
else:
   with prepare_file_arg(
       source,
       encoding=encoding,
   (...)
       storage_options=storage_options,
   ) as data:
--> 508         df = _read_csv_impl(
           data,
           has_header=has_header,
           columns=columns if columns else projection,
           separator=separator,
           comment_prefix=comment_prefix,
           quote_char=quote_char,
           skip_rows=skip_rows,
           schema_overrides=schema_overrides,
           schema=schema,
           null_values=null_values,
           missing_utf8_is_empty_string=missing_utf8_is_empty_string,
           ignore_errors=ignore_errors,
           try_parse_dates=try_parse_dates,
           n_threads=n_threads,
           infer_schema_length=infer_schema_length,
           batch_size=batch_size,
           n_rows=n_rows,
           encoding=encoding if encoding == "utf8-lossy" else "utf8",
           low_memory=low_memory,
           rechunk=rechunk,
           skip_rows_after_header=skip_rows_after_header,
           row_index_name=row_index_name,
           row_index_offset=row_index_offset,
           eol_char=eol_char,
           raise_if_empty=raise_if_empty,
           truncate_ragged_lines=truncate_ragged_lines,
           decimal_comma=decimal_comma,
           glob=glob,
       )
if new_columns:
   return _update_columns(df, new_columns)

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/polars/io/csv/functions.py:653, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, schema, schema_overrides, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
       raise ValueError(msg)
projection, columns = parse_columns_arg(columns)
--> 653 pydf = PyDataFrame.read_csv(
   source,
   infer_schema_length,
   batch_size,
   has_header,
   ignore_errors,
   n_rows,
   skip_rows,
   projection,
   separator,
   rechunk,
   columns,
   encoding,
   n_threads,
   path,
   dtype_list,
   dtype_slice,
   low_memory,
   comment_prefix,
   quote_char,
   processed_null_values,
   missing_utf8_is_empty_string,
   try_parse_dates,
   skip_rows_after_header,
   parse_row_index_args(row_index_name, row_index_offset),
   eol_char=eol_char,
   raise_if_empty=raise_if_empty,
   truncate_ragged_lines=truncate_ragged_lines,
   decimal_comma=decimal_comma,
   schema=schema,
)
return wrap_df(pydf)

ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.

Ah, looks like we’re off to a great start! And that error message is not the clearest. So let’s have a quick look at our data file:

!head data/countries_pop.csv

"Data Source","World Development Indicators",

"Last Updated Date","2024-05-30",

"Country Name","Country Code","Indicator Name","Indicator Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020","2021","2022","2023",

"Aruba","ABW","Population, total","SP.POP.TOTL","54608","55811","56682","57475","58178","58782","59291","59522","59471","59330","59106","58816","58855","59365","60028","60715","61193","61465","61738","62006","62267","62614","63116","63683","64174","64478","64553","64450","64332","64596","65712","67864","70192","72360","74710","77050","79417","81858","84355","86867","89101","90691","91781","92701","93540","94483","95606","96787","97996","99212","100341","101288","102112","102880","103594","104257","104874","105439","105962","106442","106585","106537","106445","",

"Africa Eastern and Southern","AFE","Population, total","SP.POP.TOTL","130692579","134169237","137835590","141630546","145605995","149742351","153955516","158313235","162875171","167596160","172475766","177503186","182599092","187901657","193512956","199284304","205202669","211120911","217481420","224315978","230967858","237937461","245386717","252779730","260209149","267938123","276035920","284490394","292795186","301124880","309890664","318544083","326933522","335625136","344418362","353466601","362985802","372352230","381715600","391486231","401600588","412001885","422741118","433807484","445281555","457153837","469508516","482406426","495748900","509410477","523459657","537792950","552530654","567892149","583651101","600008424","616377605","632746570","649757148","667242986","685112979","702977106","720859132","",

"Afghanistan","AFG","Population, total","SP.POP.TOTL","8622466","8790140","8969047","9157465","9355514","9565147","9783147","10010030","10247780","10494489","10752971","11015857","11286753","11575305","11869879","12157386","12425267","12687301","12938862","12986369","12486631","11155195","10088289","9951449","10243686","10512221","10448442","10322758","10383460","10673168","10694796","10745167","12057433","14003760","15455555","16418912","17106595","17788819","18493132","19262847","19542982","19688632","21000256","22645130","23553551","24411191","25442944","25903301","26427199","27385307","28189672","29249157","30466479","31541209","32716210","33753499","34636207","35643418","36686784","37769499","38972230","40099462","41128771","",

"Africa Western and Central","AFW","Population, total","SP.POP.TOTL","97256290","99314028","101445032","103667517","105959979","108336203","110798486","113319950","115921723","118615741","121424797","124336039","127364044","130563107","133953892","137548613","141258400","145122851","149206663","153459665","157825609","162323313","167023385","171566640","176054495","180817312","185720244","190759952","195969722","201392200","206739024","212172888","217966101","223788766","229675775","235861484","242200260","248713095","255482918","262397030","269611898","277160097","284952322","292977949","301265247","309824829","318601484","327612838","336893835","346475221","356337762","366489204","376797999","387204553","397855507","408690375","419778384","431138704","442646825","454306063","466189102","478185907","490330870","",

"Angola","AGO","Population, total","SP.POP.TOTL","5357195","5441333","5521400","5599827","5673199","5736582","5787044","5827503","5868203","5928386","6029700","6177049","6364731","6578230","6802494","7032713","7266780","7511895","7771590","8043218","8330047","8631457","8947152","9276707","9617702","9970621","10332574","10694057","11060261","11439498","11828638","12228691","12632507","13038270","13462031","13912253","14383350","14871146","15366864","15870753","16394062","16941587","17516139","18124342","18771125","19450959","20162340","20909684","21691522","22507674","23364185","24259111","25188292","26147002","27128337","28127721","29154746","30208628","31273533","32353588","33428486","34503774","35588987","",

Note

Here, we use the cell magic ! to run a system command called head, which, exactly as for dataframes (or is it the other way around?), shows you the first lines of a file. You’re not expected to learn this kind of commands in this course, this was simply to show that you can do that in Jupyter, and that it’s very practical.

Question

Having seen that, can you guess what the problem was? How is this file different from the ones we’ve seen previously?

Answer

The 4 first lines actually contain some metadata about the file, and the header specifying column names is only on the 5th line! That’s why Polars complains about the Schema, it thought that the first line was defining the columns, but then found more columns in following lines. Fortunately, Polars lets us skip rows from CSV files, using the skip_rows argument.

countries_pop = pl.read_csv('data/countries_pop.csv', skip_rows=4)
countries_pop.head()

shape: (5, 69)

Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
str	str	str	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	str	str
"Aruba"	"ABW"	"Population, total"	"SP.POP.TOTL"	54608	55811	56682	57475	58178	58782	59291	59522	59471	59330	59106	58816	58855	59365	60028	60715	61193	61465	61738	62006	62267	62614	63116	63683	64174	64478	64553	64450	64332	64596	65712	67864	70192	72360	74710	77050	79417	81858	84355	86867	89101	90691	91781	92701	93540	94483	95606	96787	97996	99212	100341	101288	102112	102880	103594	104257	104874	105439	105962	106442	106585	106537	106445	""	null
"Africa Eastern and Southern"	"AFE"	"Population, total"	"SP.POP.TOTL"	130692579	134169237	137835590	141630546	145605995	149742351	153955516	158313235	162875171	167596160	172475766	177503186	182599092	187901657	193512956	199284304	205202669	211120911	217481420	224315978	230967858	237937461	245386717	252779730	260209149	267938123	276035920	284490394	292795186	301124880	309890664	318544083	326933522	335625136	344418362	353466601	362985802	372352230	381715600	391486231	401600588	412001885	422741118	433807484	445281555	457153837	469508516	482406426	495748900	509410477	523459657	537792950	552530654	567892149	583651101	600008424	616377605	632746570	649757148	667242986	685112979	702977106	720859132	""	null
"Afghanistan"	"AFG"	"Population, total"	"SP.POP.TOTL"	8622466	8790140	8969047	9157465	9355514	9565147	9783147	10010030	10247780	10494489	10752971	11015857	11286753	11575305	11869879	12157386	12425267	12687301	12938862	12986369	12486631	11155195	10088289	9951449	10243686	10512221	10448442	10322758	10383460	10673168	10694796	10745167	12057433	14003760	15455555	16418912	17106595	17788819	18493132	19262847	19542982	19688632	21000256	22645130	23553551	24411191	25442944	25903301	26427199	27385307	28189672	29249157	30466479	31541209	32716210	33753499	34636207	35643418	36686784	37769499	38972230	40099462	41128771	""	null
"Africa Western and Central"	"AFW"	"Population, total"	"SP.POP.TOTL"	97256290	99314028	101445032	103667517	105959979	108336203	110798486	113319950	115921723	118615741	121424797	124336039	127364044	130563107	133953892	137548613	141258400	145122851	149206663	153459665	157825609	162323313	167023385	171566640	176054495	180817312	185720244	190759952	195969722	201392200	206739024	212172888	217966101	223788766	229675775	235861484	242200260	248713095	255482918	262397030	269611898	277160097	284952322	292977949	301265247	309824829	318601484	327612838	336893835	346475221	356337762	366489204	376797999	387204553	397855507	408690375	419778384	431138704	442646825	454306063	466189102	478185907	490330870	""	null
"Angola"	"AGO"	"Population, total"	"SP.POP.TOTL"	5357195	5441333	5521400	5599827	5673199	5736582	5787044	5827503	5868203	5928386	6029700	6177049	6364731	6578230	6802494	7032713	7266780	7511895	7771590	8043218	8330047	8631457	8947152	9276707	9617702	9970621	10332574	10694057	11060261	11439498	11828638	12228691	12632507	13038270	13462031	13912253	14383350	14871146	15366864	15870753	16394062	16941587	17516139	18124342	18771125	19450959	20162340	20909684	21691522	22507674	23364185	24259111	25188292	26147002	27128337	28127721	29154746	30208628	31273533	32353588	33428486	34503774	35588987	""	null

As you can see, there is one row per country, and the population counts are given in separate columns, for years ranging between 1960 and 2022. So in principle, when adding in the population data to protests, we should add the population corresponding to the protest’s year. For the sake of simplicity though, here we’ll take populations from a single year. But let’s at least choose it reasonably. We’ve already seen graphically the distribution of start_date, but let’s get some summary statistics.

As a small exercise, try to get the following table (hint: use the Dataframe.describe() method):

shape: (9, 2)

statistic	start_date
str	str
"count"	"15239"
"null_count"	"0"
"mean"	"2006-10-19 22:55:04.941000"
"std"	null
"min"	"1990-01-01"
"25%"	"1999-02-08"
"50%"	"2007-10-25"
"75%"	"2014-11-15"
"max"	"2020-03-31"

We see that the average sits toward the end of 2006, while the median towards the end of 2007. Let us then take 2007 as our year to get populations. Go ahead and assign a new variable countries_single_pop with the dataframe containing the populations of that year:

shape: (5, 2)

country	pop_2007
str	i64
"Aruba"	96787
"Africa Eastern and Southern"	482406426
"Afghanistan"	25903301
"Africa Western and Central"	327612838
"Angola"	20909684

Performing the `join`#

Now, adding in these populations to protests is as simple as:

protests_with_pop = protests.join(countries_single_pop, on="country")
protests_with_pop.head()

shape: (5, 10)

id	country	protester_violence	participants	start_date	end_date	duration	participants_sum	participant_ratio	pop_2007
i64	str	bool	i64	date	date	duration[ms]	i64	f64	i64
201990001	"Canada"	false	1000	1990-01-15	1990-01-15	1d	617570	0.001619	32889025
201990002	"Canada"	false	1000	1990-06-25	1990-06-25	1d	617570	0.001619	32889025
201990003	"Canada"	false	500	1990-07-01	1990-07-01	1d	617570	0.00081	32889025
201990004	"Canada"	true	100	1990-07-12	1990-09-06	57d	617570	0.000162	32889025
201990005	"Canada"	true	950	1990-08-14	1990-08-15	2d	617570	0.001538	32889025

The on argument allows you to set the column the join will consider to match. Here, the column "country" exists in both Dataframes, so we can simply set it to that. However, if, for instance, countries_single_pop still had the "Country Name" column, instead of on="country", we’d have had to specify left_on="country", right_on="Country Name".

join has another very important argument: how. To get an idea of what it does, let us compare the shape of protests_with_pop with the original protests:

protests_with_pop.shape, protests.shape

((12743, 10), (15239, 9))

As you can see, we lost some rows! Why? Because by default, the argument how is set to "inner". This means that the dataframe output by join will only have rows for which the "country" exists in both input dataframes. To conserve the rows of the original dataframe, you can set how="left" instead:

full_protests_with_pop = protests.join(countries_single_pop, on="country", how="left")
full_protests_with_pop.shape

(15239, 10)

And we did conserve our number of rows.

Question

But then, what happened to rows in which we could not find a matching country in countries_single_pop?

Answer

They have null values in "pop_2007", to represent an absence of data. Try to get the rows for which this is the case.

Exercise - proportion of protesters#

Compute the proportion of the total population that protested in each protest, and show the countries in which the maximum proportion is highest.

top_protesters.head()

shape: (5, 2)

country	max_prop_of_pop
str	f64
"Lebanon"	0.207917
"Moldova"	0.139164
"Armenia"	0.133138
"Guinea"	0.104744
"Greece"	0.09051

Now slightly harder, try to get the full row corresponding to that protest in which the maximum "prop_of_pop" was reached.

Hint

You’ll find the Expression.sort_by() method useful for that.

top_protesters.head()

shape: (5, 11)

country	id	protester_violence	participants	start_date	end_date	duration	participants_sum	participant_ratio	pop_2007	prop_of_pop
str	i64	bool	i64	date	date	duration[ms]	i64	f64	i64	f64
"Lebanon"	6602006001	false	1000000	2006-02-15	2006-02-15	1d	4800000	0.208333	4809608	0.207917
"Moldova"	3591991003	false	400000	1991-08-21	1991-08-21	1d	1112256	0.359629	2874299	0.139164
"Armenia"	3711992001	false	400000	1992-11-02	1992-11-02	1d	1616170	0.247499	3004393	0.133138
"Guinea"	4382019010	false	1000000	2019-12-10	2019-12-10	1d	2647050	0.377779	9547082	0.104744
"Greece"	3502012001	true	1000000	2012-02-07	2012-02-07	1d	3302570	0.302794	11048473	0.09051

Exercise - doing things right#

Beforehand, we just took the population of a single year, while each protest happened on different years. In reality, we can know this year from the "start_date" column. Reproduce the above to get protests_with_pop, taking into account the right population.

Tip

To start off, you’ll need to call the Dataframe method .unpivot() on countries_pop.
To select all year columns, don’t write them by hand! There are (at least) two ways to select all of them lazily.

protests_with_pop.head()

shape: (5, 11)

id	country	protester_violence	participants	start_date	end_date	duration	participants_sum	participant_ratio	year	pop
i64	str	bool	i64	date	date	duration[ms]	i64	f64	i32	i64
3391990001	"Albania"	false	null	1990-07-06	1990-07-06	1d	381150	null	1990	3286542
3391990002	"Albania"	true	null	1990-12-09	1990-12-09	1d	381150	null	1990	3286542
3391990003	"Albania"	false	9000	1990-12-12	1990-12-12	1d	381150	0.023613	1990	3286542
3391990004	"Albania"	true	50	1990-12-13	1990-12-13	1d	381150	0.000131	1990	3286542
3391990005	"Albania"	true	null	1990-12-15	1990-12-16	2d	381150	null	1990	3286542

Now, check that it worked by checking out Italy, for instance.

yearly_pop_italy

shape: (29, 2)

year	pop
i32	i64
1990	56719240
1991	56758521
1992	56797087
1993	56831821
1994	56843400
…	…
2014	60789140
2016	60627498
2017	60536709
2018	60421760
2019	59729081

Be lazy#

A great strength of Polars is its lazy mode. In this mode, instead of working with Dataframes, you work with Lazyframes. The big difference between the two is that, while Dataframes actually hold data, Lazyframes just hold a set of operations to execute.

Working with a Lazyframe#

The first way to obtain a Lazyframe is to call the .lazy() method on an existing Dataframe, like so:

type(protests.lazy())

polars.lazyframe.frame.LazyFrame

But actually, to get the most out of the lazy mode, you should rather work with Lazyframes before you even read any data. In practice, this means that, instead of calling pl.read_<file-format>() to read from a data source, you’ll call an equivalent pl.scan_<file-format>(). For instance:

lazy_protests = pl.scan_csv('data/protests.csv')
lazy_protests

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

As you can see, no data is shown, simply because no data was read! With this object, you can now call all methods seen until now, as if you were dealing with a regular Dataframe. In order to actually get the result of a computation, you simply need to call the .collect() method at the very end.

Let us show by example the power of using Lazyframes. Let’s compute the median number of participants by year in Italy, in a particularly stupid way:

%%time
dumb_agg = (
    pl.read_csv("data/protests.csv")
    .with_columns(
        pl.col("participants").str.extract("([0-9]+)").cast(int),
    )
    .group_by("country", "year")
    .agg(pl.col("participants").median())
    .filter(pl.col("country") == "Italy")
)
dumb_agg.head()

CPU times: user 45.1 ms, sys: 9.56 ms, total: 54.7 ms
Wall time: 17 ms

shape: (5, 3)

country	year	participants
str	i64	f64
"Italy"	1993	50000.0
"Italy"	1994	500025.0
"Italy"	2018	75.0
"Italy"	2005	1200.0
"Italy"	1998	575.0

Question

Why is this not very smart? How could we make this operation more efficient?

Answer

Because we’re computing this median for every single country, while we’re only interested in Italy here! The filter() should be called directly after reading the CSV.

Let’s time the same thing, except that we do everything in lazy mode:

%%time
lazy_agg = (
    pl.scan_csv("data/protests.csv")
    .with_columns(
        pl.col("participants").str.extract("([0-9]+)").cast(int),
    )
    .group_by("country", "year")
    .agg(pl.col("participants").median())
    .filter(pl.col("country") == "Italy")
)
lazy_agg.collect().head()

CPU times: user 10.3 ms, sys: 5.94 ms, total: 16.3 ms
Wall time: 5.05 ms

shape: (5, 3)

country	year	participants
str	i64	f64
"Italy"	2004	100.0
"Italy"	1993	50000.0
"Italy"	1990	50.0
"Italy"	2020	null
"Italy"	2007	10000.0

That’s a big difference in timing, and for relatively simple operations performed on a relatively small dataset!

But why is there this difference? It’s because, before calling .collect(), Polars will run what is called a query optimization, and automatically reorder and modify the operations performed in the collect, to make them as optimal as it can. To see that, you can call .show_graph() to find out what the optimization did:

lazy_agg.show_graph()

../_images/95170c927e7d392fc42956b6cf4609554f5cabc2dfbd742f1b2e5c2ac8c56f1d.svg

This is to be read from the bottom up. As you can see, the filter has been pushed down to the very moment data is read! Which does make a lot of sense, doesn’t it?

To sum up, there are two main advantages of using Lazyframes:

You get an automatic query optimization. So even if the code you wrote could be much more optimal, it will run as fast as it can. In other words, being lazy allows you to be dumb and not suffer the consequences.
As filters can be pushed down to the very moment data is read, you don’t need to load all data in memory. This means that you could analyse data files which are too big to fit in your computer memory!

Exercise - Healthcare facilities#

✪✪ Let’s examine the dataset SANSTRUT001.csv which contains the healthcare facilities of Trentino region, and for each tells the type of assistance it offers (clinical activity, diagnostics, etc), the code and name of the communality where it is located.

Data source: dati.trentino.it
License: Creative Commons Attribution 4.0

Write a function strutsan which takes as input a town code and a text string, opens the file lazily and:

Prints also the number of found rows
Returns a dataframe with selected only the rows having that town code and which contain the string in the column ASSISTENZA. The returned dataset must have only the columns STRUTTURA, ASSISTENZA, COD_COMUNE, COMUNE.

strutsan(22050, '')  # no ASSISTENZA filter

Found 6 facilities

shape: (6, 4)

STRUTTURA	ASSISTENZA	COD_COMUNE	COMUNE
str	str	i64	str
"PRESIDIO OSPEDALIERO DI CAVALE…	"ATTIVITA` CLINICA"	22050	"CAVALESE"
"PRESIDIO OSPEDALIERO DI CAVALE…	"DIAGNOSTICA STRUMENTALE E PER …	22050	"CAVALESE"
"PRESIDIO OSPEDALIERO DI CAVALE…	"ATTIVITA` DI LABORATORIO"	22050	"CAVALESE"
"CENTRO SALUTE MENTALE CAVALESE"	"ASSISTENZA PSICHIATRICA"	22050	"CAVALESE"
"CENTRO DIALISI CAVALESE"	"ATTIVITA` CLINICA"	22050	"CAVALESE"
"CONSULTORIO CAVALESE"	"ATTIVITA` DI CONSULTORIO MATER…	22050	"CAVALESE"

strutsan(22205, 'CLINICA')

Found 16 facilities

shape: (16, 4)

STRUTTURA	ASSISTENZA	COD_COMUNE	COMUNE
str	str	i64	str
"PRESIDIO OSPEDALIERO S.CHIARA"	"ATTIVITA` CLINICA"	22205	"TRENTO"
"CENTRO DIALISI TRENTO"	"ATTIVITA` CLINICA"	22205	"TRENTO"
"POLIAMBULATORI S.CHIARA"	"ATTIVITA` CLINICA"	22205	"TRENTO"
"PRESIDIO OSPEDALIERO VILLA IGE…	"ATTIVITA` CLINICA"	22205	"TRENTO"
"OSPEDALE CLASSIFICATO S.CAMIL…	"ATTIVITA` CLINICA"	22205	"TRENTO"
…	…	…	…
"COOPERATIVA SOCIALE IRIFOR DEL…	"ATTIVITA` CLINICA"	22205	"TRENTO"
"AGSAT ASSOCIAZIONE GENITORI SO…	"ATTIVITA` CLINICA"	22205	"TRENTO"
"AZIENDA PUBBLICA SERVIZI ALLA …	"ATTIVITA` CLINICA"	22205	"TRENTO"
"CST TRENTO"	"ATTIVITA` CLINICA"	22205	"TRENTO"
"A.P.S.P. 'BEATO DE TSCHIDERER'…	"ATTIVITA` CLINICA"	22205	"TRENTO"

strutsan(22205, 'LABORATORIO')

Found 5 facilities

shape: (5, 4)

STRUTTURA	ASSISTENZA	COD_COMUNE	COMUNE
str	str	i64	str
"PRESIDIO OSPEDALIERO S.CHIARA"	"ATTIVITA` DI LABORATORIO"	22205	"TRENTO"
"LABORATORI ADIGE SRL"	"ATTIVITA` DI LABORATORIO"	22205	"TRENTO"
"LABORATORIO DRUSO SRL"	"ATTIVITA` DI LABORATORIO"	22205	"TRENTO"
"CASA DI CURA VILLA BIANCA SPA"	"ATTIVITA` DI LABORATORIO"	22205	"TRENTO"
"CENTRO SERVIZI SANITARI"	"ATTIVITA` DI LABORATORIO"	22205	"TRENTO"

Some more exercises#

Exercise - Air pollutants#

Let’s try to analyse the hourly data from air quality monitoring stations from Autonomous Province of Trento.

Source: dati.trentino.it

Load the file#

✪ Load the file aria.csv in Polars

IMPORTANT 1: put the dataframe into the variable aria, so not to confuse it with the previous datasets.

IMPORTANT 2: use encoding 'latin-1' (otherwise you might get weird load errors according to your operating system)

shape: (5, 6)

Stazione	Inquinante	Data	Ora	Valore	Unità di misura
str	str	str	i64	f64	str
"Parco S. Chiara"	"PM10"	"2019-05-04"	1	17.0	"µg/mc"
"Parco S. Chiara"	"PM10"	"2019-05-04"	2	19.0	"µg/mc"
"Parco S. Chiara"	"PM10"	"2019-05-04"	3	17.0	"µg/mc"
"Parco S. Chiara"	"PM10"	"2019-05-04"	4	15.0	"µg/mc"
"Parco S. Chiara"	"PM10"	"2019-05-04"	5	13.0	"µg/mc"

Tip

If you get an error, read it carefully, what is the issue here? What column is causing it? Then, use pl.read_csv? or head over to the documentation and read the Notes, that will get you closer to the solution.

Pollutants average#

✪ find the average of PM10 pollutants at Parco S. Chiara (average on all days).

shape: (1, 1)

Valore
f64
11.385753

PM10 chart#

✪ Use Axes.plot to show in a chart the values of PM10 in Parco S. Chiara on May 7th, 2019.

../_images/dbc753b755c005d7304b43ee5c88baa34813da0fe82e0a298d33d6babdf8e3ee.png

Exercise - meteo pressure intervals#

✪✪✪ The dataset meteo.csv contains the weather data of Trento, November 2017 (source: www.meteotrentino.it). We would like to subdivide the pressure readings into three intervals A (low), B (medium), C (high), and count how many readings have been made for each interval.

IMPORTANT: assign the dataframe to a variable called meteo so to avoid confusion with other dataframes

Where are the intervals?#

First, let’s find the pressure values for these 3 intervals and plot them as segments, so to end up with a chart like this:

Before doing the plot, we will need to know at which height we should plot the segments.

Load the dataset with polars, calculate the following variables and PRINT them

round values with round function
the excursion is the difference between minimum and maximum
note intervalC coincides with the maximum

DO NOT use min and max as variable names (they are reserved functions!!)

minimum: 966.3
maximum: 998.3
excursion: 32.0
intervalA: 976.97
intervalB: 987.63
intervalC: 998.3

Segments plot#

Try now to plot the chart of pressure and the 4 horizontal segments.

to overlay the segments with different colors, just make repeated calls to ax.plot
a segment is defined by two points: so just find the coordinates of those two points..

REMEMBER title and labels

../_images/5e49cc7af07f520259458f046a96f9171941075f518162f8b340be7d47509c2e.png

Assigning the intervals#

We literally made a picture of where the intervals are located - let’s now ask ourselves how many readings have been done for each interval.

First, try creating a column which assigns to each reading the interval where it belongs to.

Tip

You will need the function pl.when(), start by inspecting it!

shape: (5, 5)

Date	Pressure	Rain	Temp	PressureInterval
str	f64	f64	f64	str
"01/11/2017 00:00"	995.4	0.0	5.4	"C (high)"
"01/11/2017 00:15"	995.5	0.0	6.0	"C (high)"
"01/11/2017 00:30"	995.5	0.0	5.9	"C (high)"
"01/11/2017 00:45"	995.7	0.0	5.4	"C (high)"
"01/11/2017 01:00"	995.7	0.0	5.3	"C (high)"

Grouping by intervals#

a. First, create a grouping to count occurrences:

shape: (3, 2)

PressureInterval	Pressure
str	u32
"C (high)"	1380
"B medium"	1243
"A (low)"	255

b. Now plot it

REMEMBER title and axis labels

Text(0, 0.5, 'Counts')

../_images/6627a91f32cf8bdf657af140b613b1b08994713d612a1c93b02d502a3662f5ee.png

Exercise - meteo average temperature#

✪✪✪ Calculate the average temperature for each day, and show it in the plot.

HINT: add 'Day' column by extracting only the day from the date. To do it, use the function .str.slice.

shape: (5, 6)

Date	Pressure	Rain	Temp	Day	avg_day_temp
str	f64	f64	f64	str	f64
"01/11/2017 00:00"	995.4	0.0	5.4	"01/11/2017"	7.983333
"01/11/2017 00:15"	995.5	0.0	6.0	"01/11/2017"	7.983333
"01/11/2017 00:30"	995.5	0.0	5.9	"01/11/2017"	7.983333
"01/11/2017 00:45"	995.7	0.0	5.4	"01/11/2017"	7.983333
"01/11/2017 01:00"	995.7	0.0	5.3	"01/11/2017"	7.983333

../_images/df90bb029fd84bd7d8cff30661be496b007157cc2dbff468e3d7a5e754d0df55.png

Advanced stuff

Contents

Advanced stuff#

Plotting data#

With Matplotlib#

With hvplot#

Exercise - time distribution#

Aggregate#

Globally#

Let’s group by something!#

Let’s aggregate#

Exercise - Top protesters#

Modifying a dataframe by plugging in the result of a grouping#

Joining tables#

Getting another table#

Performing the join#

Exercise - proportion of protesters#

Exercise - doing things right#

Be lazy#

Working with a Lazyframe#

NAIVE QUERY PLAN

Exercise - Healthcare facilities#

Some more exercises#

Exercise - Air pollutants#

Load the file#

Pollutants average#

PM10 chart#

Exercise - meteo pressure intervals#

Where are the intervals?#

Segments plot#

Assigning the intervals#

Grouping by intervals#

Exercise - meteo average temperature#

Performing the `join`#