JSON files

JSON files#

JSON is a more elaborated format, widely used in the world of web applications.

As for CSVs, Python has a dedicated built-in library to read them, let’s start by importing it:

import json
from pprint import pprint

A json is simply a text file, structured as a tree.

Let’s see an example, extracted from the data Bike sharing stations of Lavis municipality as found on dati.trentino:

Data source: dati.trentino.it - Trasport Service of the Autonomous Province of Trento
License: CC-BY 4.0

with open('data/bike-sharing-lavis-single.json') as f:
    json_content = f.read()
    print(type(json_content))
    print(json_content)

<class 'str'>
{"name": ["Grazioli", "Pressano", "Stazione RFI"], "address": ["Piazza Grazioli - Lavis", "Piazza della Croce - Pressano", "Via Stazione - Lavis"], "id": ["Grazioli - Lavis", "Pressano - Lavis", "Stazione RFI - Lavis"], "bikes": [3, 2, 4], "slots": [7, 5, 6], "totalSlots": [10, 7, 10], "position": [[46.139732902099794, 11.111516155225331], [46.15368174037716, 11.106601229430453], [46.148180371138814, 11.096753997622727]]}

As you can see, the json format is very similar to data structures we already have in Python, such as strings, integer numbers, floats, lists and dictionaries. So the conversion to Python is almost always easy and painless, with the help of the json library.

This format is commonly used to save raw data collected from online sources. For instance, the most raw form of a post from Twitter, Wikipedia, Reddit, Telegram, or other sources is JSON. Since such objects may change with time – let’s say some billionaire buys one of these and decides to add the number of views on each post –, using this format allows them to send out data representing some type of object, without worrying about fields that may change over time. That is the power of this format: its flexibility.

JSON as a single dictionary#

The simplest way to read a JSON file is then by calling the function json.load, which interprets the json text file and converts it to a Python data structure:

with open('data/bike-sharing-lavis-single.json') as f:
    python_content = json.load(f)
    
print(type(python_content))
pprint(python_content)

<class 'dict'>
{'address': ['Piazza Grazioli - Lavis',
             'Piazza della Croce - Pressano',
             'Via Stazione - Lavis'],
 'bikes': [3, 2, 4],
 'id': ['Grazioli - Lavis', 'Pressano - Lavis', 'Stazione RFI - Lavis'],
 'name': ['Grazioli', 'Pressano', 'Stazione RFI'],
 'position': [[46.139732902099794, 11.111516155225331],
              [46.15368174037716, 11.106601229430453],
              [46.148180371138814, 11.096753997622727]],
 'slots': [7, 5, 6],
 'totalSlots': [10, 7, 10]}

Notice that what we’ve just read with the function json.load is not simple text anymore, but Python objects. For this json, the most external object is a dictionary (note the curly brackets at the file beginning and end). We can check using type on python_content:

JSON as list of dictionaries#

A JSON file can also take the form of a list of dictionaries, like:

with open('data/bike-sharing-lavis.json') as f:
    python_content = json.load(f)

print(type(python_content))
pprint(python_content)

<class 'list'>
[{'address': 'Piazza Grazioli - Lavis',
  'bikes': 3,
  'id': 'Grazioli - Lavis',
  'name': 'Grazioli',
  'position': [46.139732902099794, 11.111516155225331],
  'slots': 7,
  'totalSlots': 10},
 {'address': 'Piazza della Croce - Pressano',
  'bikes': 2,
  'id': 'Pressano - Lavis',
  'name': 'Pressano',
  'position': [46.15368174037716, 11.106601229430453],
  'slots': 5,
  'totalSlots': 7},
 {'address': 'Via Stazione - Lavis',
  'bikes': 4,
  'id': 'Stazione RFI - Lavis',
  'name': 'Stazione RFI',
  'position': [46.148180371138814, 11.096753997622727],
  'slots': 6,
  'totalSlots': 10}]

By looking at the JSON closely, you will see it is a list of dictionaries. Thus, to access the first dictionary (that is, the one at zero-th index), we can write

python_content[0]

{'name': 'Grazioli',
 'address': 'Piazza Grazioli - Lavis',
 'id': 'Grazioli - Lavis',
 'bikes': 3,
 'slots': 7,
 'totalSlots': 10,
 'position': [46.139732902099794, 11.111516155225331]}

We see it’s the station in Piazza Grazioli. To get the exact name, we will access the 'address' key in the first dictionary:

python_content[0]['address']

'Piazza Grazioli - Lavis'

To access the position, we will use the corresponding key:

python_content[0]['position']

[46.139732902099794, 11.111516155225331]

Note how the position is a list itself. In JSON we can have arbitrarily branched trees, without necessarily a regular structure (although when we’re generating a json it certainly helps maintaining a regular data scheme).

Newline-delimited JSONs#

There is a particular JSON file type which is called JSONL (note the L for “lines” at the end), or NDJSON (ND for “Newline-Delimited”), which is a text file containing a sequence of lines, each representing a valid json object.

Let’s have a look at the file employees.jsonl:

{"name": "Mario", "surname":"Rossi"}
{"name": "Paolo", "surname":"Bianchi"}
{"name": "Luca", "surname":"Verdi"}

To read it, we can open the file, iterate over the text lines and then interpret each of them as a single JSON object:

with open('data/employees.jsonl') as f:
    for i, line in enumerate(f):
        python_content = json.loads(line)   # converts json text to a python object
        print('Object ', i)
        print(python_content)
        i = i + 1   

Object  0
{'name': 'Mario', 'surname': 'Rossi'}
Object  1
{'name': 'Paolo', 'surname': 'Bianchi'}
Object  2
{'name': 'Luca', 'surname': 'Verdi'}

Question

Here, we could also have first read all the lines with f.readlines(), and then iterated over them. Could you guess which option is better? Using f.readlines() or iterating over f? Why?

Answer

Iterating over f is the recommended way to read such files, to limit your memory consumption. This way, no matter the file size, you’ll be able to read it. It’s actually kind of the whole point of putting data in this format!

In a line-delimited file, each line can be read completely independently from the previous one. Reversely, a new line can be added to the file with the same level of independence. For instance, let’s say we wanted to add a new employee to the previous file, but this time we have information about their age. We can simply add them as follows:

# We'll first simply copy the original file in order to let it intact:
import shutil
shutil.copyfile('data/employees.jsonl', 'data/employees_extended.jsonl')

with open('data/employees_extended.jsonl', 'a') as f:
    json_text = json.dumps({'name': 'Bob', 'surname': 'Ross', 'age': 42})
    f.write(f"{json_text}\n")

Here we use the flag "a" to signify we want to append content to the file. This way, no need to read the whole file again in order to add content to it! We used json.dumps to transform our Python object into a valid JSON string – hence the name, you’re dumping it into a string. We then write our JSON string into the file, adding the newline character \n at the end to make room for a potential future dump.

Question

Which format do you think is better between a newline-delimited JSON and a list-of-dictionaries JSON? Why?

Answer

The newline-delimited version: it makes it very easy to read or append a single line, independently of the rest of the file.