Generic text files#
Introduction#
In these tutorials we will see how to load and write text files containing data in different formats:
line-delimited data
tabular data such as CSV
tree-like data such as JSON files.
Line files#
Line files are typically text files which contain information grouped by lines. An example using historical characters might be like the following:
Leonardo
da Vinci
Sandro
Botticelli
Niccolò
Macchiavelli
We can immediately see a regularity: first two lines contain data of Leonardo da Vinci, second one the name and then the surname. Successive lines instead have data of Sandro Botticelli, with again first the name and then the surname and so on.
We might want to do a program that reads the lines and prints on the terminal names and surnames like the following:
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli
To start having an approximation of the final result, we can open the file, read only the first line and print it:
with open("data/people-simple.txt", encoding="utf-8") as file:
line = file.readline()
print(line)
Leonardo
What happened? Let’s examine the first rows:
The open
function#
The command
open("data/people-simple.txt", encoding="utf-8")
allows us to open the text file by telling Python the file path 'people-simple.txt'
and the encoding in which it was written (encoding='utf-8'
).
The encoding#
Files on a computer are just a bunch of zeros and ones. Text files are no different, so what allows any program to interpret these zeros and ones into text is what’s called the file encoding. Python being a program like any other, it needs to know with what encoding a text file was written to interpret it correctly.
Most of the time though, you don’t need to worry about it.
When you open a file, first try not specifying any encoding.
By default, Python assumes it’s utf-8
, which is the most common one.
If it doesn’t work, then you can try others.
For example for files written in southern Europe with Windows you might try encoding='latin-1'
.
If you open a file written elsewhere, you might need other encodings.
For more in-depth information, you can read Dive into Python - Chapter 4 - Strings, and Dive into Python - Chapter 11 - File.
The with
block#
The with
defines a block with instructions inside:
with open("data/people-simple.txt", encoding="utf-8") as file:
line = file.readline()
print(line)
with
is used to create a context in which to execute the indented block of code that follows it.
When we go back to the normal indentation level, we leave this context block.
The context here is simply that a file is open, and that you can do stuff with it through the variable called file
.
Question
When we are out of this context, what happens then to the file and to the variable file
?
Answer
When you’re out of the indented block, this means that the context “the file is open” ends. Therefore the file is… closed.
Question
Let’s now see this in practice.
Before running the following cell, try to guess what happens when we try reading another line with file
.
file.readline()
Properly closing a file avoids to waste memory resources and creating hard to find paranormal errors. If you want to avoid hunting for never closed zombie files, always remember to open all files in with
blocks! Furthermore, at the end of the row in the part as file:
we assigned the file to a variable hereby called file
, but we could have used any other name we liked.
Warning
To indent the code, always use sequences of four white spaces. Sequences of only 2 spaces, even if allowed, are not recommended.
Warning
Depending on the editor you use, by pressing <Tab>
you might get a sequence o file white spaces like it happens in Jupyter (4 spaces which is the recommended length), or a special tabulation character (to avoid)! As annoying as this distinction might appear, remember it because it might generate very hard to find errors.
Warning
In the commands to create blocks such as with
, always remember to put the character of colon :
at the end of the line!
Reading in the file#
The command
line = file.readline()
puts in the variable line
the entire first line, like a string. Warning: the string will contain at the end the special character of line return!
You might wonder where that .readline()
comes from. Like everything in Python, our variable file
which represents the file we just opened is an object, and like any object, depending on its type, it has particular methods we can use on it. In this case the method is .readline()
.
The following command prints the string content:
print(line)
✪ 1.1 EXERCISE: Try to rewrite here the block we’ve just seen, but this time printing the first two lines. Rewrite the code with the fingers, not with copy-paste! Pay attention to correct indentation with spaces in the block.
✪ 1.2 EXERCISE: you might be wondering what exactly is that file
, and what exactly the method readline
should be doing. When you find yourself in these situations, you might help yourself with functions type
and help
. This time, directly copy paste the same code here, but insert inside with
block the commands:
print(type(file))
help(file)
help(file.readline)
# Attention: remember the file. before the readline !!
Every time you add something, try to execute with Control+Enter and see what happens
Note
You can obtain an output similar to help(object)
with the shortcut object?
, try it! This works for any function or class, so it’s extremely useful! Use it whenever you forget the arguments of a function. This is what is called an IPython magic command, recognisable by the starting %
. You may use them in an IPython session or Jupyter cell, but not in the basic Python interpreter (when you call python
)!
First we put the content of the first line into the variable line
, now we might put it in a variable with a more meaningful name, like name
. Also, we can directly read the next row into the variable surname
and then print the concatenation of both:
with open("data/people-simple.txt") as file:
name = file.readline()
surname = file.readline()
print(f"{name} {surname}")
Leonardo
da Vinci
PROBLEM ! The printing puts a weird carriage return. Why is that? If you remember, first we said that readline
reads the line content in a string adding to the end also the special newline character. To eliminate it, you can use the command rstrip()
:
with open("data/people-simple.txt") as file:
name = file.readline().rstrip()
surname = file.readline().rstrip()
print(f"{name} {surname}")
Leonardo da Vinci
✪ 1.3 EXERCISE: Again, rewrite the block above in the cell below, and execute the cell with Control+Enter.
Question: what happens if you use strip()
instead of rstrip()
? What about lstrip()
? Can you deduce the meaning of r
and l
? If you can’t manage it, try to use python command help
by calling help(string.rstrip)
with open("data/people-simple.txt") as file:
name = file.readline().rstrip()
surname = file.readline().rstrip()
print(f"{name} {surname}")
Leonardo da Vinci
Very good, we have the first line ! Now we can read all the lines in sequence. To this end, we can use a while
cycle:
with open("data/people-simple.txt") as file:
line = file.readline()
while line != "":
name = line.rstrip()
surname = file.readline().rstrip()
print(f"{name} {surname}")
line = file.readline()
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli
What did we do? First, we added a while
cycle in a new block
Warning
In new block, since it is already within the external with
, the instructions are indented of 8 spaces and not 4! If you use the wrong spaces, bad things happen !
We first read a line, and two cases are possible:
a. we are the end of the file (or file is empty) : in this case readline()
call returns an empty string
b. we are not at the end of the file: the first line is put as a string inside the variable line
. Since Python internally uses a pointer to keep track at which position we are when reading inside the file, after the read such pointer is moved at the beginning of the next line. This way the next call to readline()
will read a line from the new position.
In while
block we tell Python to continue the cycle as long as line
is not empty. If this is the case, inside the while
block we parse the name from the line and put it in variable name
(removing extra newline character with rstrip()
as we did before), then we proceed reading the next line and parse the result inside the surname
variable. Finally, we read again a line into the line
variable so it will be ready for the next round of name extraction. If line is empty the cycle will terminate:
while line != "": # enter cycle if line contains characters
name = line.rstrip() # parses the name
surname = file.readline().rstrip() # reads next line and parses surname
print(file"{name} {surname}")
line = file.readline() # read next line
Note
In Python there are shorter ways to read a text file line by line, we used this approach to make explicit all passages.
✪✪ 1.4 EXERCISE: We just presented the approach above to make things explicit. However, there are two more idiomatic ways to iterate over lines which involve a for
loop. Try to rewrite the code above, this time using a for
loop.
Hint
Start by inspecting all methods of file
using dir(file)
, or the cell magic ?file.*
. Can you find the two methods that could help you? Also, experiment with random ideas of how to use the for
loop, one of them might just work!
Leonardo da Vinci
da Vinci Sandro
Sandro Botticelli
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli
Writing in the file#
If you want instead to write into a file, you should specify it from the moment you open
it, using "w"
as a second argument, like so:
with open("data/people-simple-new.txt", "w") as file:
file.write("")
We passed the content of this new file to file.write()
, which is here an empty string.
Now let’s read this file to check everything went correctly:
with open("data/people-simple-new.txt", "r") as file:
print(file.read())
That’s an empty string indeed!
Warning
Always check that the path you’re writing to does not point to an existing file that you’d like to keep! Otherwise you’ll completely overwrite it, thus losing data!
✪✪ 1.5 EXERCISE: As above, get the full name of each historical character, but instead of printing it, store it into a list.
Then use this list to write a new file data/people-simple-new.txt
, containing on each line the full name of a historical character.
Check things went as planned by reading in the file after writing to it.
Content of people-simple-new:
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli
Exercises - people-complex
line file#
Look at the file people-complex.txt
:
name: Leonardo
surname: da Vinci
birthdate: 1452-04-15
name: Sandro
surname: Botticelli
birthdate: 1445-03-01
name: Niccolò
surname: Macchiavelli
birthdate: 1469-05-03
Supposing to read the file to print this output, how would you do it?
Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03
Hint
Each line is clearly split into two parts. Find the method you can use on a line to separate the line into these two parts. Then it’ll be easy to extract the part you’re interested in.
✪ 1.5 EXERCISE: Write here the solution of the exercise ‘People complex’. Use either a while
or a for
loop, whichever you feel more comfortable with.
Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03
Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03
Exercise - immersione-in-python-toc
line file#
✪✪✪ This exercise is more challenging, if you are a beginner you might skip it and go on to CSVs
The book Dive into Python is nice and for the italian version there is a PDF, which has a problem though: if you try to print it, you will discover that the index is missing. Without despairing, we found a program to extract titles in a file as follows, but you will discover it is not exactly nice to see. Since we are Python ninjas, we decided to transform raw titles in a real table of contents. Sure enough there are smarter ways to do this, like loading the pdf in Python with an appropriate module for pdfs, still this makes for an interesting exercise.
You are given the file immersione-in-python-toc.txt
:
BookmarkBegin
BookmarkTitle: Il vostro primo programma Python
BookmarkLevel: 1
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Immersione!
BookmarkLevel: 2
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Dichiarare funzioni
BookmarkLevel: 2
BookmarkPageNumber: 41
BookmarkBeginint
BookmarkTitle: Argomenti opzionali e con nome
BookmarkLevel: 3
BookmarkPageNumber: 42
BookmarkBegin
BookmarkTitle: Scrivere codice leggibile
BookmarkLevel: 2
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Stringhe di documentazione
BookmarkLevel: 3
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Il percorso di ricerca di import
BookmarkLevel: 2
BookmarkPageNumber: 46
BookmarkBegin
BookmarkTitle: Ogni cosa è un oggetto
BookmarkLevel: 2
BookmarkPageNumber: 47
Write a python program to print the following output:
Il vostro primo programma Python 38
Immersione! 38
Dichiarare funzioni 41
Argomenti opzionali e con nome 42
Scrivere codice leggibile 44
Stringhe di documentazione 44
Il percorso di ricerca di import 46
Ogni cosa è un oggetto 47
For this exercise, you will need to insert in the output artificial spaces, in a qunatity determined by the rows BookmarkLevel
QUESTION: what’s that weird value è
at the end of the original file? Should we report it in the output?
HINT 1: To convert a string into an integer number, use the function int
:
x = '5'
x
'5'
int(x)
5
Warning
int(x)
returns a value, and never modifies the argument x
!
HINT 2: To substitute a substring in a string, you can use the method .replace
:
x = "abcde"
x.replace("cd", "HELLO")
'abHELLOe'
HINT 3: while there is only one sequence to substitute, replace
is fine, but if we had a million of horrible sequences like >
, >
, &x3e;
, what should we do? As good data cleaners, we recognize these are HTML escape sequences, so we could use methods specific to sequences like html.escape. Try it instead of replace
and check if it works!
NOTE: Before using html.unescape
, import the module html
with the command:
import html
HINT 4: To write n copies of a character, use *
like this:
"b" * 4
'bbbb'
IMPLEMENTATION: Write here the solution for the line file immersione-in-python-toc.txt
.
Il vostro primo programma Python 38
Immersione! 38
Dichiarare funzioni 41
Argomenti opzionali e con nome 42
Scrivere codice leggibile 44
Stringhe di documentazione 44
Il percorso di ricerca di import 46
Ogni cosa è un oggetto 47
Il vostro primo programma Python 38
Immersione! 38
Dichiarare funzioni 41
Argomenti opzionali e con nome 42
Scrivere codice leggibile 44
Stringhe di documentazione 44
Il percorso di ricerca di import 46
Ogni cosa è un oggetto 47