Generic text files#

Introduction#

In these tutorials we will see how to load and write text files containing data in different formats:

  • line-delimited data

  • tabular data such as CSV

  • tree-like data such as JSON files.

Line files#

Line files are typically text files which contain information grouped by lines. An example using historical characters might be like the following:

Leonardo
da Vinci
Sandro
Botticelli
Niccolò 
Macchiavelli

We can immediately see a regularity: first two lines contain data of Leonardo da Vinci, second one the name and then the surname. Successive lines instead have data of Sandro Botticelli, with again first the name and then the surname and so on.

We might want to do a program that reads the lines and prints on the terminal names and surnames like the following:

Leonardo da Vinci 
Sandro Botticelli
Niccolò Macchiavelli

To start having an approximation of the final result, we can open the file, read only the first line and print it:

with open("data/people-simple.txt", encoding="utf-8") as f:
    line = f.readline()
    print(line)
Leonardo

What happened? Let’s examine the first rows:

The open function#

The command

open('data/people-simple.txt', encoding='utf-8')

allows us to open the text file by telling Python the file path 'people-simple.txt' and the encoding in which it was written (encoding='utf-8').

The encoding#

The encoding depends on the operating system and on the editor used to write the file. When we open a file, Python is not capable to figure out the encoding, and if we do not specify anything Python might open the file assuming an encoding different from the original. As a result, if we omit the encoding (or we put a wrong one) we might end up seeing weird characters (like little squares instead of accented letters).

In general, when you open a file, try not specifying any encoding, by default Python assumes it’s utf-8 which is the most common one. If it doesn’t work, you should try others. For example for files written in southern Europe with Windows you might check encoding='latin-1'. If you open a file written elsewhere, you might need other encodings. For more in-depth information, you can read Dive into Python - Chapter 4 - Strings, and Dive into Python - Chapter 11 - File.

The with block#

The with defines a block with instructions inside:

with open('data/people-simple.txt', encoding='utf-8') as f:
    line = f.readline()
    print(line)

with is used to create a context in which to execute the indented block of code that follows it. The context here is simply that a file is open, and can be operated on through the f variable. Importantly, when you’re out of the indented block – which means that the context “the file is open” ends – the file… gets closed.

Question

Then, before running the following cell, try to guess what happens when we try reading another line with f.

f.readline()
Hide code cell output
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 f.readline()

ValueError: I/O operation on closed file.

Properly closing a file avoids to waste memory resources and creating hard to find paranormal errors. If you want to avoid hunting for never closed zombie files, always remember to open all files in with blocks! Furthermore, at the end of the row in the part as f: we assigned the file to a variable hereby called f, but we could have used any other name we liked.

Warning

To indent the code, always use sequences of four white spaces. Sequences of only 2 spaces, even if allowed, are not recommended.

Warning

Depending on the editor you use, by pressing <Tab> you might get a sequence o f white spaces like it happens in Jupyter (4 spaces which is the recommended length), or a special tabulation character (to avoid)! As annoying as this distinction might appear, remember it because it might generate very hard to find errors.

Warning

In the commands to create blocks such as with, always remember to put the character of colon : at the end of the line!

Reading in the file#

The command

    line = f.readline()

puts in the variable line the entire first line, like a string. Warning: the string will contain at the end the special character of line return!

You might wonder where that .readline() comes from. Like everything in Python, our variable f which represents the file we just opened is an object, and like any object, depending on its type, it has particular methods we can use on it. In this case the method is .readline().

The following command prints the string content:

    print(line) 

✪ 1.1 EXERCISE: Try to rewrite here the block we’ve just seen, but this time printing the first two lines. Rewrite the code with the fingers, not with copy-paste! Pay attention to correct indentation with spaces in the block.

Hide code cell content
with open("data/people-simple.txt") as f:
    first_line = f.readline()
    print(first_line)
    second_line = f.readline()
    print(second_line)
Leonardo

da Vinci

✪ 1.2 EXERCISE: you might be wondering what exactly is that f, and what exactly the method readline should be doing. When you find yourself in these situations, you might help yourself with functions type and help. This time, directly copy paste the same code here, but insert inside with block the commands:

  • print(type(f))

  • help(f)

  • help(f.readline) # Attention: remember the f. before the readline !!

Every time you add something, try to execute with Control+Enter and see what happens

Hide code cell content
with open("data/people-simple.txt") as f:
    line = f.readline()
    print(line)
    print(type(f))
    help(f.readline)
    help(f)
Leonardo

<class '_io.TextIOWrapper'>
Help on built-in function readline:

readline(size=-1, /) method of _io.TextIOWrapper instance
    Read until newline or EOF.
    
    Returns an empty string if EOF is hit immediately.

Help on TextIOWrapper object:

class TextIOWrapper(_TextIOBase)
 |  TextIOWrapper(buffer, encoding=None, errors=None, newline=None, line_buffering=False, write_through=False)
 |  
 |  Character and line based layer over a BufferedIOBase object, buffer.
 |  
 |  encoding gives the name of the encoding that the stream will be
 |  decoded or encoded with. It defaults to locale.getencoding().
 |  
 |  errors determines the strictness of encoding and decoding (see
 |  help(codecs.Codec) or the documentation for codecs.register) and
 |  defaults to "strict".
 |  
 |  newline controls how line endings are handled. It can be None, '',
 |  '\n', '\r', and '\r\n'.  It works as follows:
 |  
 |  * On input, if newline is None, universal newlines mode is
 |    enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
 |    these are translated into '\n' before being returned to the
 |    caller. If it is '', universal newline mode is enabled, but line
 |    endings are returned to the caller untranslated. If it has any of
 |    the other legal values, input lines are only terminated by the given
 |    string, and the line ending is returned to the caller untranslated.
 |  
 |  * On output, if newline is None, any '\n' characters written are
 |    translated to the system default line separator, os.linesep. If
 |    newline is '' or '\n', no translation takes place. If newline is any
 |    of the other legal values, any '\n' characters written are translated
 |    to the given string.
 |  
 |  If line_buffering is True, a call to flush is implied when a call to
 |  write contains a newline character.
 |  
 |  Method resolution order:
 |      TextIOWrapper
 |      _TextIOBase
 |      _IOBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  close(self, /)
 |      Flush and close the IO object.
 |      
 |      This method has no effect if the file is already closed.
 |  
 |  detach(self, /)
 |      Separate the underlying buffer from the TextIOBase and return it.
 |      
 |      After the underlying buffer has been detached, the TextIO is in an
 |      unusable state.
 |  
 |  fileno(self, /)
 |      Returns underlying file descriptor if one exists.
 |      
 |      OSError is raised if the IO object does not use a file descriptor.
 |  
 |  flush(self, /)
 |      Flush write buffers, if applicable.
 |      
 |      This is not implemented for read-only and non-blocking streams.
 |  
 |  isatty(self, /)
 |      Return whether this is an 'interactive' stream.
 |      
 |      Return False if it can't be determined.
 |  
 |  read(self, size=-1, /)
 |      Read at most n characters from stream.
 |      
 |      Read from underlying buffer until we have n characters or we hit EOF.
 |      If n is negative or omitted, read until EOF.
 |  
 |  readable(self, /)
 |      Return whether object was opened for reading.
 |      
 |      If False, read() will raise OSError.
 |  
 |  readline(self, size=-1, /)
 |      Read until newline or EOF.
 |      
 |      Returns an empty string if EOF is hit immediately.
 |  
 |  reconfigure(self, /, *, encoding=None, errors=None, newline=None, line_buffering=None, write_through=None)
 |      Reconfigure the text stream with new parameters.
 |      
 |      This also does an implicit stream flush.
 |  
 |  seek(self, cookie, whence=0, /)
 |      Set the stream position, and return the new stream position.
 |      
 |        cookie
 |          Zero or an opaque number returned by tell().
 |        whence
 |          The relative position to seek from.
 |      
 |      Four operations are supported, given by the following argument
 |      combinations:
 |      
 |      - seek(0, SEEK_SET): Rewind to the start of the stream.
 |      - seek(cookie, SEEK_SET): Restore a previous position;
 |        'cookie' must be a number returned by tell().
 |      - seek(0, SEEK_END): Fast-forward to the end of the stream.
 |      - seek(0, SEEK_CUR): Leave the current stream position unchanged.
 |      
 |      Any other argument combinations are invalid,
 |      and may raise exceptions.
 |  
 |  seekable(self, /)
 |      Return whether object supports random access.
 |      
 |      If False, seek(), tell() and truncate() will raise OSError.
 |      This method may need to do a test seek().
 |  
 |  tell(self, /)
 |      Return the stream position as an opaque number.
 |      
 |      The return value of tell() can be given as input to seek(), to restore a
 |      previous stream position.
 |  
 |  truncate(self, pos=None, /)
 |      Truncate file to size bytes.
 |      
 |      File pointer is left unchanged.  Size defaults to the current IO
 |      position as reported by tell().  Returns the new size.
 |  
 |  writable(self, /)
 |      Return whether object was opened for writing.
 |      
 |      If False, write() will raise OSError.
 |  
 |  write(self, text, /)
 |      Write string to stream.
 |      Returns the number of characters written (which is always equal to
 |      the length of the string).
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs)
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  buffer
 |  
 |  closed
 |  
 |  encoding
 |      Encoding of the text stream.
 |      
 |      Subclasses should override.
 |  
 |  errors
 |      The error setting of the decoder or encoder.
 |      
 |      Subclasses should override.
 |  
 |  line_buffering
 |  
 |  name
 |  
 |  newlines
 |      Line endings translated so far.
 |      
 |      Only line endings translated during reading are considered.
 |      
 |      Subclasses should override.
 |  
 |  write_through
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from _IOBase:
 |  
 |  __del__(...)
 |  
 |  __enter__(...)
 |  
 |  __exit__(...)
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  readlines(self, hint=-1, /)
 |      Return a list of lines from the stream.
 |      
 |      hint can be specified to control the number of lines read: no more
 |      lines will be read if the total size (in bytes/characters) of all
 |      lines so far exceeds hint.
 |  
 |  writelines(self, lines, /)
 |      Write a list of lines to stream.
 |      
 |      Line separators are not added, so it is usual for each of the
 |      lines provided to have a line separator at the end.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from _IOBase:
 |  
 |  __dict__

Note

You can obtain an output similar to help(object) with the shortcut object?, try it! This works for any function or class, so it’s extremely useful! Use it whenever you forget the arguments of a function. This is what is called an IPython magic command, recognisable by the starting %. You may use them in an IPython session or Jupyter cell, but not in the basic Python interpreter (when you call python)!

First we put the content of the first line into the variable line, now we might put it in a variable with a more meaningful name, like name. Also, we can directly read the next row into the variable surname and then print the concatenation of both:

with open("data/people-simple.txt") as f:
    name = f.readline()
    surname = f.readline()
    print(f"{name} {surname}")
Leonardo
 da Vinci

PROBLEM ! The printing puts a weird carriage return. Why is that? If you remember, first we said that readline reads the line content in a string adding to the end also the special newline character. To eliminate it, you can use the command rstrip():

with open("data/people-simple.txt") as f:
    name = f.readline().rstrip()
    surname = f.readline().rstrip()
    print(f"{name} {surname}")
Leonardo da Vinci

✪ 1.3 EXERCISE: Again, rewrite the block above in the cell below, and execute the cell with Control+Enter.

Question: what happens if you use strip() instead of rstrip()? What about lstrip()? Can you deduce the meaning of r and l? If you can’t manage it, try to use python command help by calling help(string.rstrip)

with open("data/people-simple.txt") as f:
    name = f.readline().rstrip()
    surname = f.readline().rstrip()
    print(f"{name} {surname}")
Leonardo da Vinci

Very good, we have the first line ! Now we can read all the lines in sequence. To this end, we can use a while cycle:

with open("data/people-simple.txt") as f:
    line = f.readline()
    while line != "":
        name = line.rstrip()
        surname = f.readline().rstrip()
        print(f"{name} {surname}")
        line = f.readline()
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli

What did we do? First, we added a while cycle in a new block

Warning

In new block, since it is already within the external with, the instructions are indented of 8 spaces and not 4! If you use the wrong spaces, bad things happen !

We first read a line, and two cases are possible:

a. we are the end of the file (or file is empty) : in this case readline() call returns an empty string

b. we are not at the end of the file: the first line is put as a string inside the variable line. Since Python internally uses a pointer to keep track at which position we are when reading inside the file, after the read such pointer is moved at the beginning of the next line. This way the next call to readline() will read a line from the new position.

In while block we tell Python to continue the cycle as long as line is not empty. If this is the case, inside the while block we parse the name from the line and put it in variable name (removing extra newline character with rstrip() as we did before), then we proceed reading the next line and parse the result inside the surname variable. Finally, we read again a line into the line variable so it will be ready for the next round of name extraction. If line is empty the cycle will terminate:

while line != "":                   # enter cycle if line contains characters
    name = line.rstrip()            # parses the name
    surname = f.readline().rstrip()   # reads next line and parses surname
    print(f"{name} {surname}")
    line = f.readline()               # read next line

Note

In Python there are shorter ways to read a text file line by line, we used this approach to make explicit all passages.

✪✪ 1.4 EXERCISE: We just presented the approach above to make things explicit. However, there are two more idiomatic ways to iterate over lines which involve a for loop. Try to rewrite the code above, this time using a for loop.

Hint

Start by inspecting all methods of f using dir(f), or the cell magic ?f.*. Can you find the two methods that could help you? Also, experiment with random ideas of how to use the for loop, one of them might just work!

Hide code cell source
with open("data/people-simple.txt") as f:
    lines = f.readlines()

for i in range(len(lines) // 2):
    first_name = lines[i].rstrip()
    last_name = lines[i + 1].rstrip()
    print(f"{first_name} {last_name}")
Leonardo da Vinci
da Vinci Sandro
Sandro Botticelli
Hide code cell source
with open("data/people-simple.txt") as f:
    person = {}
    for line in f:
        line = line.rstrip()
        if "first_name" in person:
            person["last_name"] = line
            print(f"{person['first_name']} {person['last_name']}")
            # Equivalently:
            # print(" ".join([v for v in person.values()]))
            person = {}
        else:
            person["first_name"] = line
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli

Writing in the file#

If you want instead to write into a file, you should specify it from the moment you open it, using "w" as a second argument, like so:

with open("data/people-simple-new.txt", "w") as f:
    f.write("")

We passed the content of this new file to f.write(), which is here an empty string. Now let’s read this file to check everything went correctly:

with open("data/people-simple-new.txt", "r") as f:
    print(f.read())

That’s an empty string indeed!

Warning

Always check that the path you’re writing to does not point to an existing file that you’d like to keep! Otherwise you’ll completely overwrite it, thus losing data!

✪✪ 1.5 EXERCISE: As above, get the full name of each historical character, but instead of printing it, store it into a list. Then use this list to write a new file data/people-simple-new.txt, containing on each line the full name of a historical character. Check things went as planned by reading in the file after writing to it.

Hide code cell source
output_lines = []
with open("data/people-simple.txt", "r") as f:
    person = {}
    for line in f:
        line = line.rstrip()
        if "first_name" in person:
            person["last_name"] = line
            output_str = f"{person['first_name']} {person['last_name']}"
            output_lines.append(output_str)
            person = {}
        else:
            person["first_name"] = line
            person['last_name'] = line

with open("data/people-simple-new.txt", "w") as f:
    f.write("\n".join(output_lines))

with open("data/people-simple-new.txt", "r") as f:
    print('Content of people-simple-new:')
    print(f.read())
Content of people-simple-new:
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli

Exercises - people-complex line file#

Look at the file people-complex.txt:

name: Leonardo
surname: da Vinci
birthdate: 1452-04-15
name: Sandro
surname: Botticelli
birthdate: 1445-03-01
name: Niccolò 
surname: Macchiavelli
birthdate: 1469-05-03

Supposing to read the file to print this output, how would you do it?

Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03

Hint

Each line is clearly split into two parts. Find the method you can use on a line to separate the line into these two parts. Then it’ll be easy to extract the part you’re interested in.

✪ 1.5 EXERCISE: Write here the solution of the exercise ‘People complex’. Use either a while or a for loop, whichever you feel more comfortable with.

Hide code cell source
with open("data/people-complex.txt") as f:
    line = f.readline()
    while line != "":
        name = line.split(":")[1].strip()
        surname = f.readline().split(":")[1].strip()
        born = f.readline().split(":")[1].strip()
        print(name + " " + surname + ", " + born)
        line = f.readline()
Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03
Hide code cell source
with open("data/people-complex.txt") as f:
    person = {}
    for line in f:
        key, value = line.split(":")
        value = value.strip()
        if key in person:
            print(f"{person['name']} {person['surname']}, {person['birthdate']}")
            person = {key: value}
        else:
            person[key] = value
print(f"{person['name']} {person['surname']}, {person['birthdate']}")
Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03

Exercise - immersione-in-python-toc line file#

✪✪✪ This exercise is more challenging, if you are a beginner you might skip it and go on to CSVs

The book Dive into Python is nice and for the italian version there is a PDF, which has a problem though: if you try to print it, you will discover that the index is missing. Without despairing, we found a program to extract titles in a file as follows, but you will discover it is not exactly nice to see. Since we are Python ninjas, we decided to transform raw titles in a real table of contents. Sure enough there are smarter ways to do this, like loading the pdf in Python with an appropriate module for pdfs, still this makes for an interesting exercise.

You are given the file immersione-in-python-toc.txt:

BookmarkBegin
BookmarkTitle: Il vostro primo programma Python
BookmarkLevel: 1
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Immersione!
BookmarkLevel: 2
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Dichiarare funzioni
BookmarkLevel: 2
BookmarkPageNumber: 41
BookmarkBeginint
BookmarkTitle: Argomenti opzionali e con nome
BookmarkLevel: 3
BookmarkPageNumber: 42
BookmarkBegin
BookmarkTitle: Scrivere codice leggibile
BookmarkLevel: 2
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Stringhe di documentazione
BookmarkLevel: 3
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Il percorso di ricerca di import
BookmarkLevel: 2
BookmarkPageNumber: 46
BookmarkBegin
BookmarkTitle: Ogni cosa &#232; un oggetto
BookmarkLevel: 2
BookmarkPageNumber: 47

Write a python program to print the following output:

   Il vostro primo programma Python  38
      Immersione!  38
      Dichiarare funzioni  41
         Argomenti opzionali e con nome  42
      Scrivere codice leggibile  44
         Stringhe di documentazione  44
      Il percorso di ricerca di import  46
      Ogni cosa è un oggetto  47

For this exercise, you will need to insert in the output artificial spaces, in a qunatity determined by the rows BookmarkLevel

QUESTION: what’s that weird value &#232; at the end of the original file? Should we report it in the output?

HINT 1: To convert a string into an integer number, use the function int:

x = '5'
x
'5'
int(x)
5

Warning

int(x) returns a value, and never modifies the argument x!

HINT 2: To substitute a substring in a string, you can use the method .replace:

x = "abcde"
x.replace("cd", "HELLO")
'abHELLOe'

HINT 3: while there is only one sequence to substitute, replace is fine, but if we had a million of horrible sequences like &gt;, &#62;, &x3e;, what should we do? As good data cleaners, we recognize these are HTML escape sequences, so we could use methods specific to sequences like html.escape. Try it instead of replace and check if it works!

NOTE: Before using html.unescape, import the module html with the command:

import html

HINT 4: To write n copies of a character, use * like this:

"b" * 4
'bbbb'

IMPLEMENTATION: Write here the solution for the line file immersione-in-python-toc.txt.

Hide code cell source
import html

with open("data/immersione-in-python-toc.txt") as f:
    line = f.readline()
    while line != "":
        line = f.readline()
        title = html.unescape(line.split(":")[1].strip())
        line = f.readline()
        level = int(line.split(":")[1].strip())
        line = f.readline()
        page = line.split(":")[1].strip()
        print(("    " * level) + title + "  " + page)
        line = f.readline()
    Il vostro primo programma Python  38
        Immersione!  38
        Dichiarare funzioni  41
            Argomenti opzionali e con nome  42
        Scrivere codice leggibile  44
            Stringhe di documentazione  44
        Il percorso di ricerca di import  46
        Ogni cosa è un oggetto  47
Hide code cell source
import html


def print_section(section):
    if section:
        section["BookmarkLevel"] = int(section["BookmarkLevel"])
        section["BookmarkTitle"] = html.unescape(section["BookmarkTitle"])
        print(
            f"{'    ' * section['BookmarkLevel']}{section['BookmarkTitle']}"
            f"  {section['BookmarkPageNumber']}"
        )
        section = {}


with open("data/immersione-in-python-toc.txt") as f:
    section = {}
    for line in f:
        if line.startswith("BookmarkBegin"):
            print_section(section)
        else:
            key, value = line.split(":")
            value = value.strip()
            section[key] = value
print_section(section)
    Il vostro primo programma Python  38
        Immersione!  38
        Dichiarare funzioni  41
            Argomenti opzionali e con nome  42
        Scrivere codice leggibile  44
            Stringhe di documentazione  44
        Il percorso di ricerca di import  46
        Ogni cosa è un oggetto  47