8  Files and I/O

NoteCore idea

Programs talk to the world. open() plus with is the foundation; pathlib.Path is the modern way to represent a file path; the csv and json modules cover the two most common data formats. Always use with, always specify encoding="utf-8".

In this chapter you will learn to:

  1. Read and write text files with open() and with.
  2. Use pathlib.Path for file-system paths.
  3. Read and write CSV and JSON.
  4. Print to stderr, read from stdin, and parse command-line arguments with argparse.
  5. Replace ad-hoc print debugging with logging.

We’ll demonstrate everything against a writable temporary directory so the cells render without leaving stray files.

import tempfile
from pathlib import Path
work = Path(tempfile.mkdtemp())
work
PosixPath('/tmp/tmpmen1z_o4')

8.1 Reading and writing text files

The simplest “real” I/O task: write a string to a file, then read it back. Two rules turn this from fragile to robust — always use with, always pass encoding="utf-8" explicitly.

target = work / "hello.txt"
with open(target, "w", encoding="utf-8") as f:
    f.write("Hello\nWorld\n")

with open(target, encoding="utf-8") as f:
    content = f.read()
content
'Hello\nWorld\n'
  • open(target, "w", encoding="utf-8") opens the file for writing. The "w" mode truncates if the file exists. encoding="utf-8" makes the bytes-on-disk interpretation portable across platforms.
  • with ... as f: binds the open file to f and guarantees it’s closed when the block exits — even if f.write raises mid-write.
  • f.write("Hello\nWorld\n") writes a string with two embedded newlines. No trailing newline is added automatically.
  • The second block reopens the file (default mode is "r", read) and reads the whole contents in one go with f.read().

The general rule: every file operation goes through with open(...) as f: with an explicit encoding=. Never rely on the platform default — it varies across OSes and locales.

For line-by-line reading on a large file, iterate the file object directly. It reads one line at a time, never loading the whole thing into memory:

with open(target, encoding="utf-8") as f:
    for line in f:
        print(line.rstrip())
Hello
World
  • for line in f: calls the file’s iterator protocol — yields one line at a time, including its trailing \n.
  • line.rstrip() strips that trailing newline (and any other whitespace) so print doesn’t double-space.
  • Memory stays constant regardless of file size — the file holds maybe one line at a time, not the whole text.

The general rule: a file object iterates as lines. Modes: "r" (read, default), "w" (write — overwrites), "a" (append), "x" (create — fails if file exists), "rb"/"wb" (binary).

8.2 pathlib.Path — the modern path API

The traditional way to manipulate paths in Python is string concatenation plus the os.path module — os.path.join, os.path.dirname, os.path.splitext, etc. It works but it’s verbose and error-prone (Windows separators, trailing slashes). pathlib.Path replaces half of os.path with one object-oriented, cross-platform API.

p = work / "subdir" / "data.txt"
p.parent.mkdir(parents=True, exist_ok=True)
p.write_text("first line\nsecond line\n", encoding="utf-8")

[p.name, p.stem, p.suffix, p.parent.name]
['data.txt', 'data', '.txt', 'subdir']
  • work / "subdir" / "data.txt" uses Path’s overloaded / operator — the cross-platform replacement for os.path.join. The result is a new Path with the right separator for the OS.
  • p.parent is the path with the last segment removed; .mkdir(parents=True, exist_ok=True) creates the directory, including any missing parents, and doesn’t error if it already exists.
  • p.write_text(...) is a one-liner — opens, writes, closes, all in one call.
  • p.name is "data.txt" (filename); p.stem is "data" (filename without extension); p.suffix is ".txt" (the extension including the dot); p.parent.name is "subdir".

p.read_text() and p.write_text() are one-liners for whole-file I/O — handy when you don’t need streaming:

p.read_text(encoding="utf-8")
'first line\nsecond line\n'
  • One call: open in read mode with the given encoding, read the whole file, close it, return the contents as a string.

Walking a directory’s immediate children:

for child in sorted(work.iterdir()):
    print(child.name, "(dir)" if child.is_dir() else "(file)")
hello.txt (file)
subdir (dir)
  • work.iterdir() yields each child as a Path — files and directories alike, no recursion.
  • sorted(...) gives a stable, alphabetical order (since iterdir order is not guaranteed).
  • child.is_dir() tags each entry.

For recursive search by pattern, glob:

list(work.glob("**/*.txt"))
[PosixPath('/tmp/tmpmen1z_o4/hello.txt'),
 PosixPath('/tmp/tmpmen1z_o4/subdir/data.txt')]
  • **/*.txt is a glob pattern: **/ means “any depth of directories”, *.txt is “any file ending in .txt”.
  • glob returns a generator; wrapping in list materialises the matches.

The general rule: build paths with /, ask for properties as attributes (.name, .suffix), and use iterdir/glob/walk for directory traversal. Path.exists(), .is_file(), .is_dir(), .unlink(), .rename(), .mkdir() round out the common operations.

For a recursive walk that gives you each directory’s children separately — like os.walk but with Path objects — use Path.walk() (Python 3.12+):

(work / "logs").mkdir(exist_ok=True)
(work / "logs" / "app.log").write_text("ok\n", encoding="utf-8")

for root, dirs, files in work.walk():
    print(root.name or root, sorted(dirs), sorted(files))
tmpmen1z_o4 ['logs', 'subdir'] ['hello.txt']
subdir [] ['data.txt']
logs [] ['app.log']

root is a Path; dirs and files are lists of names you can mutate to prune the walk. Reach for Path.walk() when you need to act per-directory; reach for Path.glob("**/*") when you just want the flat list of matches.

8.3 Reading TOML config: tomllib

Python 3.11 added tomllib — a TOML parser in the standard library. TOML is the format for pyproject.toml and most modern Python tooling configs. The module is read-only (there is no writer) and exposes two functions:

  • tomllib.loads(s) — parse a str. Returns a dict.
  • tomllib.load(fp) — parse a binary file object. Open the file with "rb".
import tomllib

snippet = """
[project]
name = "myapp"
version = "0.1.0"
dependencies = ["httpx", "pydantic"]
"""

config = tomllib.loads(snippet)
config["project"]["name"], config["project"]["dependencies"]
('myapp', ['httpx', 'pydantic'])

Reading from disk uses load with a binary-mode file:

with open("pyproject.toml", "rb") as f:
    config = tomllib.load(f)

Why binary? TOML 1.0 mandates UTF-8, and tomllib does the decoding itself — opening in text mode would let the OS decode first and could disagree on the encoding. Open "rb", hand the bytes to load, done.

A few details worth knowing:

  • TOML types map cleanly to Python: tables → dict, arrays → list, strings → str, integers → int, floats → float, booleans → bool, datetimes → datetime.datetime / date / time.
  • Invalid TOML raises tomllib.TOMLDecodeError (a subclass of ValueError) with lineno and colno attributes for diagnostics.
  • For exact decimals (e.g., money, scientific data) pass parse_float=decimal.Decimal to keep precision instead of rounding to binary float.

For writing TOML, you need a third-party package — tomli-w for round-tripping, tomlkit for style-preserving edits.

8.4 CSV

CSV looks deceptively simple — fields separated by commas, rows separated by newlines — but real CSV has quoted fields, embedded commas, embedded newlines, and platform-specific line endings. Always use the csv module — never parse it by hand with .split(",").

import csv

target = work / "people.csv"
people = [
    {"name": "Alice", "score": 95},
    {"name": "Bob", "score": 87},
    {"name": "Carol", "score": 92},
]

with open(target, "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "score"])
    writer.writeheader()
    writer.writerows(people)

target.read_text(encoding="utf-8")
'name,score\nAlice,95\nBob,87\nCarol,92\n'
  • csv.DictWriter(f, fieldnames=[...]) builds a writer that takes a list of dicts and writes them as rows. fieldnames defines the column order and the header.
  • writer.writeheader() writes the column-name row first.
  • writer.writerows(people) writes one row per dict, taking values in the order of fieldnames.
  • newline="" tells open not to translate line endings — csv handles them internally.
with open(target, encoding="utf-8", newline="") as f:
    rows = list(csv.DictReader(f))
rows
[{'name': 'Alice', 'score': '95'},
 {'name': 'Bob', 'score': '87'},
 {'name': 'Carol', 'score': '92'}]
  • csv.DictReader(f) reads the first line as a header, then yields each subsequent row as a dict keyed by those column names.
  • list(...) materialises the iterator into a list of dicts — fine for small files.
  • All values are strings — csv doesn’t infer types. If you want int, you have to convert.

The general rule: use DictReader/DictWriter when you’re working with named columns, plain reader/writer for raw rows. The newline="" argument is required on Windows to handle line endings correctly. Always pass it.

8.5 JSON

When two programs need to exchange structured data — a Python script writing a config a Rust service reads, or an API returning a payload to a JavaScript front-end — JSON is the universal format. The json module converts between JSON text and Python objects.

import json

target = work / "config.json"
config = {"name": "Alice", "scores": [95, 87, 92], "active": True}

with open(target, "w", encoding="utf-8") as f:
    json.dump(config, f, indent=2)

target.read_text(encoding="utf-8")
'{\n  "name": "Alice",\n  "scores": [\n    95,\n    87,\n    92\n  ],\n  "active": true\n}'
  • json.dump(obj, f) serialises a Python value to JSON and writes it to an open file. Note: dump writes to a file, dumps returns a string.
  • indent=2 pretty-prints with 2-space indentation — readable for humans, slightly bigger than the compact form.
  • The Python dict, list, and string values map to the matching JSON types (object, array, string).
with open(target, encoding="utf-8") as f:
    loaded = json.load(f)
loaded
{'name': 'Alice', 'scores': [95, 87, 92], 'active': True}
  • json.load(f) reads JSON from an open file and returns the equivalent Python value.
  • loaded is a fresh dict, structurally identical to config, but a separate object in memory.

The general rule: dump/load operate on file objects, dumps/loads operate on strings — the trailing s is for “string”. Type mapping: None ↔︎ null, True/False ↔︎ true/false, int/float ↔︎ number, str ↔︎ string, list ↔︎ array, dict ↔︎ object.

json.dumps(obj, indent=2) and json.loads(text) are the string variants — useful for HTTP bodies and inline test data.

8.6 Standard streams and command-line arguments

A command-line program has three default streams: stdin (input from the user or a pipe), stdout (normal output), and stderr (errors and diagnostics). Keeping errors on stderr matters because shell users redirect them separately — myscript > out.txt captures stdout to a file while stderr still shows on the terminal.

print() writes to stdout by default; pass file=sys.stderr to write to standard error:

import sys
print("normal output")            # → stdout
print("warning!", file=sys.stderr)  # → stderr
normal output
warning!

8.6.1 Parsing command-line arguments with argparse

When a user runs python myscript.py data.json -v, what does Python actually hand to your program? Just a list of strings — sys.argv:

# If the user had run: python myscript.py data.json -v
# sys.argv would look like this:
fake_argv = ["myscript.py", "data.json", "-v"]
fake_argv
['myscript.py', 'data.json', '-v']

That’s the raw input — no types, no validation, no help text, no notion of which token is a flag and which is positional. You could parse it yourself: walk the list, recognise leading -, convert "data.json" to a Path, decide what to do if the user typed --input instead of -i, write a --help message. For a one-line script that’s tolerable; for anything bigger it’s a mess of conditionals.

Step back and ask: what does any argument parser need to know about each argument?

  1. Its name (so we can refer to it later as args.input).
  2. Whether it’s positional (required, no leading -) or optional (a flag, with a default).
  3. Its type (a Path? an int? a string?).
  4. A default if the user omits it.
  5. A help string so --help is useful.

argparse is the standard library’s answer: a tiny declarative API where you describe each argument with those five pieces, and the module does parsing, type conversion, validation, and --help for you. Build a parser in two steps.

Step 1: declare one positional argument.

import argparse
from pathlib import Path

parser = argparse.ArgumentParser(description="Process input data.")
parser.add_argument("input", type=Path, help="input JSON file")

args = parser.parse_args(["data.json"])
args.input, type(args.input).__name__
(PosixPath('data.json'), 'PosixPath')
  • ArgumentParser(description=...) creates the parser. The description shows up in the auto-generated --help text.
  • add_argument("input", ...) declares an argument named input. The name has no leading -, so it’s positional — required, consumed from the first non-flag word.
  • type=Path tells argparse to convert the raw string with Path(...). You can pass any one-argument callable: int, float, your own validation function. The argument arrives already-typed.
  • parse_args(["data.json"]) returns a Namespace where each declared argument is an attribute. In a real script you’d call parse_args() with no list and argparse would read sys.argv itself; passing a list is the testing/demo idiom.

Step 2: add optional flags with defaults. Argument names starting with - or -- are optional. The user may skip them; the parser then uses the default you supply:

parser.add_argument("-o", "--output", type=Path, default=Path("results.json"))
parser.add_argument("-v", "--verbose", action="store_true")

args = parser.parse_args(["data.json", "-v"])
[args.input, args.output, args.verbose]
[PosixPath('data.json'), PosixPath('results.json'), True]
  • -o and --output are two spellings of the same flag — short for typing, long for readability. The attribute name on args comes from the long form (with hyphens turned to underscores).
  • default=Path("results.json") is what args.output becomes when the user doesn’t pass --output. Without default, the value is None.
  • action="store_true" declares a pure on/off switch with no value following it. Present → args.verbose = True; absent → False.
  • Our call passed -v but not -o, so verbose is True and output falls back to the default.

The pay-off: declare each argument once, get parsing + typing + defaults + a generated --help (try python script.py --help). Always use argparse for any script with more than one argument — the cost is small and you ship a usage doc you’d otherwise write by hand.

8.7 logging — when print isn’t enough

Start from what print("ERROR: connection failed") actually does in a real program:

  • It writes one line to stdout. There’s no machine-readable signal that this line is an error and the next one is a routine trace — they’re all just text.
  • Every print call ships in production. There’s no off-switch for the noisy ones short of deleting them.
  • It writes to stdout, full stop. You can’t redirect only the errors to a file, or send them to a remote alert sink.
  • The line carries no timestamp, no module name, no severity — no metadata an operator can grep, sort, or filter by.

Each of those is a small annoyance for a 50-line script and a serious problem at 5,000. The standard library’s logging module is the systematic answer. It introduces three concepts:

  • A level is the severity tag attached to each log line — DEBUG, INFO, WARNING, ERROR, CRITICAL, in increasing order.
  • A logger is the object you call to emit a line. It carries a name and a level threshold; lines below the threshold are silently dropped.
  • A handler is the destination — stderr, a rotating file, a network sink. A logger can have multiple handlers; each can have its own format.

Build it up in three steps.

Step 1: levels — a silence knob and a severity tag. Set a threshold; lines below it disappear without code edits:

import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s", force=True)

logging.debug("about to start")          # DROPPED — below INFO threshold
logging.info("starting up")
logging.warning("disk almost full")
logging.error("could not connect to database")
INFO starting up
WARNING disk almost full
ERROR could not connect to database
  • basicConfig(level=logging.INFO) configures the root logger in one call. Threshold = INFO, so DEBUG is dropped and INFO/WARNING/ERROR are kept.
  • The silence knob: change level=INFO to level=WARNING and the info line stops appearing. No calls deleted; you can turn them back on by flipping the threshold.
  • format is a template. %(levelname)s becomes INFO/WARNING/ERROR; %(message)s becomes the text. Real production formats add %(asctime)s (timestamp) and %(name)s (logger name).
  • force=True lets basicConfig replace any earlier handler — useful when re-running notebook cells. In a real script you call basicConfig once at startup and don’t need it.

Step 2: named loggers — per-component control. Step 1 used logging.info(...) directly, which is the root logger. In real code, every module asks for its own logger so logs are tagged by component and a noisy component can be quieted without touching the rest:

logger = logging.getLogger("payments")
logger.warning("retrying after timeout")
WARNING retrying after timeout
  • logging.getLogger("payments") looks up (or creates) a logger named payments. Convention: write logger = logging.getLogger(__name__) at the top of every module — __name__ is the module’s import path, so the logger name auto-tracks the source.
  • Logger names are dot-namespaced: myapp is the parent of myapp.payments and myapp.email. logging.getLogger("myapp").setLevel(logging.WARNING) quiets every child at once unless a child overrides.
  • Children inherit the root config from basicConfig — same handler, same format. Override per-logger only when you genuinely need different behaviour.

Step 3: log an exception with its full traceback. Inside an except block, logger.exception records the message and the current traceback. That’s the difference between “something broke” and “exact line, exact stack”:

try:
    1 / 0
except ZeroDivisionError:
    logger.exception("could not compute ratio")
ERROR could not compute ratio
Traceback (most recent call last):
  File "/tmp/ipykernel_2718/3288506788.py", line 2, in <module>
    1 / 0
    ~~^~~
ZeroDivisionError: division by zero
  • Same severity as logger.error, but exception adds the traceback automatically. Use it whenever you catch.
  • Never swallow an exception silently — at minimum, logger.exception(...) before you re-raise or return a fallback.

Distilled: at the top of every module, logger = logging.getLogger(__name__). Once at startup, configure handlers and format with basicConfig (or a real config object). Then replace print with .debug / .info / .warning / .error / .exception. Levels make silencing free; named loggers make per-component tuning free; exception makes diagnostics free. None of this is more code than print — it’s the same number of lines with the metadata your future self (or your operator) will need.

TipWhy this matters

File I/O is the boundary where most real-world bugs live: missing files, wrong encodings, half-written outputs. with makes leaks impossible; pathlib makes path manipulation portable; csv and json give you reliable parsers. Replace print debugging with logging the moment a script grows past 100 lines, and you’ll never lose another error message.

8.8 Going deeper

The with statement and the protocol behind it are in Chapter 30. Iterators (which is how a file object iterates line by line) are in Chapter 29. The pathlib.Path API is built on the descriptor and protocol patterns covered in Chapter 23 and Chapter 25.

8.9 Build: a CSV-to-JSON converter CLI

The most common one-shot script in data work: read a CSV, convert it to JSON. We’ll grow it from a function into a real command-line tool, exercising pathlib, csv, json, argparse, and logging in three steps.

Step 1: a function that returns records. Open the CSV with DictReader, return a list of dicts. CSV reads everything as strings; we’ll convert anything that looks numeric to int or float:

import csv

src = work / "people.csv"
src.write_text("name,age,score\nAlice,30,95.5\nBob,25,87.0\n", encoding="utf-8")

def coerce(value):
    """Try int, then float, else leave as a string."""
    for typ in (int, float):
        try:
            return typ(value)
        except ValueError:
            continue
    return value

def csv_to_records(path):
    with open(path, encoding="utf-8", newline="") as f:
        return [
            {k: coerce(v) for k, v in row.items()}
            for row in csv.DictReader(f)
        ]

csv_to_records(src)
[{'name': 'Alice', 'age': 30, 'score': 95.5},
 {'name': 'Bob', 'age': 25, 'score': 87.0}]

coerce is the typical “soft” type-cast — try int, fall back to float, fall back to leaving it alone. The dict comprehension {k: coerce(v) for k, v in row.items()} rebuilds each row with values converted. csv.DictReader already gives us dicts keyed by header names; we just remap the values.

Step 2: write the records to JSON. pathlib.Path.write_text plus json.dumps is the one-liner version:

import json

def records_to_json(records, path):
    path.write_text(json.dumps(records, indent=2), encoding="utf-8")

dst = work / "people.json"
records_to_json(csv_to_records(src), dst)
print(dst.read_text(encoding="utf-8"))
[
  {
    "name": "Alice",
    "age": 30,
    "score": 95.5
  },
  {
    "name": "Bob",
    "age": 25,
    "score": 87.0
  }
]

Using dumps + write_text (rather than json.dump with an open file) is fine for files small enough to fit in memory; for very large outputs you’d stream with dump(obj, f) and a with open block. Both are in this chapter.

Step 3: a CLI wrapper with argparse and logging. A real script needs: a positional input path, an optional output path, a --verbose flag, and useful error messages. Argparse converts the strings to Path for us; logging gives us the silence knob:

import argparse
import logging

logger = logging.getLogger("csv2json")

def build_parser():
    p = argparse.ArgumentParser(description="Convert a CSV file to JSON.")
    p.add_argument("input", type=Path, help="input CSV file")
    p.add_argument("-o", "--output", type=Path,
                   help="output JSON file (default: <input>.json)")
    p.add_argument("-v", "--verbose", action="store_true",
                   help="emit DEBUG-level logs")
    return p

def main(argv):
    args = build_parser().parse_args(argv)
    logging.basicConfig(
        level=logging.DEBUG if args.verbose else logging.INFO,
        format="%(levelname)s %(name)s: %(message)s",
        force=True,
    )
    output = args.output or args.input.with_suffix(".json")
    logger.debug("reading %s", args.input)
    try:
        records = csv_to_records(args.input)
    except FileNotFoundError:
        logger.error("input file not found: %s", args.input)
        return 1
    records_to_json(records, output)
    logger.info("wrote %d records to %s", len(records), output)
    return 0

# Demo: run as if invoked from the shell
main([str(src), "-v"])
DEBUG csv2json: reading /tmp/tmpmen1z_o4/people.csv
INFO csv2json: wrote 2 records to /tmp/tmpmen1z_o4/people.json
0

Three new pieces beyond what we’ve already built. args.input.with_suffix(".json") is a Path method that swaps the extension — the natural default output name. logger.debug("reading %s", args.input) uses %-style placeholders rather than an f-string: logging defers the formatting until after the level threshold is checked, so DEBUG lines don’t pay any string-formatting cost when the level is INFO. And main returns an exit code (0/1) so a real entry point would sys.exit(main(sys.argv[1:])).

The build is the entire chapter in motion: pathlib for paths and extension swapping, csv.DictReader for parsing, json.dumps + write_text for output, argparse for CLI, logging for diagnostics, with (inside csv_to_records) for resource lifecycle, and a try/except FileNotFoundError for graceful failure. Replace print with logger.info, ship it.

8.10 Exercises

  1. CSV → JSON. Write a script that reads a CSV file and writes the same data as JSON. Use csv.DictReader and json.dump(..., indent=2).

  2. Word count. Read a text file and print the 10 most common words using Counter. Strip punctuation; lowercase everything.

  3. pathlib glob. List every .py file in your project recursively using Path(".").glob("**/*.py"). Print each path’s parent and stem.

  4. with two files. Open an input and output file in a single with statement. Write the input to the output, uppercased.

  5. Forgotten newline="". On a Mac/Linux machine, write a CSV without newline="" and read it back. On Windows, the same code writes blank lines between rows. Why?

8.11 Summary

with open(...) as f: covers most file I/O. pathlib.Path covers paths. csv, json, argparse, and logging cover the most common surrounding concerns. The next chapter, Chapter 9, returns to language design — defining your own types — using everything we’ve built so far.