11  Standard Library Tour

NoteCore idea

Python’s “batteries included” philosophy: the standard library solves most common problems without installing anything extra. Before reaching for a third-party package, check what’s already there. It’s faster, more reliable, and more portable.

In this chapter you will learn to:

  1. Use collections for Counter, defaultdict, deque, and namedtuple.
  2. Use itertools for chain, islice, groupby, and accumulate.
  3. Use functools for lru_cache, cached_property, partial, reduce, and wraps.
  4. Use datetime for dates, times, and durations.
  5. Use re for regular expressions — and recognize when not to.

This is a tour, not an encyclopedia — pointers to the modules you’ll reach for repeatedly. Beazley’s Python Distilled §9–10 is the canonical reference for the broader stdlib (asyncio, threading, sockets, struct, tempfile, urllib, xml, …).

11.1 collections

from collections import Counter, defaultdict, deque, namedtuple

Counter counts occurrences — the canonical “tally things” data structure, replacing a hand-written dict-with-default loop:

words = "the cat sat on the mat the".split()
c = Counter(words)
[c["the"], c["missing"], c.most_common(2)]
[3, 0, [('the', 3), ('cat', 1)]]
  • Counter(words) walks the iterable once, returning a dict subclass that maps each unique element to its count.
  • c["the"] is 3 (it appears three times). c["missing"] returns 0, not KeyError — that’s Counter’s key superpower over a plain dict.
  • c.most_common(2) returns the two highest-frequency (item, count) tuples, sorted by count.

Counter arithmetic merges two counters — handy for combining tallies from different sources:

a = Counter("aabb")
b = Counter("abc")
[a + b, a - b, a["zzz"]]
[Counter({'a': 3, 'b': 3, 'c': 1}), Counter({'a': 1, 'b': 1}), 0]
  • A Counter built from a string counts each character: a = {"a": 2, "b": 2}, b = {"a": 1, "b": 1, "c": 1}.
  • a + b adds counts position-by-position: {"a": 3, "b": 3, "c": 1}.
  • a - b subtracts and drops zero or negative counts: {"a": 1, "b": 1}.
  • a["zzz"] is 0 — missing keys return 0, never raise.

The general rule: Counter for frequency tables, with +/- for merging and most_common(n) for ranked extraction.

defaultdict auto-creates a default value for missing keys — eliminates the “if key not in d: d[key] = []” boilerplate that comes up constantly when grouping or accumulating:

groups = defaultdict(list)
for name, team in [("Alice", "eng"), ("Bob", "design"), ("Carol", "eng")]:
    groups[team].append(name)
dict(groups)
{'eng': ['Alice', 'Carol'], 'design': ['Bob']}
  • defaultdict(list) is a dict subclass; the argument is the factory — a zero-argument callable that produces the default.
  • The first time groups["eng"] is accessed, the factory list() runs to produce [], which is stored under "eng".
  • .append("Alice") then mutates that list in place. The next groups["eng"] lookup finds the same list and appends to it.
  • dict(groups) is purely cosmetic — converts to a plain dict for the display.

The general rule: defaultdict(int) for counters; defaultdict(list) for grouping; defaultdict(set) for collecting unique items per key. Pick the factory that matches the per-key value type.

deque — fast append/pop from both ends. A list is fine for back-end operations but pays O(n) for any front-end work (pop(0) shifts every other element); a deque is O(1) at both ends.

from collections import deque
queue = deque(["a", "b", "c"])
queue.appendleft("z")
queue.popleft()
queue.append("d")
list(queue)
['a', 'b', 'c', 'd']
  • deque(["a", "b", "c"]) builds a deque from any iterable.
  • appendleft("z") adds to the front: ["z", "a", "b", "c"].
  • popleft() removes and returns the front element "z": leaves ["a", "b", "c"].
  • append("d") adds to the back: ["a", "b", "c", "d"].
  • list(queue) materialises the contents for display.

The general rule: when you need fast front-end operations (queues, sliding windows, undo stacks), use deque, not list.

deque(maxlen=N) is a bounded deque — appends past the limit silently drop from the other end. That’s exactly the data structure for a fixed-size sliding window:

window = deque(maxlen=3)
for x in [1, 2, 3, 4, 5]:
    window.append(x)
    print(list(window))
[1]
[1, 2]
[1, 2, 3]
[2, 3, 4]
[3, 4, 5]

Three-line sliding window, no manual bookkeeping.

namedtuple — a factory for small, immutable record types. A plain tuple (3, 4) is anonymous (which one is x?). A dict gives names but is mutable and any typo at access time fails silently. namedtuple sits in between: each instance is still a tuple — immutable, hashable, indexable, unpacks the same way — but the fields also have names:

Point = namedtuple("Point", ["x", "y"])
p = Point(3, 4)
[p.x, p.y, p[0]]
[3, 4, 3]

Notice both forms of access work: p.x (by name) and p[0] (by position). Because instances are tuples, they work everywhere a tuple does — destructuring (x, y = p), match patterns, dict keys.

You’ll meet namedtuple constantly when reading other people’s code: many stdlib functions return them (os.stat, time.localtime, urllib.parse.urlparse). For new code that wants type hints or extra methods, prefer typing.NamedTuple (a typed namedtuple) or @dataclass(frozen=True). The full comparison is in Chapter 17.

11.2 itertools

import itertools

chain — flatten multiple iterables:

list(itertools.chain([1, 2], [3, 4], [5, 6]))
[1, 2, 3, 4, 5, 6]

chain(a, b, c, ...) walks each input in turn and yields its elements one by one — never materialising any intermediate list. For three lists it produces [1, 2, 3, 4, 5, 6]. The lazy form pays off when the inputs are large or themselves lazy: chain(open("a.log"), open("b.log")) streams two files as one stream of lines, no concatenation cost. There’s also chain.from_iterable(iter_of_iters) for “flatten one level” when you have a list of lists already.

islice — take a slice from any iterable (lazily):

list(itertools.islice(range(100), 5))
[0, 1, 2, 3, 4]

islice(iterable, stop) is iterable[:stop] for things that can’t be subscripted — generators, iterators, file objects, infinite sequences. The cell yields [0, 1, 2, 3, 4], the same as list(range(100))[:5] would, but without building a 100-element list first. Three-argument form: islice(it, start, stop); four-argument: islice(it, start, stop, step).

cycle — repeat forever (combine with islice to stop):

list(itertools.islice(itertools.cycle(["red", "green", "blue"]), 7))
['red', 'green', 'blue', 'red', 'green', 'blue', 'red']

cycle(iterable) yields the elements over and over — red, green, blue, red, green, blue, red, ... forever. By itself it would never terminate; combined with islice it gives you “the next N values, wrapping around.” Useful for round-robin assignment, alternating colours in a chart, repeating a fallback list. The cell takes the first 7, producing ['red', 'green', 'blue', 'red', 'green', 'blue', 'red'] — two full cycles plus the first colour again.

groupby — group consecutive equal items. The trap: it does not pre-sort the input, so unsorted data gives you one group per run, not one per distinct value. Two cells make the rule visible.

Step 1: the trap. Run groupby on unsorted data and you get a fresh group every time the key changes — including duplicates:

unsorted = ["alice", "bob", "adam", "carol", "charlie"]
[(letter, list(g)) for letter, g in itertools.groupby(unsorted, key=lambda x: x[0])]
[('a', ['alice']),
 ('b', ['bob']),
 ('a', ['adam']),
 ('c', ['carol', 'charlie'])]
  • bob interrupts the a run, so a shows up as two separate groups instead of one.

Step 2: sort by the same key first. With the input sorted, groupby collapses each distinct key into a single group:

data = sorted(unsorted, key=lambda x: x[0])
[(letter, list(g)) for letter, g in itertools.groupby(data, key=lambda x: x[0])]
[('a', ['alice', 'adam']), ('b', ['bob']), ('c', ['carol', 'charlie'])]
  • sorted(..., key=lambda x: x[0]) sorts by first letter — required preprocessing.
  • itertools.groupby(data, key=...) walks the sorted list and yields (key, group_iterator) pairs whenever the key changes.
  • list(g) materialises each group iterator before moving on — necessary because groupby reuses internal state, and the previous group iterator becomes invalid the moment you advance.

The general rule: groupby is the SQL-style group-by, but you sort first, and you consume each group before moving on.

accumulate — running totals:

list(itertools.accumulate([1, 2, 3, 4, 5]))
[1, 3, 6, 10, 15]

The default operator is +, so the output is [1, 3, 6, 10, 15] — each element is the sum of all previous ones. Element 0 is just 1 (the first input); element 1 is 1 + 2 = 3; element 2 is 3 + 3 = 6; and so on. This is what spreadsheets call a “running total.” Pass a different operator to fold differently — itertools.accumulate(xs, operator.mul) gives a running product, itertools.accumulate(xs, max) gives the running maximum.

combinations, permutations, product — combinatorial generators:

[
    list(itertools.combinations("ABCD", 2)),
    list(itertools.permutations("ABC", 2))[:3],
    list(itertools.product("AB", range(2))),
]
[[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')],
 [('A', 'B'), ('A', 'C'), ('B', 'A')],
 [('A', 0), ('A', 1), ('B', 0), ('B', 1)]]

Three different combinatorial questions:

  • combinations("ABCD", 2) yields unordered pairs without repetition: ('A','B'), ('A','C'), ('A','D'), ('B','C'), ('B','D'), ('C','D') — six pairs. “Unordered” means ('A','B') and ('B','A') count as the same pair, so only one is yielded. C(4, 2) = 6.
  • permutations("ABC", 2) yields ordered pairs without repetition: ('A','B'), ('A','C'), ('B','A'), ('B','C'), ('C','A'), ('C','B') — six pairs (we slice the first three). Order matters, so ('A','B') and ('B','A') are both included. P(3, 2) = 6.
  • product("AB", range(2)) is the cartesian product: every combination from each iterable, with order and repetition. ('A',0), ('A',1), ('B',0), ('B',1) — four tuples. The same as a nested for-loop.

Use combinations when “choose 2 of these” matters (poker hands, lottery tickets); permutations when “in what order” matters (rankings, password attempts); product when iterating over a grid of independent choices (color × size, country × year).

11.3 functools

import functools

lru_cache / cache — memoize a pure function’s results. The textbook recursive Fibonacci is exponentially slow because it recomputes the same fib(n-2) constantly; one decorator turns it linear.

@functools.lru_cache(maxsize=None)
def fib(n):
    if n <= 1:
        return n
    return fib(n - 1) + fib(n - 2)

[fib(50), fib.cache_info()]
[12586269025, CacheInfo(hits=48, misses=51, maxsize=None, currsize=51)]
  • @functools.lru_cache(maxsize=None) wraps fib in a memoizing layer: each unique call’s result is stashed in a dict keyed by arguments.
  • maxsize=None makes the cache unbounded; a number would cap it at LRU (least-recently-used) eviction.
  • The first fib(50) populates the cache; further calls would be free.
  • fib.cache_info() reports hits, misses, maxsize, currsize — useful for verifying the cache is doing its job.

The general rule: @lru_cache (or @functools.cache in 3.9+, which is the unbounded form) is a one-line speed-up for any pure function — same inputs always give the same output, no side effects.

cached_property — compute once per instance, then read like an attribute:

import statistics

class DataSet:
    def __init__(self, data):
        self._data = data

    @functools.cached_property
    def stats(self):
        return {
            "mean": statistics.mean(self._data),
            "median": statistics.median(self._data),
            "stdev": statistics.stdev(self._data),
        }

ds = DataSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
ds.stats   # computed once
{'mean': 5.5, 'median': 5.5, 'stdev': 3.0276503540974917}

partial — pre-fill some of a function’s arguments, returning a new callable that takes the rest. Useful when you have a general function and want a specialised version for repeated use.

def power(base, exp):
    return base ** exp

square = functools.partial(power, exp=2)
cube = functools.partial(power, exp=3)
[square(4), cube(3)]
[16, 27]
  • functools.partial(power, exp=2) returns a new callable equivalent to lambda base: power(base, exp=2)power with exp already bound to 2.
  • square(4) calls the partial with base=4, getting power(4, exp=2) = 16.
  • cube(3) is the same shape with exp=3, returning 27.

The general rule: partial(fn, *args, **kwargs) freezes those arguments and returns a function that takes the remaining ones. Cleaner than wrapping in a lambda for the same purpose.

reduce — fold a sequence into a single value:

import operator
[functools.reduce(operator.add, [1, 2, 3, 4, 5]),
 functools.reduce(operator.mul, [1, 2, 3, 4, 5])]
[15, 120]

The operator module provides function versions of every Python operator — operator.add(a, b) is a + b, operator.mul(a, b) is a * b, operator.itemgetter(0) returns a callable equivalent to lambda x: x[0]. They’re useful exactly here: where a higher-order function (reduce, map, sorted(..., key=...)) needs a function value rather than an inline operator.

wraps — preserve a function’s name and docstring inside a decorator. We saw this in Chapter 6; the deep treatment is Chapter 21.

The statistics module (used by cached_property above) provides the small set of descriptive-stats functions you’d otherwise compute by hand: mean, median, mode, stdev, variance, plus the histogram-style quantiles. For plain numerical means, statistics.fmean (Python 3.8+) is faster than statistics.mean because it converts to float up front instead of preserving exact Fraction arithmetic. Reach for mean only when you need exactness across Decimal or Fraction data:

import statistics
[statistics.fmean([1, 2, 3, 4, 5]), statistics.fmean(range(1_000_000))]
[3.0, 499999.5]

11.4 datetime

Time is fiddly: time zones, daylight savings, leap years, the difference between “an instant” and “a date on a calendar”. The datetime module gives you four core types — datetime, date, timedelta, timezone — that together cover almost every time-handling task.

from datetime import datetime, date, timedelta, timezone

now = datetime.now(tz=timezone.utc)
today = date.today()
[now.year, today.isoformat()]
[2026, '2026-05-11']
  • datetime is a full timestamp (date + time + optional timezone). date is a calendar date with no time component.
  • datetime.now(tz=timezone.utc) returns the current instant as a timezone-aware datetime in UTC. Always pass tz= — naive datetimes are a long-running source of bugs.
  • date.today() returns the current date.
  • now.year reads a component (int); today.isoformat() is the standard "YYYY-MM-DD" text form.

Arithmetic uses timedelta — a timedelta is a duration, not a point in time:

yesterday = now - timedelta(days=1)
next_week = now + timedelta(weeks=1)
[yesterday.date(), next_week.date()]
[datetime.date(2026, 5, 10), datetime.date(2026, 5, 18)]
  • timedelta(days=1) is “one day”. Subtracting it from a datetime shifts the timestamp back 24 hours.
  • timedelta(weeks=1) is “seven days”; adding shifts forward.
  • .date() extracts just the date portion of a datetime for compact display.

The general rule: datetime + timedelta = datetime; datetime - datetime = timedelta. Stick to timezone-aware datetimes and let timedelta do all the arithmetic.

Formatting and parsing use strftime / strptime:

[
    now.strftime("%Y-%m-%d %H:%M:%S"),
    datetime.strptime("2024-01-15", "%Y-%m-%d"),
    datetime.fromisoformat("2024-01-15T12:30:00"),
]
['2026-05-11 23:52:23',
 datetime.datetime(2024, 1, 15, 0, 0),
 datetime.datetime(2024, 1, 15, 12, 30)]

Three different conversions, two directions:

  • now.strftime(fmt)string-format-time: turn a datetime into a string. The format codes are POSIX-style: %Y is the four-digit year, %m is two-digit month, %d two-digit day, %H:%M:%S is hour:minute:second on 24-hour. Other useful codes: %A weekday name, %B month name, %j day of year. The cell renders something like '2026-05-10 12:34:56'.
  • datetime.strptime(text, fmt)string-parse-time: the inverse direction. Read a string against a format and produce a datetime. Useful for legacy or non-ISO formats. Naive (no timezone) unless your format contains %z.
  • datetime.fromisoformat(text) — the modern parser for the ISO-8601 format YYYY-MM-DDTHH:MM:SS. Faster and more permissive than strptime("%Y-%m-%dT%H:%M:%S"), and round-trips with .isoformat(). Reach for this whenever the input is ISO-8601 (which it should be, for any new code or API).

For anything time-zone-sensitive, always include tzinfo. Use timezone.utc for UTC; for named zones, use zoneinfo.ZoneInfo (Python 3.9+) — backed by the system’s IANA database, so it handles DST correctly:

from zoneinfo import ZoneInfo

london = datetime(2024, 6, 1, 12, 0, tzinfo=ZoneInfo("Europe/London"))
new_york = london.astimezone(ZoneInfo("America/New_York"))
[london.isoformat(), new_york.isoformat()]
['2024-06-01T12:00:00+01:00', '2024-06-01T07:00:00-04:00']

astimezone converts; the underlying instant is unchanged. Always store and pass datetimes as timezone-aware — naive datetimes are the source of nearly every “off by an hour twice a year” bug.

11.5 re — regular expressions

Regex is powerful but easy to overuse. Reach for it when patterns are truly irregular — phone numbers, dates of varying formats, log-line shapes. For fixed strings, .startswith() and .split() are simpler and faster.

import re

m = re.search(r"(\d+)-(\d+)", "phone: 123-456")
[m.group(), m.group(1), m.group(2)]
['123-456', '123', '456']
  • r"(\d+)-(\d+)" is the pattern — \d+ matches one or more digits, the parentheses create capture groups, and - matches a literal hyphen.
  • re.search(pattern, text) scans for the first place the pattern matches anywhere in text. Returns a match object (or None).
  • m.group() is the entire match: "123-456".
  • m.group(1) is the first captured group (the first \d+): "123".
  • m.group(2) is the second captured group: "456".

The general rule: parentheses capture, \d/\w/\s are character classes, +/*/? are repetition. Always use raw strings (r"...") for patterns — \d and \w would otherwise need double backslashes.

The core entry points:

  • re.search(pattern, text) — first match anywhere.
  • re.match(pattern, text) — match at the start.
  • re.fullmatch(pattern, text) — match the entire string.
  • re.findall(pattern, text) — list of every match.
  • re.sub(pattern, repl, text) — replace every match.
  • re.split(pattern, text) — split on the pattern.

Compile patterns you’ll reuse:

phone = re.compile(r"\b(\d{3})-(\d{3,4})\b")
phone.findall("call 555-1234 or 555-5678 today")
[('555', '1234'), ('555', '5678')]

Named groups make the resulting matches self-documenting:

m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", "2024-01-15")
m.groupdict()
{'year': '2024', 'month': '01', 'day': '15'}
TipWhy this matters

The standard library is the largest, best-tested Python codebase in existence. Knowing what’s already there is a force multiplier: Counter replaces a dozen lines of if/else; lru_cache replaces a manual cache; pathlib replaces hand-rolled string manipulation. The first question for any new task should be “is there already a stdlib module for this?”

NoteFurther reading

Beazley, Python Distilled §9–10 surveys the rest of the standard library — socket, select, asyncio, threading, tempfile, struct, urllib, xml, inspect, and more — with example-driven notes that complement this chapter’s tour.

11.6 Build: a tiny log-line analyzer

A canonical “I just need to grok these logs” task: parse some lines, count status codes, and break it down by date. Five stdlib modules earn a slot — re for parsing, collections.Counter and defaultdict for tallies, itertools.groupby for the per-day breakdown, and datetime for date validation.

Step 1: parse each line with a named-group regex. A real-world log line has fields the program needs as separate values; re with named groups gives them to you as a dict:

import re

LINE = re.compile(
    r"(?P<date>\d{4}-\d{2}-\d{2}) "
    r"(?P<time>\d{2}:\d{2}:\d{2}) "
    r"(?P<level>\w+) "
    r"(?P<status>\d+) "
    r"(?P<path>\S+)"
)

raw = """\
2024-01-15 10:30:01 ERROR 500 /api/users
2024-01-15 10:30:05 INFO 200 /api/products
2024-01-15 10:31:02 ERROR 404 /api/users
2024-01-16 09:15:00 ERROR 500 /api/orders
2024-01-16 09:15:30 ERROR 500 /api/users
"""

events = [LINE.match(line).groupdict() for line in raw.strip().splitlines()]
events[0]
{'date': '2024-01-15',
 'time': '10:30:01',
 'level': 'ERROR',
 'status': '500',
 'path': '/api/users'}

(?P<name>...) defines a named capture group; after a successful match, m.groupdict() returns a dict keyed by those names — much friendlier than m.group(1), m.group(2), etc. Compiling the pattern once at module level (re.compile(...)) avoids re-compiling it for every line.

Step 2: tally status codes (Counter) and group request paths by status (defaultdict). Two of the chapter’s collections types, applied to the parsed events:

from collections import Counter, defaultdict

status_counts = Counter(int(e["status"]) for e in events)

paths_by_status = defaultdict(list)
for e in events:
    paths_by_status[int(e["status"])].append(e["path"])

[status_counts.most_common(),
 dict(paths_by_status)]
[[(500, 3), (200, 1), (404, 1)],
 {500: ['/api/users', '/api/orders', '/api/users'],
  200: ['/api/products'],
  404: ['/api/users']}]

Counter(int(e["status"]) for e in events) walks the events with a generator expression — no intermediate list — and tallies. paths_by_status is a defaultdict(list) so each status code’s list of paths is auto-created on first access.

Step 3: per-day breakdown using itertools.groupby and datetime for validation. groupby needs the input sorted by the same key — that was the trap from earlier in the chapter. We sort by date, then group, and along the way validate that each date is a real calendar date:

import itertools
from datetime import date

def by_date(event):
    return event["date"]

events_sorted = sorted(events, key=by_date)

per_day = {}
for day, group in itertools.groupby(events_sorted, key=by_date):
    parsed = date.fromisoformat(day)         # raises ValueError on a bad date
    rows = list(group)
    per_day[parsed.isoformat()] = {
        "n": len(rows),
        "errors": sum(1 for r in rows if r["level"] == "ERROR"),
    }
per_day
{'2024-01-15': {'n': 3, 'errors': 2}, '2024-01-16': {'n': 2, 'errors': 2}}

itertools.groupby(events_sorted, key=by_date) yields (day, group_iterator) pairs whenever the key changes. list(group) materialises the iterator before we move on — the same rule we hit in the groupby section. date.fromisoformat validates that "2024-01-15" is a real date (a typo like "2024-13-15" would raise ValueError); we don’t actually use parsed further, but it lets the loop fail loudly on bad input rather than silently letting a typo through.

The build threads five modules through one pipeline: re for the parse, Counter for the rank-by-frequency tally, defaultdict for the grouped lists, itertools.groupby for the SQL-style per-day breakdown, and datetime for shape-validation. None of it is more than ten lines of code per step — and all five modules read like the prose description of what we’re doing.

11.7 Exercises

  1. Counter for word frequency. Read a text file and print the 10 most common words. Strip punctuation and lowercase everything before counting.

  2. groupby after sorting. Given a list of (date, event) tuples, group events by date. Why must you sort first?

  3. lru_cache speedup. Time fib(35) with and without @lru_cache. Predict the speedup.

  4. partial for sort keys. Use functools.partial to construct a sort key that sorts strings by their n-th character, where n is configurable.

  5. Regex pitfall. Predict and verify: re.search(r"a+", "") vs. re.search(r"a*", ""). Why does the second one match?

11.8 Summary

collections, itertools, functools, datetime, re — five modules cover most everyday tasks. The next chapter, Chapter 12, brings everything together: how to structure a script, how to add tests, and the Pythonic idioms that separate working code from clean working code.