11 Standard Library Tour

Core idea

Python’s “batteries included” philosophy: the standard library solves most common problems without installing anything extra. Before reaching for a third-party package, check what’s already there. It’s faster, more reliable, and more portable.

In this chapter you will learn to:

Use collections for Counter, defaultdict, deque, and namedtuple.
Use itertools for chain, islice, groupby, and accumulate.
Use functools for lru_cache, cached_property, partial, reduce, and wraps.
Use datetime for dates, times, and durations.
Use re for regular expressions — and recognize when not to.

This is a tour, not an encyclopedia — pointers to the modules you’ll reach for repeatedly. Beazley’s Python Distilled §9–10 is the canonical reference for the broader stdlib (asyncio, threading, sockets, struct, tempfile, urllib, xml, …).

11.1 `collections`

from collections import Counter, defaultdict, deque, namedtuple

Counter counts occurrences — the canonical “tally things” data structure, replacing a hand-written dict-with-default loop:

words = "the cat sat on the mat the".split()
c = Counter(words)
[c["the"], c["missing"], c.most_common(2)]

[3, 0, [('the', 3), ('cat', 1)]]

Counter(words) walks the iterable once, returning a dict subclass that maps each unique element to its count.
c["the"] is 3 (it appears three times). c["missing"] returns 0, not KeyError — that’s Counter’s key superpower over a plain dict.
c.most_common(2) returns the two highest-frequency (item, count) tuples, sorted by count.

Counter arithmetic merges two counters — handy for combining tallies from different sources:

a = Counter("aabb")
b = Counter("abc")
[a + b, a - b, a["zzz"]]

[Counter({'a': 3, 'b': 3, 'c': 1}), Counter({'a': 1, 'b': 1}), 0]

A Counter built from a string counts each character: a = {"a": 2, "b": 2}, b = {"a": 1, "b": 1, "c": 1}.
a + b adds counts position-by-position: {"a": 3, "b": 3, "c": 1}.
a - b subtracts and drops zero or negative counts: {"a": 1, "b": 1}.
a["zzz"] is 0 — missing keys return 0, never raise.

The general rule: Counter for frequency tables, with +/- for merging and most_common(n) for ranked extraction.

defaultdict auto-creates a default value for missing keys — eliminates the “if key not in d: d[key] = []” boilerplate that comes up constantly when grouping or accumulating:

groups = defaultdict(list)
for name, team in [("Alice", "eng"), ("Bob", "design"), ("Carol", "eng")]:
    groups[team].append(name)
dict(groups)

{'eng': ['Alice', 'Carol'], 'design': ['Bob']}

defaultdict(list) is a dict subclass; the argument is the factory — a zero-argument callable that produces the default.
The first time groups["eng"] is accessed, the factory list() runs to produce [], which is stored under "eng".
.append("Alice") then mutates that list in place. The next groups["eng"] lookup finds the same list and appends to it.
dict(groups) is purely cosmetic — converts to a plain dict for the display.

The general rule: defaultdict(int) for counters; defaultdict(list) for grouping; defaultdict(set) for collecting unique items per key. Pick the factory that matches the per-key value type.

deque — fast append/pop from both ends. A list is fine for back-end operations but pays O(n) for any front-end work (pop(0) shifts every other element); a deque is O(1) at both ends.

from collections import deque
queue = deque(["a", "b", "c"])
queue.appendleft("z")
queue.popleft()
queue.append("d")
list(queue)

['a', 'b', 'c', 'd']

deque(["a", "b", "c"]) builds a deque from any iterable.
appendleft("z") adds to the front: ["z", "a", "b", "c"].
popleft() removes and returns the front element "z": leaves ["a", "b", "c"].
append("d") adds to the back: ["a", "b", "c", "d"].
list(queue) materialises the contents for display.

The general rule: when you need fast front-end operations (queues, sliding windows, undo stacks), use deque, not list.

deque(maxlen=N) is a bounded deque — appends past the limit silently drop from the other end. That’s exactly the data structure for a fixed-size sliding window:

window = deque(maxlen=3)
for x in [1, 2, 3, 4, 5]:
    window.append(x)
    print(list(window))

[1]
[1, 2]
[1, 2, 3]
[2, 3, 4]
[3, 4, 5]

Three-line sliding window, no manual bookkeeping.

namedtuple — a factory for small, immutable record types. A plain tuple (3, 4) is anonymous (which one is x?). A dict gives names but is mutable and any typo at access time fails silently. namedtuple sits in between: each instance is still a tuple — immutable, hashable, indexable, unpacks the same way — but the fields also have names:

Point = namedtuple("Point", ["x", "y"])
p = Point(3, 4)
[p.x, p.y, p[0]]

[3, 4, 3]

Notice both forms of access work: p.x (by name) and p[0] (by position). Because instances are tuples, they work everywhere a tuple does — destructuring (x, y = p), match patterns, dict keys.

You’ll meet namedtuple constantly when reading other people’s code: many stdlib functions return them (os.stat, time.localtime, urllib.parse.urlparse). For new code that wants type hints or extra methods, prefer typing.NamedTuple (a typed namedtuple) or @dataclass(frozen=True). The full comparison is in Chapter 17.

11.2 `itertools`

import itertools

chain — flatten multiple iterables:

list(itertools.chain([1, 2], [3, 4], [5, 6]))

[1, 2, 3, 4, 5, 6]

chain(a, b, c, ...) walks each input in turn and yields its elements one by one — never materialising any intermediate list. For three lists it produces [1, 2, 3, 4, 5, 6]. The lazy form pays off when the inputs are large or themselves lazy: chain(open("a.log"), open("b.log")) streams two files as one stream of lines, no concatenation cost. There’s also chain.from_iterable(iter_of_iters) for “flatten one level” when you have a list of lists already.

islice — take a slice from any iterable (lazily):

list(itertools.islice(range(100), 5))

[0, 1, 2, 3, 4]

islice(iterable, stop) is iterable[:stop] for things that can’t be subscripted — generators, iterators, file objects, infinite sequences. The cell yields [0, 1, 2, 3, 4], the same as list(range(100))[:5] would, but without building a 100-element list first. Three-argument form: islice(it, start, stop); four-argument: islice(it, start, stop, step).

cycle — repeat forever (combine with islice to stop):

list(itertools.islice(itertools.cycle(["red", "green", "blue"]), 7))

['red', 'green', 'blue', 'red', 'green', 'blue', 'red']

cycle(iterable) yields the elements over and over — red, green, blue, red, green, blue, red, ... forever. By itself it would never terminate; combined with islice it gives you “the next N values, wrapping around.” Useful for round-robin assignment, alternating colours in a chart, repeating a fallback list. The cell takes the first 7, producing ['red', 'green', 'blue', 'red', 'green', 'blue', 'red'] — two full cycles plus the first colour again.

groupby — group consecutive equal items. The trap: it does not pre-sort the input, so unsorted data gives you one group per run, not one per distinct value. Two cells make the rule visible.

Step 1: the trap. Run groupby on unsorted data and you get a fresh group every time the key changes — including duplicates:

unsorted = ["alice", "bob", "adam", "carol", "charlie"]
[(letter, list(g)) for letter, g in itertools.groupby(unsorted, key=lambda x: x[0])]

[('a', ['alice']),
 ('b', ['bob']),
 ('a', ['adam']),
 ('c', ['carol', 'charlie'])]

bob interrupts the a run, so a shows up as two separate groups instead of one.

Step 2: sort by the same key first. With the input sorted, groupby collapses each distinct key into a single group:

data = sorted(unsorted, key=lambda x: x[0])
[(letter, list(g)) for letter, g in itertools.groupby(data, key=lambda x: x[0])]

[('a', ['alice', 'adam']), ('b', ['bob']), ('c', ['carol', 'charlie'])]

sorted(..., key=lambda x: x[0]) sorts by first letter — required preprocessing.
itertools.groupby(data, key=...) walks the sorted list and yields (key, group_iterator) pairs whenever the key changes.
list(g) materialises each group iterator before moving on — necessary because groupby reuses internal state, and the previous group iterator becomes invalid the moment you advance.

The general rule: groupby is the SQL-style group-by, but you sort first, and you consume each group before moving on.

accumulate — running totals:

list(itertools.accumulate([1, 2, 3, 4, 5]))

[1, 3, 6, 10, 15]

The default operator is +, so the output is [1, 3, 6, 10, 15] — each element is the sum of all previous ones. Element 0 is just 1 (the first input); element 1 is 1 + 2 = 3; element 2 is 3 + 3 = 6; and so on. This is what spreadsheets call a “running total.” Pass a different operator to fold differently — itertools.accumulate(xs, operator.mul) gives a running product, itertools.accumulate(xs, max) gives the running maximum.

combinations, permutations, product — combinatorial generators:

[
    list(itertools.combinations("ABCD", 2)),
    list(itertools.permutations("ABC", 2))[:3],
    list(itertools.product("AB", range(2))),
]

[[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')],
 [('A', 'B'), ('A', 'C'), ('B', 'A')],
 [('A', 0), ('A', 1), ('B', 0), ('B', 1)]]

Three different combinatorial questions:

combinations("ABCD", 2) yields unordered pairs without repetition: ('A','B'), ('A','C'), ('A','D'), ('B','C'), ('B','D'), ('C','D') — six pairs. “Unordered” means ('A','B') and ('B','A') count as the same pair, so only one is yielded. C(4, 2) = 6.
permutations("ABC", 2) yields ordered pairs without repetition: ('A','B'), ('A','C'), ('B','A'), ('B','C'), ('C','A'), ('C','B') — six pairs (we slice the first three). Order matters, so ('A','B') and ('B','A') are both included. P(3, 2) = 6.
product("AB", range(2)) is the cartesian product: every combination from each iterable, with order and repetition. ('A',0), ('A',1), ('B',0), ('B',1) — four tuples. The same as a nested for-loop.

Use combinations when “choose 2 of these” matters (poker hands, lottery tickets); permutations when “in what order” matters (rankings, password attempts); product when iterating over a grid of independent choices (color × size, country × year).

11.3 `functools`

import functools

lru_cache / cache — memoize a pure function’s results. The textbook recursive Fibonacci is exponentially slow because it recomputes the same fib(n-2) constantly; one decorator turns it linear.

@functools.lru_cache(maxsize=None)
def fib(n):
    if n <= 1:
        return n
    return fib(n - 1) + fib(n - 2)

[fib(50), fib.cache_info()]

[12586269025, CacheInfo(hits=48, misses=51, maxsize=None, currsize=51)]

@functools.lru_cache(maxsize=None) wraps fib in a memoizing layer: each unique call’s result is stashed in a dict keyed by arguments.
maxsize=None makes the cache unbounded; a number would cap it at LRU (least-recently-used) eviction.
The first fib(50) populates the cache; further calls would be free.
fib.cache_info() reports hits, misses, maxsize, currsize — useful for verifying the cache is doing its job.

The general rule: @lru_cache (or @functools.cache in 3.9+, which is the unbounded form) is a one-line speed-up for any pure function — same inputs always give the same output, no side effects.

cached_property — compute once per instance, then read like an attribute:

import statistics

class DataSet:
    def __init__(self, data):
        self._data = data

    @functools.cached_property
    def stats(self):
        return {
            "mean": statistics.mean(self._data),
            "median": statistics.median(self._data),
            "stdev": statistics.stdev(self._data),
        }

ds = DataSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
ds.stats   # computed once

{'mean': 5.5, 'median': 5.5, 'stdev': 3.0276503540974917}

partial — pre-fill some of a function’s arguments, returning a new callable that takes the rest. Useful when you have a general function and want a specialised version for repeated use.

def power(base, exp):
    return base ** exp

square = functools.partial(power, exp=2)
cube = functools.partial(power, exp=3)
[square(4), cube(3)]

[16, 27]

functools.partial(power, exp=2) returns a new callable equivalent to lambda base: power(base, exp=2) — power with exp already bound to 2.
square(4) calls the partial with base=4, getting power(4, exp=2) = 16.
cube(3) is the same shape with exp=3, returning 27.

The general rule: partial(fn, *args, **kwargs) freezes those arguments and returns a function that takes the remaining ones. Cleaner than wrapping in a lambda for the same purpose.

reduce — fold a sequence into a single value:

import operator
[functools.reduce(operator.add, [1, 2, 3, 4, 5]),
 functools.reduce(operator.mul, [1, 2, 3, 4, 5])]

[15, 120]

The operator module provides function versions of every Python operator — operator.add(a, b) is a + b, operator.mul(a, b) is a * b, operator.itemgetter(0) returns a callable equivalent to lambda x: x[0]. They’re useful exactly here: where a higher-order function (reduce, map, sorted(..., key=...)) needs a function value rather than an inline operator.

wraps — preserve a function’s name and docstring inside a decorator. We saw this in Chapter 6; the deep treatment is Chapter 21.

The statistics module (used by cached_property above) provides the small set of descriptive-stats functions you’d otherwise compute by hand: mean, median, mode, stdev, variance, plus the histogram-style quantiles. For plain numerical means, statistics.fmean (Python 3.8+) is faster than statistics.mean because it converts to float up front instead of preserving exact Fraction arithmetic. Reach for mean only when you need exactness across Decimal or Fraction data:

import statistics
[statistics.fmean([1, 2, 3, 4, 5]), statistics.fmean(range(1_000_000))]

[3.0, 499999.5]

11.4 `datetime`

Time is fiddly: time zones, daylight savings, leap years, the difference between “an instant” and “a date on a calendar”. The datetime module gives you four core types — datetime, date, timedelta, timezone — that together cover almost every time-handling task.

from datetime import datetime, date, timedelta, timezone

now = datetime.now(tz=timezone.utc)
today = date.today()
[now.year, today.isoformat()]

[2026, '2026-05-11']

datetime is a full timestamp (date + time + optional timezone). date is a calendar date with no time component.
datetime.now(tz=timezone.utc) returns the current instant as a timezone-aware datetime in UTC. Always pass tz= — naive datetimes are a long-running source of bugs.
date.today() returns the current date.
now.year reads a component (int); today.isoformat() is the standard "YYYY-MM-DD" text form.

Arithmetic uses timedelta — a timedelta is a duration, not a point in time:

yesterday = now - timedelta(days=1)
next_week = now + timedelta(weeks=1)
[yesterday.date(), next_week.date()]

[datetime.date(2026, 5, 10), datetime.date(2026, 5, 18)]

timedelta(days=1) is “one day”. Subtracting it from a datetime shifts the timestamp back 24 hours.
timedelta(weeks=1) is “seven days”; adding shifts forward.
.date() extracts just the date portion of a datetime for compact display.

The general rule: datetime + timedelta = datetime; datetime - datetime = timedelta. Stick to timezone-aware datetimes and let timedelta do all the arithmetic.

Formatting and parsing use strftime / strptime:

[
    now.strftime("%Y-%m-%d %H:%M:%S"),
    datetime.strptime("2024-01-15", "%Y-%m-%d"),
    datetime.fromisoformat("2024-01-15T12:30:00"),
]

['2026-05-11 23:52:23',
 datetime.datetime(2024, 1, 15, 0, 0),
 datetime.datetime(2024, 1, 15, 12, 30)]

Three different conversions, two directions:

now.strftime(fmt) — string-format-time: turn a datetime into a string. The format codes are POSIX-style: %Y is the four-digit year, %m is two-digit month, %d two-digit day, %H:%M:%S is hour:minute:second on 24-hour. Other useful codes: %A weekday name, %B month name, %j day of year. The cell renders something like '2026-05-10 12:34:56'.
datetime.strptime(text, fmt) — string-parse-time: the inverse direction. Read a string against a format and produce a datetime. Useful for legacy or non-ISO formats. Naive (no timezone) unless your format contains %z.
datetime.fromisoformat(text) — the modern parser for the ISO-8601 format YYYY-MM-DDTHH:MM:SS. Faster and more permissive than strptime("%Y-%m-%dT%H:%M:%S"), and round-trips with .isoformat(). Reach for this whenever the input is ISO-8601 (which it should be, for any new code or API).

For anything time-zone-sensitive, always include tzinfo. Use timezone.utc for UTC; for named zones, use zoneinfo.ZoneInfo (Python 3.9+) — backed by the system’s IANA database, so it handles DST correctly:

from zoneinfo import ZoneInfo

london = datetime(2024, 6, 1, 12, 0, tzinfo=ZoneInfo("Europe/London"))
new_york = london.astimezone(ZoneInfo("America/New_York"))
[london.isoformat(), new_york.isoformat()]

['2024-06-01T12:00:00+01:00', '2024-06-01T07:00:00-04:00']

astimezone converts; the underlying instant is unchanged. Always store and pass datetimes as timezone-aware — naive datetimes are the source of nearly every “off by an hour twice a year” bug.

11.5 `re` — regular expressions

Regex is powerful but easy to overuse. Reach for it when patterns are truly irregular — phone numbers, dates of varying formats, log-line shapes. For fixed strings, .startswith() and .split() are simpler and faster.

import re

m = re.search(r"(\d+)-(\d+)", "phone: 123-456")
[m.group(), m.group(1), m.group(2)]

['123-456', '123', '456']

r"(\d+)-(\d+)" is the pattern — \d+ matches one or more digits, the parentheses create capture groups, and - matches a literal hyphen.
re.search(pattern, text) scans for the first place the pattern matches anywhere in text. Returns a match object (or None).
m.group() is the entire match: "123-456".
m.group(1) is the first captured group (the first \d+): "123".
m.group(2) is the second captured group: "456".

The general rule: parentheses capture, \d/\w/\s are character classes, +/*/? are repetition. Always use raw strings (r"...") for patterns — \d and \w would otherwise need double backslashes.

The core entry points:

re.search(pattern, text) — first match anywhere.
re.match(pattern, text) — match at the start.
re.fullmatch(pattern, text) — match the entire string.
re.findall(pattern, text) — list of every match.
re.sub(pattern, repl, text) — replace every match.
re.split(pattern, text) — split on the pattern.

Compile patterns you’ll reuse:

phone = re.compile(r"\b(\d{3})-(\d{3,4})\b")
phone.findall("call 555-1234 or 555-5678 today")

[('555', '1234'), ('555', '5678')]

Named groups make the resulting matches self-documenting:

m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", "2024-01-15")
m.groupdict()

{'year': '2024', 'month': '01', 'day': '15'}

Why this matters

The standard library is the largest, best-tested Python codebase in existence. Knowing what’s already there is a force multiplier: Counter replaces a dozen lines of if/else; lru_cache replaces a manual cache; pathlib replaces hand-rolled string manipulation. The first question for any new task should be “is there already a stdlib module for this?”

11.6 Build: a tiny log-line analyzer

A canonical “I just need to grok these logs” task: parse some lines, count status codes, and break it down by date. Five stdlib modules earn a slot — re for parsing, collections.Counter and defaultdict for tallies, itertools.groupby for the per-day breakdown, and datetime for date validation.

Step 1: parse each line with a named-group regex. A real-world log line has fields the program needs as separate values; re with named groups gives them to you as a dict:

import re

LINE = re.compile(
    r"(?P<date>\d{4}-\d{2}-\d{2}) "
    r"(?P<time>\d{2}:\d{2}:\d{2}) "
    r"(?P<level>\w+) "
    r"(?P<status>\d+) "
    r"(?P<path>\S+)"
)

raw = """\
2024-01-15 10:30:01 ERROR 500 /api/users
2024-01-15 10:30:05 INFO 200 /api/products
2024-01-15 10:31:02 ERROR 404 /api/users
2024-01-16 09:15:00 ERROR 500 /api/orders
2024-01-16 09:15:30 ERROR 500 /api/users
"""

events = [LINE.match(line).groupdict() for line in raw.strip().splitlines()]
events[0]

{'date': '2024-01-15',
 'time': '10:30:01',
 'level': 'ERROR',
 'status': '500',
 'path': '/api/users'}

(?P<name>...) defines a named capture group; after a successful match, m.groupdict() returns a dict keyed by those names — much friendlier than m.group(1), m.group(2), etc. Compiling the pattern once at module level (re.compile(...)) avoids re-compiling it for every line.

Step 2: tally status codes (Counter) and group request paths by status (defaultdict). Two of the chapter’s collections types, applied to the parsed events:

from collections import Counter, defaultdict

status_counts = Counter(int(e["status"]) for e in events)

paths_by_status = defaultdict(list)
for e in events:
    paths_by_status[int(e["status"])].append(e["path"])

[status_counts.most_common(),
 dict(paths_by_status)]

[[(500, 3), (200, 1), (404, 1)],
 {500: ['/api/users', '/api/orders', '/api/users'],
  200: ['/api/products'],
  404: ['/api/users']}]

Counter(int(e["status"]) for e in events) walks the events with a generator expression — no intermediate list — and tallies. paths_by_status is a defaultdict(list) so each status code’s list of paths is auto-created on first access.

Step 3: per-day breakdown using itertools.groupby and datetime for validation. groupby needs the input sorted by the same key — that was the trap from earlier in the chapter. We sort by date, then group, and along the way validate that each date is a real calendar date:

import itertools
from datetime import date

def by_date(event):
    return event["date"]

events_sorted = sorted(events, key=by_date)

per_day = {}
for day, group in itertools.groupby(events_sorted, key=by_date):
    parsed = date.fromisoformat(day)         # raises ValueError on a bad date
    rows = list(group)
    per_day[parsed.isoformat()] = {
        "n": len(rows),
        "errors": sum(1 for r in rows if r["level"] == "ERROR"),
    }
per_day

{'2024-01-15': {'n': 3, 'errors': 2}, '2024-01-16': {'n': 2, 'errors': 2}}

itertools.groupby(events_sorted, key=by_date) yields (day, group_iterator) pairs whenever the key changes. list(group) materialises the iterator before we move on — the same rule we hit in the groupby section. date.fromisoformat validates that "2024-01-15" is a real date (a typo like "2024-13-15" would raise ValueError); we don’t actually use parsed further, but it lets the loop fail loudly on bad input rather than silently letting a typo through.

The build threads five modules through one pipeline: re for the parse, Counter for the rank-by-frequency tally, defaultdict for the grouped lists, itertools.groupby for the SQL-style per-day breakdown, and datetime for shape-validation. None of it is more than ten lines of code per step — and all five modules read like the prose description of what we’re doing.

11.7 Exercises

Counter for word frequency. Read a text file and print the 10 most common words. Strip punctuation and lowercase everything before counting.
groupby after sorting. Given a list of (date, event) tuples, group events by date. Why must you sort first?
lru_cache speedup. Time fib(35) with and without @lru_cache. Predict the speedup.
partial for sort keys. Use functools.partial to construct a sort key that sorts strings by their n-th character, where n is configurable.
Regex pitfall. Predict and verify: re.search(r"a+", "") vs. re.search(r"a*", ""). Why does the second one match?

11.8 Summary

collections, itertools, functools, datetime, re — five modules cover most everyday tasks. The next chapter, Chapter 12, brings everything together: how to structure a script, how to add tests, and the Pythonic idioms that separate working code from clean working code.

11.1 collections

11.2 itertools

11.3 functools

11.4 datetime

11.5 re — regular expressions