17  Data Class Builders

NoteCore idea

Python offers three ways to build data classes, each with a different trade-off. @dataclass is the modern default; NamedTuple is the immutable alternative; old-style classes are only when you need full control.

In this chapter you will learn to:

  1. Compare the three builders — collections.namedtuple, typing.NamedTuple, @dataclasses.dataclass — and pick one for a given problem.
  2. Use field() to declare default factories, control repr, and control comparison.
  3. Run setup logic after __init__ with __post_init__.
  4. Mark dataclass attributes as class-level (not instance-level) with ClassVar.
  5. Use frozen=True and order=True to opt into immutability and ordering.
  6. Pattern-match against dataclass instances with case ClassName(...).

17.1 The three builders at a glance

namedtuple typing.NamedTuple @dataclass
Mutable fields no no yes
Class statement no yes yes
Default values limited (3.6.1+) yes yes
__repr__ yes yes yes
__eq__ yes yes yes
Ordering yes (tuple-style) yes opt-in (order=True)
Hashable yes yes conditional
__slots__ yes yes opt-in (slots=True)

The choice in one line: use @dataclass unless you need a tuple, in which case use typing.NamedTuple. The classic collections.namedtuple is fine but offers strictly less than typing.NamedTuple once you want type hints.

17.2 Classic named tuples

namedtuple is a factory that returns a class:

from collections import namedtuple

Coordinate = namedtuple("Coordinate", ["lat", "lon"])
moscow = Coordinate(55.756, 37.617)
moscow.lat, moscow == Coordinate(55.756, 37.617)
(55.756, True)

namedtuple("Coordinate", ["lat", "lon"]) is a factory call: it builds and returns a new class object whose name is "Coordinate" and whose fields are lat and lon. We assign it to Coordinate so we can use it. moscow.lat reads the lat field by name (55.756); the == comparison succeeds because two Coordinate instances with the same field values are equal — namedtuple generates __eq__ for you, comparing field-by-field. The cell’s output is (55.756, True).

Defaults are right-aligned, in the same way default arguments are:

Coordinate = namedtuple("Coordinate", ["lat", "lon", "reference"], defaults=["WGS84"])
Coordinate(55.756, 37.617)
Coordinate(lat=55.756, lon=37.617, reference='WGS84')

defaults=["WGS84"] supplies one default — and Python attaches it to the rightmost field, reference. Constructing with two arguments fills lat and lon; reference falls back to "WGS84". The output is Coordinate(lat=55.756, lon=37.617, reference='WGS84'). To default two fields, you’d pass defaults=[default_for_second_to_last, default_for_last] — Python lines them up from the right.

Because a named tuple is a tuple, you can unpack it positionally:

lat, lon, ref = Coordinate(55.756, 37.617)
lat, ref
(55.756, 'WGS84')

Three names on the left, three slots in the tuple on the right — Python unpacks element-by-element. lat gets 55.756, lon gets 37.617, ref gets "WGS84" (from the default). The cell prints (55.756, 'WGS84'). This is exactly the unpacking from the tuples chapter — namedtuples didn’t introduce a new mechanism; they just added attribute access on top of the same tuple shape.

Two helpers worth remembering — both bridge between named-tuples and “regular” Python data shapes:

_asdict() turns a namedtuple instance into a plain dict keyed by field names. Useful when you need to hand the data to something that expects a mapping — json.dumps, a templating engine, an HTTP request body. Note the leading underscore: the namedtuple machinery prefixes its own helpers with _ so they can’t collide with user-defined field names. (Imagine a namedtuple with a field literally called keys — without the prefix the helper would shadow it.)

moscow = Coordinate(55.756, 37.617)
moscow._asdict()
{'lat': 55.756, 'lon': 37.617, 'reference': 'WGS84'}

The result is an ordinary dict. The default reference="WGS84" is included because every field of the tuple — defaulted or not — has a concrete value at this point.

_make(iterable) is the inverse: it builds a namedtuple instance from any iterable, without writing out the field names. Compare Coordinate(*row) (positional unpacking) with Coordinate._make(row) — they’re equivalent in effect, but _make reads as “construct from this row” and works directly on iterators that you don’t want to materialise as a list first.

Coordinate._make([55.756, 37.617, "WGS84"])
Coordinate(lat=55.756, lon=37.617, reference='WGS84')

The classic use case: parsing a CSV into namedtuples — for row in csv.reader(f): Coordinate._make(row). Each row is already a list of strings; _make slots them into the right fields by position, not by name.

17.3 Typed named tuples

typing.NamedTuple lets you write a tuple as a class — type hints, methods, and all:

from typing import NamedTuple

class Coordinate(NamedTuple):
    lat: float
    lon: float
    reference: str = "WGS84"

    def __str__(self):
        ns = "N" if self.lat >= 0 else "S"
        we = "E" if self.lon >= 0 else "W"
        return f"{abs(self.lat):.1f}°{ns}, {abs(self.lon):.1f}°{we}"

print(Coordinate(55.756, 37.617))
55.8°N, 37.6°E

The instance is still a tuple — the methods are added on top.

17.4 @dataclass

@dataclass is the closest Python has to a Kotlin data class or a Scala case class. It generates __init__, __repr__, and __eq__ from the class’s annotated attributes:

from dataclasses import dataclass, field
from typing import ClassVar

@dataclass
class ClubMember:
    name: str
    guests: list[str] = field(default_factory=list)
    athlete: bool = field(default=False, repr=False)

ClubMember("Anna")
ClubMember(name='Anna', guests=[])

Walking through what each annotated line declares:

  • name: str — a required field. The decorator generates __init__(self, name, ...) that assigns self.name = name.
  • guests: list[str] = field(default_factory=list) — an optional field with a fresh empty list per instance. Writing = [] directly would share one list across every ClubMember, which is the mutable-default trap; default_factory=list calls list() for each new instance.
  • athlete: bool = field(default=False, repr=False) — defaults to False. The repr=False flag tells the generated __repr__ to omit this field, which is why the printed output shows only name and guests.

field() is the customization hook. Each option is a small but useful escape hatch:

field() option Purpose
default static default value
default_factory callable producing default (use this for any mutable type)
repr include in __repr__?
compare include in __eq__ and ordering?
hash include in __hash__?
init accept as parameter to __init__?

Rule: never use a mutable default (= [], = {}) — every instance would share the same list. The whole reason field(default_factory=list) exists is to dodge that trap.

17.5 __post_init__

The generated __init__ only assigns the fields you declared. What if you need to derive one — say, default handle to the first word of name? You can’t do it in a field(default=...) because the default doesn’t see the other fields. The hook is __post_init__: a method the generated __init__ calls right after the assignments are done.

@dataclass
class HackerClubMember:
    name: str
    guests: list = field(default_factory=list)
    handle: str = field(default="", init=True)

    def __post_init__(self):
        if self.handle == "":
            self.handle = self.name.split()[0]

HackerClubMember("Anna Ravenscroft", handle="AnnaRaven").handle, \
HackerClubMember("Leo Rochael").handle
('AnnaRaven', 'Leo')

Walking through what runs at construction time:

  • The @dataclass decorator generates an __init__ that assigns self.name, self.guests, and self.handle from the constructor arguments.
  • After those assignments, the generated __init__ automatically calls self.__post_init__() if the method exists.
  • Inside __post_init__, every field is already set, so we can read self.name and self.handle. If the caller passed an empty handle, we replace it with the first word of name.
  • For "Anna Ravenscroft" the caller passed handle="AnnaRaven", so the if is False — handle stays as given. For "Leo Rochael" the handle defaulted to "", so we derive "Leo".

The general rule: __post_init__ is the right place to validate, normalize, or compute any value that depends on the other fields after the auto-generated __init__ finishes its assignments.

17.6 ClassVar for class-level attributes

Annotated attributes become __init__ parameters by default. To opt out — to declare a class-level attribute that’s shared across instances — wrap the type in ClassVar:

@dataclass
class HackerClub:
    name: str
    guests: list = field(default_factory=list)
    all_handles: ClassVar[set[str]] = set()

HackerClub.all_handles.add("anna")
HackerClub.all_handles
{'anna'}

Now all_handles is not a parameter to __init__ — it’s a single set shared by every instance.

17.7 frozen and order

Two flags handle the most common configurations. frozen=True makes instances immutable:

@dataclass(frozen=True)
class FrozenCoordinate:
    lat: float
    lon: float

c = FrozenCoordinate(55.756, 37.617)
c.lat = 0
---------------------------------------------------------------------------
FrozenInstanceError                       Traceback (most recent call last)
Cell In[10], line 7
      3     lat: float
      4     lon: float
      5 
      6 c = FrozenCoordinate(55.756, 37.617)
----> 7 c.lat = 0

File <string>:16, in __create_fn__.<locals>.__setattr__(self, name, value)
     14 'Could not get source, probably due dynamically evaluated source code.'

FrozenInstanceError: cannot assign to field 'lat'

order=True generates __lt__, __le__, __gt__, __ge__ based on the field order — the comparison is field-by-field, top to bottom:

@dataclass(order=True)
class Card:
    rank: int
    suit: str

Card(2, "hearts") < Card(3, "spades")
True

17.8 Modern @dataclass features

Three flags and one helper, all from Python 3.10+, cover the configurations worth knowing beyond frozen and order.

slots=True generates __slots__ for the class — instances skip the per-object __dict__, save memory, and reject undeclared attributes:

@dataclass(slots=True)
class Point:
    x: float
    y: float

p = Point(1.0, 2.0)
p.__slots__, p.x
(('x', 'y'), 1.0)
p.z = 3.0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[13], line 1
----> 1 p.z = 3.0

AttributeError: 'Point' object has no attribute 'z' and no __dict__ for setting new attributes

kw_only=True forces every field to be passed by keyword. This pays off when the field order in __init__ is incidental — keyword-only calls survive field reorderings without breaking callers:

@dataclass(kw_only=True)
class Window:
    width: int
    height: int

Window(width=800, height=600)
Window(width=800, height=600)
Window(800, 600)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 Window(800, 600)

TypeError: Window.__init__() takes 1 positional argument but 3 were given

dataclasses.replace is the canonical way to “modify” a frozen=True instance — it returns a new instance with the requested fields changed:

from dataclasses import replace

@dataclass(frozen=True)
class Config:
    host: str
    port: int
    tls: bool = False

prod = Config("example.com", 443, tls=True)
staging = replace(prod, host="staging.example.com")
prod, staging
(Config(host='example.com', port=443, tls=True),
 Config(host='staging.example.com', port=443, tls=True))

Walking through the call:

  • replace(prod, host="staging.example.com") reads every field of prod, overrides the ones you name, and constructs a new Config from the merged values.
  • Internally it’s roughly Config(host="staging.example.com", port=prod.port, tls=prod.tls) — but you don’t have to spell out the unchanged fields.
  • The original prod is unchanged; staging is a new immutable instance.

The general rule: for any frozen dataclass, replace(obj, field=new_value) is the substitute for assignment.

17.9 Pattern matching dataclass instances

Class patterns in match/case work cleanly with dataclasses, because the class already exposes its field names:

@dataclass
class City:
    continent: str
    name: str
    country: str

def describe(record):
    match record:
        case City(continent="Asia", name=city):
            return f"Asian city: {city}"
        case City(continent="Europe", name=city):
            return f"European city: {city}"
        case City(name=city):
            return f"City: {city}"

describe(City("Asia", "Tokyo", "JP")), describe(City("Africa", "Lagos", "NG"))
('Asian city: Tokyo', 'City: Lagos')

Walking through the cases:

  • case City(continent="Asia", name=city): matches when the value is a City instance and record.continent == "Asia". The name=city part captures record.name into the local name city.
  • case City(continent="Europe", name=city): is the same shape for Europe.
  • case City(name=city): is the catch-all for any City regardless of continent — it only constrains the type, not the field values, and still captures name.
  • Inside the match, case clauses are tried top-to-bottom; the first one that fits wins.

The general rule: case ClassName(field=pattern, ...) matches an instance of that class whose named fields satisfy each sub-pattern. The same syntax works on typing.NamedTuple instances too.

TipWhy this matters

A data class with zero methods is a code smell — it’s a dumb data container that forces callers to know its internals. Either add the methods that belong with the data, or use a tuple/dict and be explicit about its structure. A class earns its existence by encapsulating both data and behavior.

17.10 Build: a validated, frozen Config with environment dispatch

Configuration objects show up in every program — and they hit every dataclass feature we’ve covered: validation, immutability for safety, replace for variants, and pattern matching for environment-specific behaviour.

Step 1: a frozen dataclass with __post_init__ validation. Lock down the fields, validate them at construction, slot the instance for memory efficiency:

from dataclasses import dataclass, replace

@dataclass(frozen=True, slots=True)
class Config:
    env: str
    host: str
    port: int = 5432
    tls: bool = False

    def __post_init__(self):
        if self.env not in {"dev", "staging", "prod"}:
            raise ValueError(f"unknown env: {self.env!r}")
        if not 1 <= self.port <= 65535:
            raise ValueError(f"port out of range: {self.port}")

dev = Config(env="dev", host="localhost")
dev
Config(env='dev', host='localhost', port=5432, tls=False)

frozen=True rejects any post-construction mutation, so the validation in __post_init__ is the only time the values can be wrong — once a Config exists, it’s known-good. slots=True skips the per-instance __dict__. The __post_init__ runs after the generated __init__ finishes assigning fields, so we can read self.env and self.port to validate them; raising ValueError aborts construction.

Config(env="prod", host="db.example.com", port=70000)   # invalid port
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 Config(env="prod", host="db.example.com", port=70000)   # invalid port

File <string>:7, in __create_fn__.<locals>.__init__(self, env, host, port, tls)

Cell In[18], line 14, in Config.__post_init__(self)
     10     def __post_init__(self):
     11         if self.env not in {"dev", "staging", "prod"}:
     12             raise ValueError(f"unknown env: {self.env!r}")
     13         if not 1 <= self.port <= 65535:
---> 14             raise ValueError(f"port out of range: {self.port}")

ValueError: port out of range: 70000

Step 2: derive variants with dataclasses.replace. A frozen instance can’t be mutated — the canonical “make a tweaked copy” function is replace:

prod = replace(dev, env="prod", host="db.prod.example.com", tls=True)
staging = replace(prod, env="staging", host="db.staging.example.com")

[dev, prod, staging]
[Config(env='dev', host='localhost', port=5432, tls=False),
 Config(env='prod', host='db.prod.example.com', port=5432, tls=True),
 Config(env='staging', host='db.staging.example.com', port=5432, tls=True)]

replace(dev, env="prod", ...) reads every field of dev, overrides the named ones, and runs the constructor again — so __post_init__ re-validates the new instance. staging derives from prod rather than from scratch, picking up tls=True for free; that’s the value-object pattern in motion.

Step 3: dispatch by environment with match/case. Class patterns let you branch on field values without touching the dataclass code:

def connect_url(cfg):
    match cfg:
        case Config(env="prod", host=h, port=p, tls=True):
            return f"https://{h}:{p}/?strict=true"
        case Config(env="staging", host=h, port=p):
            return f"https://{h}:{p}/"
        case Config(env="dev", host=h, port=p):
            return f"http://{h}:{p}/"
        case _:
            raise ValueError(f"no URL handler for {cfg!r}")

[connect_url(dev), connect_url(staging), connect_url(prod)]
['http://localhost:5432/',
 'https://db.staging.example.com:5432/',
 'https://db.prod.example.com:5432/?strict=true']

case Config(env="prod", host=h, port=p, tls=True): matches when the value is-a Config and the named fields equal the literals (env="prod", tls=True). Bare names (h, p) capture; literal values ("prod", True) constrain. Top-to-bottom evaluation gives prod-specific handling first, with staging and dev fallbacks. Adding a fourth environment is one more case clause, no if/elif chain.

The build is the chapter in motion: frozen=True + slots=True for an immutable, lightweight value object, __post_init__ for invariant validation, replace for derived variants, and match/case class patterns for dispatch — all on the same fifteen-line dataclass.

17.11 Exercises

  1. Defaults trap. Write a Bag dataclass with a contents: list = [] default. Create two Bag instances and add to one. Predict and explain what happens. Then fix it with field(default_factory=list).

  2. __post_init__ validation. Write a Temperature dataclass with value: float and unit: str. Reject any unit that isn’t "C", "F", or "K" by raising ValueError in __post_init__.

  3. Hashable but mutable? Create a non-frozen dataclass with frozen=False (the default). Try inserting an instance into a set. What happens? Why does frozen=True fix it?

  4. Match against NamedTuple. Rewrite the Coordinate NamedTuple example to dispatch on hemisphere — north of the equator vs. south — using match/case.

  5. __slots__ opt-in. Read the docs for @dataclass(slots=True). Create a class with and without slots, and compare the size with sys.getsizeof.

17.12 Summary

Python’s three data-class builders cover three different needs: namedtuple for tuple-shaped immutable records, typing.NamedTuple for the same with type hints and methods, and @dataclass for everything else. They all generate the boring boilerplate (__init__, __repr__, __eq__); they all integrate with pattern matching; and they all reward the discipline of giving your data classes behavior alongside their data.

Next, Chapter 18 fixes the three ideas every Python programmer eventually trips over: variables are labels, not boxes; == and is ask different questions; and == does not survive a copy by default.