15 Dictionaries and Sets

Core idea

Dicts and sets are hash tables. Understanding this explains their performance (O(1) for get/set/in), their constraints (keys must be hashable), and their behavior (ordering in dicts since Python 3.7).

In this chapter you will learn to:

Build dicts with comprehensions, unpacking, and the merge operator (|).
Match against dict patterns with match/case.
Handle missing keys with setdefault, defaultdict, and __missing__.
Use the dict variations from collections: OrderedDict, ChainMap, Counter.
Wrap a dict in a read-only proxy with MappingProxyType.
Apply set algebra: union, intersection, difference, symmetric difference.
Reason about why dicts and sets need hashable keys.

15.1 Modern dict syntax

Dict comprehensions mirror list comprehensions, but with key: value pairs:

dial_codes = [
    (880, "Bangladesh"),
    (55,  "Brazil"),
    (86,  "China"),
    (91,  "India"),
    (62,  "Indonesia"),
]
country_dial = {country: code for code, country in dial_codes}
country_dial

{'Bangladesh': 880, 'Brazil': 55, 'China': 86, 'India': 91, 'Indonesia': 62}

The trick in the comprehension is the variable order: each tuple in dial_codes is (code, country), but we unpack it as for code, country in dial_codes and then write {country: code} — flipping the pair so the country becomes the key. The result is a dict mapping country name to dial code, which is the lookup direction you actually want for human-typed input. Same shape works to filter or transform: {c: code for c, code in dial_codes.items() if code > 60} keeps only some keys.

** unpacks a dict into another dict (or into keyword arguments):

def dump(**kwargs):
    return kwargs

dump(**{"x": 1}, y=2, **{"z": 3})

{'x': 1, 'y': 2, 'z': 3}

Three pieces: **{"x": 1} spreads the dict’s pairs as keyword arguments — equivalent to writing x=1. y=2 is a literal keyword. **{"z": 3} spreads another dict — equivalent to z=3. They all pile into kwargs, which the function returns: {'x': 1, 'y': 2, 'z': 3}. The same ** syntax works in dict literals — {**d1, **d2, "extra": 1} builds a fresh dict by spreading both, with later spreads winning on key collision.

Since Python 3.9, | merges dicts and |= updates in place — and they handle conflicts the same way dict.update does (the right-hand side wins):

d1 = {"a": 1, "b": 3}
d2 = {"a": 2, "b": 4, "c": 6}
d1 | d2

{'a': 2, 'b': 4, 'c': 6}

d1 | d2 returns a new dict {'a': 2, 'b': 4, 'c': 6} — keys present in both come from d2 (right wins), keys only in d1 survive, keys only in d2 come along. d1 is unchanged after this expression.

d1 |= d2
d1

{'a': 2, 'b': 4, 'c': 6}

d1 |= d2 is the augmented form: same merge, but it mutates d1 in place. Equivalent to d1.update(d2) but reads as a one-liner. Use | when you want to keep both originals; use |= when overwriting d1 is what you mean.

Dicts are now ordered

Since Python 3.7, regular dicts preserve insertion order. This was an implementation detail of CPython 3.6 that became a language guarantee. You no longer need collections.OrderedDict for ordering alone — you only need it for the methods we’ll see below (move_to_end, popitem(last=...)).

15.2 Pattern matching with mappings

match/case works on dicts the way it works on lists, but the patterns describe which keys must be present:

def get_creators(record):
    match record:
        case {"type": "book", "api": 2, "authors": [*names]}:
            return names
        case {"type": "book", "api": 1, "author": name}:
            return [name]
        case {"type": "book"}:
            raise ValueError(f"Invalid 'book' record: {record!r}")
        case {"type": "movie", "director": name}:
            return [name]
        case _:
            raise ValueError(f"Invalid record: {record!r}")

b1 = dict(api=1, author="Douglas Hofstadter",
          type="book", title="Gödel, Escher, Bach")
get_creators(b1)

['Douglas Hofstadter']

Walking through the cases:

case {"type": "book", "api": 2, "authors": [*names]}: matches when the dict has type == "book", api == 2, and an authors key whose value is a sequence. The inner [*names] is itself a sequence pattern that captures every element of the list.
case {"type": "book", "api": 1, "author": name}: matches a v1 book — one author (singular). The bare name name captures the value.
case {"type": "book"}: is the fallback for books: any book that didn’t match the two above is malformed. Order matters — this comes after the specific cases.
case {"type": "movie", "director": name}: handles movies with the same shape rule.
case _: is the catch-all wildcard for anything that isn’t a book or movie.

The crucial rule for mapping patterns: a pattern matches if the given keys are present and their values match. Extra keys in the input are ignored — b1 has a title key that no pattern mentions, and that’s fine. That’s the opposite of how == on dicts works (which requires identical keys) — and it’s exactly what you want for parsing semi-structured records.

15.3 Handling missing keys

There are four idioms for “give me the value for this key, or a default.”

dict.get with a default is the simplest:

d = {"a": 1}
d.get("missing", [])

[]

dict.setdefault does get-or-insert in a single hash lookup. Use it when you want to mutate the default:

occurrences = {}
word = "hello"
occurrences.setdefault(word, []).append(3)
occurrences

{'hello': [3]}

The naive version of that pattern — if word not in d: d[word] = []; d[word].append(...) — costs three hash lookups per word. setdefault costs one.

collections.defaultdict moves the default into the dict itself:

from collections import defaultdict
index = defaultdict(list)
for position, word in enumerate("the cat sat on the mat".split()):
    index[word].append(position)
dict(index)

{'the': [0, 4], 'cat': [1], 'sat': [2], 'on': [3], 'mat': [5]}

The factory (list) is called every time a missing key is accessed. The price: the default is created on any access, including a read for a key that doesn’t exist — so index["never_seen"] would silently insert an empty list.

__missing__ is the lowest-level hook. The flow it slots into is worth understanding before any code:

When you write d[key], Python calls d.__getitem__(key). If the key is in the dict, its value is returned. If it isn’t, Python checks whether the class defines a __missing__ method — if it does, it calls __missing__(key) and returns that result; if it doesn’t, you get a KeyError.

So __missing__ is your hook to compute a value from the key on a miss. Unlike defaultdict, the result doesn’t have to be a fixed factory call — it can be anything.

Step 1: the simplest possible example. A dict where every missing key returns 0:

class ZeroDict(dict):
    def __missing__(self, key):
        return 0

z = ZeroDict({"a": 1})
[z["a"], z["b"]]

[1, 0]

z["a"] finds 1 directly. z["b"] doesn’t find anything, so Python calls z.__missing__("b"), which returns 0. That’s the entire mechanism.

Step 2: a real use case — normalising key types. Suppose you load config from JSON, where every key comes back as a string ("2"), but somewhere else in your code you index with an int (d[2]). Rather than make every caller remember to stringify, encode the rule in the dict itself: if d[2] misses, retry with d["2"].

class StrKeyDict(dict):
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)             # already a string and still missing — give up
        return self[str(key)]               # otherwise try the stringified key

    def get(self, key, default=None):
        try:
            return self[key]
        except KeyError:
            return default

    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()

d = StrKeyDict([("2", "two"), ("4", "four")])
d[2]

'two'

What each addition does:

The isinstance(key, str) guard prevents infinite recursion. If __missing__ is called with a string, that string has already been searched for and not found — retrying via self[str(key)] would call __missing__ again with the same string, forever. Raising KeyError ends the lookup the normal way.
return self[str(key)] is the retry. A non-string key like 2 gets stringified to "2" and looked up again. This second lookup goes through __getitem__ again — but now with a string, so the guard above will stop the recursion if "2" is also missing.
get and __contains__ are overridden because the inherited versions bypass __missing__. dict.get and key in d have their own fast paths in C — they check the underlying hash table directly, never calling __getitem__ and so never reaching __missing__. If you want consistent behavior across d[k], d.get(k), and k in d, you have to override all three.

The general rule: __missing__ is the right hook for “I want d[key] to do something custom on a miss” — but it only catches d[key], not d.get or key in d, so override those too when you need consistent behaviour across all three.

15.4 Variations of `dict`

collections exports three close cousins of dict. Each one solves a single problem.

from collections import OrderedDict, ChainMap, Counter

OrderedDict keeps insertion order (so does plain dict since 3.7) but adds move_to_end and popitem(last=...):

od = OrderedDict(a=1, b=2, c=3)
od.move_to_end("b")
list(od.keys())

['a', 'c', 'b']

OrderedDict(a=1, b=2, c=3) builds a dict whose keys are in insertion order: a, b, c. od.move_to_end("b") mutates od so that b is now at the end — the keys reorder to a, c, b. There’s a last=False flag for “move to the front” instead. The use case is LRU-cache-style ordering: every time a key is touched, move_to_end it; the popitem(last=False) form then evicts the oldest. Plain dict doesn’t have this — the only ordering guarantee is “first insert wins.”

ChainMap layers several mappings into one logical view. Lookups traverse the chain in order:

import builtins
pylookup = ChainMap(locals(), globals(), vars(builtins))
pylookup["print"] is print

True

Three new pieces in that snippet. builtins is the module that holds the always-available names like print, len, range, int — usually you never need to import it because those names are visible everywhere, but the module still exists and you can introspect it. vars(obj) returns the __dict__ of any object that has one, so vars(builtins) is the dict of every built-in name. locals() and globals() are built-in functions returning the dicts for the current local and module-level scopes. The ChainMap then layers them in lookup order — locals, then globals, then builtins — which is exactly the LEGB rule from chapter 6 written as data.

Counter is a multiset. It maps each element to its count:

ct = Counter("abracadabra")
ct.most_common(3)

[('a', 5), ('b', 2), ('r', 2)]

Counter("abracadabra") walks the iterable (a string is iterable as characters) and tallies occurrences: {'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1}. .most_common(3) returns the top three as (char, count) tuples sorted by count descending — [('a', 5), ('b', 2), ('r', 2)]. Tie-breaking between 'b' and 'r' (both count 2) is by insertion order. With no argument, .most_common() returns all pairs sorted; with a number, the top n.

ct.update("aaaaazzz")
ct["z"]

ct.update("aaaaazzz") adds to the existing counts (it doesn’t replace, the way dict.update would). After the call, 'a' has gone from 5 to 10, and a new 'z' key was created with count 3. ct["z"] returns 3 — and crucially, even before the update, asking for ct["unseen_key"] returns 0 rather than raising KeyError. That zero-default is Counter’s key superpower over a plain dict.

15.5 Immutable mappings with `MappingProxyType`

When you want to expose a dict for reading without letting the caller mutate it, wrap it in types.MappingProxyType:

from types import MappingProxyType
d = {1: "A"}
d_proxy = MappingProxyType(d)
d_proxy[1]

'A'

d_proxy[2] = "X"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 d_proxy[2] = "X"

TypeError: 'mappingproxy' object does not support item assignment

The proxy is a view, not a copy. Updates to the original are visible through it:

d[2] = "B"
d_proxy[2]

'B'

15.6 Set theory

Sets are like dicts without values: hash tables of unique elements. The operators read like the math.

s = {1, 2, 3}
empty = set()  # NOT {} — that's an empty dict
type(s), type(empty)

(set, set)

{1, 2, 3} is a set literal — curly braces with comma-separated values (no : between them). For the empty set you must write set(), because {} parses as empty dict (curly braces with key-value pairs). The cell confirms this: type(s) is set, type(empty) is set. If you swapped set() for {}, type(empty) would be dict — the canonical first-day-with-Python gotcha.

The four set operations:

a = {1, 2, 3, 4}
b = {2, 4, 6}

a | b, a & b, a - b, a ^ b

({1, 2, 3, 4, 6}, {2, 4}, {1, 3}, {1, 3, 6})

The output is ({1, 2, 3, 4, 6}, {2, 4}, {1, 3}, {1, 3, 6}) — read from the table below for the meaning of each operator. Each operation returns a new set; a and b are unchanged. There are method forms too — a.union(b), a.intersection(b), a.difference(b), a.symmetric_difference(b) — which accept any iterable, not just another set, so a.union([5, 6, 7]) works while a | [5, 6, 7] would raise. Reach for the operators when both sides are sets and you want the math notation; reach for the methods when one side is an arbitrary iterable.

Operator	Meaning	Example output
`a \\| b`	union	`{1, 2, 3, 4, 6}`
`a & b`	intersection	`{2, 4}`
`a - b`	difference	`{1, 3}`
`a ^ b`	symmetric difference	`{1, 3, 6}`

The classic use of intersection: counting matches between two collections in O(min(len(a), len(b))) instead of O(len(a) * len(b)).

needles = {1, 2, 3, 99, 100}
haystack = set(range(50))
len(needles & haystack)

That same logic with list membership would be O(n*m). With sets, it is roughly linear in the smaller collection.

Why this matters

Dicts are hash tables — O(1) for lookup, insertion, deletion. This is why Python can iterate dict.keys() and dict.items() as set-like views: a dict internally is a hash table, and sets are their own hash tables. The price you pay: keys must be hashable.

15.7 What “hashable” means

An object is hashable if it has a __hash__ method whose return value never changes during the object’s lifetime, and an __eq__ method that’s consistent with that hash. The standard immutable types — int, float, str, bytes, tuple (of hashables), frozenset — are all hashable. list, dict, set are not.

hash("hello")

-6846546988974161659

hash([1, 2, 3])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 hash([1, 2, 3])

TypeError: unhashable type: 'list'

This is why dict keys and set elements must be hashable: the hash table needs a stable bucket. We’ll come back to __hash__ in Chapter 23 when we make our own class hashable.

The lookup itself is a short loop:

flowchart TD
  K["key"] --> H["hash(key)"]
  H --> B["bucket index"]
  B --> Q{"slot occupied<br/>and __eq__ matches?"}
  Q -- "yes" --> V["return value"]
  Q -- "no, slot empty" --> KE["KeyError"]
  Q -- "collision" --> P["probe next slot"]
  P --> Q

Hash collisions resolve via open addressing — the implementation probes nearby slots until it finds the key or an empty slot. The amortized cost stays O(1) because the load factor is kept low by resizing.

15.8 Build: a tiny inverted index with boolean search

An inverted index maps each word to the set of documents that contain it. It’s the heart of every text-search engine. Building one is the canonical workout for defaultdict(set) and the set-algebra operators.

Step 1: build the index. Tokenise each document and accumulate a defaultdict(set) mapping word → set of doc ids:

from collections import defaultdict

docs = {
    1: "the quick brown fox",
    2: "lazy dog sleeps",
    3: "quick brown dog",
    4: "fox watches dog",
}

def build_index(docs):
    index = defaultdict(set)
    for doc_id, text in docs.items():
        for word in text.lower().split():
            index[word].add(doc_id)
    return dict(index)

index = build_index(docs)
{w: index[w] for w in ("quick", "dog", "fox")}

{'quick': {1, 3}, 'dog': {2, 3, 4}, 'fox': {1, 4}}

defaultdict(set) does the “create-an-empty-set-on-missing-key” dance from earlier in the chapter — without it, every index[word] access on a new word would need an if guard. .add(doc_id) mutates the set in place; duplicates collapse for free. We convert to a plain dict at the end purely so the display is clean.

Step 2: a single-word query. Look up the word, return the set of document ids — or an empty set if the word never appeared:

def find(index, word):
    return index.get(word.lower(), set())

[find(index, "DOG"), find(index, "absent")]

[{2, 3, 4}, set()]

dict.get(word, set()) is the safe-access idiom from the chapter. The default has to be a fresh empty set on each miss, not a shared module-level one — set() re-evaluates on each call, so callers can’t accidentally mutate the default. (This is the mutable-default trap from chapter 6, applied to dict.get.)

Step 3: boolean queries with set algebra. AND-queries are intersection; OR-queries are union; NOT-queries (relative to a known universe) are difference. The set operators map directly onto search semantics:

def search(index, query):
    """query: ('AND'|'OR', word1, word2)"""
    op, *words = query
    sets = [find(index, w) for w in words]
    match op:
        case "AND":
            result = sets[0]
            for s in sets[1:]:
                result = result & s
        case "OR":
            result = set()
            for s in sets:
                result = result | s
        case _:
            raise ValueError(f"unknown op: {op!r}")
    return result

[
    search(index, ("AND", "quick", "dog")),    # both words
    search(index, ("OR",  "fox",   "lazy")),   # either word
    search(index, ("AND", "quick", "lazy")),   # neither doc has both
]

[{3}, {1, 2, 4}, set()]

a & b is intersection — the docs in both a and b (the AND case). a | b is union — the docs in either (the OR case). The match/case block uses a literal pattern on op to dispatch; op, *words = query is the *rest unpacking from chapter 14. The whole search engine is fifteen lines.

The build threads the chapter’s tools through one self-contained problem: defaultdict(set) for accumulation, dict.get with a mutable-safe default for safe lookups, match/case on a literal for dispatch, and the set-algebra operators for the actual search semantics.

15.9 Exercises

setdefault vs defaultdict. Rewrite the index-building example using setdefault instead of defaultdict. Which is more readable to you?
Pattern shape. Modify get_creators to also handle a record with type: "movie" and a list of directors ("directors": [...]).
MappingProxyType semantics. Given d = {"a": 1} and p = MappingProxyType(d), can you mutate the value at d["a"] through p? Through d? Try it.
Set difference for diffs. Given two dicts representing config files, write a one-line expression that returns the keys present in the new dict but missing from the old. Use set difference on .keys().
The classic {} trap. What does type({}) return? What about type({1})? What does type({1: 2}) return? Why is set() the recommended way to spell an empty set?

15.10 Summary

dict and set are the two hash-table types in Python. Their performance, their ordering rules, and their constraint on hashable keys all flow from that one fact. The standard library adds three useful flavors — OrderedDict, ChainMap, Counter — and MappingProxyType for read-only views.

The next chapter, Chapter 16, leaves the data structures and dives into the I/O boundary: when does a str become bytes, when does the reverse happen, and how do you keep your code from breaking on non-ASCII text?

15.1 Modern dict syntax

15.2 Pattern matching with mappings

15.3 Handling missing keys

15.4 Variations of dict

15.5 Immutable mappings with MappingProxyType

15.6 Set theory

15.7 What “hashable” means

15.8 Build: a tiny inverted index with boolean search

15.9 Exercises

15.10 Summary

15.4 Variations of `dict`

15.5 Immutable mappings with `MappingProxyType`