5  Collections

NoteCore idea

Python has four built-in collection types. Each solves a different problem: list for ordered mutable sequences, tuple for fixed records, dict for key→value lookup, set for unique unordered values. Knowing which to reach for is the mark of a Python programmer.

In this chapter you will learn to:

  1. Construct, index, slice, and modify lists.
  2. Use tuples for fixed records and unpacking.
  3. Look up, insert, and iterate dictionaries.
  4. Use sets for membership tests, deduplication, and set algebra.
  5. Write list, dict, set, and generator comprehensions.

5.1 Lists

A list is the default “ordered collection of things” in Python — when you have a sequence whose contents you’ll modify (add to, remove from, reorder), reach for a list. Lists are ordered, mutable, and accept any types.

numbers = [1, 2, 3, 4, 5]
mixed = [1, "two", 3.0, None]
[numbers[0], numbers[-1], numbers[1:3], numbers[::-1]]
[1, 5, [2, 3], [5, 4, 3, 2, 1]]
  • [1, 2, 3, 4, 5] is a list literal — square brackets, comma-separated.
  • A list can hold mixed types — Python doesn’t insist on uniform element types.
  • numbers[0] is the first element; numbers[-1] is the last (negative indices count from the end).
  • numbers[1:3] is a slice — elements at positions 1 and 2 (stop is exclusive). Returns a new list.
  • numbers[::-1] reverses with a step of -1, returning a new list.

The mutation methods modify in place and return None:

xs = [1, 2, 3]
xs.append(4)
xs.insert(0, 0)
xs.extend([5, 6])
xs
[0, 1, 2, 3, 4, 5, 6]
  • xs.append(4) adds 4 to the end. Mutates xs, returns None.
  • xs.insert(0, 0) inserts 0 at position 0 — pushes existing items right.
  • xs.extend([5, 6]) adds each element of the argument to the end — different from append, which would add the whole list as a single element.

The general rule: list mutation methods return None. Never write xs = xs.append(4) — you’ll lose your list and bind xs to None.

Removal has three primitives, each with a different semantic:

xs.pop(), xs.pop(0)
(6, 0)
  • xs.pop() (no argument) removes and returns the last element — the right tool for stack-style usage.
  • xs.pop(0) removes and returns the first element — useful but slow for large lists, because every other element has to shift down. For frequent front-end pops, use collections.deque (chapter 11) which makes that O(1).
  • The cell shows both calls in one tuple expression, so you see both popped values: the last one first (whatever was on the right end), then the first one (0).
xs.remove(3)
xs
[1, 2, 4, 5]
  • xs.remove(value) removes the first occurrence of value by equality (==), not by index. Raises ValueError if value is absent.
  • Useful when you have the value, not the position. If you have the position, del xs[i] is faster (no scan).

Sorting comes in two flavors — in-place (modifies the original) vs. returning a new list (leaves the original alone). Pick based on whether you still need the unsorted version:

xs = [3, 1, 4, 1, 5, 9, 2, 6]
sorted_new = sorted(xs)        # new list; xs unchanged
xs.sort(reverse=True)          # in-place; xs is now sorted
sorted_new, xs
([1, 1, 2, 3, 4, 5, 6, 9], [9, 6, 5, 4, 3, 2, 1, 1])
  • sorted(xs) is the built-in function — works on any iterable, returns a fresh sorted list, leaves xs untouched.
  • xs.sort(reverse=True) is the list method — mutates xs in place, returns None. reverse=True sorts descending.

Sort with a key function — useful when the natural ordering isn’t what you want:

words = ["banana", "Apple", "cherry"]
sorted(words, key=str.lower)
['Apple', 'banana', 'cherry']
  • The default sort would put "Apple" first because uppercase 'A' has a lower code point than lowercase 'b'.
  • key=str.lower tells sorted to call str.lower(word) on each element and sort by that — a case-insensitive sort. The original strings are still in the result; only the comparison key is lowered.

The general rule: sorted(...) returns, .sort() mutates; pass key= to sort by a derived value.

Lists are mutable — sharing is dangerous. b = a does not copy. Both labels point to the same list. Use a.copy() or list(a) for a shallow copy, or copy.deepcopy(a) for a deep one. The full treatment is in Chapter 18.

a = [1, 2, 3]
b = a
b.append(4)
a
[1, 2, 3, 4]

The output [1, 2, 3, 4] — for both a and b — is the aliasing demo: b = a doesn’t copy the list, it makes b a second name for the same object. b.append(4) mutates that one list; reading a shows the mutation because there’s only one list to read. The fix when you want a separate copy is b = a.copy() (or the equivalent b = list(a)); for nested data, use copy.deepcopy.

5.2 Tuples

When the items you’re grouping form a fixed record — an (x, y) coordinate, an RGB triple, a (name, age) pair — a list is the wrong choice: you don’t want anyone modifying the slots, and you’d like the structure to be hashable so it can serve as a dict key. Tuples are immutable lists for exactly this case.

point = (3, 4)
single = (42,)         # trailing comma! (42) is just 42
empty = ()
point[0], single, empty
(3, (42,), ())
  • (3, 4) is a 2-tuple. Indexing works exactly like a list (point[0] is 3).
  • (42,) — the trailing comma is required. Without it, (42) is just the integer 42 in parentheses.
  • () is the empty tuple.
  • Tuples have no .append, .pop, or .sort — they’re frozen at construction.

The most powerful feature of tuples is unpacking — taking a tuple apart into named variables in one statement:

x, y = point
first, *rest = (1, 2, 3, 4, 5)
[x, y, first, rest]
[3, 4, 1, [2, 3, 4, 5]]
  • x, y = point matches the 2-tuple’s two slots to two names. After this, x is 3, y is 4.
  • first, *rest = (1, 2, 3, 4, 5) uses the starred targetfirst binds to the first element, *rest collects everything else into a list [2, 3, 4, 5].
  • The starred target can also appear in the middle: head, *middle, tail = ....

The general rule: any tuple (or list) can be unpacked into matching names; *name greedily collects the leftovers as a list. The full unpacking treatment is in Chapter 14.

Tuples can be dict keys (lists cannot, because they’re mutable and unhashable). A hashable object is one whose value can be reduced to a fixed integer summary (the hash) — that integer is what lets a dict or set find the slot for the key in O(1). Mutable objects like lists are deliberately unhashable: if their hash could change after insertion, lookups would silently fail. Tuples (with hashable contents) are hashable, hence valid keys:

locations = {(40.7, -74.0): "New York", (51.5, -0.1): "London"}
locations[(40.7, -74.0)]
'New York'

The dict has two entries; each key is a 2-tuple of latitude and longitude. Looking up (40.7, -74.0) returns "New York" — and the lookup works because Python first hashes the tuple (which it can, because tuples-of-floats are hashable), uses that hash to find the bucket, and then does a == check against the candidate keys. Trying [40.7, -74.0] (a list) as a key would raise TypeError: unhashable type: 'list' — that’s the rule the previous paragraph just stated, made concrete.

5.3 Dictionaries

When you need to look something up by name — a user record by username, a count by word, a config value by key — a list is wrong (linear scan) and a class is overkill. A dict maps keys to values with O(1) lookup, insertion, and deletion on average.

person = {"name": "Alice", "age": 30}
person["name"], person.get("missing"), person.get("missing", "N/A")
('Alice', None, 'N/A')
  • {"name": "Alice", "age": 30} is a dict literal — curly braces, key: value pairs.
  • person["name"] looks up "name" and returns "Alice" — the bracket syntax raises KeyError if the key is absent.
  • person.get("missing") returns None instead of raising — useful when absence is expected.
  • person.get("missing", "N/A") returns "N/A" as the fallback — a default of your choosing.

The general rule: d[key] is loud (raises on miss); d.get(key, default) is quiet (returns the default). Modify with assignment (sets/updates), the del statement (removes a key — del d[k] is built-in syntax, not a method call), or .pop(key) (removes and returns the value):

person["city"] = "London"
del person["age"]
person.pop("city")
person
{'name': 'Alice'}

Iterate keys, values, or pairs:

person = {"name": "Alice", "age": 30, "city": "London"}
list(person.keys()), list(person.values()), list(person.items())
(['name', 'age', 'city'],
 ['Alice', 30, 'London'],
 [('name', 'Alice'), ('age', 30), ('city', 'London')])
for key, value in person.items():
    print(f"{key}: {value}")
name: Alice
age: 30
city: London

Two helpers worth memorizing — both spare you the “check if the key exists, create the value if it doesn’t, then update it” boilerplate that comes up constantly:

from collections import defaultdict, Counter

groups = defaultdict(list)
for name, team in [("Alice", "eng"), ("Bob", "design"), ("Carol", "eng")]:
    groups[team].append(name)
dict(groups)
{'eng': ['Alice', 'Carol'], 'design': ['Bob']}
  • defaultdict(list) is a dict that, on a missing-key lookup, auto-creates the value by calling list() (i.e. []).
  • The first time groups["eng"] is accessed, defaultdict creates an empty list and stores it; .append("Alice") then adds to that list.
  • Without defaultdict you’d write if team not in groups: groups[team] = []; groups[team].append(name) every iteration.
  • dict(groups) converts it back to a plain dict for display — purely cosmetic.
text = "the cat sat on the mat".split()
Counter(text).most_common(2)
[('the', 2), ('cat', 1)]
  • Counter(text) walks the iterable and counts how many times each element appears — {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1}.
  • .most_common(2) returns the top 2 most-frequent items as (item, count) tuples, sorted by count.

The general rule: defaultdict(factory) for “missing keys auto-fill via factory()”; Counter for frequency tables. Both live in collections.

Two dicts can be merged with the union operator | (Python 3.9+). Right-hand keys win on conflict — the idiom for “defaults plus overrides”:

defaults = {"host": "localhost", "port": 5432, "ssl": False}
overrides = {"port": 6543, "ssl": True}
defaults | overrides
{'host': 'localhost', 'port': 6543, 'ssl': True}

The deep dive — including hashing, dict ordering, and dict views — is in Chapter 15.

5.4 Sets

When you need to ask “have I seen this?” repeatedly, or take the intersection/union/difference of two collections, a list is the wrong tool — x in big_list is O(n). A set holds unique values with O(1) membership testing. Order is not preserved.

fruits = {"apple", "banana", "cherry"}
empty = set()              # NOT {} — that's a dict
"apple" in fruits, "grape" in fruits
(True, False)
  • {"apple", "banana", "cherry"} is a set literal — curly braces, no key:value pairs.
  • set() (not {}) is the empty set — {} is the empty dict, which is the one common gotcha.
  • "apple" in fruits is the membership test — True, and constant-time regardless of set size.

The set-algebra operators are pleasant — and exactly what you’d write on paper:

a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
[a | b, a & b, a - b, a ^ b]   # union, intersection, difference, symmetric difference
[{1, 2, 3, 4, 5, 6}, {3, 4}, {1, 2}, {1, 2, 5, 6}]
  • a | bunion: everything in either set.
  • a & bintersection: only the items in both.
  • a - bdifference: in a but not b.
  • a ^ bsymmetric difference: in exactly one of the two, not both.

The general rule: reach for a set when you need fast membership, deduplication, or the algebra above.

A common one-liner: deduplicate a list while preserving uniqueness (order is not preserved):

duplicates = [1, 2, 2, 3, 3, 3, 4]
list(set(duplicates))
[1, 2, 3, 4]

If you need order-preserving deduplication, use dict.fromkeys(duplicates) (since Python 3.7, dict iteration order matches insertion order).

5.5 Comprehensions

The “build an empty collection, loop, append” pattern is so common that Python has a one-line syntax for it. A comprehension is a compact, readable way to build a collection from another collection. There are four kinds: list, dict, set, and generator.

squares = [x ** 2 for x in range(10)]
evens = [x for x in range(20) if x % 2 == 0]
[squares, evens]
[[0, 1, 4, 9, 16, 25, 36, 49, 64, 81], [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]]
  • [x ** 2 for x in range(10)] reads “the list of x ** 2 for each x in range(10)” — equivalent to the four-line loop:

    squares = []
    for x in range(10):
        squares.append(x ** 2)
  • The general shape is [expression for variable in iterable] — the expression in front, the source loop behind.

  • [x for x in range(20) if x % 2 == 0] adds an if filter — only items satisfying the condition reach the output.

The same pattern with curly braces builds dicts and sets:

square_map = {x: x ** 2 for x in range(5)}
unique_lengths = {len(w) for w in "the cat sat on the mat".split()}
[square_map, unique_lengths]
[{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}, {2, 3}]
  • {x: x ** 2 for x in range(5)} is a dict comprehensionkey: value instead of just an expression.
  • {len(w) for w in ...} is a set comprehension — single expression, but in braces with no colon. Duplicates collapse: "the" appears twice in the source but 3 appears once in the result.

The general rule: [...] builds a list, {k: v for ...} a dict, {...} a set; add if cond for filtering, change the expression for mapping.

A list comprehension materialises the whole list at once — fine for thousands of items, wasteful for millions. A generator expression is a comprehension in (...) instead of [...]. It does not build a list — it computes values on demand:

total = sum(x ** 2 for x in range(1_000_000))
total
333332833333500000
  • The parentheses turn the comprehension into a generator — a lazy producer that yields one value at a time.
  • sum(...) walks the generator, adding each squared value to a running total.
  • At no point is a million-item list materialised in memory — peak memory stays tiny.

The general rule: when the result is fed straight into a consumer like sum, max, any, or "".join, drop the brackets to a generator and save the intermediate list. Generator expressions are how you scale comprehensions to large or infinite data; the deep dive is in Chapter 29.

The walrus operator := (Python 3.8+) lets a comprehension capture a per-item value once and reuse it — useful when a filter and a map both need the same expensive computation:

import re

lines = ["score: 95", "noise", "score: 87", "noise", "score: 72"]
nums = [int(m.group(1)) for line in lines if (m := re.match(r"score: (\d+)", line))]
nums
[95, 87, 72]

m is bound only when the regex matches; the body then uses m.group(1) without re-running the match. Without the walrus you’d either match twice or fall back to a regular for loop.

5.6 Choosing the right collection

Need Use
Ordered, modifiable items list
Ordered, fixed record tuple
Lookup by key dict
Fast in test, dedupe, set algebra set
Counting occurrences collections.Counter
Auto-create missing keys collections.defaultdict
Fast append/pop both ends collections.deque
TipWhy this matters

The choice of collection drives the complexity of your code. x in some_list is O(n); x in some_set is O(1). Looking up by key in a dict is O(1); searching a list of pairs is O(n). The right collection often turns “make this faster” into “rewrite the data structure” — and the rewrite is one line.

5.7 Going deeper

The full treatment of sequences (lists, tuples, slicing, comprehensions, the sequence protocol) is Chapter 14. The full treatment of dicts and sets — hashing, ordering, performance — is Chapter 15. Generator expressions and the iterator protocol they ride on are in Chapter 29.

5.8 Build: a tiny text analyzer

A common starter problem in data work: given some text, report how many words there are, how many of those are unique, what the most common are, and which appeared exactly once. Every collection in this chapter — list, dict, set, Counter — earns a slot.

Step 1: tokenize and count. Lower-case the text, strip simple punctuation, split on whitespace, and feed the resulting list into Counter:

from collections import Counter

text = """The quick brown fox jumps over the lazy dog.
The dog sleeps. The fox watches."""

words = text.lower().replace(".", "").split()
freq = Counter(words)
freq.most_common(3)
[('the', 4), ('fox', 2), ('dog', 2)]

text.lower() returns a new string with every letter lower-cased (so “The” and “the” merge); .replace(".", "") strips periods so “dog.” and “dog” merge too; .split() (no argument) splits on any whitespace including the embedded newline. Counter(words) walks the list once and returns a dict-subclass mapping each word to its count. .most_common(3) returns the three highest counts as (word, count) tuples — sorted descending.

Step 2: pull out the once-only words. A hapax legomenon (a word that appears exactly once) is interesting in linguistics and search-quality work. We can build it as a set comprehension over the Counter, since Counter is also a dict so .items() yields pairs:

hapax = {word for word, count in freq.items() if count == 1}
unique_words = set(words)
[len(unique_words), len(hapax)]
[10, 7]

set(words) is the simplest deduplication — turning the word list into a set drops duplicates in one step. The set comprehension {word for word, count in freq.items() if count == 1} filters Counter’s pairs and keeps only the words whose count is 1.

Step 3: assemble a report. Wrap everything in a function that returns a single dict — the kind of structure you’d render as JSON or pass to a template:

def analyze(text):
    words = text.lower().replace(".", "").split()
    freq = Counter(words)
    return {
        "total_words": len(words),
        "unique_words": len(set(words)),
        "top3": freq.most_common(3),
        "hapax": sorted({w for w, c in freq.items() if c == 1}),
    }

analyze(text)
{'total_words': 15,
 'unique_words': 10,
 'top3': [('the', 4), ('fox', 2), ('dog', 2)],
 'hapax': ['brown', 'jumps', 'lazy', 'over', 'quick', 'sleeps', 'watches']}

sorted(set) returns a list — sets aren’t ordered, but the report’s reader probably expects deterministic output. Wrapping the set in sorted is the idiomatic conversion.

The build exercises every collection from the chapter: list (tokens), dict (the report), set (deduplication and one-shot words), tuple (Counter’s (word, count) pairs), and a comprehension in each variety we covered.

5.9 Exercises

  1. Dedupe preserving order. Write a one-liner that removes duplicates from a list while preserving the original order. (Hint: dict.fromkeys.)

  2. Word frequency. Given a string of text, return the three most common words using Counter.

  3. Group by first letter. Given a list of names, return a defaultdict(list) mapping each first letter to the names starting with it.

  4. Comprehension trade-offs. Time sum(x*x for x in range(1_000_000)) against sum([x*x for x in range(1_000_000)]). Which is faster? Which uses less memory?

  5. The {} trap. Predict type({}) and type(set()). Verify with type().

5.10 Summary

Lists, tuples, dicts, sets — and the comprehensions that build them — cover almost every collection task. The next chapter, Chapter 6, makes these collections flow through code by introducing functions: parameters, return values, scope, and a first taste of decorators and generators.