s = "café"
len(s)4
Python 3 has a hard separation: str is text (Unicode code points), bytes is binary data. Encode to go from str → bytes. Decode to go bytes → str. Handle this at the I/O boundary; work with str inside.
In this chapter you will learn to:
str) and bytes, and convert between them with explicit encodings.UnicodeEncodeError and UnicodeDecodeError and choose an appropriate errors= strategy.str, encode on output.str to NFC/NFD/NFKC/NFKD for reliable comparisons.locale or pyuca.unicodedata.str / bytes boundaryA str is a sequence of code points — abstract numbers like U+0041 for A. (U+XXXX is the standard Unicode notation: a four- to six-digit hexadecimal index into the Unicode database. So U+0041 is decimal 65, the letter A; U+00E9 is é; U+1F600 is 😀.) A bytes is a sequence of 8-bit integers. The two never mix.
s = "café"
len(s)4
len(s) returns 4 — the count of code points, not bytes. Each visible character (c, a, f, é) is one code point regardless of how many bytes it would take to encode.
But the byte representation is encoding-dependent. UTF-8 takes one byte for ASCII characters and two bytes for é:
b = s.encode("utf-8")
b, len(b)(b'caf\xc3\xa9', 5)
s.encode("utf-8") returns a bytes object — the actual on-disk representation — b'caf\xc3\xa9'. The first three bytes (c, a, f) are ASCII and take one byte each; é becomes two bytes \xc3 \xa9 in UTF-8. So len(b) is 5 even though len(s) is 4. Number of bytes is not number of characters — that’s the most important sentence in this chapter.
The reverse trip is decode:
b.decode("utf-8")'café'
b.decode("utf-8") reads the bytes back, recognises the \xc3 \xa9 pair as é, and rebuilds the four-character string 'café'. Encode-then-decode with the same encoding round-trips perfectly; mixing encodings is where data corruption begins.
bytes is immutable; bytearray is its mutable sibling. Slicing bytes returns bytes, but indexing returns an int:
cafe = bytes("café", encoding="utf-8")
cafe[0], cafe[:1](99, b'c')
Walking through the result:
cafe[0] is 99 — a single byte, returned as a Python int (the ASCII code for c).cafe[:1] is b'c' — a one-element slice, returned as a bytes object of length 1.bytes gives you a number; slicing gives you bytes. This asymmetry trips up everyone the first time.Encoding can fail when the target charset can’t represent a character. Decoding can fail when the bytes are invalid for the chosen encoding.
city = "São Paulo"
city.encode("cp437")--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) Cell In[5], line 2 1 city = "São Paulo" ----> 2 city.encode("cp437") File /opt/hostedtoolcache/Python/3.13.13/x64/lib/python3.13/encodings/cp437.py:12, in Codec.encode(self, input, errors) 11 def encode(self,input,errors='strict'): ---> 12 return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined> encoding with 'cp437' codec failed
cp437 is an old DOS code page that doesn’t include ã. The errors= parameter chooses what happens on failure:
city.encode("cp437", errors="ignore")b'So Paulo'
city.encode("cp437", errors="replace")b'S?o Paulo'
city.encode("cp437", errors="xmlcharrefreplace")b'São Paulo'
Decoding is the harder direction because any sequence of bytes can be misinterpreted under the wrong encoding.
octets = b"Montr\xe9al"
octets.decode("utf-8")--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[9], line 2 1 octets = b"Montr\xe9al" ----> 2 octets.decode("utf-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
The b"..." prefix is the bytes literal — Python builds a bytes object from the literal text. Inside, \xHH is an escape sequence for a single byte with that two-hex-digit value: \xe9 is the byte 0xE9 (decimal 233). So b"Montr\xe9al" is eight bytes: M, o, n, t, r, \xe9, a, l — the same way "hello\n" is six characters with the \n an escape for newline.
octets.decode("iso8859-1")'Montréal'
octets.decode("utf-8", errors="replace")'Montr�al'
The replace form succeeded — but it produced wrong data. Montréal and Montr�al are not the same. Silent data loss is the danger of permissive decoding.
To detect an unknown encoding, the chardet package guesses based on byte statistics:
import chardet # pip install chardet
rawdata = open("mystery.txt", "rb").read()
chardet.detect(rawdata)
# {'encoding': 'ISO-8859-1', 'confidence': 0.73}Confidence below ~0.9 means you should treat the result as a hint, not an answer.
The right design for any text-handling program: decode bytes on input, work with str throughout, encode to bytes on output. This is the Unicode sandwich.
import tempfile, os
with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".txt", delete=False) as fp:
fp.write("café")
path = fp.name
with open(path, encoding="utf-8") as fp:
print(fp.read())
os.unlink(path)café
Walking through the moving parts:
tempfile.NamedTemporaryFile("w", encoding="utf-8", ...) opens a temp file in text mode — Python encodes the str "café" to UTF-8 bytes on the way out. We never touch bytes ourselves.delete=False keeps the file on disk after the with exits, so we can re-open it. We clean up by hand at the end.open(path, encoding="utf-8") reads in text mode — Python decodes the UTF-8 bytes back into a str. The print displays the str.os.unlink(path) deletes the temp file.The general rule: pass encoding="utf-8" every time you open a file. Inside your program, work in str. Bytes only exist at the entry and exit boundaries.
encoding= explicitly
open() defaults to a platform-dependent encoding (often UTF-8 on macOS/Linux but cp1252 on Windows). Code that omits encoding= is a Heisenbug waiting for the day you ship to a different OS. Always pass it.
The same visible text can be represented two ways. é can be a single code point (U+00E9) or e followed by a combining accent (U+0301):
s1 = "café" # composed
s2 = "cafe\u0301" # decomposed
s1 == s2, len(s1), len(s2)(False, 4, 5)
Walking through what each line shows:
s1 uses the precomposed é — one code point — so len(s1) is 4.s2 writes e followed by \u0301, the combining acute accent. Two code points, so len(s2) is 5. The terminal renders them on top of each other, but the underlying code-point sequence is different.s1 == s2 is False because == on str compares code-point sequences, not visual appearance.unicodedata.normalize collapses both to a canonical form:
from unicodedata import normalize
normalize("NFC", s1) == normalize("NFC", s2)True
The general rule: any time user-typed text is compared, hashed, or used as a key, run it through normalize("NFC", ...) first — otherwise visually identical strings will sometimes be unequal.
| Form | Meaning | Use for |
|---|---|---|
| NFC | composed (shortest) | default; storage and display |
| NFD | decomposed | char-by-char ASCII operations |
| NFKC, NFKD | compatibility (more aggressive) | search, deduplication |
For case-insensitive comparison, prefer casefold over lower:
"ß".casefold(), "Σ".casefold()('ss', 'σ')
ß lowercases to ß but case-folds to ss, which is what a German speaker would consider equal.
The default sorted orders by code point, which gives the wrong answer for any language with diacritics:
fruits = ["caju", "atemoia", "cajá", "açaí", "acerola"]
sorted(fruits)['acerola', 'atemoia', 'açaí', 'caju', 'cajá']
acerola < açaí is wrong in Portuguese — ç should sort with c. The fix is locale.strxfrm, but it requires that the locale be installed on your OS:
import locale
locale.setlocale(locale.LC_ALL, "pt_BR.UTF-8")
sorted(fruits, key=locale.strxfrm)A portable alternative is pyuca, a pure-Python implementation of the Unicode Collation Algorithm:
import pyuca # pip install pyuca
coll = pyuca.Collator()
sorted(fruits, key=coll.sort_key)unicodedata exposes the metadata that ships with Python:
import unicodedata
unicodedata.name("A"), unicodedata.name("€")('LATIN CAPITAL LETTER A', 'EURO SIGN')
unicodedata.lookup("SNOWMAN"), unicodedata.lookup("EURO SIGN")('☃', '€')
unicodedata.numeric("⅔"), unicodedata.digit("3")(0.6666666666666666, 3)
The Unicode sandwich: decode bytes on input → work with str throughout → encode to bytes on output. Never mix str and bytes. Never rely on default encodings. Always pass encoding="utf-8" explicitly.
A slug is the URL-safe version of a title — "São Paulo Year-In-Review" becomes "sao-paulo-year-in-review". Building one is the canonical workout for the chapter’s tools: NFKD normalisation to peel accents off as combining marks, code-point filtering to drop the marks, casefold for case-insensitive collapse, and a regex to handle whitespace.
Step 1: peel diacritics off via NFKD + filter combining marks. The trick: NFKD decomposes é into e + \u0301 (combining acute), and unicodedata.combining(c) is non-zero exactly for those marks. Drop them and you’re left with the base letters:
import unicodedata
def strip_accents(text):
nfd = unicodedata.normalize("NFKD", text)
return "".join(c for c in nfd if not unicodedata.combining(c))
[strip_accents("café"), strip_accents("São Paulo"), strip_accents("naïve")]['cafe', 'Sao Paulo', 'naive']
unicodedata.combining(c) returns the combining class of a character — an integer that’s 0 for ordinary letters and non-zero for combining marks like the acute, diaeresis, tilde. The generator expression keeps only the non-marks, then "".join(...) reassembles a string. That’s the heart of every “ASCII-fold” implementation you’ll meet.
Step 2: case-fold and collapse whitespace. Casefold (rather than .lower()) so German ß becomes ss, then a regex collapses runs of whitespace and any non-ASCII residue:
import re
def slugify(text):
text = strip_accents(text).casefold()
text = re.sub(r"[^a-z0-9]+", "-", text)
return text.strip("-")
[slugify("São Paulo"), slugify("Naïve café résumé"), slugify("Hello, World!")]['sao-paulo', 'naive-cafe-resume', 'hello-world']
re.sub(r"[^a-z0-9]+", "-", text) replaces every run of non-alphanumerics with a single hyphen — spaces, punctuation, and any leftover non-ASCII characters that survived strip_accents (some scripts have no plain-ASCII fold). .strip("-") removes any leading or trailing hyphens.
Step 3: a round trip through the Unicode sandwich. Real slugifiers run at the I/O boundary — a title comes in as bytes from a database, the slug goes back out as bytes to a URL. Decode → process in str → encode is the chapter’s main rule:
def slugify_bytes(raw, encoding="utf-8"):
text = raw.decode(encoding) # bytes -> str at the boundary
return slugify(text).encode("ascii")
slugify_bytes("São Paulo".encode("utf-8"))b'sao-paulo'
.decode(encoding) opens the sandwich; slugify(...) works entirely in str; .encode("ascii") closes it. Because the slug only contains a–z, 0–9, and -, encoding to ascii is safe — the ASCII alphabet is a strict subset of every common encoding, so the resulting bytes are interchangeable.
The build exercises everything: NFKD normalisation (unicodedata.normalize), combining-mark filtering (unicodedata.combining), casefold for case-insensitive collapse, regex over a str, and decode/encode framing the whole thing as a Unicode sandwich.
Round-trip. Take the string "naïve", encode it as UTF-8 and as Latin-1. How many bytes does each produce? Decode each back to str — do you recover the original?
Spot the silent failure. Decode b"naïve".replace(b"\xc3\xaf", b"\xee") (which simulates a corrupted byte) with errors="replace". What do you see? With errors="strict"?
Composed vs decomposed. Build the string "café" two ways: as "caf" + "\u00e9" and as "cafe" + "\u0301". They print identically. Why does set() count them as different elements? Fix it.
Casefold corner. Find a character whose .lower() and .casefold() differ. The notes mention one — German ß — find another.
Sort by accent. Without using locale or pyuca, write a sort key that strips accents using NFD plus unicodedata.combining. Sort the fruits list and compare with the broken default order.
Python 3 makes a hard separation between str (text) and bytes (binary). The Unicode sandwich is the discipline that keeps that separation clean: decode at the boundary, work with str inside, encode on the way out. Combined with explicit encodings and Unicode normalization, you have a working model that handles every language.
Next, Chapter 17 introduces Python’s three (or four) ways to build data-record classes: namedtuple, typing.NamedTuple, @dataclass, and the older path of writing classes by hand.