16 Unicode Text Versus Bytes

Core idea

Python 3 has a hard separation: str is text (Unicode code points), bytes is binary data. Encode to go from str → bytes. Decode to go bytes → str. Handle this at the I/O boundary; work with str inside.

In this chapter you will learn to:

Distinguish between characters (str) and bytes, and convert between them with explicit encodings.
Recognize UnicodeEncodeError and UnicodeDecodeError and choose an appropriate errors= strategy.
Apply the Unicode sandwich: decode on input, work in str, encode on output.
Normalize str to NFC/NFD/NFKC/NFKD for reliable comparisons.
Sort Unicode text correctly with locale or pyuca.
Look up character names and properties via unicodedata.

16.1 The `str` / `bytes` boundary

A str is a sequence of code points — abstract numbers like U+0041 for A. (U+XXXX is the standard Unicode notation: a four- to six-digit hexadecimal index into the Unicode database. So U+0041 is decimal 65, the letter A; U+00E9 is é; U+1F600 is 😀.) A bytes is a sequence of 8-bit integers. The two never mix.

s = "café"
len(s)

len(s) returns 4 — the count of code points, not bytes. Each visible character (c, a, f, é) is one code point regardless of how many bytes it would take to encode.

But the byte representation is encoding-dependent. UTF-8 takes one byte for ASCII characters and two bytes for é:

b = s.encode("utf-8")
b, len(b)

(b'caf\xc3\xa9', 5)

s.encode("utf-8") returns a bytes object — the actual on-disk representation — b'caf\xc3\xa9'. The first three bytes (c, a, f) are ASCII and take one byte each; é becomes two bytes \xc3 \xa9 in UTF-8. So len(b) is 5 even though len(s) is 4. Number of bytes is not number of characters — that’s the most important sentence in this chapter.

The reverse trip is decode:

b.decode("utf-8")

'café'

b.decode("utf-8") reads the bytes back, recognises the \xc3 \xa9 pair as é, and rebuilds the four-character string 'café'. Encode-then-decode with the same encoding round-trips perfectly; mixing encodings is where data corruption begins.

bytes is immutable; bytearray is its mutable sibling. Slicing bytes returns bytes, but indexing returns an int:

cafe = bytes("café", encoding="utf-8")
cafe[0], cafe[:1]

(99, b'c')

Walking through the result:

cafe[0] is 99 — a single byte, returned as a Python int (the ASCII code for c).
cafe[:1] is b'c' — a one-element slice, returned as a bytes object of length 1.
Same byte; two different types. The cheap rule of thumb: indexing a bytes gives you a number; slicing gives you bytes. This asymmetry trips up everyone the first time.

16.2 Encode and decode errors

Encoding can fail when the target charset can’t represent a character. Decoding can fail when the bytes are invalid for the chosen encoding.

city = "São Paulo"
city.encode("cp437")

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[5], line 2
      1 city = "São Paulo"
----> 2 city.encode("cp437")

File /opt/hostedtoolcache/Python/3.13.13/x64/lib/python3.13/encodings/cp437.py:12, in Codec.encode(self, input, errors)
     11 def encode(self,input,errors='strict'):
---> 12     return codecs.charmap_encode(input,errors,encoding_map)

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>
encoding with 'cp437' codec failed

cp437 is an old DOS code page that doesn’t include ã. The errors= parameter chooses what happens on failure:

city.encode("cp437", errors="ignore")

b'So Paulo'

city.encode("cp437", errors="replace")

b'S?o Paulo'

city.encode("cp437", errors="xmlcharrefreplace")

b'S&#227;o Paulo'

Decoding is the harder direction because any sequence of bytes can be misinterpreted under the wrong encoding.

octets = b"Montr\xe9al"
octets.decode("utf-8")

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[9], line 2
      1 octets = b"Montr\xe9al"
----> 2 octets.decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

The b"..." prefix is the bytes literal — Python builds a bytes object from the literal text. Inside, \xHH is an escape sequence for a single byte with that two-hex-digit value: \xe9 is the byte 0xE9 (decimal 233). So b"Montr\xe9al" is eight bytes: M, o, n, t, r, \xe9, a, l — the same way "hello\n" is six characters with the \n an escape for newline.

octets.decode("iso8859-1")

'Montréal'

octets.decode("utf-8", errors="replace")

'Montr�al'

The replace form succeeded — but it produced wrong data. Montréal and Montr�al are not the same. Silent data loss is the danger of permissive decoding.

To detect an unknown encoding, the chardet package guesses based on byte statistics:

import chardet  # pip install chardet
rawdata = open("mystery.txt", "rb").read()
chardet.detect(rawdata)
# {'encoding': 'ISO-8859-1', 'confidence': 0.73}

Confidence below ~0.9 means you should treat the result as a hint, not an answer.

16.3 The Unicode sandwich

The right design for any text-handling program: decode bytes on input, work with str throughout, encode to bytes on output. This is the Unicode sandwich.

import tempfile, os

with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".txt", delete=False) as fp:
    fp.write("café")
    path = fp.name

with open(path, encoding="utf-8") as fp:
    print(fp.read())

os.unlink(path)

café

Walking through the moving parts:

tempfile.NamedTemporaryFile("w", encoding="utf-8", ...) opens a temp file in text mode — Python encodes the str "café" to UTF-8 bytes on the way out. We never touch bytes ourselves.
delete=False keeps the file on disk after the with exits, so we can re-open it. We clean up by hand at the end.
open(path, encoding="utf-8") reads in text mode — Python decodes the UTF-8 bytes back into a str. The print displays the str.
os.unlink(path) deletes the temp file.

The general rule: pass encoding="utf-8" every time you open a file. Inside your program, work in str. Bytes only exist at the entry and exit boundaries.

Always pass encoding= explicitly

open() defaults to a platform-dependent encoding (often UTF-8 on macOS/Linux but cp1252 on Windows). Code that omits encoding= is a Heisenbug waiting for the day you ship to a different OS. Always pass it.

16.4 Normalizing Unicode for comparisons

The same visible text can be represented two ways. é can be a single code point (U+00E9) or e followed by a combining accent (U+0301):

s1 = "café"          # composed
s2 = "cafe\u0301"    # decomposed
s1 == s2, len(s1), len(s2)

(False, 4, 5)

Walking through what each line shows:

s1 uses the precomposed é — one code point — so len(s1) is 4.
s2 writes e followed by \u0301, the combining acute accent. Two code points, so len(s2) is 5. The terminal renders them on top of each other, but the underlying code-point sequence is different.
s1 == s2 is False because == on str compares code-point sequences, not visual appearance.

unicodedata.normalize collapses both to a canonical form:

from unicodedata import normalize
normalize("NFC", s1) == normalize("NFC", s2)

True

The general rule: any time user-typed text is compared, hashed, or used as a key, run it through normalize("NFC", ...) first — otherwise visually identical strings will sometimes be unequal.

Form	Meaning	Use for
NFC	composed (shortest)	default; storage and display
NFD	decomposed	char-by-char ASCII operations
NFKC, NFKD	compatibility (more aggressive)	search, deduplication

For case-insensitive comparison, prefer casefold over lower:

"ß".casefold(), "Σ".casefold()

('ss', 'σ')

ß lowercases to ß but case-folds to ss, which is what a German speaker would consider equal.

16.5 Sorting Unicode text

The default sorted orders by code point, which gives the wrong answer for any language with diacritics:

fruits = ["caju", "atemoia", "cajá", "açaí", "acerola"]
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

acerola < açaí is wrong in Portuguese — ç should sort with c. The fix is locale.strxfrm, but it requires that the locale be installed on your OS:

import locale
locale.setlocale(locale.LC_ALL, "pt_BR.UTF-8")
sorted(fruits, key=locale.strxfrm)

A portable alternative is pyuca, a pure-Python implementation of the Unicode Collation Algorithm:

import pyuca   # pip install pyuca
coll = pyuca.Collator()
sorted(fruits, key=coll.sort_key)

16.6 The Unicode database

unicodedata exposes the metadata that ships with Python:

import unicodedata
unicodedata.name("A"), unicodedata.name("€")

('LATIN CAPITAL LETTER A', 'EURO SIGN')

unicodedata.lookup("SNOWMAN"), unicodedata.lookup("EURO SIGN")

('☃', '€')

unicodedata.numeric("⅔"), unicodedata.digit("3")

(0.6666666666666666, 3)

Why this matters

The Unicode sandwich: decode bytes on input → work with str throughout → encode to bytes on output. Never mix str and bytes. Never rely on default encodings. Always pass encoding="utf-8" explicitly.

16.7 Build: a tiny ASCII slugifier

A slug is the URL-safe version of a title — "São Paulo Year-In-Review" becomes "sao-paulo-year-in-review". Building one is the canonical workout for the chapter’s tools: NFKD normalisation to peel accents off as combining marks, code-point filtering to drop the marks, casefold for case-insensitive collapse, and a regex to handle whitespace.

Step 1: peel diacritics off via NFKD + filter combining marks. The trick: NFKD decomposes é into e + \u0301 (combining acute), and unicodedata.combining(c) is non-zero exactly for those marks. Drop them and you’re left with the base letters:

import unicodedata

def strip_accents(text):
    nfd = unicodedata.normalize("NFKD", text)
    return "".join(c for c in nfd if not unicodedata.combining(c))

[strip_accents("café"), strip_accents("São Paulo"), strip_accents("naïve")]

['cafe', 'Sao Paulo', 'naive']

unicodedata.combining(c) returns the combining class of a character — an integer that’s 0 for ordinary letters and non-zero for combining marks like the acute, diaeresis, tilde. The generator expression keeps only the non-marks, then "".join(...) reassembles a string. That’s the heart of every “ASCII-fold” implementation you’ll meet.

Step 2: case-fold and collapse whitespace. Casefold (rather than .lower()) so German ß becomes ss, then a regex collapses runs of whitespace and any non-ASCII residue:

import re

def slugify(text):
    text = strip_accents(text).casefold()
    text = re.sub(r"[^a-z0-9]+", "-", text)
    return text.strip("-")

[slugify("São Paulo"), slugify("Naïve  café résumé"), slugify("Hello,  World!")]

['sao-paulo', 'naive-cafe-resume', 'hello-world']

re.sub(r"[^a-z0-9]+", "-", text) replaces every run of non-alphanumerics with a single hyphen — spaces, punctuation, and any leftover non-ASCII characters that survived strip_accents (some scripts have no plain-ASCII fold). .strip("-") removes any leading or trailing hyphens.

Step 3: a round trip through the Unicode sandwich. Real slugifiers run at the I/O boundary — a title comes in as bytes from a database, the slug goes back out as bytes to a URL. Decode → process in str → encode is the chapter’s main rule:

def slugify_bytes(raw, encoding="utf-8"):
    text = raw.decode(encoding)         # bytes -> str at the boundary
    return slugify(text).encode("ascii")

slugify_bytes("São Paulo".encode("utf-8"))

b'sao-paulo'

.decode(encoding) opens the sandwich; slugify(...) works entirely in str; .encode("ascii") closes it. Because the slug only contains a–z, 0–9, and -, encoding to ascii is safe — the ASCII alphabet is a strict subset of every common encoding, so the resulting bytes are interchangeable.

The build exercises everything: NFKD normalisation (unicodedata.normalize), combining-mark filtering (unicodedata.combining), casefold for case-insensitive collapse, regex over a str, and decode/encode framing the whole thing as a Unicode sandwich.

16.8 Exercises

Round-trip. Take the string "naïve", encode it as UTF-8 and as Latin-1. How many bytes does each produce? Decode each back to str — do you recover the original?
Spot the silent failure. Decode b"naïve".replace(b"\xc3\xaf", b"\xee") (which simulates a corrupted byte) with errors="replace". What do you see? With errors="strict"?
Composed vs decomposed. Build the string "café" two ways: as "caf" + "\u00e9" and as "cafe" + "\u0301". They print identically. Why does set() count them as different elements? Fix it.
Casefold corner. Find a character whose .lower() and .casefold() differ. The notes mention one — German ß — find another.
Sort by accent. Without using locale or pyuca, write a sort key that strips accents using NFD plus unicodedata.combining. Sort the fruits list and compare with the broken default order.

16.9 Summary

Python 3 makes a hard separation between str (text) and bytes (binary). The Unicode sandwich is the discipline that keeps that separation clean: decode at the boundary, work with str inside, encode on the way out. Combined with explicit encodings and Unicode normalization, you have a working model that handles every language.

Next, Chapter 17 introduces Python’s three (or four) ways to build data-record classes: namedtuple, typing.NamedTuple, @dataclass, and the older path of writing classes by hand.

16.1 The str / bytes boundary

16.2 Encode and decode errors

16.3 The Unicode sandwich

16.4 Normalizing Unicode for comparisons

16.5 Sorting Unicode text

16.6 The Unicode database

16.7 Build: a tiny ASCII slugifier

16.8 Exercises

16.9 Summary

16.1 The `str` / `bytes` boundary