Python Codecs

codecs is a module that provides an interface to the codec registry. This is mainly useful for interfacing with text encodings, which encode text to bytes and decode bytes to text.


Example

To replace invalid Unicode codepoints while writing chunks of text, without needing to aggregate the chunks, try:

from codecs import iterdecode, iterencode

decoded = iterdecode(chunks, 'utf-8', 'replace')
for chunk in iterencode(decoded, 'utf-8'):
    f.write(chunk)


Usage

Decode

Convert bytes to str using a text encoding.

Error handlers include:

Value

Meaning

strict

raise UnicodeError on errors

ignore

do nothing

replace

replace errors with (U+FFFD)

backslashreplace

replace errors with hex escape sequences (\xhh)

surrogateescape

replace errors with surrogate codes (within U+DC80 to U+DCFF)

IterDecode

Iteratively decode. See the example above.

Encode

Convert str to bytes using a text encoding.

Error handlers include:

Value

Meaning

strict

raise UnicodeError

ignore

do nothing

replace

replace errors with ?

backslashreplace

replace errors with hex escape sequences (\xhh, \uxxxx, or \Uxxxxxxxx)

surrogateescape

replace surrogate codes with the original code

xmlcharrefreplace

replace errors with XML numeric character references (&num)

namereplace

replace errors with Unicode Character Database name escape sequences (\N{name})

IterEncode

Iteratively encode. See the example above.


Troubleshooting

can't concat int to bytes

The iterencode and iterdecode APIs convert an iterator of strings (bytes) into an iterator of bytes (strings, respectively). This is trickier than expected, as iterating over a bytestring yields integers, not bytes. The APIs are not symmetrical. See #38482 for more details.

>>> import codecs
>>> list(codecs.iterencode(['spam'], 'utf-8'))
[b'spam']
>>> list(codecs.iterencode('spam', 'utf-8'))
[b's', b'p', b'a', b'm']
>>> list(codecs.iterdecode([b'spam'], 'utf-8'))
['spam']
>>> list(codecs.iterdecode(b'spam', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 1048, in iterdecode
    output = decoder.decode(input)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 321, in decode
    data = self.buffer + input
TypeError: can't concat int to bytes

Instead, try:

>>> list(codecs.iterdecode([bytes([b]) for b in b'spam'], 'utf-8'))
['s', 'p', 'a', 'm']


See also

Python codecs module documentation

Python Module of the Day article for codecs


CategoryRicottone

Python/Codecs (last edited 2023-10-11 14:41:13 by DominicRicottone)