Python Codecs
codecs is a module that provides an interface to the codec registry. This is mainly useful for interfacing with text encodings, which encode text to bytes and decode bytes to text.
Contents
Example
To replace invalid Unicode codepoints while writing chunks of text, without needing to aggregate the chunks, try:
from codecs import iterdecode, iterencode decoded = iterdecode(chunks, 'utf-8', 'replace') for chunk in iterencode(decoded, 'utf-8'): f.write(chunk)
Usage
Decode
Convert bytes to str using a text encoding.
Error handlers include:
Value |
Meaning |
strict |
raise UnicodeError on errors |
ignore |
do nothing |
replace |
replace errors with � (U+FFFD) |
backslashreplace |
replace errors with hex escape sequences (\xhh) |
surrogateescape |
replace errors with surrogate codes (within U+DC80 to U+DCFF) |
IterDecode
Iteratively decode. See the example above.
Encode
Convert str to bytes using a text encoding.
Error handlers include:
Value |
Meaning |
strict |
raise UnicodeError |
ignore |
do nothing |
replace |
replace errors with ? |
backslashreplace |
replace errors with hex escape sequences (\xhh, \uxxxx, or \Uxxxxxxxx) |
surrogateescape |
replace surrogate codes with the original code |
xmlcharrefreplace |
replace errors with XML numeric character references (&num) |
namereplace |
replace errors with Unicode Character Database name escape sequences (\N{name}) |
IterEncode
Iteratively encode. See the example above.
Troubleshooting
can't concat int to bytes
The iterencode and iterdecode APIs convert an iterator of strings (bytes) into an iterator of bytes (strings, respectively). This is trickier than expected, as iterating over a bytestring yields integers, not bytes. The APIs are not symmetrical. See #38482 for more details.
>>> import codecs >>> list(codecs.iterencode(['spam'], 'utf-8')) [b'spam'] >>> list(codecs.iterencode('spam', 'utf-8')) [b's', b'p', b'a', b'm'] >>> list(codecs.iterdecode([b'spam'], 'utf-8')) ['spam'] >>> list(codecs.iterdecode(b'spam', 'utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 1048, in iterdecode output = decoder.decode(input) File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 321, in decode data = self.buffer + input TypeError: can't concat int to bytes
Instead, try:
>>> list(codecs.iterdecode([bytes([b]) for b in b'spam'], 'utf-8')) ['s', 'p', 'a', 'm']
See also
Python codecs module documentation
Python Module of the Day article for codecs