= Python Codecs =

'''`codecs`''' is a module that provides an interface to the codec registry. This is mainly useful for interfacing with '''text encodings''', which encode text to bytes and decode bytes to text.

<<TableOfContents>>

----



== Example ==

To replace invalid Unicode codepoints while writing chunks of text, without needing to aggregate the chunks, try:

{{{
from codecs import iterdecode, iterencode

decoded = iterdecode(chunks, 'utf-8', 'replace')
for chunk in iterencode(decoded, 'utf-8'):
    f.write(chunk)
}}}

----



== Usage ==


=== Decode ===

Convert `bytes` to `str` using a text encoding.


Error handlers include:

||'''Value'''       ||'''Meaning'''                                                ||
||`strict`          ||raise `UnicodeError` on errors                               ||
||`ignore`          ||do nothing                                                   ||
||`replace`         ||replace errors with `�`  (U+FFFD)                            ||
||`backslashreplace`||replace errors with hex escape sequences (`\xhh`)            ||
||`surrogateescape` ||replace errors with surrogate codes (within U+DC80 to U+DCFF)||



=== IterDecode ===

Iteratively decode. See the example above.




=== Encode ===

Convert `str` to `bytes` using a text encoding.

Error handlers include:

||'''Value'''        ||'''Meaning'''                                                                    ||
||`strict`           ||raise `UnicodeError`                                                             ||
||`ignore`           ||do nothing                                                                       ||
||`replace`          ||replace errors with `?`                                                          ||
||`backslashreplace` ||replace errors with hex escape sequences (`\xhh`, `\uxxxx`, or `\Uxxxxxxxx`)     ||
||`surrogateescape`  ||replace surrogate codes with the original code                                   ||
||`xmlcharrefreplace`||replace errors with XML numeric character references (`&num`)                    ||
||`namereplace`      ||replace errors with Unicode Character Database name escape sequences (`\N{name}`)||




=== IterEncode ===

Iteratively encode. See the example above.

----



== Troubleshooting ==



=== can't concat int to bytes ===

The `iterencode` and `iterdecode` APIs convert an iterator of strings (bytes) into an iterator of bytes (strings, respectively). This is trickier than expected, as iterating over a bytestring yields integers, not bytes. The APIs are not symmetrical. See [[https://github.com/python/cpython/issues/82663#issuecomment-1093844388|#38482]] for more details.

{{{
>>> import codecs
>>> list(codecs.iterencode(['spam'], 'utf-8'))
[b'spam']
>>> list(codecs.iterencode('spam', 'utf-8'))
[b's', b'p', b'a', b'm']
>>> list(codecs.iterdecode([b'spam'], 'utf-8'))
['spam']
>>> list(codecs.iterdecode(b'spam', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 1048, in iterdecode
    output = decoder.decode(input)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 321, in decode
    data = self.buffer + input
TypeError: can't concat int to bytes
}}}

Instead, try:

{{{
>>> list(codecs.iterdecode([bytes([b]) for b in b'spam'], 'utf-8'))
['s', 'p', 'a', 'm']
}}}

----



== See also ==

[[https://docs.python.org/3/library/codecs.html|Python codecs module documentation]]

[[https://pymotw.com/3/codecs/|Python Module of the Day article for codecs]]



----
CategoryRicottone