= Python Codecs = '''`codecs`''' is a module that provides an interface to the codec registry. This is mainly useful for interfacing with '''text encodings''', which encode text to bytes and decode bytes to text. <> ---- == Example == To replace invalid Unicode codepoints while writing chunks of text, without needing to aggregate the chunks, try: {{{ from codecs import iterdecode, iterencode decoded = iterdecode(chunks, 'utf-8', 'replace') for chunk in iterencode(decoded, 'utf-8'): f.write(chunk) }}} ---- == Usage == === Decode === Convert `bytes` to `str` using a text encoding. Error handlers include: ||'''Value''' ||'''Meaning''' || ||`strict` ||raise `UnicodeError` on errors || ||`ignore` ||do nothing || ||`replace` ||replace errors with `�` (U+FFFD) || ||`backslashreplace`||replace errors with hex escape sequences (`\xhh`) || ||`surrogateescape` ||replace errors with surrogate codes (within U+DC80 to U+DCFF)|| === IterDecode === Iteratively decode. See the example above. === Encode === Convert `str` to `bytes` using a text encoding. Error handlers include: ||'''Value''' ||'''Meaning''' || ||`strict` ||raise `UnicodeError` || ||`ignore` ||do nothing || ||`replace` ||replace errors with `?` || ||`backslashreplace` ||replace errors with hex escape sequences (`\xhh`, `\uxxxx`, or `\Uxxxxxxxx`) || ||`surrogateescape` ||replace surrogate codes with the original code || ||`xmlcharrefreplace`||replace errors with XML numeric character references (`&num`) || ||`namereplace` ||replace errors with Unicode Character Database name escape sequences (`\N{name}`)|| === IterEncode === Iteratively encode. See the example above. ---- == Troubleshooting == === can't concat int to bytes === The `iterencode` and `iterdecode` APIs convert an iterator of strings (bytes) into an iterator of bytes (strings, respectively). This is trickier than expected, as iterating over a bytestring yields integers, not bytes. The APIs are not symmetrical. See [[https://github.com/python/cpython/issues/82663#issuecomment-1093844388|#38482]] for more details. {{{ >>> import codecs >>> list(codecs.iterencode(['spam'], 'utf-8')) [b'spam'] >>> list(codecs.iterencode('spam', 'utf-8')) [b's', b'p', b'a', b'm'] >>> list(codecs.iterdecode([b'spam'], 'utf-8')) ['spam'] >>> list(codecs.iterdecode(b'spam', 'utf-8')) Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 1048, in iterdecode output = decoder.decode(input) File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 321, in decode data = self.buffer + input TypeError: can't concat int to bytes }}} Instead, try: {{{ >>> list(codecs.iterdecode([bytes([b]) for b in b'spam'], 'utf-8')) ['s', 'p', 'a', 'm'] }}} ---- == See also == [[https://docs.python.org/3/library/codecs.html|Python codecs module documentation]] [[https://pymotw.com/3/codecs/|Python Module of the Day article for codecs]] ---- CategoryRicottone