Python 3 porting - pickles ========================== Pickles are a built-in object and data serialisation mechanism in Python. Whether using them is a good idea is a topic beyond this article, however it will hopefully bring some new points to the table. Pickles can be created in different protocols. Default protocols depend on the version of Python. In Python 2 the default protocol is 0, while in Python 3 - 3, however. Maximum supported protocol number in Python 2 is 2 and in Python 3 - depending on the minor version, up to 5. `See "Data stream format" `_. Protocol version 0 is the "text-friendly" one, which is, as we'll soon realise, disputable. Python documentation calls it "human-readable" (quotes theirs), which is even more disputable. Later protocol versions are binary. Both textual and binary formats are not very human-readable and consist of stack machine instructions for that construct deserialised objects. The great difference between Pythons 2 and 3 is the way text is stored. Python 2's ``str`` is just a sequence of bytes, while for a `half-proper unicode type `_, unicode strings have to be used. Python 3's ``str`` is a proper text format, that, since `PEP-393 `_, is either latin1, UCS-2, or UCS-4 underneath. The encoding is transparently converted when constructing new strings, depending on the characters needed. This is not much of a problem, since due to immutability, creating a new string always requires copying (`apart from certain hacky optimisations that exploit the special case of the reference count being 1 `_). Python 3 bytes are like Python 2's string except without misleading methods (for example, you cannot encode bytes, only decode), and if iterated on, they yield integers. The debatability of text-friendliness ------------------------------------- Text-friendliness of pickles is debatable, since in both Python 2 and Python 3 it is possible to produce a pickle that contains null bytes. Different systems have different opinions, Postgres for example doesn't allow null bytes in its TEXT type, at least when encoded as unicode, therefore it is wrong to store pickles as text, even if you use protocol 0. Replacing TEXT with BYTEA works well. Producing a protocol 0 pickle with null bytes in Python 2: .. code:: python py2> [ord(c) for c in pickle.dumps(u'\0')] [86, 0, 10, 112, 48, 10, 46] Producing a protocol 0 pickle with null bytes in Python 3: .. code:: python py3> [c for c in pickle.dumps('\0', protocol=0)] [86, 0, 10, 112, 48, 10, 46] Unfortunately, storing pickles in a TEXT fields may work by accident. .. code:: python py2> [ord(c) for c in pickle.dumps('abcd')] [83, 39, 97, 98, 99, 100, 39, 10, 112, 48, 10, 46] py2> [ord(c) for c in pickle.dumps(u'abcd')] [86, 97, 98, 99, 100, 10, 112, 48, 10, 46] While in Python 3.6, with `pickle.DEFAULT_PROTOCOL` of 3, the analogous pickles will contain zeros. .. code:: python py3> [c for c in pickle.dumps(b'abcd')] [128, 3, 67, 4, 97, 98, 99, 100, 113, 0, 46] py3> [c for c in pickle.dumps('abcd')] [128, 3, 88, 4, 0, 0, 0, 97, 98, 99, 100, 113, 0, 46] Moreover, these pickles will fail to load in Python 2, where the highest supported protocol is 2. Unicode-bytes confusion bites for the last time ----------------------------------------------- How can we get decoding errors in ``pickle.loads``? .. code:: $ python2 -c 'import pickle f = open("d.pickle", "w") f.truncate() pickle.dump("abcd", f) f.close()' $ python3 -c ' import pickle print(repr(pickle.load(open("d.pickle", "rb"))))' 'abcd' The above works seemingly well - we put ``str`` and we get ``str``. Notice however, that the type changed. What makes this crime a perfect one is that the changed type's name is identical. We had a Python 2 ``str``, now we have Python 3 ``str``. In other words, we silently deserialised bytes into unicode-aware text by reinterpreting it as ASCII. Here's how it breaks: .. code:: $ python2 -c 'import pickle f = open("d.pickle", "w") f.truncate() pickle.dump("µð", f) f.close()' $ python3 -c 'import pickle print(repr(pickle.load(open("d.pickle", "rb"))))' Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) This behaviour is regardless of the used pickle protocol version in Python 2. There are solutions, however they require pre-existing knowledge about the implicit encoding used in Python 2 strings. .. code:: $ python3 -c 'import pickle print(repr(pickle.load(open("d.pickle", "rb"), errors="replace")))' '����' $ python3 -c 'import pickle print(repr(pickle.load(open("d.pickle", "rb"), encoding="utf-8")))' 'µð' $ python3 -c 'import pickle print(repr(pickle.load(open("d.pickle", "rb"), encoding="bytes")))' b'\xc2\xb5\xc3\xb0' Another issue is when the pickle contains a mix of Python 2 ``str`` containing unicode and Python 2 ``str`` containing bytes. It's one or the other. You may need to pass ``encoding="bytes"``, and later doctor the data with code aware of which of the byte strings need decoding to Python 3 ``str``. A word of consolation --------------------- If you're facing tables and tables full of pickles that won't load in Python 3, check if they still load in Python 2 at all. If they contain types defined in your project or any dependencies, and these types change or are moved, it may break existing pickles. You may then be able to remove or ignore most of the records. Conclusion ---------- Several points to remember: - Storing pickles as text may work most of the time, but don't do that. It will break in Python 3. - Loading Python 2 pickles in Python 3 may result in the bytes/unicode confusion biting for the last time. - Care must be taken to save pickles in a Python 2-compatible protocol version if interop must be maintained. - Having a suite of integration tests helps a lot.