Python 3 porting - pickles

Pickles are a built-in object and data serialisation mechanism in Python. Whether using them is a good idea is a topic beyond this article, however it will hopefully bring some new points to the table.

Pickles can be created in different protocols. Default protocols depend on the version of Python. In Python 2 the default protocol is 0, while in Python 3 - 3, however. Maximum supported protocol number in Python 2 is 2 and in Python 3 - depending on the minor version, up to 5. See “Data stream format”.

Protocol version 0 is the “text-friendly” one, which is, as we’ll soon realise, disputable. Python documentation calls it “human-readable” (quotes theirs), which is even more disputable. Later protocol versions are binary. Both textual and binary formats are not very human-readable and consist of stack machine instructions for that construct deserialised objects.

The great difference between Pythons 2 and 3 is the way text is stored. Python 2’s str is just a sequence of bytes, while for a half-proper unicode type, unicode strings have to be used.

Python 3’s str is a proper text format, that, since PEP-393, is either latin1, UCS-2, or UCS-4 underneath. The encoding is transparently converted when constructing new strings, depending on the characters needed. This is not much of a problem, since due to immutability, creating a new string always requires copying (apart from certain hacky optimisations that exploit the special case of the reference count being 1).

Python 3 bytes are like Python 2’s string except without misleading methods (for example, you cannot encode bytes, only decode), and if iterated on, they yield integers.

The debatability of text-friendliness

Text-friendliness of pickles is debatable, since in both Python 2 and Python 3 it is possible to produce a pickle that contains null bytes. Different systems have different opinions, Postgres for example doesn’t allow null bytes in its TEXT type, at least when encoded as unicode, therefore it is wrong to store pickles as text, even if you use protocol 0. Replacing TEXT with BYTEA works well.

Producing a protocol 0 pickle with null bytes in Python 2:

py2> [ord(c) for c in pickle.dumps(u'\0')]
[86, 0, 10, 112, 48, 10, 46]

Producing a protocol 0 pickle with null bytes in Python 3:

py3> [c for c in pickle.dumps('\0', protocol=0)]
[86, 0, 10, 112, 48, 10, 46]

Unfortunately, storing pickles in a TEXT fields may work by accident.

py2> [ord(c) for c in pickle.dumps('abcd')]
[83, 39, 97, 98, 99, 100, 39, 10, 112, 48, 10, 46]
py2> [ord(c) for c in pickle.dumps(u'abcd')]
[86, 97, 98, 99, 100, 10, 112, 48, 10, 46]

While in Python 3.6, with pickle.DEFAULT_PROTOCOL of 3, the analogous pickles will contain zeros.

py3> [c for c in pickle.dumps(b'abcd')]
[128, 3, 67, 4, 97, 98, 99, 100, 113, 0, 46]
py3> [c for c in pickle.dumps('abcd')]
[128, 3, 88, 4, 0, 0, 0, 97, 98, 99, 100, 113, 0, 46]

Moreover, these pickles will fail to load in Python 2, where the highest supported protocol is 2.

Unicode-bytes confusion bites for the last time

How can we get decoding errors in pickle.loads?

$ python2 -c 'import pickle
f = open("d.pickle", "w")
f.truncate()
pickle.dump("abcd", f)
f.close()'

$ python3 -c '
import pickle
print(repr(pickle.load(open("d.pickle", "rb"))))'
'abcd'

The above works seemingly well - we put str and we get str. Notice however, that the type changed. What makes this crime a perfect one is that the changed type’s name is identical. We had a Python 2 str, now we have Python 3 str. In other words, we silently deserialised bytes into unicode-aware text by reinterpreting it as ASCII.

Here’s how it breaks:

$ python2 -c 'import pickle
f = open("d.pickle", "w")
f.truncate()
pickle.dump("µð", f)
f.close()'

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"))))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

This behaviour is regardless of the used pickle protocol version in Python 2. There are solutions, however they require pre-existing knowledge about the implicit encoding used in Python 2 strings.

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), errors="replace")))'
'����'

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), encoding="utf-8")))'
'µð'

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), encoding="bytes")))'
b'\xc2\xb5\xc3\xb0'

Another issue is when the pickle contains a mix of Python 2 str containing unicode and Python 2 str containing bytes. It’s one or the other. You may need to pass encoding="bytes", and later doctor the data with code aware of which of the byte strings need decoding to Python 3 str.

A word of consolation

If you’re facing tables and tables full of pickles that won’t load in Python 3, check if they still load in Python 2 at all. If they contain types defined in your project or any dependencies, and these types change or are moved, it may break existing pickles. You may then be able to remove or ignore most of the records.

Conclusion

Several points to remember:

  • Storing pickles as text may work most of the time, but don’t do that. It will break in Python 3.

  • Loading Python 2 pickles in Python 3 may result in the bytes/unicode confusion biting for the last time.

  • Care must be taken to save pickles in a Python 2-compatible protocol version if interop must be maintained.

  • Having a suite of integration tests helps a lot.