Python 3 porting - pickles

Pickles are a built-in object and data serialisation mechanism in Python. Whether using them is a good idea is a topic beyond this article, however it will hopefully bring some new points to the table.

Pickle formats depend on “pickle protocols”. In Python 2 the default protocol is 0, while in Python 3 - 3. Maximum supported protocol number in Python 2 is 2, while in Python 3 - depending on the minor version, up to 5. See “Data stream format”.

Protocol version 0 is the “text-friendly” one, which is - as we’ll soon realise - disputable. Python documentation calls it “human-readable” (quotes theirs), which is even more disputable. Later protocol versions are binary. Both textual and binary formats are in fact not very human-readable and consist of stack machine instructions that construct deserialised objects.

Background - text representation

The great difference between Python 2 and 3 is the way text is stored. Python 2’s str is just a sequence of bytes, while unicode is merely a half-proper unicode type.

Python 3’s str is a proper text format, which since PEP-393 is either latin1, UCS-2, or UCS-4 underneath. The encoding is transparently converted when constructing new strings, depending on the contents. This is not much of a problem - str immutability forces creating copies anyway when string operations are invoked [1].

Python 3 bytes are like Python 2’s str except without misleading methods (for example, you cannot encode bytes, only decode), and if iterated on, they yield integers.

The debatability of text-friendliness

Text-friendliness of pickles is debatable, since both in Python 2 and in Python 3 it is possible to produce a pickle that contains null bytes. Different systems have different opinions on null bytes, Postgres for example doesn’t allow null bytes in its TEXT type, at least when encoded as unicode, therefore it is wrong to store pickles as TEXT, even if you use “protocol 0”. As far as Postgres is concerned, replacing TEXT with BYTEA fixes this particular problem.

It is easy to produce pickles with null bytes:

Producing a protocol 0 pickle with null bytes in Python 2
 py2> [ord(c) for c in pickle.dumps(u'\0')]
 [86, 0, 10, 112, 48, 10, 46]
Producing a protocol 0 pickle with null bytes in Python 3
 py3> [c for c in pickle.dumps('\0', protocol=0)]
 [86, 0, 10, 112, 48, 10, 46]

Unfortunately, storing pickles in formats disallowing null bytes my work by accident:

py2> [ord(c) for c in pickle.dumps('abcd')]
[83, 39, 97, 98, 99, 100, 39, 10, 112, 48, 10, 46]
py2> [ord(c) for c in pickle.dumps(u'abcd')]
[86, 97, 98, 99, 100, 10, 112, 48, 10, 46]

However in Python 3.6, with pickle.DEFAULT_PROTOCOL of 3, analogous pickles will contain zeros.

py3> [c for c in pickle.dumps(b'abcd')]
[128, 3, 67, 4, 97, 98, 99, 100, 113, 0, 46]
py3> [c for c in pickle.dumps('abcd')]
[128, 3, 88, 4, 0, 0, 0, 97, 98, 99, 100, 113, 0, 46]

Moreover, these pickles will fail to load in Python 2, where the highest supported protocol is 2.

Unicode-bytes confusion bites for the last time

How can we get decoding errors in pickle.loads?

$ python2 -c 'import pickle
f = open("d.pickle", "w")
f.truncate()
pickle.dump("abcd", f)
f.close()'

$ python3 -c '
import pickle
print(repr(pickle.load(open("d.pickle", "rb"))))'
'abcd'

The above works seemingly well - we put str and we get str. But those are different Pythons decoding the pickle into different types. What makes this a perfect crime are identical type names. We had a Python 2 str, now we have Python 3 str. In other words, we silently deserialised bytes into unicode-aware text by reinterpreting it as ASCII.

Here’s how it breaks:

$ python2 -c 'import pickle
f = open("d.pickle", "w")
f.truncate()
pickle.dump("µð", f)
f.close()'

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"))))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

This behaviour doesn’t depend on the used pickle protocol version in Python 2. There are solutions, though they require pre-existing knowledge about the implicit encoding used in Python 2 strings.

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), errors="replace")))'
'����'

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), encoding="utf-8")))'
'µð'

$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), encoding="bytes")))'
b'\xc2\xb5\xc3\xb0'

Another issue is when a pickle contains a mix of Python 2 str containing unicode and Python 2 str containing bytes. The only way to resolve this confusion is to deserialise with encoding="bytes" and later fix the data - either with custom code aware of field semantics or with some sort of schema library, provided with a schema defining conversions of string fields from bytes.

A word of consolation

If you’re facing enormous datasets full of pickles that won’t load in Python 3, check if they still load in Python 2. If they contain types defined in your project or its dependencies, and these types change or are moved, it may break existing pickles. You may then be able to remove or ignore most of the records.

Conclusion

Several points to remember:

  • Storing pickles as text may work most of the time, but don’t do that. It will break in Python 3.

  • Loading Python 2 pickles in Python 3 may result in the bytes/unicode confusion, hopefully the last one you’ll have to deal with ever.

  • Care must be taken to save pickles in a Python 2-compatible protocol version if interop must be maintained.

  • Having a suite of integration tests helps a lot.