Python 3 porting - pickles¶
Pickles are a built-in object and data serialisation mechanism in Python. Whether using them is a good idea is a topic beyond this article, however it will hopefully bring some new points to the table.
Pickles can be created in different protocols. Default protocols depend on the version of Python. In Python 2 the default protocol is 0, while in Python 3 - 3, however. Maximum supported protocol number in Python 2 is 2 and in Python 3 - depending on the minor version, up to 5. See “Data stream format”.
Protocol version 0 is the “text-friendly” one, which is, as we’ll soon realise, disputable. Python documentation calls it “human-readable” (quotes theirs), which is even more disputable. Later protocol versions are binary. Both textual and binary formats are not very human-readable and consist of stack machine instructions for that construct deserialised objects.
The great difference between Pythons 2 and 3 is the way text is stored. Python 2’s str
is just a
sequence of bytes, while for a half-proper unicode type, unicode strings have to be used.
Python 3’s str
is a proper text format, that, since PEP-393, is either latin1, UCS-2, or UCS-4 underneath. The
encoding is transparently converted when constructing new strings, depending on the characters
needed. This is not much of a problem, since due to immutability, creating a new string always
requires copying (apart from certain hacky optimisations that exploit the special case of the
reference count being 1).
Python 3 bytes are like Python 2’s string except without misleading methods (for example, you cannot encode bytes, only decode), and if iterated on, they yield integers.
The debatability of text-friendliness¶
Text-friendliness of pickles is debatable, since in both Python 2 and Python 3 it is possible to produce a pickle that contains null bytes. Different systems have different opinions, Postgres for example doesn’t allow null bytes in its TEXT type, at least when encoded as unicode, therefore it is wrong to store pickles as text, even if you use protocol 0. Replacing TEXT with BYTEA works well.
Producing a protocol 0 pickle with null bytes in Python 2:
py2> [ord(c) for c in pickle.dumps(u'\0')]
[86, 0, 10, 112, 48, 10, 46]
Producing a protocol 0 pickle with null bytes in Python 3:
py3> [c for c in pickle.dumps('\0', protocol=0)]
[86, 0, 10, 112, 48, 10, 46]
Unfortunately, storing pickles in a TEXT fields may work by accident.
py2> [ord(c) for c in pickle.dumps('abcd')]
[83, 39, 97, 98, 99, 100, 39, 10, 112, 48, 10, 46]
py2> [ord(c) for c in pickle.dumps(u'abcd')]
[86, 97, 98, 99, 100, 10, 112, 48, 10, 46]
While in Python 3.6, with pickle.DEFAULT_PROTOCOL of 3, the analogous pickles will contain zeros.
py3> [c for c in pickle.dumps(b'abcd')]
[128, 3, 67, 4, 97, 98, 99, 100, 113, 0, 46]
py3> [c for c in pickle.dumps('abcd')]
[128, 3, 88, 4, 0, 0, 0, 97, 98, 99, 100, 113, 0, 46]
Moreover, these pickles will fail to load in Python 2, where the highest supported protocol is 2.
Unicode-bytes confusion bites for the last time¶
How can we get decoding errors in pickle.loads
?
$ python2 -c 'import pickle
f = open("d.pickle", "w")
f.truncate()
pickle.dump("abcd", f)
f.close()'
$ python3 -c '
import pickle
print(repr(pickle.load(open("d.pickle", "rb"))))'
'abcd'
The above works seemingly well - we put str
and we get str
. Notice however, that the type
changed. What makes this crime a perfect one is that the changed type’s name is identical. We had a
Python 2 str
, now we have Python 3 str
. In other words, we silently deserialised bytes into
unicode-aware text by reinterpreting it as ASCII.
Here’s how it breaks:
$ python2 -c 'import pickle
f = open("d.pickle", "w")
f.truncate()
pickle.dump("µð", f)
f.close()'
$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"))))'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
This behaviour is regardless of the used pickle protocol version in Python 2. There are solutions, however they require pre-existing knowledge about the implicit encoding used in Python 2 strings.
$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), errors="replace")))'
'����'
$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), encoding="utf-8")))'
'µð'
$ python3 -c 'import pickle
print(repr(pickle.load(open("d.pickle", "rb"), encoding="bytes")))'
b'\xc2\xb5\xc3\xb0'
Another issue is when the pickle contains a mix of Python 2 str
containing unicode and Python 2
str
containing bytes. It’s one or the other. You may need to pass encoding="bytes"
, and
later doctor the data with code aware of which of the byte strings need decoding to Python 3
str
.
A word of consolation¶
If you’re facing tables and tables full of pickles that won’t load in Python 3, check if they still load in Python 2 at all. If they contain types defined in your project or any dependencies, and these types change or are moved, it may break existing pickles. You may then be able to remove or ignore most of the records.
Conclusion¶
Several points to remember:
Storing pickles as text may work most of the time, but don’t do that. It will break in Python 3.
Loading Python 2 pickles in Python 3 may result in the bytes/unicode confusion biting for the last time.
Care must be taken to save pickles in a Python 2-compatible protocol version if interop must be maintained.
Having a suite of integration tests helps a lot.