Current py3k stage and next steps

Fri Jun 25 16:28:24 CDT 2010

On Fri, 2010-06-25 at 07:44 +0000, Antoine Pitrou wrote:
> Hi,
> 
> > To do that, I'd have to define a compatibility layer for str/bytes...  Martin
> > Geisler commented on IRC that I could use Uche Mennel's ustr[1] to separate
> > strings and unicode objects. Another approach would be to use Martin v. Löwis'
> > py3 module[2]. Maybe integrating both approaches would be a nice way of doing
> > it, defining u to be ustr in 2.x and str in py3k...
> 
> I'm not sure what you need ustr for. If Mercurial already enforces proper bytes
> / unicode separation (which I assume it does), you shouldn't need an additional
> type to enforce it for you. Actually, porting to py3k is the way to verify that
> there is no issue there.

There are basically no Unicode objects "in the wild" in Mercurial. Their
usage is more or less restricted to a couple transcoding function in
encoding.py where they can't hurt anybody.

The tricky part is this:

ui.write() and the like are used to handle three kinds of data:

- utf-8 encoded metadata that's been transcoded to the local encoding
- internal ASCII messages that may or may not go through gettext()
before being present to the user in the local encoding
- raw byte data that is presented to the user byte-for-byte as-is

In the last case, it's unacceptable to do any form of transcoding even
if we knew what encoding the data was in (which we don't and which is
not possible in the general case). Also note that these strings may be
hundreds of megabytes - even an extra copy (let alone blowing it up
2-4x) may not be acceptable. 

So we'll have:

a) ui.write(repo[rev].user())  # username is transcoded to local
encoding
b) ui.write(_("abort: can't do that")) # translated and possibly
transcoded
c) ui.write("debug message") # debug messages aren't translated
d) ui.write(repo[rev][file].data()) # raw file data

We also have many instances of:

e) ui.write("debug message: %s\n" % somerawdata) # cases c and d
f) ui.write(_("some message: %s\n") % somerawdata) # cases b and d

This generally all works smoothly because data is either 1) received in
the local encoding 2) uniformly converted to the local encoding as soon
as possible or 3) left completely unmolested.

Enter py3k. Cases (a) and (d) are pretty straightforward. And we can
even managed (b) by teaching ui.write to handle Unicode objects
containing ASCII without complaint.

But (e) gets us in trouble before Mercurial's even involved, right at
the % operator. This operation is only correct on bytestrings but 
we'd need to add a b"" to all our string manipulations to be safe. Which
is a maintenance nightmare of epic proportions.

2to3 can be taught to do this, but we can't do it to the main codebase
(even if we wanted to inflict such a horror on ourselves!) as we'll be
supporting 2.4 and 2.5 for a few years yet.

Relatedly, I expect many functions in the standard library are going to
begin handing back Unicode results, so we'll have to wrap everything in
a thick layer of duct tape.

It's instructive to note a core asymmetry here:

- Any Unicode string can be converted to a bytestring and back
losslessly via UTF-8

- An arbitrary bytestring -cannot- be converted losslessly to Unicode
and back

If Unicode had, say, a codeplane to represent "unknown byte 0x??" such
that arbitrary byte strings could round-trip losslessly to Unicode, none
of this would be a problem (except for overhead). But since that's not
possible, Unicode strings are a bad fit for much of what Mercurial
does. 

-- 
Mathematics is the supreme nostalgia of our time.