Questions regarding WindowsUTF8 plan

Chinmay Joshi c at chinmayjoshi.com
Wed Jun 11 15:30:03 CDT 2014


On Wed, Jun 11, 2014 at 2:15 AM, Matt Mackall <mpm at selenic.com> wrote:
>
> On Mon, 2014-06-09 at 23:40 +0000, Chinmay Joshi wrote:
> > Hello FUJIWARA Katsunori,
> >
> > I am currently working on WindowsUTF8 plan under GSoC. As I learnt you
> > contributed on WindowsUTF8 plan. I really thank you for your feedback 
on my
> > patches until now. I understand you have the greater idea of this plan 
and
> > I had a few queries for you, for which I expect some help from you (or
> > anyone else). Any help will be highly appreciated.
> >
> > The one question is regarding u16vfs class for Windows. Some discussion 
has
> > taken place on #mercurial IRC channel. This class is supposed to be 
derived
> > from vfs and should use "wide APIs internally" and give UTF-8 results in
> > case of UTF-8 changeset. What I understand from this is using Pythons 
APIs
> > with unicode objects
> > which use windows wide APIs to to give UTF-8 results. One another 
solution
> > raised was using windows specific win32 APIs. This would need a lot 
work to
> > match python's current implementation of filesystem functions used in 
vfs
> > class.
>
> Use Python's APIs whenever possible. Please don't call it u16vfs
> (because it should never be passed a UTF-16 string). I probably wrote
> that on the wiki, but a better name would be utf8vfs. Methods in utf8vfs
> should generally follow a model like this:
>

Agreed, I will use "utf8vfs" nomenclature.

> def listdir(self, path):
>     # take a utf-8 encoded byte string, convert it to a unicode() object
>     upath = path.decode("utf-8")
>
>     # pass the unicode object to a Python API, which will check the
>     # class of the argument and internally use Windows 'wide string'
>     # methods to do filesystem operations return unicode() objects
>     # in its result
>     uresult = os.listdir(ufilename)
>
>     # this function gives back a list of unicode() filenames
>     # convert the results back to bytestrings in UTF-8
>     result = [u.encode('utf-8') for u in uresult]
>     return result
>

Thanks, this is more than self explanatory.

> Crucially, Mercurial code outside this vfs class should _never_ see a
> UTF-16 encoded bytestring OR a unicode() object, nor should it be doing
> any of its own encode/decode.
>

Agreed.

> > One more issue was raised in today's meet up which is about not passing
> > Unicode objects to any Mercurial APIs (
> > http://mercurial.selenic.com/wiki/EncodingStrategy#Unicode_strings).
> >
> > As per discussion with mpm on irc, a concern is that people will want to
> > convert their existing non-ASCII repositories to UTF-8. This will not 
work
> > if previous commits remain unchanged.
>
> You're not reading WindowsUTF8Plan carefully enough, see section 5.1.
> Conversion will not be converting all of history, it will be converting
> the branch head(s) by renaming the non-UTF-8 files and making a new
> commit. When you check out the new commit, Mercurial will say "ah, all
> these files are all UTF-8, switch to using utf8vfs". But if you switch
> to an old commit, it'll keep working the way it currently does.
>

Ah OK. Proposed isutf8 method should detect these files when committing and 
switch accordingly.

> --
> Mathematics is the supreme nostalgia of our time.
>
>

--
Chinmay Joshi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20140611/19c24abe/attachment.html>


More information about the Mercurial-devel mailing list