RFC: safe pattern matching for problematic encoding

Wed May 23 07:38:52 CDT 2012

Hi, devels.

I'm working to achieve safe pattern matching/parsing for problematic
encodings (e.g.: cp932), in which strings may contain '\\' as a part
of multi-byte characters.

I finished to write draft version patch series to do it, but it
requires changes as below:

  (1) add hooks to replace '\' in mbcs by '\x5c' before:

      - re.compile() invocation from:
        - grep() in command.py
        - grep() in fileset.py
        - grep() in revset.py

      - re.escape() invocation from:
        -  _globre(), _regex() in match.py
        - remap() in subrepo.py

      - re.sub() invocation from remap() in subrepo.py

    re.compile()/escape()/sub() are invoked from many other places,
    so they can not be wrapped directly.

  (2) wrap tokenizer for parser._iter in parser.py to:

      - convert mbcs to unicode to avoid unexpected escaping by '\' in mbcs
      - convert token from unicode to local encoding, and
      - adjust parsed length in unicode to one in local encoding
        (mismatching of parsed length causes exception)

  (3) add hooks to execute 'string-escape' safely:

      - for "r'xxx'" style by tokenize() in fileset.py, revset.py,
      - for "r'xxx'" style by tokenizer() in templater.py

    'string-escape' can't be applied on unicode, but sometimes it is
    needed, because above wrapping/hooking convert target object to
    unicode object.

As you noticed, wrapping/hooking points are scattered in widely, so I
think that this implementation is not so good. But I don't have any
other ideas.

Are there any other ideas to solve this problem ? or should I post
this patches, even though it is not so good, as a first step ?

BTW, how is "using Unicode API on Windows" plan progressing ?

  http://www.selenic.com/pipermail/mercurial-devel/2011-December/036385.html

If this is implemented, people who uses problematic encoding can solve
this problem by using utf-8 encoding instead of local problematic
encoding.

# of course, encoding exchange between utf-8 and local encoding should
# be done on UI boundary for interaction with other programs (e.g.:
# dos prompt window), even though mercurial itself can use utf-8
# internally.

So, I will do whatever is in my power to progress this changes !

# checking sources, testing in problematic encoding env, and so on

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp