Repo corrupted again, no idea why

Fri Oct 1 15:37:52 CDT 2010

On 01.10.2010 20:14, Luis Navarro wrote:
> On Fri, Oct 1, 2010 at 10:44 AM, Mads Kiilerich <mads at kiilerich.com
> <mailto:mads at kiilerich.com>> wrote:
> 
>     On 10/01/2010 07:20 PM, Luis Navarro wrote:
> 
>         The most recent problem arose when the user was pushing changes
>         from our
>         staging repo to our production repo.
> 
> 
>     Using a repository and a shared file system?
> 
> 
> Partially.
> 
> Most developers are on their own workstations and run TortoiseHg 1.1.1
> locally.
> 
> Individual dev repos and the staging repo are on "Server A".  All of
> these repos are available over shared file systems *and*
> Mercurial/Apache/mod_wsgi.  Server A has Mercurial 1.6 and Tortoise
> 1.1.1 installed.
> 
> Production repo is on "Server B".  I can't install THG or Mercurial on
> Server B so this repo has to be accessed over a shared file system.
> 
>         The production repo is set up with
>         changegroup hooks to (a) automatically update the production
>         files and
>         (b) send a notification e-mail (using the "notify" extension).
> 
>         The push generated the following error:
> 
>         pushing to y:\production\
>         searching for changes
>         adding changesets
>         adding manifests
>         adding file changes
>         added 2 changesets with 0 changes to 0 files
>         warning: changegroup hook exited with status -1
>         [command completed successfully Tue Sep 28 16:02:29 2010]
>         error: changegroup.notify hook raised an exception:
>         data/web/shared/pubs/workshops/index.cfm.i at 919619b0c5ef: no
>         match found
> 
> 
>     What Mercurial version?
> 
> 
> THG has 1.6.1023 in it.  Server A has Mercurial 1.6.  So neither have
> the fix you listed below.  I'll try to update as soon as possible. 
> Assuming the bug listed below is the culprit, is there any sort of error
> message that it would have caused so I can be sure?

There error scenario for the hardlink problem mentioned is described at
http://mercurial.selenic.com/bts/issue761

In that scenario, as I understand matters, there are no error messages
at the moment the problem starts happening. The corruption is silent.

The problem is, if a 'hg clone' was done (without --pull option), then
the destination and the source repo share files inside .hg/store by
using hardlinks [1], if the filesystem provides the hardlinking feature
(NTFS does).

Mercurial is designed to break such hardlinks inside .hg if a commit or
push is done to one of the clones. The prerequisite for this is, that
the Windows API mercurial is using should give a correct answer, if
mercurial is asking "how many hardlinks are on this file?".

We found out that this answer is almost always wrong (always reporting
1, even if it is in fact >1) iff the hg process is running on one
Windows computer and the repository files are on a network share on a
different Windows computer.

Due to that Windows API implementation sillyness (even with the latest
and greatest Windows 7), mercurial failed to break hardlinks for
repositories on network shares for quite a long time now (years, that
is). This means older mercurial versions that don't have the fix may
corrupt repositories pretty seriously when committing or pushing to a
repo on a network share.

The fix now detects if the repo is write-accessed *over a network share*
and then unconditionally assumes all files are hardlinked. So it
unconditionally executes the "breaking hardlink" method for all files it
is writing to on a network share (making it slower but safe now).

Breaking a hardlink in this situation means creating a normal full blown
copy of the file before writing to it, completely separating both files
from each other (writes to one file no longer affect the other).

Not copying such a file if it is hardlinked means the file modification
appears in all clones (even though it should be done only on the clone
where the commit or push is made to).

Clones that were done using 'hg clone --pull' or using Windows Explorer
should not be affected.

> 
>     Y: looks like a network drive. Do you use Mercurial 1.6.2 or later
>     which contains the fix http://selenic.com/repo/hg//rev/50523b4407f6 ?

I think the fix was first released with 1.6.3, according to
http://mercurial.selenic.com/wiki/WhatsNew (Mads confirmed this on IRC
in the mean time).

>     If that doesn't explain and fix it: Can you check if files in the
>     two repositories are hardlinked to each other?
> 
> 
> Sorry if this is a silly question but how would I do that?

The question is: did you do a 'hg clone' without --pull? In your
situation, to be safe, I would assume this happened and act accordingly.

Stop write access to the share, restore the corrupted repositories from
backup and upgrade all Mercurial installs on all workstations and
servers to a version that has the fix.

Then enable write access on the share again.

>         Is there a way to get Mercurial/THG to provide more verbose
>         information,
>         preferably logging to a file so I don't have to count on my users
>         diligently sending me error info?  If there's some sort of
>         problem at
>         the file system or disk level, I would expect Mercurial/THG to
>         log an
>         error somewhere that says something like "couldn't write file
>         xyz - err
>         # 123" so I can debug these sorts of things after the fact.
> 
> 
>     As long as it is the Mercurial running at the users local PC
>     accessing a repository on a shared file system you wouldn't be able
>     to rely on it anyway. So Mercurial has something even better: The
>     ability to run as a client/server system where it only is your
>     Mercurial at the server that accesses the repository files.
> 
>     See http://mercurial.selenic.com/wiki/PublishingRepositories .
>     IIS/hgweb.cgi seems like the most obvious choice for you.
> 
>     /Mads
> 
> 
> For at least the staging to production path, I'm stuck with shared file
> systems.  What do you mean "you won't be able to rely on it?"  I know
> HTTP is preferable to SMB but if Mercurial isn't robust over SMB then
> I've got problems.
> 
> Thanks for your quick reply and thoughtful answers!

I think 1.6.3 or newer *should* be safe to use for committing and
pushing to Windows network shares.

Although I do not consider pushing and committing to a share
particularly robust.

[1] http://en.wikipedia.org/wiki/Hard_link