bdiff/bsdiff: comprehensive performance measurements for large files [long, but nice graphics attached]

Thu Sep 6 03:06:00 CDT 2007

These are interesting results. As stated in my patch-mail I initially
wrote this patch as I couldn't commit at all due to internal stack
overflow in mercurial. So another interesting experiment would be to
have large files of same size but with increasing number of changed
lines. I would assume that the internal bdiff algo gets to its limits so
that at some time the bsdiff algo is faster.

Did you also measure the resulting repository size? If so: Did you get
smaller or bigger repositories with bsdiff?

Did the patch proved to be stable for you? 

Thanks
Ralf

-----Original Message-----
From: Christoph.Spiel at partner.bmw.de
[mailto:Christoph.Spiel at partner.bmw.de] 
Sent: Thursday, September 06, 2007 8:27 AM
To: mercurial at selenic.com
Cc: bos at serpentine.com; Ralf Leibold
Subject: bdiff/bsdiff: comprehensive performance measurements for large
files [long, but nice graphics attached]

Hi all!

A month ago Matt wondered...

> It'd be interesting to get a graph of bdiff performance
> against file size.

... and here it comes with even more ;-)

Setup
=====

- Dual 550MHz Coppermine box, 1GB RAM
- 160GB IDE disk, ReiserFS-3.6
- Linux-2.4.34
- gcc-4.2.1 (for the compilation of bdiff and bsdiff)
- Mercurial-0.9.4 + bsdiff

Bsdiff was written by Ralf Leibold.  His posting can be found here:

http://www.selenic.com/pipermail/mercurial-devel/2007-July/002503.html

Test
====

Preparation
-----------

To model "binary data" I concatenated so called "raw files" from my
DSLR until the file size was larger than 70MB.  These files have an
enormously high information entropy.  Some other raw files were
converted to 16-bit TIFF and the TIFF files concatenated so as to
also exceed about 70MB.  A set of eleven files each were prepared
this way.

Timing Loop
-----------

- The timing loop was performed for different file sizes starting at
  1000kB with four samples per octave up to 64000kB.  Exceptions are
  noted in "Results".
- I used dd(1) to cut out samples files of the desired length of the
  eleven large files described in "Preparation".
- In the performance test I started with an empty repository, checked
  in an initial version and then committed ten versions of exactly the
  same size but different contents.  The duration of each commit was
  timed.
- If the total CPU-time of a commit was larger than one hour, the
  process got killed by the OS (ulimit -t 3600) and a duration of one
  hour was assumed.

Variations
----------

- The test was repeated with different sets if TIFF data (green
  diamonds and red squares).
- The test was conducted with Mercurial's built-in difference algorithm
  "bdiff" and a supplement posted here about six weeks ago, called
  "bsdiff".  The bsdiff threshold was lowered from 10000KB to 100kB so
  as to see the runtime behavior of the algorithm in the same range of
  file sizes as in the "bdiff" tests.

Results
=======

For details please have a look at the attached figure.  Note its
log-log-scale.

The duration of the initial commit is approximately linear and of
course independent of the diff-algorithm in use.

bdiff
-----

- For small files (< 4000KB) subsequent commits take 4-5 times longer
  than the first one, making the durations fall in a band of values.
- Depending on the kind of data, bdiff behaves differently for files
  larger than 4000KB.
  * The cr2 data set (yellow squares) follows the linear size
    dependency up to the largest sizes.
  * The tiff data sets (green diamonds and red squares) show a very
    pronounced deviation towards longer durations.  No comprehensive
    tests were performed for these above 50000KB.

bsdiff
------

Subsequent commits take 10-25 times longer than the initial checkin,
but this relation holds up to the largest files sizes (64000KB).  No
influence of the data to the commit duration was found.

Conclusions
===========

For large files the built in "bdiff" algorithm dug itself in for some
data.  Otherwise it was pleasently fast.  "bsdiff" on the other hand
was much slower in my tests, but exhibited a predictable runtime
behavior for all file sizes and all data.  The algorithm presented
itself as slow but robust.

HTH!

Cheers,
        Chris

--
Dr. Christoph L. Spiel
BMW Forschungs- und Innovationszentrum, EA-410
Lauchstaedterstrasse 5, 80995 Muenchen