<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; "><BR><DIV><DIV>On Aug 2, 2007, at 9:39 AM, Matt Mackall wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite"><P style="margin: 0.0px 0.0px 0.0px 0.0px"><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">Mercurial's bdiff algorithm treats all files as strings of bytes and</FONT></P> <P style="margin: 0.0px 0.0px 0.0px 0.0px"><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">breaks them on newline characters. For low-entropy "pure binary" files</FONT></P> <P style="margin: 0.0px 0.0px 0.0px 0.0px"><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">like JPEGs, those should occur roughly every 256 characters so the</FONT></P> <P style="margin: 0.0px 0.0px 0.0px 0.0px"><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">average "line length" for a binary file is a bit longer than for text,</FONT></P> <P style="margin: 0.0px 0.0px 0.0px 0.0px"><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">but not outrageously so.</FONT></P> </BLOCKQUOTE></DIV><BR><DIV>Really? I thought, from reading the [excellent] paper on the innards of Mercurial, that it used a binary-delta algorithm (the old first version of xdelta, IIRC) for binary files. </DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>I would imagine that a line-oriented text diff algorithm would achieve pretty poor compression on a binary file, much less than one designed for binary data.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>The current xdelta, version 3 <<A href="http://xdelta.org">http://xdelta.org</A>/>, appears to be the state of the art in delta compression, and emits standard VCDIFF [RFC 3284] format. (Although for my own work I've been using zdelta, largely because the license is more flexible.)</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>--Jens</DIV></BODY></HTML>