The above files were generated by exploiting two facts: the block structure of the MD5 function, and the fact that Wang and Yu's technique works for an arbitrary initialization vector. To understand what this means, it is useful to have a general idea of how the MD5 function processes its input. This is done by an iteration method known as the Merkle-Damgard method. A given input file is first padded so that its length will be a multiple of 64 bytes. It is then divided into individual 64-byte blocks M0, M1, ..., Mn-1. The MD5 hash is computed by computing a sequence of 16-byte states s0, ..., sn, according to the rule: si+1 = f(si, Mi), where f is a certain fixed (and complicated) function. Here, the initial state s0 is fixed, and is called the initialization vector. The final state sn is the computed MD5 hash.
The method of Wang and Yu makes it possible, for a given initialization vector s, to find two pairs of blocks M,M' and N,N', such that f(f(s, M), M') = f(f(s, N), N'). It is important that this works for any initialization vector s, and not just for the standard initialization vector s0.
Combining these observations, it is possible to find pairs of files of arbitrary length, which are identical except for 128 bytes somewhere in the middle of the file, and which have identical MD5 hash. Indeed, let us write the two files as sequences of 64-byte blocks:
M0, M1, ..., Mi-1, Mi, Mi+1, Mi+2, ..., Mn,
M0, M1, ..., Mi-1, Ni, Ni+1, Mi+2, ..., Mn.
The blocks at the beginning of the files, M0, ..., Mi-1, can be chosen arbitrarily. Suppose that the internal state of the MD5 hash function after processing these blocks is si. Now we can apply Wang and Yu's method to the initialization vector si, to find two pairs of blocks Mi, Mi+1 and Ni, Ni+1, such that
si+2 = f(f(si, Mi), Mi+1) = f(f(si, Ni), Ni+1).
This guarantees that the internal state si+2 after the i+2st block will be the same for the two files. Finally, the remaining blocks Mi+2, ..., Mn can again be chosen arbitrarily.
So how can we use this technique to produce a pair of programs (or postscript files) that have identical MD5 hash, yet behave in arbitrary different ways? This is simple. All we have to do is write the two programs like this:
Program 1: if (data1 == data1) then { good_program } else { evil_program }
Program 2: if (data2 == data1) then { good_program } else { evil_program }
and arrange things so that "data1" = Mi, Mi+1 and "data2" = Ni, Ni+1 in the above scheme. This can even be done in a compiled program, by first compiling it with dummy values for data1 and data2, and later replacing them with the properly computed values. |