Original subject: [patch] Merkle Hash Tree Against Corruption
Posted: Mar 14 2003
In case anyone cares: I do have experience in programming and software design,
as that's what I do for living. But I'm not a cryptography expert - I just use
cryptography within software, and know just enough to implement existing algorithms,
as I've sometimes done, from the algorithm description. I'm not doing any code
on eMule because I don't have VS 7 which would be required for that, and I'm
not willing to shell out a thousand euros (or however much it costs) just to
start developing eMule. Neither am I willing to use VS 7 without proper license,
so don't even think about telling me how easy it is to download without first
sending me proper license papers.
First, an algorithm that would produce result vectors verifiable against the
current ed2k-hash or the md4 parthashes within is quite impossible. The only
thing you could do would be to send md4 intermittent result vectors that would
again be used as starting vectors for the md4 transformation in the next block,
and verifying that hash intermittent result vector list is not possible without
having the correct data, at which point it's not useful to any degree any more.
Thus, a new method that is incompatible with the old ed2k-hash/md4 parthashes
is needed, and when it's already going to be incompatible with existing method,
dumping md4 altogether and going for a cryptographically secure hash algorithm
is the only reasonable course of action.
This doesn't mean that the old ed2k/md4 parthashes would be dumped.
Of course they would be used as the first method, and the new method would primarly
be a way to pinpoint corruption to blocks that are way smaller than 9500kB.
If the new hash becomes prelevant at some point, and support for it is added
to other clients and even servers, we could then, and only then, drop current
ed2k/md4 parthash method and start using the new method as the primary method.
Probably would have to support old ed2k/md4 parthash method still for a long
time to maintain backwards compatibility with old clients and ed2k link websites,
forums and databases.
Regarding the resources needed to create invalid data that collides with the
correct hash is considered trivial. I could probably produce one bad 64 byte
block per day with my current main workstation, so if any organization with
any moneytary support at all wants to use that method to attack ed2k network,
it'd not be hard at all to do so. Remember that the attacker can choose any
part of the file himself, knowing each and every intermittent result vector
in the file (generating such a list is trivial - perhaps a minute on average
PC for an 800MB file), so he's got all the cards to help him in corrupting the
file data.
And no, I don't have any idea whether that attack method is used or not. There
are easier ways to attack ed2k network still. Which is indeed one of the reasons
why we're talking about new hashtree extension in the first place. Nice additional
benefit about a hashtree with cryptographically secure hash algorithm is of
course that hash collision attack becomes extremely hard to do, also.
I don't really care one way or another about having one more database of files
with hashes. There are already numerous large collections of ed2k links by known
good releasers, so one more would be just Yet Another Link Collection, nothing
more.
However, if that just happens by choosing one algorithm from many that are equally
good, why not? Still, "equally good" may be different for different
people, and my points are from the point of view of a fast uploader that shares
a lot (>>200GB of files online that may be shared and are listed in known.met
due to having been shared at one point, ~180GB of files shared most of the time,
800 kilobytes/sec or higher upload limit all the time), and therefore CPU, disk
I/O, and diskspace requirements all are considered important.
With that out of way...
For the rest of this post I assume Tiger-192 as the hash algorithm. For reference
once more:
SHA1 is 160 bit hash (20 bytes per node)
Tiger is 192 bit hash (24 bytes per node)
SHA256 is 256 bit hash (32 bytes per node)
Thus, you whenever I mention amount of hash data, if you want to know how much
it would be using SHA1 instead of Tiger, divide by 24 and multiply by 20 (1/6th
less), and using SHA256 divide by 24 and multiply by 32 (1/3rd more).
Why a treehash?
Because in a tree, siblings of any parent can be verified against that parent,
and a parent can be created by knowing the siblings. Therefore, knowing any
level in full allows for construction of the whole tree from that level upwards,
and verifying against a single node - the top of the tree - which also acts
as the filehash.
This also leads to the fact that the tree is completely verifiable down to any
set of nodes for which all siblings are known recursively once one has the top
node. Eg. if you have all siblings from one branch of level 13, the siblings
of the parent (at L14) of that branch, the siblings of the parent of that branch,
and so on, up to the top, you can verify the validity of all the nodes within
that branch.
Of course this also means that when you have higher levels of the tree completely
verified, any subtree below can be verified against the closest parent. Any
subtree can be treated as if it were a tree of it's own.
There is absolutely no need to try to protect the tree by hashing the hashdata
with a separate hash. For as long as the top is known by the client receiveing
tree's hashdata.
Why binary tree and not higher branching factor?
As all siblings are needed to verify against parent or calculate the parent,
with binary tree it means exactly one sibling for each node, while with higher
branching factor higher number of siblings are needed. While binary tree is
deeper than a tree with higher branching factor for any equal number of leaves,
the binary tree has maximum of (2*leaves-1) nodes, and a tree with branching
factor equal to the number of leaves (which is more specifically a sequential
block hash - a special subset of treehash) the number of nodes is (leaves+1).
The maximum difference is roughly twofold with a high number of leaves. The
size of a binary tree of just one level lower (or exactly twice the blocksize)
is always less than the size of the sequential block hash. Therefore, when the
size of the hashdata total is very small compared to the data to be verified
with the hashdata, the difference in the amount of hashdata that must be transmitted
is ignisificant. Also, a full tree can be reconsructed from just the leaves
and the top, which would be exactly the same number of nodes as a sequential
block hash for the same data.
Therefore, there is no such case where a binary tree would require transmitting
more hashdata than a sequential block hash (and therefore any treehash of higher
branching factor than 2) would.
Why such a small atomic block size as 1 kilobyte?
Because atomic block size defines the absolute smallest blocksize that can be
verified. However, when the leaf nodes (hashes which have been calculated from
a chunk of original data, not from a chunk of hashdata) are not available, any
block down to which a node is available can be verified.
This means that if the leaf is not available, but a L7 node (128kB blocksize)
is available, the 128kB block can be verified against the corresponding L7 node.
That is, the blocksize may be variable depending on availability of nodes and
the need to verify small blocks. The need is defined by the client wanting to
verify a block (usually trying to get lower level nodes when it detects corruption
in a large block), and the availability is defined by the clients sharing the
blocks (dependant on the resources available for sending hashdata). It's extremely
unlikely that leaf nodes with atomic block size of 1kB would ever be available,
as that'd take huge amounts of critical resources (diskspace and/or CPU power,
and bandwidth). Note that even while bandwidth is listed in critical resources,
technically one 1kB block from an 800MB file can be verified by a hashpath of
20 nodes deep (only siblings are needed for each node on the path, as with the
sibling the parent can be calculated, leading up to the top hash). I may be
off by one or two in that hashpath depth, but considering the total number of
nodes needed, the amount of hashdata is completely insignificant compared to
the filesize.
OK, onto the critical resource issues and the protocol requirements to optimize
critical resource usage..
Critical resources are:
-Bandwidth
-Disk space
-Disk I/O
-CPU
-Memory
Memory usage shouldn't be high, as code isn't a big issue, and hashdata is insignificant
compared to filedata.
CPU usage is high on hashing, but not more so than current ed2k hash creation.
Also, if they are done in the same disk I/O cycle (read block, hash with md4,
hash with Tiger, continue) then disk I/O isn't an issue. Hashing is I/O bound,
so hashing times shouldn't go up more than a small fraction. In treehash about
twice the number of hash operations will be executed as in the sequential block
hash, and all in all this means that hashing will take approximately three times
the CPU power it takes with just ed2k hash, assuming both hash algorithms take
the same time to execute for the same amount of data. If hashing cycle is planned
well, there is also a potential of keeping both hash algorithms in the CPU's
cache, and hashing the data brought into CPU's cache on one go, thus removing
memory I/O from the critical dual-hash path.
Disk space requirements define the level down to which hashes can be stored.
Assuming storing of down to 8MB blocksize (L13) only would mean just
128 leaves per gigabyte of data, or ~6kB of hashdata per gigabyte. Which is
completely insignificant - it's only about 3.5 times the ed2k parthashes
for the same amount of data.
Storing down to 8kB blocksize (L3) would mean 128k leaves per gigabyte
of data, or ~6MB of hashdata per gigabyte. Which is no longer insignificant,
even if it's only 0.6% of the filedata. It'd push me to well over a gigabyte
of hashdata, which I don't consider reasonable.
Storing down to 128kB blocksize (L7) would mean 8k leaves per gigabyte
of data, or ~384kB of hashdata per gigabyte. At 0.037% of filedata, it'd
push me to perhaps about a 100MB of hashdata, which is still reasonable.
At L7 storage (and storing full tree down to L7), verifying blocks at 128kB
would be possible. That's still acceptable, and compared to verifying at 9500kB
blocksize is excellent.
Halving storage requirement by storing just the L7 (or other chosen level) hashes
and the top hash would mean having to dynamically create missing hash levels
on the fly, and also that if corruption of hashdata container is noticed, all
hashes for the file whose hashtree has corrupted have to be invalidated and
the file completely rehashed. Storing full tree would mean that corruption could
be detected and pinpointed to specific branch.
Also, the mule could at startup verify all hashdata in the container, as hashing
tens or even hundreds of megabytes can be done in a matter of seconds, or at
most minutes. Few people close and restart mule so often that it'd be even an
annoyance - what's one more minute in startup time when you run it continously
for hours or days at a time?
Note that I vehemently oppose dynamic hash creation, as it'd put additional
disk I/O and CPU strain on the uploading client, and for highspeed uploaders
the mule already takes a hefty toll on CPU, possibly thus pushing mule from
bandwidth bound to CPU bound task, which is definitely not what we want to do.
If you want to know what kind of CPU usage mule has on a highspeed uploader,
try uploading a megabyte a second, divided to over a hundred upload slots, with
a few thousand clients (that will actually be served in reasonable time) in
queue. If you don't feel close to CPU bound at that point, I want your code
(if it's optimized so well) or your computer (if it's due to hefty CPU).
Now to bandwidth, which is for most people the most critical resource. Especially
upstream bandwidth..
First, remember that to verify anything, you need the full hashpath up to the
top from the block being verified. And that the path is just 20 or so nodes
for an 800MB file, or just about half a kilobyte. And, as the hashpath down
to L0 wouldn't be available, from top to L7 it'd be less than 15 nodes, so down
to <360 bytes.
However, as within the 9500kB that has been noticed as corrupted no single block
can, with ed2k parthashes, be singled out for verification, the whole chunk
must be verified. To do that, hashes for each block in the chunk must be obtained.
While that could be done by requesting each hashpath down to the L7 (or whatever
level is available) block. However, as the neighbouring hashpaths unite, at
worst halfway through the file, so the hashpaths separate at the top two hashpaths
down to about 38 neighbouring L7 nodes each must be obtained. 38 nodes completely
unite at 6 levels above, or at L13 (8MB), producing thus two trees of <63
nodes each. Also, from L13 to top at 800MB filesize the paths are seven nodes
long, thus requiring 13 more nodes (paths unite at top). Total count thus would
be <139 nodes in the worst case to obtain L7 hashes for any single chunk
in an 800MB file. Total hashdata needed would thus be <3.4kB. Which definitely
is worth it to grab when corruption can be pinpointed to a single (or multiple,
which of course would be worse) 128kB block instead of a 9500kB chunk.
Of course once a client has obtained a set of nodes, it will store them as well,
and will never again need more hashes to re-verify that chunk, or any new blocks
received from any client, trying to find a good block to replace the corrupted
one with.
What does this mean?
That if we don't care about anything else but being able to verify a corrupted
chunk at a finer level:
- store only the finest level chosen by the user
- store list of leaves that correspond to each chunk
- add to the protocol a request that addresses the nodes by chunk number
- add to the protocol a response to the above request that trasmits the top
hash and the nodes corresponding to the chunk
However!
Doing a more flexible implementation would
not be much more of a burden on bandwidth, and would allow the mule to truly
take a step ahead.
- How about sharing blocks even when a full chunk has not been completed? Would
probably help file spreading a lot, especially in cases where the release distro
is doing a bad job and even pulls out before all chunks have been well spread.
- Noticing a bad client when only a small block has been received? Why download
a full corrupted chunk when you can get rid of the bad client after just a small
block instead.
More reasons needed? I don't know what all benefits could be gained by this,
but I'm sure you can come up new benefits above the two I listed. And even those
two are big ones.
So, what's the cost?
First, if a client downloaded a suitable treetop when it gets the normal ed2k
parthash list, it'd be a long way towards hashing the whole file with the new
treehash. And as noted already, a treetop down to 8MB blocksize (L13) would
be just ~3.5 times as much as the parthash list, whatever the file size.
And it is downloaded only once, just like that parthash list is.
Or, if only leaves and top were transmitted, it'd be just 1.7 times the size
of the normal parthash list, although like the parthash list, unless it is complete
and correct, knowing which node is corrupt is impossible, and the whole list
would have to be invalidated and reasked.
When the first time asking for a chunk, the subtree from L13 down to L7 would
also be requested. At 6 levels, it's size is 64 leaves or 122 nodes (the topnode,
L13, is already known), or a little less than 3kB (assuming sharing at 8MB blocksize)
or twice that to cover any single chunk. This list is downloaded once per
chunk only, and when a branch from L13 to L7 hits two chunks' areas, it's
downloaded only once - first time either of those chunks is requested.
Overall, the hashtree down to L7 for 800MB file would be 300kB in size, or just
0.037% of the filedata. In all overhead traffic that's not much, and it'd allow
for corruption detection, bad client detection, and sharing smaller blocks than
9500kB chunks.
Minimal implementation in protocol would be:
- request for treetop
- response to treetop request
- request for L13-L7 branch
- response to L13-L7 branch
And limits in this case would be trivially easy:
- a client may request treetop just as it may request for parthash list
- a client may request for subtree when it requests a chunk. Such request is
fullfilled only when the client gets an upload slot, and is fullfilled only
for the branches it has requested and the uploading client wishes to serve -
eg. send a branch only when starting upload of filedata in that branch.
This would be very similar to Champ's proposal, except that the low level would
L7, not L3 (which is insanely small due to bandwidth, diskspace and/or CPU and
disk I/O critical resources usage), and that branches would be transferred not
just when corruption is noticed and is to be pinpointed within the chunk, but
already when a chunk upload is started. This would allow for checking for corruption
already when receiving, not just trying to pinpoint exact location when it has
been noticed - which may be after hundreds or thousands of bad kilobytes too
late.
Note that a client that wishes only to locate corrupted block within a chunk
could still, with these control messages, achieve very closely the same result
as with the "absolute minimum corruption detection only" proposal.
It could refrain from requesting new hash branches until it detects corruption
with the old ed2k/md4 parthashes. After that, it would request the treetop,
and the branches covering that chunk. The treetop it would get as easily as
it would get the hashpart list, while for the branches, it would have to wait
in upload queue until slot.
Still, this would not be a satisfactory solution. Why? Because it's rigid.
What if we notice that L13 blocksize is not the optimal treetop bottom? Or that
it'd be worth the resources to use even finer grained block to find corruption?
Or find some other reason to try other levels, branches, or paths, and find
more optimal solutions?
There are at least two reasonable ways to map the tree to a list. With one,
any branch can be addressed as a list of consecutive nodes, and with the other,
any level can be addressed as a list of consecutive nodes.
Because any branch is a fully verifiable self-contained unit, and also chunks
are easiest to map via branches, I think fetching branches easily is the best
way to go.
Why not levels? Because there is no case where you would need a full level where
a branch wouldn't do at least as well.
I don't know if a mapping where any full path can be addressed as a consecutive
list of nodes exists. Neither do I consider such mapping very useful compared
to branch mapping.
So, the simplest way of providing access to any branch, from any level to any
other level, is to store the tree linearly mapped in such way that branches
can be addressed as lists. Of course it would be possible to rather add a mapping
type parameter to the request opcode.
So, to move from rigid two-step approach to more extensible approach while maintaining
the ability to request exactly the treetop and branches as per the two-step
approach, the request and response could be:
- request hashtree for <ed2k hash> of <hash algo> <hash subtype>
mapped as <mapping type> from <first node> to <last node>
inclusive
- respond with the obvious error messages
- file not found
- hash algo not supported
- hash subtype not supported
- mapping type not supported
- branch not available
- respond repeating request parameters except with first node being the first
node within the range that is available, and last node being the last node within
the range that is available, and the nodelist
So, after all this, how do we know that the hashtree received is for the
file with that ed2k hash?
When a client has received a complete chunk and has verified it against the
ed2k/md4 parthash as usual, coming to the conclusion that it is correct and
valid chunk, chunk data can be verified using hashtree as has been received.
If the data hashes correctly and matches the hashes in the available hashtree,
the hashtree is correct and valid.
That is the only way I can think of, as with old ed2k hash, the smallest verifiable
unit is a chunk.
So, that's it. I've now explained my case pretty well, so if someone still thinks
another way would be better, I'd very much like to read a good post describing
how to achieve as good a result, including how it is better on critical resource
usage.
If you truly read this post
completely, and gave thought to it, I thank you from the bottom of my heart.
It took me about five hours to write this.