Original subject: [patch] Merkle Hash Tree Against Corruption

Posted: Mar 14 2003

In case anyone cares: I do have experience in programming and software design, as that's what I do for living. But I'm not a cryptography expert - I just use cryptography within software, and know just enough to implement existing algorithms, as I've sometimes done, from the algorithm description. I'm not doing any code on eMule because I don't have VS 7 which would be required for that, and I'm not willing to shell out a thousand euros (or however much it costs) just to start developing eMule. Neither am I willing to use VS 7 without proper license, so don't even think about telling me how easy it is to download without first sending me proper license papers.


First, an algorithm that would produce result vectors verifiable against the current ed2k-hash or the md4 parthashes within is quite impossible. The only thing you could do would be to send md4 intermittent result vectors that would again be used as starting vectors for the md4 transformation in the next block, and verifying that hash intermittent result vector list is not possible without having the correct data, at which point it's not useful to any degree any more. Thus, a new method that is incompatible with the old ed2k-hash/md4 parthashes is needed, and when it's already going to be incompatible with existing method, dumping md4 altogether and going for a cryptographically secure hash algorithm is the only reasonable course of action.

This doesn't mean that the old ed2k/md4 parthashes would be dumped. Of course they would be used as the first method, and the new method would primarly be a way to pinpoint corruption to blocks that are way smaller than 9500kB. If the new hash becomes prelevant at some point, and support for it is added to other clients and even servers, we could then, and only then, drop current ed2k/md4 parthash method and start using the new method as the primary method. Probably would have to support old ed2k/md4 parthash method still for a long time to maintain backwards compatibility with old clients and ed2k link websites, forums and databases.

Regarding the resources needed to create invalid data that collides with the correct hash is considered trivial. I could probably produce one bad 64 byte block per day with my current main workstation, so if any organization with any moneytary support at all wants to use that method to attack ed2k network, it'd not be hard at all to do so. Remember that the attacker can choose any part of the file himself, knowing each and every intermittent result vector in the file (generating such a list is trivial - perhaps a minute on average PC for an 800MB file), so he's got all the cards to help him in corrupting the file data.
And no, I don't have any idea whether that attack method is used or not. There are easier ways to attack ed2k network still. Which is indeed one of the reasons why we're talking about new hashtree extension in the first place. Nice additional benefit about a hashtree with cryptographically secure hash algorithm is of course that hash collision attack becomes extremely hard to do, also.

I don't really care one way or another about having one more database of files with hashes. There are already numerous large collections of ed2k links by known good releasers, so one more would be just Yet Another Link Collection, nothing more.
However, if that just happens by choosing one algorithm from many that are equally good, why not? Still, "equally good" may be different for different people, and my points are from the point of view of a fast uploader that shares a lot (>>200GB of files online that may be shared and are listed in known.met due to having been shared at one point, ~180GB of files shared most of the time, 800 kilobytes/sec or higher upload limit all the time), and therefore CPU, disk I/O, and diskspace requirements all are considered important.

With that out of way...


For the rest of this post I assume Tiger-192 as the hash algorithm. For reference once more:
SHA1 is 160 bit hash (20 bytes per node)
Tiger is 192 bit hash (24 bytes per node)
SHA256 is 256 bit hash (32 bytes per node)
Thus, you whenever I mention amount of hash data, if you want to know how much it would be using SHA1 instead of Tiger, divide by 24 and multiply by 20 (1/6th less), and using SHA256 divide by 24 and multiply by 32 (1/3rd more).


Why a treehash?
Because in a tree, siblings of any parent can be verified against that parent, and a parent can be created by knowing the siblings. Therefore, knowing any level in full allows for construction of the whole tree from that level upwards, and verifying against a single node - the top of the tree - which also acts as the filehash.
This also leads to the fact that the tree is completely verifiable down to any set of nodes for which all siblings are known recursively once one has the top node. Eg. if you have all siblings from one branch of level 13, the siblings of the parent (at L14) of that branch, the siblings of the parent of that branch, and so on, up to the top, you can verify the validity of all the nodes within that branch.
Of course this also means that when you have higher levels of the tree completely verified, any subtree below can be verified against the closest parent. Any subtree can be treated as if it were a tree of it's own.
There is absolutely no need to try to protect the tree by hashing the hashdata with a separate hash. For as long as the top is known by the client receiveing tree's hashdata.

Why binary tree and not higher branching factor?
As all siblings are needed to verify against parent or calculate the parent, with binary tree it means exactly one sibling for each node, while with higher branching factor higher number of siblings are needed. While binary tree is deeper than a tree with higher branching factor for any equal number of leaves, the binary tree has maximum of (2*leaves-1) nodes, and a tree with branching factor equal to the number of leaves (which is more specifically a sequential block hash - a special subset of treehash) the number of nodes is (leaves+1). The maximum difference is roughly twofold with a high number of leaves. The size of a binary tree of just one level lower (or exactly twice the blocksize) is always less than the size of the sequential block hash. Therefore, when the size of the hashdata total is very small compared to the data to be verified with the hashdata, the difference in the amount of hashdata that must be transmitted is ignisificant. Also, a full tree can be reconsructed from just the leaves and the top, which would be exactly the same number of nodes as a sequential block hash for the same data.
Therefore, there is no such case where a binary tree would require transmitting more hashdata than a sequential block hash (and therefore any treehash of higher branching factor than 2) would.

Why such a small atomic block size as 1 kilobyte?
Because atomic block size defines the absolute smallest blocksize that can be verified. However, when the leaf nodes (hashes which have been calculated from a chunk of original data, not from a chunk of hashdata) are not available, any block down to which a node is available can be verified.
This means that if the leaf is not available, but a L7 node (128kB blocksize) is available, the 128kB block can be verified against the corresponding L7 node. That is, the blocksize may be variable depending on availability of nodes and the need to verify small blocks. The need is defined by the client wanting to verify a block (usually trying to get lower level nodes when it detects corruption in a large block), and the availability is defined by the clients sharing the blocks (dependant on the resources available for sending hashdata). It's extremely unlikely that leaf nodes with atomic block size of 1kB would ever be available, as that'd take huge amounts of critical resources (diskspace and/or CPU power, and bandwidth). Note that even while bandwidth is listed in critical resources, technically one 1kB block from an 800MB file can be verified by a hashpath of 20 nodes deep (only siblings are needed for each node on the path, as with the sibling the parent can be calculated, leading up to the top hash). I may be off by one or two in that hashpath depth, but considering the total number of nodes needed, the amount of hashdata is completely insignificant compared to the filesize.


OK, onto the critical resource issues and the protocol requirements to optimize critical resource usage..

Critical resources are:
-Bandwidth
-Disk space
-Disk I/O
-CPU
-Memory

Memory usage shouldn't be high, as code isn't a big issue, and hashdata is insignificant compared to filedata.

CPU usage is high on hashing, but not more so than current ed2k hash creation. Also, if they are done in the same disk I/O cycle (read block, hash with md4, hash with Tiger, continue) then disk I/O isn't an issue. Hashing is I/O bound, so hashing times shouldn't go up more than a small fraction. In treehash about twice the number of hash operations will be executed as in the sequential block hash, and all in all this means that hashing will take approximately three times the CPU power it takes with just ed2k hash, assuming both hash algorithms take the same time to execute for the same amount of data. If hashing cycle is planned well, there is also a potential of keeping both hash algorithms in the CPU's cache, and hashing the data brought into CPU's cache on one go, thus removing memory I/O from the critical dual-hash path.

Disk space requirements define the level down to which hashes can be stored. Assuming storing of down to 8MB blocksize (L13) only would mean just 128 leaves per gigabyte of data, or ~6kB of hashdata per gigabyte. Which is completely insignificant - it's only about 3.5 times the ed2k parthashes for the same amount of data.
Storing down to 8kB blocksize (L3) would mean 128k leaves per gigabyte of data, or ~6MB of hashdata per gigabyte. Which is no longer insignificant, even if it's only 0.6% of the filedata. It'd push me to well over a gigabyte of hashdata, which I don't consider reasonable.
Storing down to 128kB blocksize (L7) would mean 8k leaves per gigabyte of data, or ~384kB of hashdata per gigabyte. At 0.037% of filedata, it'd push me to perhaps about a 100MB of hashdata, which is still reasonable.

At L7 storage (and storing full tree down to L7), verifying blocks at 128kB would be possible. That's still acceptable, and compared to verifying at 9500kB blocksize is excellent.

Halving storage requirement by storing just the L7 (or other chosen level) hashes and the top hash would mean having to dynamically create missing hash levels on the fly, and also that if corruption of hashdata container is noticed, all hashes for the file whose hashtree has corrupted have to be invalidated and the file completely rehashed. Storing full tree would mean that corruption could be detected and pinpointed to specific branch.

Also, the mule could at startup verify all hashdata in the container, as hashing tens or even hundreds of megabytes can be done in a matter of seconds, or at most minutes. Few people close and restart mule so often that it'd be even an annoyance - what's one more minute in startup time when you run it continously for hours or days at a time?

Note that I vehemently oppose dynamic hash creation, as it'd put additional disk I/O and CPU strain on the uploading client, and for highspeed uploaders the mule already takes a hefty toll on CPU, possibly thus pushing mule from bandwidth bound to CPU bound task, which is definitely not what we want to do. If you want to know what kind of CPU usage mule has on a highspeed uploader, try uploading a megabyte a second, divided to over a hundred upload slots, with a few thousand clients (that will actually be served in reasonable time) in queue. If you don't feel close to CPU bound at that point, I want your code (if it's optimized so well) or your computer (if it's due to hefty CPU).


Now to bandwidth, which is for most people the most critical resource. Especially upstream bandwidth..

First, remember that to verify anything, you need the full hashpath up to the top from the block being verified. And that the path is just 20 or so nodes for an 800MB file, or just about half a kilobyte. And, as the hashpath down to L0 wouldn't be available, from top to L7 it'd be less than 15 nodes, so down to <360 bytes.
However, as within the 9500kB that has been noticed as corrupted no single block can, with ed2k parthashes, be singled out for verification, the whole chunk must be verified. To do that, hashes for each block in the chunk must be obtained. While that could be done by requesting each hashpath down to the L7 (or whatever level is available) block. However, as the neighbouring hashpaths unite, at worst halfway through the file, so the hashpaths separate at the top two hashpaths down to about 38 neighbouring L7 nodes each must be obtained. 38 nodes completely unite at 6 levels above, or at L13 (8MB), producing thus two trees of <63 nodes each. Also, from L13 to top at 800MB filesize the paths are seven nodes long, thus requiring 13 more nodes (paths unite at top). Total count thus would be <139 nodes in the worst case to obtain L7 hashes for any single chunk in an 800MB file. Total hashdata needed would thus be <3.4kB. Which definitely is worth it to grab when corruption can be pinpointed to a single (or multiple, which of course would be worse) 128kB block instead of a 9500kB chunk.
Of course once a client has obtained a set of nodes, it will store them as well, and will never again need more hashes to re-verify that chunk, or any new blocks received from any client, trying to find a good block to replace the corrupted one with.

What does this mean?
That if we don't care about anything else but being able to verify a corrupted chunk at a finer level:
- store only the finest level chosen by the user
- store list of leaves that correspond to each chunk
- add to the protocol a request that addresses the nodes by chunk number
- add to the protocol a response to the above request that trasmits the top hash and the nodes corresponding to the chunk

However!
Doing a more flexible implementation would not be much more of a burden on bandwidth, and would allow the mule to truly take a step ahead.
- How about sharing blocks even when a full chunk has not been completed? Would probably help file spreading a lot, especially in cases where the release distro is doing a bad job and even pulls out before all chunks have been well spread.
- Noticing a bad client when only a small block has been received? Why download a full corrupted chunk when you can get rid of the bad client after just a small block instead.
More reasons needed? I don't know what all benefits could be gained by this, but I'm sure you can come up new benefits above the two I listed. And even those two are big ones.


So, what's the cost?

First, if a client downloaded a suitable treetop when it gets the normal ed2k parthash list, it'd be a long way towards hashing the whole file with the new treehash. And as noted already, a treetop down to 8MB blocksize (L13) would be just ~3.5 times as much as the parthash list, whatever the file size. And it is downloaded only once, just like that parthash list is. Or, if only leaves and top were transmitted, it'd be just 1.7 times the size of the normal parthash list, although like the parthash list, unless it is complete and correct, knowing which node is corrupt is impossible, and the whole list would have to be invalidated and reasked.

When the first time asking for a chunk, the subtree from L13 down to L7 would also be requested. At 6 levels, it's size is 64 leaves or 122 nodes (the topnode, L13, is already known), or a little less than 3kB (assuming sharing at 8MB blocksize) or twice that to cover any single chunk. This list is downloaded once per chunk only, and when a branch from L13 to L7 hits two chunks' areas, it's downloaded only once - first time either of those chunks is requested.

Overall, the hashtree down to L7 for 800MB file would be 300kB in size, or just 0.037% of the filedata. In all overhead traffic that's not much, and it'd allow for corruption detection, bad client detection, and sharing smaller blocks than 9500kB chunks.

Minimal implementation in protocol would be:
- request for treetop
- response to treetop request
- request for L13-L7 branch
- response to L13-L7 branch

And limits in this case would be trivially easy:
- a client may request treetop just as it may request for parthash list
- a client may request for subtree when it requests a chunk. Such request is fullfilled only when the client gets an upload slot, and is fullfilled only for the branches it has requested and the uploading client wishes to serve - eg. send a branch only when starting upload of filedata in that branch.

This would be very similar to Champ's proposal, except that the low level would L7, not L3 (which is insanely small due to bandwidth, diskspace and/or CPU and disk I/O critical resources usage), and that branches would be transferred not just when corruption is noticed and is to be pinpointed within the chunk, but already when a chunk upload is started. This would allow for checking for corruption already when receiving, not just trying to pinpoint exact location when it has been noticed - which may be after hundreds or thousands of bad kilobytes too late.

Note that a client that wishes only to locate corrupted block within a chunk could still, with these control messages, achieve very closely the same result as with the "absolute minimum corruption detection only" proposal. It could refrain from requesting new hash branches until it detects corruption with the old ed2k/md4 parthashes. After that, it would request the treetop, and the branches covering that chunk. The treetop it would get as easily as it would get the hashpart list, while for the branches, it would have to wait in upload queue until slot.


Still, this would not be a satisfactory solution. Why? Because it's rigid.

What if we notice that L13 blocksize is not the optimal treetop bottom? Or that it'd be worth the resources to use even finer grained block to find corruption? Or find some other reason to try other levels, branches, or paths, and find more optimal solutions?

There are at least two reasonable ways to map the tree to a list. With one, any branch can be addressed as a list of consecutive nodes, and with the other, any level can be addressed as a list of consecutive nodes.
Because any branch is a fully verifiable self-contained unit, and also chunks are easiest to map via branches, I think fetching branches easily is the best way to go.
Why not levels? Because there is no case where you would need a full level where a branch wouldn't do at least as well.
I don't know if a mapping where any full path can be addressed as a consecutive list of nodes exists. Neither do I consider such mapping very useful compared to branch mapping.

So, the simplest way of providing access to any branch, from any level to any other level, is to store the tree linearly mapped in such way that branches can be addressed as lists. Of course it would be possible to rather add a mapping type parameter to the request opcode.

So, to move from rigid two-step approach to more extensible approach while maintaining the ability to request exactly the treetop and branches as per the two-step approach, the request and response could be:
- request hashtree for <ed2k hash> of <hash algo> <hash subtype> mapped as <mapping type> from <first node> to <last node> inclusive
- respond with the obvious error messages
- file not found
- hash algo not supported
- hash subtype not supported
- mapping type not supported
- branch not available
- respond repeating request parameters except with first node being the first node within the range that is available, and last node being the last node within the range that is available, and the nodelist


So, after all this, how do we know that the hashtree received is for the file with that ed2k hash?

When a client has received a complete chunk and has verified it against the ed2k/md4 parthash as usual, coming to the conclusion that it is correct and valid chunk, chunk data can be verified using hashtree as has been received. If the data hashes correctly and matches the hashes in the available hashtree, the hashtree is correct and valid.
That is the only way I can think of, as with old ed2k hash, the smallest verifiable unit is a chunk.


So, that's it. I've now explained my case pretty well, so if someone still thinks another way would be better, I'd very much like to read a good post describing how to achieve as good a result, including how it is better on critical resource usage.

If you truly read this post completely, and gave thought to it, I thank you from the bottom of my heart. It took me about five hours to write this.