New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SHA-2 blobrefs (migrate away from sha1-*) #537
Comments
For base64 blobrefs, it may be useful to look at RFC 6920 "Naming Things with Hashes", which defines a base64 hash for content-addressable. For example: ni:///sha-3;f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk It uses "base64url" which uses '_' and '-' instead of '/' and '+' so that the base64 can be easily put in a URL, for example if using it to fetch an image for a blog. http://tools.ietf.org/html/rfc6920 |
Yeah, base64url was what I was imagining we'd do. I don't plan to use the "ni:" scheme, though. We'll stick with Camlistore's blobref format, though. Moving away from that would be a lot more work, for little gain. We can also support "ni:///" in other places where necessary. We need to decide which blobref prefix to use ("sha3-"?) and which SHA-3 output size to use. And whether we put the output size into the blobref prefix or not. We could infer it from the length, but then we can't tell the difference between truncated blobref strings, if they get truncated at an unfortunate spot. So we could go with e.g. "sha3_256-" which is kinda gross. Or "sha3-" for now (implying whatever we pick, e.g. 256), and add "sha3_512-" later if we need a different one. Thoughts? |
If we find a cute way to be implicit all along ("sha3s-" - s for short - for 256, "sha3l-" - l for long - for 512?) I think it'd be ok, but starting implicit and switching later to precising the size seems inconsistent to me. In any case, if I wanted to bikeshedd, I think we should stick with the notation on wikipedia when being explicit about the size: "sha3-256-" and "sha3-512-". |
Are you planning to use base64 for files and folders in the Camlistore/blobs/sha3 folder? I ask because Windows and OS X filenames are case insensitive. There would be a chance of collision, although very small. Or would you keep using the hex of the sha3 hash for the blob store even if the blobref uses base64? |
Good point. We'd probably need to use hex of the hash for the "localdisk" blob store (the one that stores one blob per file). |
I estimate the probability of collision for sha3_256 as about one in 2^180 (math below *). When you are about to store a blob and see the same file name, you could re-scan the first few bytes of the file to see if it is really different. Do you already do this when you are about to overwrite a blob file? (*) Sha3_256 has 43 characters in the base64 file name. For simplicity, assume that the base64 character set has only the 52 letters, ignoring the other 12 symbols. A collision would happen if any of the 43 characters flips between upper and lower case. So there are 2^43 other file names that collide. This is out of 2^256 possible file names. So the chance of one other file colliding is 2^43 / 2^256 or one in 2^213. If over the lifetime of the blob store you generate 2^33 files, then the chance of collision is one in 2^180. |
For case-insensitive file systems (Windows and OS X) you can put a special character after each upper-case letter. For example "sha3-a2B4C6d" becomes "sha3-a2B!4C!6d". It's easy to recover the original base64 by removing all "!". And the file name is still shorter 90% of the time than using only hex characters. |
How will this affect detection of identical blobs? Will we compute both sha1 and sha3 to be sure (and possibly more in the future) or do we just give up and have same blob multiple times? |
I'm leaning towards SHA-224, and keeping it in hex format, not base64. It's not much longer: https://play.golang.org/p/zIg749SJou6
|
Any thoughts on using multiformats? https://github.com/multiformats
This would give the advantage of having self-describing hashes..
…On Mon, Jan 8, 2018 at 11:05 AM mpl ***@***.***> wrote:
https://camlistore-review.googlesource.com/c/camlistore/+/13066
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#537 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAF5y9WnxRXF-u5EeTrt9bw5fvxCbwzDks5tImbsgaJpZM4DNB6b>
.
|
@lindner, we're already effectively using multiformats, in an even more verbose & data-archaeological-proof way. And our format predated that proposal, otherwise we might've considered it. I don't see a compelling advantage to hex encoding our "sha224-" prefix to save a few bytes. Actually they don't even have sha2-224 registered with a unique number. |
TODO list for this bug:
|
Updates #537 Change-Id: I3966697cbdb05ca4b380974be604deebdaa258c2
… sha-1 Remove the blob.SHA{1,224}From{Bytes,String} constructors too. No longer used. This adds blob.RefFromBytes which was missing. We had blob.RefFromString. Now everything uses blob.RefFrom* instead of specifying a hash function. Some tests set a flag to force use of SHA-1 because there was too much golden data to update. We can remove those one-by-one over time as we fix up tests. Updates #537 Change-Id: Ibe6428089a6221594c2b751f53f98b03b5a28dc2
Punting the remaining stuff (config, mostly) to 1.0. |
SHA224 is a truncated version of SHA256 why not use the full SHA256? |
The text was updated successfully, but these errors were encountered: