Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SHA-2 blobrefs (migrate away from sha1-*) #537

Open
bradfitz opened this issue Oct 21, 2014 · 21 comments
Open

Support SHA-2 blobrefs (migrate away from sha1-*) #537

bradfitz opened this issue Oct 21, 2014 · 21 comments
Assignees
Milestone

Comments

@bradfitz
Copy link
Contributor

Camlistore 0.9 should support something besides just SHA-1, like SHA-3. Should probably
then also make those use base64 blobrefs, instead of hex.
@camlistorebot
Copy link

Comment 1 by jefft0:

For base64 blobrefs, it may be useful to look at RFC 6920 "Naming Things with Hashes",
which defines a base64 hash for content-addressable.  For example:
ni:///sha-3;f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk
It uses "base64url" which uses '_' and '-' instead of '/' and '+' so that the base64 can
be easily put in a URL, for example if using it to fetch an image for a blog.
http://tools.ietf.org/html/rfc6920

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 3, 2014

Comment 2:

Yeah, base64url was what I was imagining we'd do.
I don't plan to use the "ni:" scheme, though. We'll stick with Camlistore's blobref
format, though. Moving away from that would be a lot more work, for little gain. We can
also support "ni:///" in other places where necessary.
We need to decide which blobref prefix to use ("sha3-"?) and which SHA-3 output size to
use.
And whether we put the output size into the blobref prefix or not. We could infer it
from the length, but then we can't tell the difference between truncated blobref
strings, if they get truncated at an unfortunate spot.  So we could go with e.g.
"sha3_256-" which is kinda gross. Or "sha3-" for now (implying whatever we pick, e.g.
256), and add "sha3_512-" later if we need a different one.
Thoughts?

@mpl
Copy link
Contributor

mpl commented Dec 3, 2014

Comment 3:

If we find a cute way to be implicit all along ("sha3s-" - s for short - for 256,
"sha3l-" - l for long - for 512?) I think it'd be ok, but starting implicit and
switching later to precising the size seems inconsistent to me.
In any case, if I wanted to bikeshedd, I think we should stick with the notation on
wikipedia when being explicit about the size: "sha3-256-" and "sha3-512-".

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 3, 2014

Comment 4:

My only concern with "sha3-256-" is that it adds a hyphen in the middle of the blobref
type, breaking our usual convention of /^(\w+)-(.+)$/.  I'd be okay with "sha3_256-",
but is a bit uglier and not the normal spelling.

@camlistorebot
Copy link

Comment 5 by jefft0:

There are hits if you Google for "sha3256".

@bradfitz bradfitz self-assigned this Dec 5, 2014
@jefft0
Copy link

jefft0 commented Dec 29, 2014

Are you planning to use base64 for files and folders in the Camlistore/blobs/sha3 folder? I ask because Windows and OS X filenames are case insensitive. There would be a chance of collision, although very small. Or would you keep using the hex of the sha3 hash for the blob store even if the blobref uses base64?

@bradfitz
Copy link
Contributor Author

Good point. We'd probably need to use hex of the hash for the "localdisk" blob store (the one that stores one blob per file).

@jefft0
Copy link

jefft0 commented Dec 30, 2014

I estimate the probability of collision for sha3_256 as about one in 2^180 (math below *). When you are about to store a blob and see the same file name, you could re-scan the first few bytes of the file to see if it is really different. Do you already do this when you are about to overwrite a blob file?

(*) Sha3_256 has 43 characters in the base64 file name. For simplicity, assume that the base64 character set has only the 52 letters, ignoring the other 12 symbols. A collision would happen if any of the 43 characters flips between upper and lower case. So there are 2^43 other file names that collide. This is out of 2^256 possible file names. So the chance of one other file colliding is 2^43 / 2^256 or one in 2^213. If over the lifetime of the blob store you generate 2^33 files, then the chance of collision is one in 2^180.

@bradfitz bradfitz modified the milestone: 0.9 Jan 9, 2015
@jefft0
Copy link

jefft0 commented Mar 25, 2015

For case-insensitive file systems (Windows and OS X) you can put a special character after each upper-case letter. For example "sha3-a2B4C6d" becomes "sha3-a2B!4C!6d". It's easy to recover the original base64 by removing all "!". And the file name is still shorter 90% of the time than using only hex characters.

@hullerob
Copy link

How will this affect detection of identical blobs? Will we compute both sha1 and sha3 to be sure (and possibly more in the future) or do we just give up and have same blob multiple times?

@bradfitz bradfitz modified the milestones: 0.10, 0.9 Mar 31, 2015
@bradfitz bradfitz removed the accepted label Apr 25, 2016
@mpl
Copy link
Contributor

mpl commented May 31, 2017

@bradfitz bradfitz changed the title Support SHA-3, base64 blobrefs Support SHA-256 blobrefs (migrate away from sha1-*) Jan 7, 2018
@bradfitz
Copy link
Contributor Author

bradfitz commented Jan 7, 2018

I'm leaning towards SHA-224, and keeping it in hex format, not base64. It's not much longer:

https://play.golang.org/p/zIg749SJou6

sha1	= da39a3ee5e6b4b0d3255bfef95601890afd80709
sha224	= d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f

@mpl
Copy link
Contributor

mpl commented Jan 8, 2018

@lindner
Copy link
Member

lindner commented Jan 8, 2018 via email

@bradfitz
Copy link
Contributor Author

bradfitz commented Jan 9, 2018

@lindner, we're already effectively using multiformats, in an even more verbose & data-archaeological-proof way. And our format predated that proposal, otherwise we might've considered it. I don't see a compelling advantage to hex encoding our "sha224-" prefix to save a few bytes. Actually they don't even have sha2-224 registered with a unique number.

@bradfitz
Copy link
Contributor Author

bradfitz commented Jan 9, 2018

TODO list for this bug:

  • support sha224 by default, https://camlistore-review.googlesource.com/c/camlistore/+/13066
  • audit/remove all callers of "crypto/sha1" or blob.SHA1FromString or blob.SHA1FromBytes in the code, and use the generic blob.RefFromString etc.
  • add server config to let people control whether sha1 is accepted by servers (off by default, but we'll want to let people turn it on for a bit while clients are updated)
  • update docs & examples to sha224
  • support wholeref queries of multiple hashes, and implement for client

camlistorebot pushed a commit that referenced this issue Jan 9, 2018
Updates #537

Change-Id: I3966697cbdb05ca4b380974be604deebdaa258c2
camlistorebot pushed a commit that referenced this issue Jan 10, 2018
… sha-1

Remove the blob.SHA{1,224}From{Bytes,String} constructors too. No
longer used. This adds blob.RefFromBytes which was missing. We had
blob.RefFromString. Now everything uses blob.RefFrom* instead of
specifying a hash function.

Some tests set a flag to force use of SHA-1 because there was too much
golden data to update. We can remove those one-by-one over time as we
fix up tests.

Updates #537

Change-Id: Ibe6428089a6221594c2b751f53f98b03b5a28dc2
@mpl
Copy link
Contributor

mpl commented Jan 18, 2018

@mpl
Copy link
Contributor

mpl commented Jan 23, 2018

camlistorebot pushed a commit that referenced this issue Jan 23, 2018
Follow-up of ec66bcc

Updates #537

Change-Id: Ib4dc62ae6be91c061dc6de4bfdd617785abdaab6
@mpl
Copy link
Contributor

mpl commented Jan 24, 2018

camlistorebot pushed a commit that referenced this issue Jan 25, 2018
Follow-up of ec66bcc

This change should be the last of its kind, in the corpus at least, and
should make most searches and the web UI usable again, with both kinds
of hashes (sha1 and sha224).

Updates #537

Change-Id: Icfe9e8aaab031313612c555b7601895aeba16a7c
@bradfitz bradfitz changed the title Support SHA-256 blobrefs (migrate away from sha1-*) Support SHA-2 blobrefs (migrate away from sha1-*) Jan 31, 2018
@bradfitz bradfitz modified the milestones: 0.10, 1.0 May 2, 2018
@bradfitz
Copy link
Contributor Author

bradfitz commented May 2, 2018

Punting the remaining stuff (config, mostly) to 1.0.

@milahu
Copy link

milahu commented Nov 4, 2021

I'm leaning towards SHA-224

SHA224 is a truncated version of SHA256

why not use the full SHA256?
why increase the risk of collisions, just to save a few bytes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants