new "diskpacked" local disk storage format #197

bradfitz · 2013-08-10T17:16:27Z

The current "localdisk" storage type stores each blob in its own file.  This
has pros and cons:  it's easy to reason about and debug, it works well with concurrent
readers & writers (if you were to run multiple servers over the same data), and
works on network filesystems well.  But on the downside, it locality problems if the
underlying disk isn't an SSD.  Because a file is cut into many chunks, the chunks aren't
likely to be contiguous on disk.

This bug is about adding a new filesystem-based storage target, with different
properties.  (We won't remove the current one)

The new "diskpacked" storage type would just append all blobs one after
another in big file, or directly to a raw block device.  An index (using Camlistore's
index.Storage abstraction, so any index/db could be used) would then store the metadata
of which blobs exist and their size and offset in the big on-disk file(s)/blockdevice(s).

In practice, when using the filesystem instead of a block device, the big files would
probably start a new packed chunk every few hundred MB.  This could be configurable.

Also, there will be a tiny header/footer around each blob in the packs:

[BEGIN sha1-f1d2d2f924e986ac86fdf7b36c94bcdf32beec15 4]
foo
[END]
[BEGIN ...

... so the index can be recovered from the packs if needed.  And it's still
self-describing (a goal in Camlistore everywhere, that things can be reconstructed many
years in the future)

(From discussion with adg@ in Sydney)

nf · 2013-08-11T22:55:46Z

Comment 1:

Is the footer necessary? I think this should suffice:
[sha1-f1d2d2f924e986ac86fdf7b36c94bcdf32beec15 4]
foo
[sha1-...
Do you intend to do this soon (you made yourself "owner")? If not, I might tackle it.

bradfitz · 2013-08-12T00:11:08Z

Comment 2:

The footer isn't necessary.
If you want to do it soon, it's all yours.

nf · 2013-08-12T00:26:33Z

Comment 3:

Will mark as "Started" and take ownership if/when I do.

wathiede · 2013-08-20T05:14:05Z

Comment 4:

As part of https://camlistore-review.googlesource.com/#/c/536/ I did some profiling. 
The post-536 profile looks like
https://googledrive.com/host/0B7S33WfKVjF0Z2dVSWVtelpmbFU/local-check-last.svg
If I'm reading that correctly, the localdisk could benefit from an index.Storage too. 
It seems like enumerating blobs generates a lot of garbage, just from getting the
dirnames (or does disk seek time cause that to show up in the profile more than it
should?)  If it is just slow access times causing that to show up in the profile, then
having an indexer that could more efficiently enumerate blobs could go a long way to
making the localdisk storage backend better.
I'm in favor of the work this issue sets out to perform; I just wanted to present a
middle ground that may be worth exploring.

bradfitz · 2013-08-20T15:14:48Z

Comment 5:

Interesting!  That's surprisingly expensive.
Looks like os.File.Readdirnames with a non-negative n makes it allocate the whole buffer
(512 KB, for 32k of strings) each call.
We should use Readdirnames(-1) instead.
Could you try that in enumerate.go first?  (Adding an index to localdisk makes sense
too, but I want to keep localdisk dead simple with no moving parts and things to corrupt
and easy to debug.  And we don't have a good pure-Go index yet)

wathiede · 2013-08-21T03:25:15Z

Comment 6:

Here's a profile doing the same basic thing as before post-
https://camlistore.googlesource.com/camlistore/+/6dac08f5390019a2cd368cf51e9bc98acb379132
https://googledrive.com/host/0B7S33WfKVjF0Z2dVSWVtelpmbFU/readdirnames-1.svg

tgulacsi · 2013-08-21T08:08:26Z

Comment 7:

If you don't want to reinvent the wheel, we could strip the inner parts from weed-fs,
which does this (stores files in a volume, reparsable, but uses an in-memory index for
them, with persistency - and a volume manager with topologies etc which are unneeded
now), implementing the facebook haystack paper.
Not a big thing, but thought out: 8-padded blocks, nice needle structure, rebuildable
space-efficient in-memory volume index.
My only concern is the question: which volume shall i start searchin in? It is solvable
by several methods (search in every of them, or split volumes by first n bytes of the
hash, or have a separate index for this) each with its own downsides.
I started a new blobstore with weed-fs especially because the one-file-per-chunk is not
scaleable. Thus a more direct solution (local volumes, not remote server with its own
replication) would be better for me!

bradfitz · 2013-08-22T01:25:45Z

Comment 8:

diskpacked is so easy to do natively, it's just not worth a dependency on a new moving
part.

tgulacsi · 2013-08-22T03:18:12Z

Comment 9:

Ok, another shootable idea: tar format.
I have an old production system for file storage, where the volumes are just tars. The
main benefit is that tar is a very easy, portable format, and the tools to manage it are
already exists.
The drawback is its relative size: 512 byte blocks plus separator block plus header
block. But this also means that reindexing is easy, and the format is robust.

tgulacsi · 2013-08-22T11:02:11Z

Comment 10:

Plan:
Have a few tars/diskpacks opened for appending. Store the index (ref -> filename:pos) in
a key-value db (cznic/kv for example) right in that dir. Let's call this Level 0.
When any of the tars reaches a size threshold, then create a permanent index of the
tar's content in cdb format, and move the tar and the cdb to the "1" subdir and remove
their keys from the Level 0 db. Let's call the "1" subdir and its content Level 1. 
When there are too many files in Level n, then stack each tar into a new tar file at
Level n+1, and create the corresponding cdb, too; then delete the files from Level n.
Let's call this event "level jump".
Search goes the other way: starts at Level n, searches each cdb (parallel - maybe using
parallel workers) and then descends to Level n-1.
If the "too many files" threshold is m, then at most n*m+1 searches are needed, where n
is the number of levels.
This can be minimized if we have only one cdb per level - but this means we have to
merge the new tar's index into the level's cdb when a level jump happens.
Another consideration is db file size. I don't know cznic/kv 's limits, but I know CDB's
limits: the file must be less then 4Gb. As a ref is 20 bytes, a block offset is at least
4 byte (this means at most 2TB chunks), and with some indirection (separate file name ->
file id table), 4 byte file id (2^32 tars max); that means about 82 million items in a
cdb maximum.
Maybe this is reached by my requirements: I have about 10 million files, but I don't
know the multiplication factor of the refblob chunking (the files are less than 16Mb).
A second, modified version is to separate the tars and the indices: create tars as big
as comfortable, and have a cdb for each prefix, till all cdbs under limit. (Start with
one cdb for all, then one cdb for each half-byte (0-9a-f), then byte
([0-9a-f][0-9a-f])...
This means optimal (between CDB's limits) minimum number of cdbs, and only one cdb to
search - using the ref's beginning as prefix.
The downside is that we have to rebuild (recreate) every cdb at the arrival of a new
(full, finished) tar.
I don't know enough key-value stores to know whether there are better alternatives to
CDB, but maybe the upsides (simple, easy, fast, low overhead, one file) compensate for
the downsides (have to copy a few gigabytes on each cdb change).

nf · 2013-08-29T03:03:46Z

Comment 11:

I've started work on this. I have a receiving and fetching blobs working. I need to do
the queue/partition stuff so I can actually use it as primary blob storage, which I
don't understand very well ATM.
The indexing is all in-memory at this point, but I have a simple design for a blob index
in mind:
As blobs are added, track them in a "recent" map[blobref string](blobref, packedfile,
offset).
Periodically merge sort those records into a file named "index" of the form
sha1-xxx packedfile offset
sha1-yyy packedfile offset
sha1-zzz packedfile offset
Where lines are sorted by blobref.
On blob lookup, check the "recent" map first, then do a binary search of the index file.
Enumeration is just reading the index file top-to-bottom and merge sorting the "recent"
map on the fly.
I guess the index file will grow pretty big over time. But it can be partitioned pretty
easily.
We could use one of the other indexers but this approach seems simple enough that it's
not worth it.
Comments welcome.

bradfitz · 2013-08-29T03:10:47Z

Comment 12:

Andrew, it's much easier than you think.
Instead of inventing your own index format, just use sorted key/value abstraction we
already in Camlistore and have multiple implementations of (index.Storage).  There's
even the "kvfile" on-disk, pure Go implementation of it.
See how blobserver/pkg/encrypt uses it:
type storage struct {
        // index is the meta index.
        // it's keyed by plaintext blobref.
        // the value is the meta key (encodeMetaValue)
        index index.Storage
...
But instead of doing what encrypt does,
        sto := &storage{
                index: index.NewMemoryStorage(), // TODO: temporary for development; let be configurable (mysql, etc)
        }
Just use kvfile.NewStorage(), which returns an index.Storage.  And put the index file in
the same directory.
Further, don't worry about QueueCreator.  I'm going to implement that at a higher level
so people don't need to worry about it for blobservers.  (which will also be implemented
atop existing blob/keyvalue abstractions, although probably configurable)

nf · 2013-08-29T03:16:51Z

Comment 13:

Oh, that rules!

tgulacsi · 2013-08-29T06:31:23Z

Comment 14:

I'm eager to test it!

nf · 2013-08-29T07:08:24Z

Comment 15:

The code review system isn't working for me, I guess because of the migration to
googlesource.com? I'll figure that out later. In the meantime (so I don't lose it if my
machine explodes) here is the patch.

Attachments:

diskpacked.patch (8737 bytes)

nf · 2013-08-29T07:48:29Z

Comment 16:

The code review system isn't working for me, I guess because of the migration to
googlesource.com? I'll figure that out later. In the meantime (so I don't lose it if my
machine explodes) here is the patch.

Attachments:

diskpacked.patch (8878 bytes)

tgulacsi · 2013-08-29T14:22:27Z

Comment 17:

Seems nice and thought-out.
Attaching a patch for your patch (:)) - nothing serious, just don't use os.Stat when we
have the file already opened, use "data-%020d.dat" as filename and use
binary.LittleEndian.PutUint64 for blobMeta serialization (constant 24 bytes) for keeping
index size as low as possible.
I really tried hard to find logical errors and lack of optimization, but found none.
How can I test this?

Attachments:

diskpacked-gt.patch (4441 bytes)

hjfreyer · 2013-08-29T17:05:54Z

Comment 18:

One suggestion: 
Add 4 bytes to the header for a status of the blob. The two statuses you'd want in the
short term are "OKAY" and "DELE" so that you can mark the given blob for GC in some
future compaction (unless RemoveBlobs has the specific semantic that the data should be
destroyed immediately). Even if you do zero it out, you might want to tag its status as
"ZERO".
You could also imagine other statuses like "CRPT". Seems like it might come in handy.

bradfitz · 2013-08-29T17:49:20Z

Comment 19:

Andrew, 
You'll need to update your .git/config and your ~/.netrc.
Your .git/config should be:
[remote "origin"]
        url = https://camlistore.googlesource.com/camlistore
        fetch = +refs/heads/*:refs/remotes/origin/*
And then update your ~/.netrc per http://camlistore.org/docs/contributing

tgulacsi · 2013-08-29T20:31:39Z

Comment 20:

with CAMLI_DEBUG_CONFIG=1 ./bin/camlistored I could create a low-level config file,
where I could replace "filesystem" with "diskpacked" - but this way it complains about
(not implemented) efficient queueing.
I don't understand what this queues/partitions are, simply copied the code from under
localdisk, but for no avail, my replication seems to be corrupt?
Upon uploading a file I receive:
2013/08/29 22:25:42 No recent permanodes because keyId for owner
sha1-08be8cbcf21ff7d76f7a3b5f73ea9d3532314979 not found
2013/08/29 22:25:50 receiving blob sha1-36d25e0387c7839dcf6a7c0551885436153de193
2013/08/29 22:25:50 Mirrored blob [sha1-36d25e0387c7839dcf6a7c0551885436153de193; 551
bytes] to partition "/home/gthomas/var/camlistore/packs/partition/queue-index-sqlite"
2013/08/29 22:25:50 Received blob [sha1-36d25e0387c7839dcf6a7c0551885436153de193; 551
bytes]
2013/08/29 22:25:50 receiving blob sha1-c60cc966e7d79179fe5b415ce2bd6e2ee5dacd79
2013/08/29 22:25:50 Mirrored blob [sha1-c60cc966e7d79179fe5b415ce2bd6e2ee5dacd79; 739
bytes] to partition "/home/gthomas/var/camlistore/packs/partition/queue-index-sqlite"
2013/08/29 22:25:50 replica: receiving blob, 1 successes, 1 failures; last error =
jsonsign: error fetching public key blob: file does not exist
2013/08/29 22:25:50 Client error: Error receiving blob
sha1-c60cc966e7d79179fe5b415ce2bd6e2ee5dacd79: jsonsign: error fetching public key blob:
file does not exist
2013/08/29 22:25:52 replication error for queue "index-sqlite", blob
sha1-36d25e0387c7839dcf6a7c0551885436153de193: dest write: corrupt blob; digest doesn't
match
I'm attaching the diskpacked.patch + diskpacked-gt2.patch

Attachments:

diskpacked-gt2.patch (3077 bytes)

cznic · 2013-09-01T10:27:49Z

Comment 21:

@10: kv DB size limit is 2^60 bytes.
Updated the docs:
cznic/kv@89c3914

nf · 2013-09-01T23:16:54Z

Comment 22:

@tgulacs... thanks for the suggestions. Feel free to comment on 
https://camlistore-review.googlesource.com/#/c/708/
Your patch has some issues, though (notably parseBlobMeta doesn't do anything).
Until Brad refactors the partition stuff out of localdisk, you can use the diskpacked
storage as a secondary blobstore and sync to it. You can't use it as primary storage yet.

nf · 2013-09-02T06:02:34Z

Comment 23:

Labels changed: added type-enhancement, component-persistence.

Owner changed to @nf.

Status changed to Started.

bradfitz · 2013-09-02T19:51:38Z

Comment 24:

Now in (rev 4ca346f12). Still needs delete and a couple other TODOs.

tgulacsi · 2013-09-20T17:59:18Z

Comment 25:

I've implemented delete and CreateQueue in
https://camlistore-review.googlesource.com/#/c/876/

tgulacsi · 2013-09-26T20:04:05Z

Comment 26:

And now without CreateQueue, one level up (more general solution):
https://camlistore-review.googlesource.com/#/c/936/

tgulacsi · 2013-11-29T09:12:05Z

Comment 27:

Now I get "No recent permanodes because keyId for owner
sha1-df5bea9fbfabb126cb6d4a317363c87c2bb3367d not found" and 
"Stat error: Error, off: 0xb0, read /home/tgulacsi/var/camlistore/pack/index.kv: bad
file descriptor"
See the attached log - I have started with empty pack-000000.blobs and it seems that
camlistore stored the public key in there.

Attachments:

camlistore-jsonsign-error.log (3683 bytes)

bradfitz · 2014-10-21T08:46:29Z

Comment 28:

Diskpacked is done.  Future problems and enhancements should be their own bugs.

Status changed to Fixed.

bradfitz added fixed labels Oct 21, 2014

bradfitz added the component-persistence label Oct 21, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new "diskpacked" local disk storage format #197

new "diskpacked" local disk storage format #197

bradfitz commented Aug 10, 2013

nf commented Aug 11, 2013

bradfitz commented Aug 12, 2013

nf commented Aug 12, 2013

wathiede commented Aug 20, 2013

bradfitz commented Aug 20, 2013

wathiede commented Aug 21, 2013

tgulacsi commented Aug 21, 2013

bradfitz commented Aug 22, 2013

tgulacsi commented Aug 22, 2013

tgulacsi commented Aug 22, 2013

nf commented Aug 29, 2013

bradfitz commented Aug 29, 2013

nf commented Aug 29, 2013

tgulacsi commented Aug 29, 2013

nf commented Aug 29, 2013

nf commented Aug 29, 2013

tgulacsi commented Aug 29, 2013

hjfreyer commented Aug 29, 2013

bradfitz commented Aug 29, 2013

tgulacsi commented Aug 29, 2013

cznic commented Sep 1, 2013

nf commented Sep 1, 2013

nf commented Sep 2, 2013

bradfitz commented Sep 2, 2013

tgulacsi commented Sep 20, 2013

tgulacsi commented Sep 26, 2013

tgulacsi commented Nov 29, 2013

bradfitz commented Oct 21, 2014