New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new "diskpacked" local disk storage format #197
Comments
As part of https://camlistore-review.googlesource.com/#/c/536/ I did some profiling. The post-536 profile looks like https://googledrive.com/host/0B7S33WfKVjF0Z2dVSWVtelpmbFU/local-check-last.svg If I'm reading that correctly, the localdisk could benefit from an index.Storage too. It seems like enumerating blobs generates a lot of garbage, just from getting the dirnames (or does disk seek time cause that to show up in the profile more than it should?) If it is just slow access times causing that to show up in the profile, then having an indexer that could more efficiently enumerate blobs could go a long way to making the localdisk storage backend better. I'm in favor of the work this issue sets out to perform; I just wanted to present a middle ground that may be worth exploring. |
Interesting! That's surprisingly expensive. Looks like os.File.Readdirnames with a non-negative n makes it allocate the whole buffer (512 KB, for 32k of strings) each call. We should use Readdirnames(-1) instead. Could you try that in enumerate.go first? (Adding an index to localdisk makes sense too, but I want to keep localdisk dead simple with no moving parts and things to corrupt and easy to debug. And we don't have a good pure-Go index yet) |
Here's a profile doing the same basic thing as before post- https://camlistore.googlesource.com/camlistore/+/6dac08f5390019a2cd368cf51e9bc98acb379132 https://googledrive.com/host/0B7S33WfKVjF0Z2dVSWVtelpmbFU/readdirnames-1.svg |
If you don't want to reinvent the wheel, we could strip the inner parts from weed-fs, which does this (stores files in a volume, reparsable, but uses an in-memory index for them, with persistency - and a volume manager with topologies etc which are unneeded now), implementing the facebook haystack paper. Not a big thing, but thought out: 8-padded blocks, nice needle structure, rebuildable space-efficient in-memory volume index. My only concern is the question: which volume shall i start searchin in? It is solvable by several methods (search in every of them, or split volumes by first n bytes of the hash, or have a separate index for this) each with its own downsides. I started a new blobstore with weed-fs especially because the one-file-per-chunk is not scaleable. Thus a more direct solution (local volumes, not remote server with its own replication) would be better for me! |
Ok, another shootable idea: tar format. I have an old production system for file storage, where the volumes are just tars. The main benefit is that tar is a very easy, portable format, and the tools to manage it are already exists. The drawback is its relative size: 512 byte blocks plus separator block plus header block. But this also means that reindexing is easy, and the format is robust. |
Plan: Have a few tars/diskpacks opened for appending. Store the index (ref -> filename:pos) in a key-value db (cznic/kv for example) right in that dir. Let's call this Level 0. When any of the tars reaches a size threshold, then create a permanent index of the tar's content in cdb format, and move the tar and the cdb to the "1" subdir and remove their keys from the Level 0 db. Let's call the "1" subdir and its content Level 1. When there are too many files in Level n, then stack each tar into a new tar file at Level n+1, and create the corresponding cdb, too; then delete the files from Level n. Let's call this event "level jump". Search goes the other way: starts at Level n, searches each cdb (parallel - maybe using parallel workers) and then descends to Level n-1. If the "too many files" threshold is m, then at most n*m+1 searches are needed, where n is the number of levels. This can be minimized if we have only one cdb per level - but this means we have to merge the new tar's index into the level's cdb when a level jump happens. Another consideration is db file size. I don't know cznic/kv 's limits, but I know CDB's limits: the file must be less then 4Gb. As a ref is 20 bytes, a block offset is at least 4 byte (this means at most 2TB chunks), and with some indirection (separate file name -> file id table), 4 byte file id (2^32 tars max); that means about 82 million items in a cdb maximum. Maybe this is reached by my requirements: I have about 10 million files, but I don't know the multiplication factor of the refblob chunking (the files are less than 16Mb). A second, modified version is to separate the tars and the indices: create tars as big as comfortable, and have a cdb for each prefix, till all cdbs under limit. (Start with one cdb for all, then one cdb for each half-byte (0-9a-f), then byte ([0-9a-f][0-9a-f])... This means optimal (between CDB's limits) minimum number of cdbs, and only one cdb to search - using the ref's beginning as prefix. The downside is that we have to rebuild (recreate) every cdb at the arrival of a new (full, finished) tar. I don't know enough key-value stores to know whether there are better alternatives to CDB, but maybe the upsides (simple, easy, fast, low overhead, one file) compensate for the downsides (have to copy a few gigabytes on each cdb change). |
I've started work on this. I have a receiving and fetching blobs working. I need to do the queue/partition stuff so I can actually use it as primary blob storage, which I don't understand very well ATM. The indexing is all in-memory at this point, but I have a simple design for a blob index in mind: As blobs are added, track them in a "recent" map[blobref string](blobref, packedfile, offset). Periodically merge sort those records into a file named "index" of the form sha1-xxx packedfile offset sha1-yyy packedfile offset sha1-zzz packedfile offset Where lines are sorted by blobref. On blob lookup, check the "recent" map first, then do a binary search of the index file. Enumeration is just reading the index file top-to-bottom and merge sorting the "recent" map on the fly. I guess the index file will grow pretty big over time. But it can be partitioned pretty easily. We could use one of the other indexers but this approach seems simple enough that it's not worth it. Comments welcome. |
Andrew, it's much easier than you think. Instead of inventing your own index format, just use sorted key/value abstraction we already in Camlistore and have multiple implementations of (index.Storage). There's even the "kvfile" on-disk, pure Go implementation of it. See how blobserver/pkg/encrypt uses it: type storage struct { // index is the meta index. // it's keyed by plaintext blobref. // the value is the meta key (encodeMetaValue) index index.Storage ... But instead of doing what encrypt does, sto := &storage{ index: index.NewMemoryStorage(), // TODO: temporary for development; let be configurable (mysql, etc) } Just use kvfile.NewStorage(), which returns an index.Storage. And put the index file in the same directory. Further, don't worry about QueueCreator. I'm going to implement that at a higher level so people don't need to worry about it for blobservers. (which will also be implemented atop existing blob/keyvalue abstractions, although probably configurable) |
The code review system isn't working for me, I guess because of the migration to googlesource.com? I'll figure that out later. In the meantime (so I don't lose it if my machine explodes) here is the patch. Attachments:
|
The code review system isn't working for me, I guess because of the migration to googlesource.com? I'll figure that out later. In the meantime (so I don't lose it if my machine explodes) here is the patch. Attachments:
|
Seems nice and thought-out. Attaching a patch for your patch (:)) - nothing serious, just don't use os.Stat when we have the file already opened, use "data-%020d.dat" as filename and use binary.LittleEndian.PutUint64 for blobMeta serialization (constant 24 bytes) for keeping index size as low as possible. I really tried hard to find logical errors and lack of optimization, but found none. How can I test this? Attachments:
|
One suggestion: Add 4 bytes to the header for a status of the blob. The two statuses you'd want in the short term are "OKAY" and "DELE" so that you can mark the given blob for GC in some future compaction (unless RemoveBlobs has the specific semantic that the data should be destroyed immediately). Even if you do zero it out, you might want to tag its status as "ZERO". You could also imagine other statuses like "CRPT". Seems like it might come in handy. |
Andrew, You'll need to update your .git/config and your ~/.netrc. Your .git/config should be: [remote "origin"] url = https://camlistore.googlesource.com/camlistore fetch = +refs/heads/*:refs/remotes/origin/* And then update your ~/.netrc per http://camlistore.org/docs/contributing |
with CAMLI_DEBUG_CONFIG=1 ./bin/camlistored I could create a low-level config file, where I could replace "filesystem" with "diskpacked" - but this way it complains about (not implemented) efficient queueing. I don't understand what this queues/partitions are, simply copied the code from under localdisk, but for no avail, my replication seems to be corrupt? Upon uploading a file I receive: 2013/08/29 22:25:42 No recent permanodes because keyId for owner sha1-08be8cbcf21ff7d76f7a3b5f73ea9d3532314979 not found 2013/08/29 22:25:50 receiving blob sha1-36d25e0387c7839dcf6a7c0551885436153de193 2013/08/29 22:25:50 Mirrored blob [sha1-36d25e0387c7839dcf6a7c0551885436153de193; 551 bytes] to partition "/home/gthomas/var/camlistore/packs/partition/queue-index-sqlite" 2013/08/29 22:25:50 Received blob [sha1-36d25e0387c7839dcf6a7c0551885436153de193; 551 bytes] 2013/08/29 22:25:50 receiving blob sha1-c60cc966e7d79179fe5b415ce2bd6e2ee5dacd79 2013/08/29 22:25:50 Mirrored blob [sha1-c60cc966e7d79179fe5b415ce2bd6e2ee5dacd79; 739 bytes] to partition "/home/gthomas/var/camlistore/packs/partition/queue-index-sqlite" 2013/08/29 22:25:50 replica: receiving blob, 1 successes, 1 failures; last error = jsonsign: error fetching public key blob: file does not exist 2013/08/29 22:25:50 Client error: Error receiving blob sha1-c60cc966e7d79179fe5b415ce2bd6e2ee5dacd79: jsonsign: error fetching public key blob: file does not exist 2013/08/29 22:25:52 replication error for queue "index-sqlite", blob sha1-36d25e0387c7839dcf6a7c0551885436153de193: dest write: corrupt blob; digest doesn't match I'm attaching the diskpacked.patch + diskpacked-gt2.patch Attachments:
|
@10: kv DB size limit is 2^60 bytes. Updated the docs: cznic/kv@89c3914 |
@tgulacs... thanks for the suggestions. Feel free to comment on https://camlistore-review.googlesource.com/#/c/708/ Your patch has some issues, though (notably parseBlobMeta doesn't do anything). Until Brad refactors the partition stuff out of localdisk, you can use the diskpacked storage as a secondary blobstore and sync to it. You can't use it as primary storage yet. |
Labels changed: added type-enhancement, component-persistence. Owner changed to @nf. Status changed to Started. |
I've implemented delete and CreateQueue in https://camlistore-review.googlesource.com/#/c/876/ |
And now without CreateQueue, one level up (more general solution): https://camlistore-review.googlesource.com/#/c/936/ |
Now I get "No recent permanodes because keyId for owner sha1-df5bea9fbfabb126cb6d4a317363c87c2bb3367d not found" and "Stat error: Error, off: 0xb0, read /home/tgulacsi/var/camlistore/pack/index.kv: bad file descriptor" See the attached log - I have started with empty pack-000000.blobs and it seems that camlistore stored the public key in there. Attachments:
|
This issue was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The text was updated successfully, but these errors were encountered: