How much space in Git does a git-annex file use?
- When I store a file in git annex, how much space does it use in the Git repository?
- TL;DR about 400 packfile bytes/file at scale, plus another 100 per
copy
ordrop
untilgit annex forget
.
- TL;DR about 400 packfile bytes/file at scale, plus another 100 per
- What is a sensible value for
annex.largefiles
?
The experiment
- Make an empty git annex repository
- Gathering disk usage and other repository size information after each change,
- Add small files in batches of 1000
- Copy to a special remote
- Draw graphs
What does it look like?
- While adding the tiny files in batches of 1000, it looks so linear it’s dull
- Dividing the size of the packfile by number of files gives a size per file. As the number of files increases this shows an asymptotic decrease in packfile bytes cost per file,
- With 50k files present, adding or removing a copy of all files also looks fairly linear
The return to three copies would not look out of place if it were mirrored over to five copies. - Files also take fewer bytes per copy when there are more copies
There is probably room for more experiments here, but it is enough for me.
Configuration to store files efficiently?
- In the case of changing sensibly sized text files, plain Git does a great job.
- For write-once files which gzip to over 400 bytes,
- it might be worth putting them in the annex, so you can choose to not have them available
- you still have to pay for the symlink checkout (probably rounds up to 4 KiB)
- file checkouts probably also round up to 4 Kib
- files present in the annex also have three levels of directory above (probably 4 KiB each), two of which will be shared when many files are present
I haven’t made or tested settings for annex.largefiles
yet, or considered what sort of experiment to run.
Are the simplifications realistic?
- Using very small files like
"$n\n"
onbackend=SHA256E
- These keys contain the length, which is single-digit (up to 6). For longer files, more digits and more variability.
- With no chunking or URLs?
git annex
can store other things with the log, which are not tested here
- Running the add/copy/drop operations close together in time
- This allows the integer part of the timestamps to compress better - you may pay some extra bytes for operations spread across years.
- The nanoseconds are probably still just noise in any case.
Other caveats,
- Sizes are for one aggressively repacked packfile. When there are loose objects or multiple packfiles, total size will be larger.
How did you do it?
Or “Can I repeat the experiment?”
I did it with some grubby shellscripts, Perl and filled in the gaps with paste-from-the-documentation oneliners.
Here is the git bundle of it.