Large Files, Version Control, and Hashsplitting

I’m currently working on Attaca, a fast, distributed, and resilient version control system for extremely large files and repositories, designed for scientists working on projects containing terabytes or petabytes of data. (Warning: not currently in any sort of working condition as of 8/29/17!) There are a few key components to this design; the first is the git data structure, with one special modification. The second is a distributed hash table (DHT) which is used to store the objects of the graph data structure, and the third is a technique called “hashsplitting” or “hash chunking”. Resilience is handled by the distributed object storage system, and version control itself by the git-like data structure.

Read More