Want a Reproducible Data Science Framework? Try Git LFS!

by Eric Brown –

The Git version control system is useful for reproducible data science frameworks when it is augmented with the Git LFS extension for handling large assets.

Place input data, code, and output data under version control. Push to centralized server.
Entire lineage of code and data are preserved
Special care must be taken to ensure that virtually all files are handled by the LFS extension
Git LFS has a 2G file size limit that must be worked around
Gitlab on bare metal is the recommended server configuration
Only those large assets requested for specific checkout are transmitted from server to client. Efficient and compliant.
SSH and/or HTTPS distributed infrastructure
Network latency is bottleneck

There is perhaps no more common question asked when visiting new customers than “How do we manage digital assets during data science?”

I believe the problem is solvable by a version control system. But bad experiences have shown me that the version control system (VCS) must be capable of handling large binary assets.

The most fundamental description of Data Science is the application of functions (f) to data (x), thereby generating a result (y):

y = f(x)

There is a startling simplicity to this: the “state” of the system is completely determined by three quantities. If the “state” can be serialized to the file system, then the state of the system can be captured and managed by a file-based VCS system such as Git.

Fortunately, serialization in data science is conceptually easy:

input data are usually big blobs of CSV files, object serialization, etc.
source code are textual
output data could be of many types, from text to binary objects

If the “state” is captured and can be accessed from anywhere, anytime–then this is, by definition, a reproducible framework and worthy of consideration.

Git and Git LFS

Most popular version control system is Git. Today, it has all the tooling support and mind-share, and is actually very flexible for composing solutions. Git is optimized for source code. It only stores file changes, which are usually quite small compared to the code base.

On the other hand, BINARY assets often change completely. For example, a JPEG file comprised of pixels with will be completely different than the JPEG file where a Sepia Filter has been applied. So why compute the delta difference of these large (compared to source code files), just skip that computation and whole-sale replace the file with a new version?

This is precisely the role that the extension Git LFS (Large File Sizes) provides to Git, and makes the goal of reproducible data science achievable. I have concluded that Git LFS’s role is to help Git behave a lot more like SVN. (But no one is talking about SVN today, are they?)

Git LFS Gotchas and Workarounds

Right about this time, you might be eager to know what the proposed solution is. Hopefully this repository makes it easy to follow:

git lfs clone ssh://git@96.84.85.81/ebrown/reproducible-data-science.git

https://96.84.85.81:8929/ebrown/reproducible-data-science

(or whatever you may use in your client, such as SourceTree.)

This repository has a suggested file structure, as well as Git-specific configurations stored in various .gitattributes and .gitignore files.

Here are some of the (mis)features of this system:

2 Gigabytes is the maximum size file that Git LFS can handle

You may need to use the split command: split –bytes 1024M –numeric-suffixes –suffix-length=6 myfile.json myfile.json. and concatenation command to reconstruct: cat myfile.json.* > ../tmp/myfile.json

Accidental file creation is a big issue

Git LFS works by matching a pattern of files that it should handle instead of regular Git. I’ve discovered the hard way that accidental file creation, e.g. errors and debugger output creep in to projects.

My solution is to just assume that everything should go in to Git LFS unless otherwise specified.

Need for Temporary Space not Versioned

During the course of a run, it may be necessary to have a place to put temporary files, which should never be admitted into the VCS. This is accomplished here by the tmp/ directory with a highly excluding .gitignore file. (See above `cat’ for example. The reconstructed large file can not re-enter the Git LFS process due to its size!)

The file system is thus:

bin/ : the run scripts (e.g. call python)

input/ : the x’s

output/ : the y’s

src/ : the f’s

tmp/ : temporary, not synced to git!

In Practice

Example Use Cases

Data Science projects are often different in their shape and scope. Here, I illustrate briefly two common use cases:

Regression

Input data are obtained and serialized, and add to Git (LFS) Perform a regression model Store the binary blob of the regression model to disk, and add to Git (LFS) Push to Central Repository Delete work space

Yelp Competition Dataset

8G of JSON files 200,000 JPEG photos Split JSON files into 1T chunks, Delete Original Add Data to Git (LFS) – 2 hours! Push to Central Server Pull on Desktop, Laptop, Servers. Branch, and start doing data analysis on each machine, pushing results back to central server Consolidate branches on Desktop, selecting best line of work to deliver on master

Naming and Git Comments

I find that the repository name, Git checksum and timestamp, and a brief comment are sufficient to provide context for what a particular commit is, but it is possible to construct commit messages with code should that be more appropriate.

Tested Systems

VMWare Fusion Instance	RCG Instance	Vertical Scaling Enterprise Instance
4 cores 16G 2T vmxnet3 network interface	8 core Xeon 64G of RAM 20T RAID10 with daily backup two bridged ethernet NICs	64 core AMD Threadripper 256G of RAM 200T ZFS for checkpointing 8 bridged ethernet NICs

Github.com Gitlab.com Bitbucket.com

Performance

When working with large data, it becomes obvious that there’s a big difference between moving data around and doing stuff with the data. It is conceivable that large data structures would take many hours to add to VCS and then push to a centralized server.

But my experience has been that I would pay any price for distributed reproducible data science workflow, so long as it works with minimum gotcha’s.

Conclusion

Having gotten the Yelp dataset into Gitlab, I’m excited to set about performing analysis on all my machines! It will be nice to push their results back to a central server, so that I can pull to my laptop and review on my upcoming flight. With Git LFS, I’ll just take the assets that I need.

Although I have presented this workflow in the context of Data Science, it is, in fact, applicable to any situation that requires large files to be managed by Git. Hopefully these Git LFS workarounds are useful to you!

Please check back and find out what I’ve come up with using a decent asset management system. Until then!