Acorns

Marcel's blog

Version Control Comparison for Large Repositories

At Care2, our main repository had 120,000 files and a 2.4 GB CVS checkout. CVS was mostly working with some hacks to run faster on our huge repository. But I wanted more out of version control.

The biggest issue was that merging didn't work well. Sometimes adding or removing files on a branch would have an unexpected result after merging to trunk. And it was difficult to merge to and from trunk multiple times. I know, I know, you can tag branches at just the right place to track what's been merged already... but I'd rather not.

We also relied on file lists to speed up CVS operations.  File lists help by restricting CVS commands to a carefully maintained list of files that were actually touched on a feature branch. But we ran into intermittent problems when the file list database got moved or access was accidentally revoked. The file lists were good at speeding up CVS but greatly increased the complexity and fragility of our development process, so I was eager to leave them behind.

Our two main reasons for ditching CVS were, in a nut shell:

  • Cumbersome merging
  • Slow performance on our large repository

With those reasons in mind, and I set out to find a replacement.

Contenders

So which version control system (aka source code management system) to switch to?

Subversion seemed like the most natural upgrade, claiming to fix what was wrong with CVS.

Then there was this Git thing that I'd seen, but it sounded weird and seemed kind of new. How many version control systems are there room for in this world anyway?

I'd tried Darcs a while back. While I liked the simple user interface, the distributed model, and the patch theory concept, it could barely handle just 100MB subdirectory of our code base a year before, when I tried it out.

I read bad review after bad review about arch. Hard to use, not great performance. It seemed a little old, and given that, why hadn't I heard more of it?

My last company switched to BitKeeper. MySQL uses it. The kernel uses it. Must be good, right? Well, we're pretty open-source friendly, so I didn't think it would go over well. Then I learned that the kernel used to use BitKeeper, and is now using Git. That got my attention. After all, the kernel source, at 20k files, has some of the same types of performance problems as Care2's code base. (MySQL has since switched from BitKeeper to Bazaar, which still seems to have performance issues with large projects.)

And then I learned that Mercurial (hg) was born from the same flames as Git, so I had to try it, too. Mercurial obviously lost the kernel source to Git, but I didn't know that story, so I decided to give it a chance. It was in Python, which might be easier to extend, and it generally seemed slightly more user friendly at first.

There were others, but really I was looking for a small number of version control systems to try, and I ended up with:

  • CVS (for a baseline)
  • Subversion
  • Git
  • Mercurial

Importing History

To see how they would perform on our code base, I wanted to import our code into each version control system.  And, in case an operation depended on the amount of history in the repository, I wanted to include the three years of CVS history we had.

For Subversion, cvs2svn worked well.

For Git, I tried the built-in CVS importer, but it calls cvsps under the covers, and that choked. It just segfaulted. And it looks like a mostly dead project. Well, it turns out you can use cvs2svn for Git as well.

For Mercurial, I used fromcvs. It choked on some UTF-8 commit messages about 80% of the way through the import, even when I set various encoding flags. Since I wasn't concerned about details like character encoding yet, I manually edited a copy of the CVS repository to get around the problem.

Performance

The Subversion killer was switching to a branch. What?! You thought Subversion could create O(1) branches? Well, it can, in the repository. But first thing you'll do after creating a branch is switch your working copy to it. And with that, Subversion fails on our code base. That's it. the most natural upgrade from CVS failed. I'm sure the large Subversion projects out there have ways of coping.  Maybe they use mixed checkouts? We could have made it work with file lists. But I didn't want to. I wanted a version control system that worked on its own.

And Git did. Git was fast. Git is a little memory hungry on a big repository. And when your OS buffer cache is cold, statting the 120k files can get slow. On NFS, the cache is always cold, so you have to use local disk. If you do that, it's mostly, blazingly, magically fast. No inotify, no daemon. Diffing your working copy, for example, requires just 120,000 stats in 0.8 seconds. "git status" does just a little more work, reading all the directories, but takes only a bit longer.  And that solved a problem I didn't know I was going to solve: seeing what I still need to check in.

Mercurial turned out not to be as fast as Git. I chose Git over it because Mercurial took 12 seconds for the same diff task that took Git 0.8 seconds. As a developer, in 12 seconds, I can loose my train of thought or get distracted with something I could be doing while I'm waiting. It's an interruption. It's not long enough to get a cup of coffee, but I don't like to wait. And that performance gap extended to other common operations: looking at recent history and committing. Git performance trumped Mercurial's nicely organized and named command set.

Here's a table of the performance measurements I considered, with the ones most important to me highlighted:

  cvs svn git hg
Working copy changes
Determine working copy changes on workstation 12m 4m 1-25s 12-33s
Determine working copy changes on server with NFS 17m   20-55s  
Committing
Commit with files not listed     1.6s 17s
Commit with file(s) listed
  • I would expect simple commits to be fairly fast. In this case, I explicitly named the file to be changed to avoid having it crawl the working copy. However, I think Git crawls it anyway.
    2.4s 8-14s
Push any changes to central repository     0.5s 7s
History
Show recent entries in repository commit log     0.01s 2m
Show recent entries in commit log for one file     0.1s 84s
Diff between last two revisions     0.1s 2.2s
Git grep of whole repo, usually takes minutes     8-32s  
Branching
Create branch (no file lists) 27m 0.07s 0.004s 0.04s
Switch to branch (for svn, in addition to create) 22m >7m 7.4s 20s-1m
Switch to branch memory footprint (MB) 23 >450 130  
Switch to branch with file lists (9 files) 1.5s 10s    
Trivial merge     <1s 20s
Merge with 680 files changed, ~3200 edits     4.5s  
Checkout
Gigabytes for checkout
  • Git's checkout includes full history with branches
  • A "bare" Git checkout (no working copy) is 1.4G
2.5 5.6 4.1 7.6
Checkout time on workstation (min)
  • Git's and hg's include full history w/ branches
14 53 8 17
Checkout time on server with NFS (min) 67   45  
Memory footprint for checkout (MB) 5-172 100 260  
Memory footprint for checkout on 64-bit server (MB) 61   1300  
Update to recent revision (no file lists) 4m 6m 4s  

 

Notes:

  • File System Buffer Cache: Performance of Git and Mercurial relies on the OS's file system buffer cache. "git diff" caches about 60MB in it, so ideally we'd have 60MB x 8 developers = 0.5GB of RAM available for the file system buffer cache. When I clear the cache, performance of "git diff" returns to the 25 sec range.
  • All tests performed on recent distributions of Linux
  • Tested git 1.6.0, hg 0.9.4, svn 1.5.1, and cvs 1.12.13.  However, I expect these performance characteristics not to change much over time unless the storage formats are drastically changed
  • I've highlighted in green and red the numbers that I found most important. In my working style, operations from most to least common are: diffing working copy, committing, showing recent history, pushing code (for systems where a separate step is required), branching, merging, checking out a new working copy

Features

Ok, so Git is the clear choice in terms of performance on my large repository without file list hacks, but does it really deliver good merging?

Linus really got that part right: Merging is the hard part; merging should be better, not just branching. So of course Git deals properly with files added or removed on a branch when I merge. And since merging is pretty much symmetrical, dito when files are added or removed on trunk.

But when files are moved is when Git really gets going. Git handles edit/move conflicts well: if you rename a file on a branch, and then edit it on trunk, when you merge, you'll have the edit under the new path, as it should be. Subversion fails there. Mercurial insists that you tell it when moving a file, while Git just infers the rename by comparing hashes of the new and old file contents. But wait! Git doesn't stop there: it can also auto-detect the rename when the hash is different — when you've edited the file in the same commit that you move it!! How this is done efficiently eludes me, especially given the claim that Git doesn't store any evidence of the rename. It rediscovers that there was a rename each time it needs to know. I've also heard that this kind of edit/move merging can happen on a finer grained level, with function moves instead of file moves, for example.

Git can quickly list what changed between two revisions. It can quickly update your working copy by altering just the files that differ between your current and target revisions. Everything that used to be even a little slow in CVS is much faster in Git.

Git has some nice bonus features, too:

  • The distributed nature means you can checkpoint your work more often, even before you know you want to share it: Until you've published your commits, you can revise, reorder, or discard them
  • Git's speed with branching and merging means you can do it much more often, which enables a cleaner workflow
  • You can grep any revision of your code base much faster with "git grep" because it can open just a small number of history storage "pack" files instead of every file in your working copy
  • Git can do a smart binary search to find which revision introduced an bug.  You provide the test for bugginess, which can be either manual or automated
  • Git has a variety of options that affect output formatting
  • It can even graph your the ancestry of your commits with ASCII or a GUI

Some short comings of Git, however:

  • It doesn't work as well on Windows, though it does work
  • It does have a steep learning curve for CVS users, partly because it's distributed, and partly because revisions are repository-wide, not per-file
  • For large projects, you really need the local disk; NFS will not do
  • You can't checkout just one subdirectory of your repository.  If you're accustomed to that, you'll need to break a monolithic repository into cohesive packages, possibly using the submodule system to reference one from another
  • Git doesn't track base revisions for each checked out file, so you can't really maintain a mixed checkout

All of these issues have turned out to be well worth overcoming for me.  For example, I've come to realize that mixed checkouts were really just a hack to cope with poor performance.

I was also concerned about extra storage for one copy of project history per developer.  While it turns out you can setup your repository to share another's history in Git, we don't actually bother.  Disk is cheap.  At least we don't need to maintain multiple checkouts per developer per active feature branch, now that we have Git's fast branch switching.

Conclusion

Git improves tremendously over CVS in the two areas we most needed improvement: merging and performance.  In addition, it resolves another issue I had thought was impossible to solve: quickly seeing which files I still need to commit.

The numbers speak for themselves. If have a large project and you want to spend more time coding and less time waiting for you version control system, you want Git. If you have a small project, performance may matter less, but you may still find Git's feature set to be superior to Subversion and Mercurial. For my sample code base, Git was really the only workable option.

Post a comment

Name or OpenID (required)


(lesstile enabled - surround code blocks with ---)