Faster Feature Branching in Large CVS Repositories
At work we have a large CVS repository. By large, I mean 120k files, 2.5GB checkout. Most things work fine, and we've evolved some techniques to deal with operations that would otherwise be slow.
Things that work well:
- Committing a small list of files
- Updating your whole working copy, since we only expect to do so once daily
- Updating a small list of files to get someone's recent changes
Things that didn't work well:
- Scanning your working copy for things you forgot to checkin
- Branching, because if you do the naive thing, you have to wait for the CVS to branch the whole repository
- When using tags the naive way, for example for marking a release and deploying it, again because CVS has to walk the whole tree
While we never quite addressed the first problem, we do pretty well at making sure CVS never has to walk the whole tree.
We keep CVS from walking the whole tree almost all the time by making sure we can tell it exactly which files to work on. We keep a list of affected files for each bug or feature. At first we did it manually, but that proved a little error prone, so we built a commit hook that would index the commits by bug number, provided the commit message is formatted like this:
 Lay the groundwork for the Widget model
where 1234 is the bug number for the commit.
With that file list, we can do pretty much whatever we want without walking the working tree like this:
list_files 1234 | xargs cvs ...
The CVS command could be a diff to see what's changed on my branch, an update to switch on or off a branch, or a commit to checkin files that are affected by my feature... assuming of course that I didn't change any new files since last checkin!
Branching with File Lists
Typically you'd branch an entire CVS project at a time, applying the branch up-front to all files in the project. But that requires walking the whole tree. With file lists, we can do it one file at a time. So only the files affected by a given feature have the branch. When feature touches an additional file, we branch the file at that time and not sooner.
When deploying a feature for testing, we also avoid walking the whole tree to checkout the branch. We use a mixed tree. We start with trunk (well, actually production, which is a subtly different melange) and switch the affected files to the branch. After a certain amount of QA, we merge the feature to trunk, again using file lists to restrict the scope of CVS commands, and release.
Since CVS doesn't let you easily refer to the start of a branch that includes multiple files, we adhere to the standard practice of marking the start of each branch in a given file with a tag, which facilitates the merge to trunk.
Viola, no more waiting 20 minutes for CVS to branch, merge, update, or diff.
(Of course, CVS doesn't always handle the addition or removal of new files and directories the way I wish it did, and if you change a file to a directory... well I feel sorry for you. But those issues seem to exist with or without file lists.)
Maintaining the File Lists
We maintain the lists with two mechanisms: a loginfo hook to reindex each file of a commit, and a nightly batch job to walk the repository and pickup any files that someone missed.
The lists are stored as one big index in a MySQL database that essentially maps the Bugzilla bug ID of a bug to the list of files with commits for that bug, whether they are on trunk or on a branch. (We also store information about which revision numbers of a file belong to which bug, because that lets us do extra things like see the most recent commits or find unreleased bugs that are committed on top of one we'd like to release.)
This lets us very easily construct a list of files for a feature, as long as the MySQL database is available. We did find, unfortunately, that when this new component to our version control system was unavailable, our dependence on it abruptly halted our development progress. For example, when we changed data centers and again when we upgraded MySQL servers, the development team struggled to reconstruct file lists by hand because the database hosting the file list index took a back seat to our primary production databases.
As mentioned, before we wrote the in-house scripts to update this index, we kept a list manually in Bugzilla. We found the bug comments quickly got cluttered, especially on larger features. But I'd still recommend it as place to start if you think this process might be for you.
Although we used this mechanism successfully for three years, we are in the process of switching to Git for the following reasons:
Automatic tracking of which branch commits need to be merged — the ones since the last merge
Preservation of fine-grained branch history after merging back to trunk
Reduced dependence on a central database for the file list index — operations are fast out-of-the-box, and most work without a network connection
Ability to quickly scan the whole tree for uncommitted changes