Update 2013-10-05: See my other post on git for my latest thoughts on git. I’m now an enthusiastic fan of git. :)
At work, we’ve been investigating revision control systems to replace our current system, CVS.
Primarily, we’re investigating git (wikipedia article) as it has basically become the de facto distributed version control system. And we definitely see the advantage of being distributed. While we’re focusing on git, I have been reading up on lots and lots of other systems: dcvs, svn, aegis, bazaar, mercurial, and many, many more. I’ve been compiling an internal wiki document with many resources for each so we can better learn and compare them.
The odd thing with me is that I was late to the rcs game, starting to use cvs only around 2004 or so. I started on cvs and have continued to use cvs until now. I know its ins and outs. I’m comfortable with its plain text repository files and their contents. I’ve written scripts to parse them and present structured data back to the user. For example, which files and revisions have we marked with a specific bug in the comments: Great way to get a changeset for one bug if you’re disciplined enough to “tag” comments with the right bug number. Again, I’m comfortable with tagging, branching and all the discipline and structure and branch modeling needed for proper branching and merging that will necessarily occur in significant projects.
So, I see the dotted version numbers they use for revision specifiers and they make sense. Everybody versions their code releases with dotted version numbers (1.0, 1.1, 2.0, 3.0, etc.), whether they use revision control or not. So it makes sense to version files that way. But it seems modern systems don’t like that. They’ll use hashes to represent file versions.
I also see the plain way in which my checkout files and directories are recorded as the same name files and directories in the repository. It makes sense. There are some limitations like symbolic links but they either don’t come up that often or they can be dealt with in other ways (we use project initialization shell scripts that setup the environment complete with symlinks for perl package namespaces, apache config symlinks, etc.). But, again, modern systems will use custom structures which allow them to model plain files and plain directories in their own format and export them as needed.
I can totally understand where the mindset came for cvs back in the day in the 1980s. It was probably significantly if not directly inspired in its design and architecture by the Unix Philosophy. The Unix Philosophy, by the way, is a brilliant thing of utter common sense and simplicity. It is the only philosophy in program design that has produced real interoperability: the command-line.
Then I look at all of the modern, distributed revision control systems and they completely eschew with the idea of plain files and plain directories. They also mostly do away with file version numbers. I understand there are advantages. Namely performance and ability to model more than simple files and directories. The possibility of more easily versioned directories is there as well. There are definitely advantages but I’m still on the fence as to whether it’s worth it or not.
As I said, I grew up on cvs and, as necessity usually forces one to do, I invented means to get around cvs’s shortcomings and now have an ecosystem that I am comfortable with, have a good knowledge of, and have the confidence that I can extend it to do what I need when I need it without a lot of hassle.
Modern systems abstract out concrete file system concepts such as files and directories and simply model them in their own objects. They can then be exported to concrete file system files or directories internally at will. I’ve actually always been one for abstracting and modeling things in a generalized way that gives power to the application rather than to external forces but it does have its disadvantages; things become more opaque and so more difficult to work with.
We have a number of server-side “locks” in place to prevent things that might be “bad” when users commit. For example, a file, directory or branch may be locked, or the commit message was invalid, or the file contained dos line endings which we don’t want, or we’d like to enforce certain policy regarding the content of the file. This is all very easy as, during the commit, these hooks have access to the plain files. I tried to figure out how to do the file access hooks in git and didn’t come away with any real answers – yet.
I guess my concern is that perhaps modern revision control systems have abandoned some good principles of the Unix Philosophy and, as those with experience know, those principles came with a lot of pain. Discard those principles and you’ll probably learn them all over again with a lot of pain. In a common unix adage, paraphrased, those who forget their unix history are doomed to repeat it, badly.
Or perhaps we understand version control well enough now that it’s no longer exciting and we don’t want to implement plainly again so we’re taking the next step? I find this at work in my own code. At times I’m excited by a new technique or implementation and I hone it and hone it until at some point I get bored of it and no longer want to know it inside and out. So I might use a perl cpan module and now it’s a black box and I don’t have to worry about reinventing the wheel.
Whatever the case may be, if hiding file and directory details behind custom, binary data structures, and using meaningless hashes for version identifiers, is the way of the future, perhaps I’ll just need to try a little harder to wrap my head around these new systems and the new ideas they’re bringing to the table.
I’ll mention just one other thing that I think is important. RCS can do everything that modern systems can do, it just needs glue scripts. CVS can do everything that modern systems can do, it just needs glue scripts. Git, being a modern system, does everything it does without the need for glue scripts.
What I’m saying is, I get the advantage of cvs plain files and plain directories, and dotted version numbers, and I get the advantages of changesets and a dstributed nature if I put in place my own solutions for these things – they’re not in by default. Some if not all can be relatively automated. The point is I almost feel I can get that well known 80/20 win where the biggest win comes in the first, relatively small amount of effort. Perhaps it would be a nightmare of details to make cvs distributed but here’s the important thing to remember:
Give me something that follows the Unix Philosophy and I’ll likely be able to extend it to do pretty much whatever I need.
Get a little too complex with your data formats and abstract models and you’ll start suffocating innovation because it’s difficult to work with.
Git also follows the UNIX philosophy as close as possible, but starting not like CVS wirh versioning single files and tying it (trying to tie it) together, but with snapshots of the whole project. It was built from bottom up; even if you can’t access files directly, there are low level tools (commands) such as git-cat-file and git-hash-object which allow for direct manipulation of underlying repository structure.
I’d recommend to read “The Git Parable” blog post by Tim Preston-Warner.
Thanks for the heads up on The Git Parable link. It was a great, simple introduction to the ideas in git.
I just finished watching Randal Schwartz’s Google TechTalk and am getting a better picture of the whole “track contents not files” paradigm.
I’m just not sure I see the win. I like the advantages. I would like offline mode in cvs. I would like the performance of git in cvs. Great advances have been made in VCS’s that I wish were in cvs.
I think if distributed development is a requirement, git is a necessity, but otherwise I don’t think I see a big enough win with git.
I’m still pressing ahead with research to see if something will click for me or perhaps I’ll get my head wrapped around git enough to see a killer feature.
Right now I just can’t see that thing that I absolutely need in a new VCS.