Home > Eclipse, Git > Shrimp Git.

Shrimp Git.

One of the occupational hazards of being a programmer is the constant need to learn new things or else risk falling hopelessly behind, or so it would seem.  One of the more annoying aspects of this process is that it always seems like the new “new thing” you’ve got to learn has been around for quite a while, and that everybody else is already an expert.  It’s the feeling that you’ve stumbled across a really great party, but that you’re arriving a bit late and now you have to park way down the street, and then when you finally do get in the door you discover that all the shrimp dip is gone.  I’ve decided that this is normal.  One has only so many hours in the day to learn stuff, we can’t possibly keep up with absolutely everything; the infinite supply of shrimp dip will always elude us.

But we can try.

The current focus of my self-education efforts is Git and I think I’m making some progress.  Or, at least, I’m at the point of understanding enough things that I can reflect back on my ignorance and look at the path I took to my current state.  Of course, I’m absolutely certain someone will read this posting and confirm that I’m deluding myself. Anyway, I’ve always been fascinated by the “Ah. Ha!” moment, which is the point where something just clicks and I suddenly just “get it,” whatever it is that “it” is.  I always wonder why someone didn’t just tell me the “Ah ha!”  stuff first.  I certainly got that feeling learning Git.  I’ll try to explain my version of the “Ah ha!” stuff below, maybe it will work for you.

Most explanations of Git that I’ve encountered begin with some true, but not very enlightening statements.

  • “It’s distributed.”
  • “Everyone has their own copy of the repository.”
  • “It’s a graph.”
  • “It is easy to branch, people do it all the time.”

These initial descriptions of Git are well meaning and actually do accurately summarize many of its main characteristics, but they seem to lead away from giving the student the “Ah. Ha!” moment they need, or at least they did for me.

To someone only familiar with CVS and Subversion, which is probably the case for most people trying to learn Git, these initial descriptions seem like rather undesirable characteristics for a revision control system, they may even sound like a recipe for complete chaos.  They certainly raised a bunch of immediate questions in my mind.  For instance, wouldn’t it be hard to coordinate and manage the different distributed instances?  Also, how could everyone have their own copy and why would that be a good idea, how would you synchronize all the different copies?  And, what about back-up?  What happens if you lose one of distributed copies of the  repository, wouldn’t you lose part of your code base?  And, finally, what’s with the branching thing anyway, that just sounds like an extra complication?

To figure this out I read “Pro Git,” by Scott Chacon, which I thought was pretty good, and other material I could find online.  The moment things became clear, though, was when I realized the significance of the role played by the SHA-1 cryptographic check sum that Git computes for each set of changes committed to a copy of the repository.

So, here’s where I had my “Ah. Ha!” moment.  Centralized repository systems like CVS and Subversion produce change set identifiers, version or revision numbers, that are relative to, and only valid within the context of, a single centralized repository instance.  And, the relationships between change sets are implicitly expressed by the value of the identifier (i.e., revision 999 comes “after” revision 998).  Here it is:  In Git, the use of the SHA-1  cryptographic check sum allows these two characteristics to be different. The check sum plays the role of an absolute, not relative, identifier.  In essence, it is the absolute address of a change set in the virtual address space of change sets, as defined by the SHA-1 computation.

Git it?!

Ah!: The use of the SHA-1 cryptographic check sum allows Git to produce
unique revision numbers without the need to consult an external
(centralized) authority.  This means that there is no need to be
in communication with anything to commit.  The computation of the
check sum can be performed independently, offline, but because all
implementations of Git perform exactly the same computation, this
is not a problem.  Independently committed change sets can come
together at any time in the future with no clashes.

Ha!: Git explicitly links change sets by storing the check sum of the
previous change set (i.e., the “parent” or “parents”) in the current change
set, and using this value in the computation of the current check
sum.  This produces a unique, explicit, chain of linked change sets that can
be followed back to the original state of the system.  This means
that arbitrary sequences of change sets, created by many
independent programmers, can be committed to their own instances
of a Git repository and then later, if desired, shared and
integrated with other related Git repositories by finding common
ancestors.

Now we’re talking “Shrimp dip.”

With an understanding of these two points, the idea that Git is “distributed” starts to make some sense.  There could be two instances of a Git repository on two different computers, and two different programmers could commit to their respective different repository instances without consulting each other.  Later, the change sets from the different commits could be exchanged between the two repository instances and would be available to both programmers and could be merged as appropriate, or not.  And, of course, there’s nothing special about having just two instances, there could be an arbitrary number of instances “distributed” across many different computers.  I must say that this isn’t what I thought when I first read “distributed,” instead, I envisioned a repository that stored different portions of its contents in a, mostly, non-redundant manner on many different machines.  Now, I understand it means that it stores mostly redundant copies of its contents on different machines, with the differences being propagated, occasionally, as necessary.

The idea that everyone would have their own copy of the repository also doesn’t seem so strange, it is  completely obvious, of course they would, how else would it work?  However, I have to admit that it took me a moment to realize that this flexibility doesn’t imply that all of the Git repository instances are equally “valid” with respect to creating the contents of an eventual system release.  There is still the need for a single Git repository instance to serve as a curated, backed up, collection point for contributions from copies (“clones” in Git-speak).

Finally, the ideas that Git repositories form a graph and that branching is easy and happens all the time,  become obvious.  The change sets form the nodes of a graph which are linked together using the check sums as pointers.  Branching is simply a function of how the different commits point to common ancestors.

Now, it’s time for some more dip.

Categories: Eclipse, Git Tags:
  1. Ian Bull
    June 9th, 2010 at 17:23 | #1

    Thanks for writing this. I had the exact same feeling about git. It was after the eclipsecon tutorial where I had my Ah-Ha moment, and it was very similar to yours.

  2. June 9th, 2010 at 17:54 | #2

    I wrote up a long post on the key differences between CVCS and a DVCS at:

    http://alblue.bandlem.com/2010/02/git-for-eclipse-users.html

    This has also made it into the EGit documentation inside Eclipse

    http://wiki.eclipse.org/EGit/Git_For_Eclipse_Users

  3. Pat
    June 9th, 2010 at 17:54 | #3

    Well said. Grokking this is essential to truly understanding and making the best use of Git. That’s why I think the best reading on Git, once you’re past the basic intro, is either the “Git Internals” PDF ($9 ebook by “Pro Git” author Scott Chacon — worth every penny) or the free PDF “Git from the bottom up”.

    Git Internals: http://peepcode.com/products/git-internals-pdf

    Git from the bottom up: http://www.newartisans.com/2008/04/git-from-the-bottom-up.html

  4. June 11th, 2010 at 01:51 | #4

    I have outlined other key differences between DVCS and CVCS:

    http://stackoverflow.com/questions/2704996/describe-your-workflow-of-using-version-control-vcs-or-dvcs/2705286#2705286

    But, regarding Git specifically, one blog post that forces you to “re-design Git” is also very illuminating:

    “You could have invented git (and maybe you already have!)”:
    http://reprog.wordpress.com/2010/05/13/you-could-have-invented-git-and-maybe-you-already-have/

  5. June 12th, 2010 at 05:03 | #5

    I love it!

  1. No trackbacks yet.