2010-08-13

Git Rebase Part 1: Why you should use it, Theory.

If you have yet to master Gits rebase feature, now is the time to do so. Rebasing, as we call it, provides you with astonishingly awesome powers of manipulating your repository history. Rebasing is however, not for the faint of heart, and those still green on how Git really works, as there is a lot of concepts you need to have a firm grasp of in order to utilize it. ( I will try cover these concepts here ).

If you have only ever worked on simple no-branching repositories with few committers, in a non-distributed SCM, you probably have not even yet encountered a scenario where rebasing would make much sense to you. The problem is when you get diverging histories greater than a few commits.

Normally, people wait till merge time to resolve this divergence, but it can be less than simple, and the longer and more complex your history, the harder merging will be, and there is no technological way to make this simply go away ( at least, not yet ).

Fortunately, most of merges are non-competing commits, and Git does a stellar job of doing the right thing in these scenarios.

The challenge occurs when two people create competing commits on parallel histories, and the problem is exacerbated when there is a stack load of changes on top of those competing commits.

Classical solutions to this result in a big explosion at merge time, and all you have to go on is what the current state of either branch is at that time, and you have to know the code very well, and be able to “pick one” of the solutions ( or, if you are especially unlucky, manually find a 3rd solution in your head which is a product of both ).

Unfortunately, in many large open-source projects, not everyone knows everything about everything, and when it comes time to merge, the person doing the merge knows nothing about the specifics of the others changes on a line-by-line basis, and so, there is an unreasonable demand on the merger to be a magician.

Offloading the merge load to the contributor.

A good solution in my mind, is to offload the responsibility of resolving issues onto the person performing the contributions. They understand their code the most, they know what needs to go away when and where. One approach to this, involves perpetual merges from upstream to keep your branch “synced”, but this is a nightmare. It also in my experience doesn't work like you would expect. I do not want to go into the specifics of the problems I have seen with merges, simply because communicating them simply is difficult. Also, it makes things even more complicated later down the line with reintegration, as comparing the diffs can be misleading as to what really changed, as well as overcomplicating the commit history.

A logical way to consider how rebasing works.

Consider you are working on a more old-school SCM such as Subversion, where this rebasing feature does not exist. To emulate a rebase, what you would have to do, is first, find the point where the branch you are to rebase first diverges from trunk. Then, you would produce a patch for each and every commit that had been applied to the branch since it was branched from trunk. Doing this, is of course no simple feat ☺.

You would then create another branch, starting from the current trunk, and switch to it. Then, you would iterate through every patch in order, and apply it to this new branch, possibly stopping between each patch application, to correct any collisions that caused the patch to fail ( i.e.: edit the patch until it applied cleanly ), before committing it.

The product is a completely unambiguous patch series, relative to the current trunk. Where branches are considered “feature branches”, this new branch becomes a perfect logical sequence of commits that can be unambiguously applied to trunk to add the given feature.

At this new state, assuming no other commits are made to trunk, this new branch logically should be merge-able straight into trunk with no collisions whatsoever. ( I am of course making huge assumptions here with regard to subversion being smart enough to know how to handle it, and not simply going “Hurr, branch and trunk look different, must be a collision!”)

To Explain this Visually:

http://gist.github.com/raw/517220/9b885f405d2f9cd3bc1c19b69868db341d6eea75/graph.dot.txt This is our initial repository, a nice straight forward commit sequence. “Trunk” is the current state of our directory. Note that although I have used numbers for clarity in explanation, Git internally has no such sequential concept. Hopefully, this structure is apparent to all readers.

http://gist.github.com/raw/517220/81a7d918d13cfec92fcc84c6fd39b1fdb68e28cd/graph2.dot.txt In diagram 2, more commits have been created. At 04 a new topic branch was created called “X”. Since the divergence of these branches, 4 commits have been created on trunk and 5 commits on X. Normally, you would probably want to try merging x05 back into trunk to create a new commit. But this leaves you with multiple paths in your history, which can make things very messy over time.

http://gist.github.com/raw/517220/8e6db008abe76f80575a225734148cfc6f6af05c/graph3.dot.txt
In the above diagram, each and every commit has been “replayed” on top of the trunk. Note that this creates a new commit, which is a derivation of the original commit. Aggregatively, when a whole branch is replayed on top of trunk ( or any other branch for that matter ), the effect is you produce a second derived branch, that simply has a different origin.

In practice, this new branch is much like you had decided “Hey, branches are too hard, we will not do them, so new features must be worked on, and completed 100%, before starting another feature” , and you had instead merely “waited around” for commit 08 to arrive, and then proceeded to develop the same feature. ( Except for of course, in reality, you never had to do any of that silly waiting stuff, and you actually were able to use branches! )

After creating the derivative branch, we can then clean up the original branch. It is no longer needed, and having it lying around is just likely to confuse people, not to mention make our history graph very messy. Git will perform this step for you automatically, as soon as the rebase is deemed “Successful” and “Complete”.

Now, if you consider the logical application of what a merge does at this point, its quite straight forward now. In fact, that sentence almost constitutes a bad pun, considering git calls this type of merge scenario a “Fast Forward”. This is simply because it does no real merging at all. Git sees there is a simple straight linear sequence of commits that can exist to update trunk to reflect the integration of the branch, so it simply changes what commit it calls “head”.

Now as you can see ...

The result is a much much cleaner history to work with, and merging branches becomes trivially mindless ☺

Footnotes:

All diagrams designed in graphviz. For the source for these diagrams, see This Gist on Github