Merge vs Rebase: Part 3 - What is a rebase?

Merge vs Rebase: Part 3 - What is a rebase?

Table of Contents:


In part 1 we discussed what a commit hash was. One important aspect that we learned about commits was that they cannot be altered. The hash itself is generated from the information stored in the commit, so to modify a commit or commit hash you must create an entirely new commit. We also discussed that each commit stores the hash of the commit before it. What we didn't discuss is what effect this has on our Git history.

Because of the fact that commit hashes are generated based on the information they store and part of that information is the previous commit's hash, modifying your commit history is almost impossible. Each commit is like a link in a chain that was forged around the previous link.

If you bust a link out of a metal chain like the one above, it's impossible to make the previous link and the next link reconnect without breaking them too. However, it's even worse in the context of Git; the analogy breaks down here because in a metal chain you could forge a new link to connect the previous and next links together. You can't do that in Git; the only link you could reforge here would have to be the exact same one, containing the exact same information, including the previous commit hash. Only that exact commit could have the exact commit hash that the next link in the chain holds a reference to.

If you were to delete one commit in the middle of the commit history then the next commit would reference a commit hash that no longer exists. Since you can't change that commit without changing it's hash, you can't simply generate a new commit that references the previous commit because the next commit in the chain references the exact hash from the original commit.

If you change one thing about a commit then the hash generated for it will be different and the next commit in the chain will no longer reference the new commit. You will have to change the next commit to reference the new commit hash, which causes that commit's hash to change as well. So on and so forth all the way to the end of the chain.

This is where rebase comes into play. If you remember from part 2, after we merged our feature1 branch into master we were left with a fork in our graph showing how all our commits were related to each other.

Merges work just fine, but the repository graph can quickly get out of control with all the forks and criss-crossing of commit relationships. Below is just a tiny snippet of what one of the repositories looks like where I work.

If you use a graphical interface for Git then chances are pretty good that you've seen something similar. Merges are the easiest way to move changes between branches because they avoid breaking the commit history chain and cause fewer headaches. However, once you get a robust understanding of how rebasing works you begin to gain an appreciation for it. For instance, if we were to rebase our feature1 branch in our demo repository onto the master branch then we would get a nice clean history like the following:

Notice now that our history is a simple straight line once again. What magic did Git perform to make that possible? If you remember from earlier, our Commit 3 and Commit 4 commits both shared Commit 2 as a common ancestor. Commit 3 referenced Commit 2 as it's previous commit. Now you're probably wondering why it appears that Commit 3 has Commit 4 listed as it's previous commit.

Remember how I said if you broke the chain in the middle then you'd have to rebuild all the commits from that point all the way to the end? Well that's exactly what rebase did.

If you look closely you can see that the commit hashes for Commmit 3, Commit 5, and Commit 6 have all changed. Those were the three commits made to our feature1 branch. By rebasing our feature1 branch onto master git was able to rewind our branch all the way back to where it first diverged from master. It stored the diff for each commit on our branch in a temporary file. It then began to rewrite our branch history, but this time branching from the latest commit on master, Commit 4.

Git created brand new commits for every commit on our branch, complete with brand new commit hashes. As it creates the new commits it changes the first commit on our branch so that it now references the latest commit on master as its previous commit. This process of re-committing your changes as new commits is called replaying your commits on top of master.

Note: Don't be confused by the terminology. Rebasing onto master does not modify master. It means that your branch commits will now follow after the latest commits to master.

You'll notice in the above screenshot that the master branch pointer still points at Commit 4, whose commit hash has not changed. If we now went to checkout master and merged feature1 into master, we wouldn't get a merge commit. It would be a simple fast-forward merge, meaning Git would simply move the master branch pointer up our now straight line of commit references so that it points to the same commit as the feature1 branch pointer.

If instead of merging feature1 into master we decided to do more work and make more commits, we would again create a fork in the graph. Our next commit to master would reference Commit 4 as its ancestor and the first commit of our feature1 branch would also reference Commit 4 as its ancestor. To get a straight line again we would need to checkout feature1 and rebase onto master once more. This is basically what happens if you've ever submitted a pull request through Github and had it go stale. If the project maintainer doesn't merge your pull request in and instead continues to do more work on the project, your pull request would require another rebase to maintain a clean Git history. Rebasing your work onto the original repository/branch will cause the pull request to simply fast-forward their branch pointer up to the same commit that your branch is pointing to once they accept it. Accepting a pull request is just a simple merge. If you've rebased your work before submitting the request and they merge it in before more commits are made, then the merge will be a fast-forward merge and will keep the original repository clean.

DANGER, DANGER WILL ROBINSON!

So far I've shown you how to rebase a feature branch onto the master branch which does not modify any commits already on master. Rebasing your branch onto master will only modify all the commits to your branch. However, if you try to do it the other direction by rebasing master onto your branch then you're asking for a world of pain.

Presumably your master branch is shared with everybody that clones your repository. If you push up a bunch of commits on master or any other shared branch and then I pull those commits down to my machine, I can now continue doing my work and committing new changes on top of yours. My first commit will now reference your last commit as its previous commit.

Let's pretend that you made five commits to your local master branch since the feature1 branch diverged from it and three commits to the feature1 branch. We'll also pretend that you pushed those commits on master up to your remote repository.

Now you make the mistake of rebasing master onto feature1 instead of rebasing feature1 onto master. Git rewinds master and replays those five commits as five brand new commits after the last commit on feature1. You now have a clean Git history with your five master commits appearing after the three feature1 commits, thanks to your rebase. However, you'll notice that origin/master appears to be on a strange fork in the graph.

If you now try to push your master branch up Git will refuse. Git thinks you have five changes on your remote master branch that you need to pull down. It also thinks that you have eight changes that need to be pushed up.

If you try to pull down in this scenario things are going to get hairy for you pretty quickly. Those five commits you pulled down were the old commits with the old commit hashes before you rebased.

Yuck!

"Aha!" you say, "I'll simply git push --force and all will be well!"

DON'T YOU DARE!

By force pushing you've simply dumped the problem on my lap as someone who has cloned your repository and started doing work based on those old commit hashes. Currently my clone of your repository looks like this.

--force will tell the remote repository to simply dump those old commits and use yours instead. Now the next time I go to pull from your repository I get to deal with the mess you've created.

I now have strange merge conflicts and weird duplicated commits. I have no idea where I should continue my work or how my work will ever again be in sync with yours. I am not a happy camper.

This is the reason that people need to be cautious of rebasing, and likewise of using git push --force. Undoing a rebase is not easy, and often impossible so you really need to pay attention to what you're doing. The benefits of rebasing are great, but not if you don't know what you're doing. If you're not careful you could end up losing a bunch of work because you had to rewind your repository back to a commit that wasn't rebased and were unable to restore them from somewhere else.

In our imaginary doomsday scenario the fix would be for you to reset your repository to the last commit on your feature1 branch. You would then need to do a pull from your remote repository to restore those five master commits that you pushed up there before the rebase. Then you can checkout the feature1 branch and do the rebase properly.

Our hypothetical situation would have been easy to revert, but if you lose your cool and start doing things without knowing what you're doing then you could easily lose those five commits forever. If you git push --force and blow away the original five commits on your remote repository then everything will look dandy to you just like in the previous screenshot. That is, until your contributors start cursing at you in your issue tracker or message list. Those five old commits would be toast and the only way for you to restore them so that others can work again would be to find someone who had pulled those five original commits down and restore them from that person.

Bottom line: just don't force push unless you KNOW that nobody else is working on that remote repository or branch. Where I work each developer has their own fork of the main repository and we submit pull requests back to the main repository with our changes. It is common for me to catch a small bug after I already have a pending pull request.

I will often fix the bug and amend my last commmit with git commit --amend, which is equivalent to rebasing the most recent commit and giving it a new commit hash. I have no problem force pushing that change to my fork and blowing away the old commit with the old hash because I know for a fact that I am the only one working on that fork and my pull request has not been merged in yet. If it has been merged then I can no longer submit that ammended commit to our main repository because the original commit has already been merged in and others could have pulled it down. If I force pushed at that point and opened a new pull request, I could potentially cause problems for others on my team were it to be merged in.

Rebase with caution. Once you get the hang of it you'll no longer fall into such traps. Just be careful and pay attention to what you're doing. Rebase snags are almost always just minor snags at first; they only turn into black holes of doom when the user tries to "fix" it and doesn't know what they're doing.

Remember, rebasing one branch onto another simply means: Break the chain where my current branch diverged and start forging brand new chain links onto the end of the other branch so that each link is connected to exactly one previous link and no two links connect to the same previous link.