What do you want to learn?
Leverged
jhuang@tampa.cgsinc.com
Skip to main content
Pluralsight uses cookies.Learn more about your privacy
Advanced Git Tips and Tricks
by Enrico Campidoglio
Get the most out of Git by exploring its lesser-known features to take your daily workflow to the next level. This course teaches you how to find a good workflow, track commits, and even debug.
Start CourseBookmarkAdd to Channel
Table of contents
Description
Transcript
Exercise files
Discussion
Learning Check
Recommended
Introduction
Introduction
Hi, my name is Enrico Campidoglio. Welcome to Advanced Git Tips and Tricks. If you've been using Git for a while, you probably know how to use it like any other version control system. You commit files, review the history of changes, maybe even create a merge branch. But you've probably also heard that Git isn't like any other version control system; it's more powerful. That certainly sounds great, but what does it mean in practice? This course will teach you what that power is all about. You will learn how to take advantage of Git's advanced, and therefore lesser known, features to help you improve your daily workflow. Let's get started.
The Tools
Throughout this course, we'll be interacting with Git entirely from the command line. If you feel more comfortable using a graphical user interface, you can certainly run one alongside the command line as you follow along. However, I highly recommend that you make the command line your primary tool when using Git. Why, because graphical interfaces, regardless of how well they are designed, always impose some kind of limitation on what you can and can't do with Git. We'll talk more about the difference between using Git from the command line and through a graphical user interface in the next module. Just because we'll be using Git from a command prompt doesn't mean that we'll have to sacrifice user-friendliness. There are a few utilities designed specifically to make Git easier to use from the command line. For example, they enrich the command prompt with updated information about the repository. Most of them also provide autocompletion, so we don't have to type as much. We'll look at a couple of these utilities in the beginning of the next module. Of course, we're going to need to have Git installed on our machine. It doesn't have to be a specific version. Anything from 1.8.0 and above will work just fine. If your version is older than 1.8.0, the easiest thing to do is to simply install the latest version for your operating system. Since starting at the command prompt alone can sometimes become quite boring, we'll also be looking at the real-time graphical representation of our Git repositories. Every time we run a command, the graph will show us how it has affected the underlying history. This kind of visualization really helps to understand how the commands work at their fundamental level. With that knowledge, it then becomes easier to recognize in what situations they're useful, and how to use them appropriately. At this point, we should be all set. But before we dive into the advanced stuff, let's take a moment to go through some Git fundamentals.
Git Fundamentals
One of the common complaints people have about Git is that it's too complicated. You may have heard someone say that Git is hard, or in some cases even, Git to me is harder than programming. These complaints are understandable. Git can, in fact, be rather intimidating at first, with its scary-sounding commands and huge set of options and parameters. However, beneath that unfriendly surface, Git's foundation is actually very simple. So simple, in fact, that we can summarize it using three basic concepts. Commits, Snapshots, and References. Let's look at each of them individually, and how they relate to each other. Every time we Commit some changes, Git creates a so-called Commit Object. The Commit contains information about its author, the timestamp when it was created, and, most importantly, a Reference to the Commit that came before it: its Parent. The Snapshot represents the state of the directories and files in the repository at the time the Commit was made. Internally, Git represents each directory with an object called a Tree. Under a Tree, there can be other Trees, or the actual files, which are called Blobs. You can directly point to any Commit, Tree, or Blob by its unique I.D., which consists of the computed SHA-1 hash of the object's contents. While having these I.D.'s is sometimes pretty handy, if you are to use them all the time, it will quickly become bothersome. So Git offers a better way to refer to Commits by using so-called References. A Reference is just the name that points to a certain Commit. You can think of it as a symbolic link, or a bookmark. In Git, there are three kinds of References: Tags, Branches, and HEAD. A Tag is a fixed Reference to a specific Commit. It never changes. This is commonly used to mark the Commit associated with a certain version of the software that's in the repository. A Branch is a Reference to the latest Commit in the line of history. Every time we make a new Commit, the Branch Reference is updated to point to it. The history of a Branch can then be recreated by starting from the latest one, and track backwards, following the trail of Parents. Lastly, HEAD is a special Reference managed by Git to keep track of the Branch, or in some cases, the Commit, whose Snapshot corresponds to what's in the working directory. And that is Git's object model in a nutshell. As you see, there is nothing really complicated about it. The complexity everyone talks about lies entirely in the multitude of commands and options built to manipulate this object model in various ways. This kind of complexity has nothing to do with the essence of the problem that Git is trying to solve. Instead, it's accidental. A byproduct of design choices and collaboration that could have been avoided in the first place. Therefore, keeping a mental model of Git's fundamentals really helps see past this complexity, making even the most complicated command look trivial. And that's exactly what you are going to do in this course.
The Command Line
Introduction
In this module we look at how to work efficiently with Git from the command line. We'll start by comparing the experience of using Git to a Graphical User Interface versus using it from the command line and demonstrate why the latter is a better tool for the job. Next, we'll see how we can do more and type less by using aliases. Finally, we'll explore interesting ways to visualize the contents and history of our repositories without ever leaving the command line. Let's get started.
CLI vs. GUI
Generally speaking people tend to fall into two camps when it comes to the way they prefer to interact with computers. On one side there is the people who given the choice will always pick a command line interface and on the other the people who will rather use a Graphical User Interface. Whichever you feel more comfortable with is largely a matter of taste however it definitely is the case that for certain tasks one is more suited than the other. For example it would be slow and impractical to browse the web using a text-based browser like Lynx by the same token when it comes to Git the command line is a lot more efficient than any Graphical User Interface. Does this statement sound controversial? Well, considering what I said earlier about taste it certainly might be to some degree but to prove my point let's take a look at a fairly common Git GUI. Does anything in particular stand out to you? Like many other tools of the same kind it presents the user with toolbars and menus packed with commands and options. These applications are supposed to make it easier to use Git without having to invest a whole lot of time in learning how it works. However in reality if you don't know what to choose there is very little they can do to help. If you think of Git's commands and options as a domain specific language, one that allows you to manipulate the history of your source code it becomes obvious how any attempt to make it fit into some sort of graphical interface works against its text-based nature it's destined to become bloated to the point of becoming a limitation rather than an empowerment. Consider a programming language would you rather create software by writing code or by clicking on menus and buttons? That's why it makes more sense to use Git from the command line much like when using a text editor everything is only a few characters away without forcing any particular workflow. The trick to become efficient is to understand how the different commands work how they affect history and learn when it's appropriate to use them.
Command Line Utilities
Just because we used a command line doesn't mean we have to give up on user friendliness. A few utilities do exist to help being more efficient when using Git from the command line. The choice of which utility to use depends first and foremost on the operating system you're on. I'm going to present you with two very good options one for Windows and one for a Unix based operating system such as Linux or McQuistan. If you're using Windows your best option is to use Git from PowerShell using a module called posh-git you can easily install posh-git by cloning it from its GitHub repository and running the script install.ps1. If you have psget installed it becomes even easier since you can simply say Install-Module posh-git. From that point on every time you see the interdirectory that contains a Git repository posh-git will automatically enrich the command prompt with useful information such as the name of the current branch and a number of added, modified and deleted files both in the working directory and in the index. On top of that it will also get autocompletion for all Git commands and any reference you create in your repository like branches, tags and remotes. If you're using bash on Linux or McQuistan and you are only interested in autocompletion there is a simple script in Git's source code that will do just that all you have to do is download it and add it to your bash profile that way it will automatically become available in all new bash sessions. If you are like me and prefer something a little more visually appealing then I highly recommend that you replace bash with Z shell and install a set of utilities known as oh-my-zsh. Z shell comes preinstalled in McQuistan and mostly in _____ if it isn't you can install it following the instructions available at this URL. Oh-my-zsh can then be installed by downloading and running a shell script from the project's GitHub repository. This adds the same functionality for Git as posh-git does for PowerShell however with a bit more style.
Aliases
Sometimes even with autocompletion on you just can't seem to type fast enough for those situations having short aliases for the most common Git commands can be quite handy. Defining an alias is easy you simply say git config alias dot the name of the alias followed by the command that you want to map the alias to. In this case we create an alias for a status command called st so now we can write git st. Aliases are stored in a dedicated section in Git's settings file. Here is how it looks like as we have done the other setting you can define aliases that apply only for the current repository or the repositories that belong to the logged on user or for the entire system bypassing the correct switch to git config. Git config --global for example creates the alias in the git config file located in the user's home directory. Of course you can create an alias not just for a simple command but also for a combination of commands and options. Here is a slightly more useful version of our previous alias this one is less verbose than the standard status and therefore it's easier to read. Another alias that I use all the time is commit --all--message which means commit-all modified files using the following message. Aliases can also contain an entire sequence of commands and even accept parameters you can really take them as far as you need them to go. Here is an alias that I use fairly often and that demonstrates this capability where qm stands for quick-merge. As the name implies it allows you to quickly merge the current branch into another one by combining two commands and one parameter. Let's break it down piece by piece bang indicates that what follows should be interpreted by the shell as a command git checkout moves head to the branch specified by the first parameter indicated by $1 git merge merges the branch that head was pointing to just before it moved into the current branch that is the previous entry in the ref log. With this we could be working in a branch and merge it into master by simply saying git qm master. Pretty useful, don't you think? Aliases can do even more than that for example they can be associated to an entire shell function and do all kinds of things like manipulate the strings, invoke other programs and so on we'll be using aliases extensively throughout the entire course to wrap long combination of commands and options that would otherwise be too cumbersome to type over and over.
Pretty History
When we want to look at the history of the current branch we say git log this way of visualizing a sequence of commits is fine but we can do better. Since history is something we'll be wanting to look at all the time every tiny bit of improvement that we can make is worth the effort. In fact the log command supports an option that allows to format the list of commits in different ways for example if we would like to limit the output to only one line per commit we could say git log --pretty=oneline this is definitely easier to read but it's missing some important information all we get is the SHA-1 hash of the commits along with their messages and that's pretty much it. What if we could specify the exact pieces of information we want to include and the order in which we want to see them? As it turns out the same option can do just that here is an example. Let's break down this format string %h represents the abbreviated commit hash, %d will show any reference currently pointing to the commit, %s contains the first line of the commit message also known as the subject, %cr represents the time stamp when the commit was made relative to now and finally %an contains the name of the commit author. You can find a complete list of all available placeholders at this URL. Now this output is short and expressive but it's also quite boring to be honest what it needs is some color let's go ahead and add it. That's much better but we aren't done yet as a final touch let's include a graphical representation of the branches by adding the --graph option. Of course we wouldn't want to type all this every time we want to look at history so let's make an alias for it, shall we? Now this is what I call a pretty history.
Diffs
Besides history the next thing we'll be looking at just as often are the contents of our working tree, the index and specific commits. For that we use the diff command what we see here are the changes currently in our working tree displayed in the universal diff format pretty common stuff however believe it or not there is a lot we can do here to improve the command line experience. By default Git redirects the output of commands that might produce more text than would fit in one screen to a program called less in order to provide paging if we see text that's cut out because it's too long to fit on the screen we can wrap it by specifying the -S option we can also scroll to the text using the k and j keys to scroll up and down. Once we're done we go back to the command prompt by pressing q. One thing to notice here is that Git uses paging regardless of whether or not the text actually fits on the screen. Having to press q all the time can quickly become annoying so for convenience we configure Git not to make the output scrollable unless it's needed bypassing direct set of options to less. The core.pager setting tells Git which program to invoke when it has to paginate the output of a command here the -F and -X options instruct less to exit immediately if the output fits in a single screen without clearing it. The -R option stands for raw output which includes the control character used for colors. If you want you could also add the -S option we used before to always wrap long lines however I prefer to choose myself when to wrap the output depending on the situation. We have also a few different options in regards to how Git generates the diff itself. According to the unified diff format a line of text is always either added or removed even if only part of a line is modified it will be displayed as if the old version was removed and a new one added that's fine in most situations but certain kinds of changes are better visualized in line. The diff command is versatile enough to allow us to do just that bypassing the word-diff option. Sometimes when we are reviewing a change we need to see more line around the modified ones to gain a better understanding of the context in which the change was made. By default git diff shows exactly three lines before and after its block of changes according to the unified diff specification however we can increase that threshold with the --unified option. This time we see ten lines on each side of the change instead of just three. These two options word-dif and unified are especially useful when looking at prose for example documentation or a blog post. To save us from typing we can create an alias that combines both of them and call it diff-prose or dp for short. Sometimes git diff doesn't produce the output we expect consider this patch for example we can see that a few lines have been modified but it's hard to find any coherence among them. Since the default diff algorithm by Git favors speed over readability it sometimes misinterprets which portions of the file belong together making it nearly impossible to understand at a glance what the change means. This can become a problem when we are manually editing patches during a partial commit or an interactive rebase. In order to remedy these situations Git offers a couple of different diff algorithms that we can use to produce better results. One such algorithm is called patience. The patience algorithm is named after a fact that it trades speed for accuracy in other words it takes its time to correlate the groups of matching lines to find out which belong to the same change this produces a much more readable diff. You might not think that this was any slower than the previous example but processing large files using the patience algorithm might take significantly longer. For that reason there is a faster version of it called histogram. Now it would be nice if Git could use this every time just to generate a diff and not only when we explicitly ask for it. Fortunately as of Git 1.8.2 we can set it as a configuration setting. Next we look at how we can take advantage of the diff command to compare entire commits or even individual files within commits in our Git repo.
Show Commits
There are a couple of different ways to look at the changes introduced by a specific commit. First there is the Show command of course we can specify any valid reference here like for example where head~2 refers to the second parent of head. Now let's look at the output the first section contains information about the commit object itself its SHA-1 hash, its author, the time stamp when it was created and the message associated with it. The second section shows the difference between this snapshot referenced by the commit and the one referenced by its parent. The patch is formatted according to the unified diff format similarly to what you will get by running git diff. The Show command accepts the same -pretty option as git log does that way we can provide a custom visualization for single commits as well. Of course we wouldn't want to type this all the time so we make an alias for it. Let's call it show-object. If you are just interested in reading the commit matter data we can do that by suppressing the diff output or if we'd like to include a summary of the changes we would specify the --stat option. Let's take a moment to look a little deeper into this as we talked about in the previous module each commit points to a snapshot of the files in the working directory. This snapshot contains a tree object for its directory and a blob object with the contents of each file. This aspect of Git is so important that it's worth repeating. As opposed to traditional version control systems like Subversion Git records the entire contents of the files in the working directory for each commit not just a difference based on the files in the previous one. To demonstrate it let's look at the contents of a commit using the lower level commands collectively called plumbing. Cat-file will print the raw contents of the object pointed by their specified reference. In this case since head points to a commit we'll get the contents of the commit object. You may notice that the information displayed here is the same as the one provided by the git Show command only this time there is no patch instead we get a hash of the root tree object referenced by the commit. If we run cat-file on that we get the contents of the working directory. As you can see there is a tree object for every sub directory and a blob object for every file. Now let's look at what's inside that blob and there you go the blob doesn't contain a diff instead it contains the entire file this means that commands like git-show calculate the difference between snapshots on the fly. Of course storing the entire contents of each file for every commit will be a waste of disc space that's why Git is able to calculate deltas between sequences of commits also called the delta chains and stored only those to save space but that's just an implementation detail of the storage subsystem. Conceptually we can still think of a commit as pointing to the entire working directory with all of its files. Another way of looking at the difference between two commits is by using the diff command and specify a range of commits using the double dot notation. We can see that the snapshot in head has two new files compared to the one two commits ago. Now let's try to switch places between the two references this time the generated diff shows that the files are deleted. That's because those files didn't exist two commits ago this further demonstrates that diffs are generated on the fly by comparing the specified snapshots. Finally git diff allows us to compare even individual files within different commits for example we could compare the README.md file in the working directory against the same file referenced three commits ago. As you can see using the Git's commands directly is far more versatile and flexible than any GUI could ever hope to be.
Summary
In this module we looked at how we can work efficiently with Git from the command line. We started by comparing the experience of using Git through a Graphical User Interface versus using it from the command line and demonstrated why the latter is a better tool for the job. Then we saw how we can do more with less typing by using aliases. We saw how we can create aliases for basically anything from simple commands to entire combinations of commands, parameters and options. Finally we learn how we can use Git, Log, Diff and Show to enrich the way we look at the history of our repositories, the contents of the working directory and individual commits. In the next module we'll start creating commits but not just any commits we'll be creating commits that are crafted in such a way to turn the history of our repository into a journal that's self-explanatory and easy to follow.
Crafting Commits
Introduction
In this module, we look at how to take advantage of Git's unique features to craft beautiful commits that respect the history of our source code. We start out by defining what it means to have a good-looking history and why that's important. Next, we'll see how we can use features like the index, the stash, and commit hooks to carefully organize, verify, and properly document the contents of our commits. Finally, we'll see how to line up our local history to create a trail of commits that document our work in a way that's self-explanatory and easy to follow. Let's get started.
The Importance of a Good-looking History
"Study the past if you would define the future," said the ancient Chinese philosopher, Confucius. As with many other things in life, the way you move forward in a code base starts by understanding how things got to where they are in the first place. This precious information is captured by the version control system. But there is a catch. Simply storing the history of our code base in a version control system doesn't necessarily mean that we'll be able to gather any value out of it. If the history is complicated, ambiguous, and poorly-documented, it turns into a black box, keeping the precious information locked inside inaccessible. If instead, it is clear and easy to follow, it becomes the key to understand the decisions made by the programmers that came before us. The designers of Git understood this. That's why they built special features into Git to give us a fair shot at maintaining a good-looking history. So what makes a history good-looking? Well, it all starts with the quality of our commits. A good commit usually has four recognizable traits. It's atomic, consistent, incremental, and documented. Let's look at each of these qualities individually. A commit must be atomic, or in other words, self-contained. This means that we shouldn't split semantically-related changes across multiple commits. For example, if we were to rename a function, we would commit the renamed function as well as all the references where that function was used into one single commit. Related to being atomic, there is the concept of coherence. Just as we should avoid the breaking apart semantically-related changes, we should also make sure that each commit represents one logical change. Renaming a function, along with all its references, represents one commit. Fixing a bug represents another one. A corollary of this principle is that commits should be kept relatively small. Personally, I tend to make commits that include between two and five files. If I have a commit with more than say, ten, I've either renamed a rather important symbol, that is one with lots of references, or I'm trying to do too much in a single commit. Either way, I am aware of the size of the commits I create because small patches are easier to review and to reason about than large ones. Each commit should leave the code in a consistent state. At the very least, the code should compile with no errors, no broken tests. The reason why this is important is because it should be possible to apply individual commits to the working directory, and be able to immediately build on top of them without first having to deal with computation errors or failing tests. A code base evolves through a sequence of self-contained and logically-coherent modifications that build on top of each other incrementally. That's why the order in which the commits appear in the line of history matters. For example, if we were to build a feature, the order of our commits will reflect the evolution the code went through as the feature was implemented. So the order of the commits shouldn't be arbitrary, but rather, it should be explanatory. In other words, it should clearly document the thought process the programmer went through as they worked on the code base. Speaking of documentation, a very important piece of information about the meaning and role of a change is the commit's message. We should use it to communicate not only what the commit means for the system, but also the reasoning behind it if it's not immediately obvious. That's why a good commit message is made up of two parts. A short summary in the form of a single sentence that describes what the change is about. And a longer description containing more details about the change, should it be necessary. Here is an example of a useful commit message from Git's own history. This is one of the very early commits in the Git source code. As you can see, the author has described what the patch does in one short sentence, the summary. But, he didn't stop there. He used the next paragraph, the body, to describe how to use the new options of the update-cache command. This is a great example of useful it is to provide the background on the thoughts that went into a modification. Just imagine what a huge time saver this information will be in the case of a bug, or in future design discussions. The way this commit message is formatted is also very deliberate. We'll talk more about how to properly format our commit messages, and why it is a good idea to follow a common convention later in this module. At this point, you might have noticed that the properties of a good commit make up the acronym ACID, just like the one used when talking about database transactions. This isn't a coincidence. A version control system is, in fact, a form of database designed to store information about the changes that happen over time in a collection of files. Following the same metaphor, commits represent the transactions that add information about a new change as a single atomic operation. While these principles can be applied to just about any version control system, in practice, the vast majority of them don't offer the level of granularity and control that's necessary to maintain a good-looking history. Git, on the other hand, was built from the ground up for exactly that goal. During the rest of this module, we'll look at how we can use Git's unique features to make sure our history is as clean as it can possibly be.
Staging Commits
A single coherent change, that's all a commit should be. It should represent one kind of modification, whether it be a piece of documentation, a refactoring, a bug fix, or a new feature. Following this rule with Git is easier than with any other version control system thanks to the index. The index, also known as staging area, is one of Git's most distinctive features. According to the programmers who were involved with Git during its early days, it didn't take long before it became clear that having the possibility to pick and choose which changes to include in the next commit was a very useful feature. "It was very obvious from the early days "that unlike 'cvs commit' or 'svn commit' "it was very useful that you can trust 'git commit' "after preparing the index with what is "and isn't to be included in the commit, "won't pick up debugging cruft you keep "around in the working tree." That's exactly what the index is for. Let's say we have two modified files in our dirty working directory. One of them changes a word in a documentation file, while the other modifies a piece of code. Now we wouldn't want to have both of these changes in the same commit, since they are totally unrelated to each other and the commit wouldn't be atomic. Thanks to the index, it's easy to create two separate commits, one for each change. First, we tell Git about our intention to create a commit with only the documentation file by adding it to the index. If we want to see what changes are going to be part of the next commit, in other words, what's staged, we can say, "git diff --staged." As you can see, only the README.md file is currently in the index. If instead, we say "git diff," we see the changes that are only in the working directory. Now, let's try something interesting. Let's say we make another change and add it to the index. Later, we realize that we forgot to make another change to the same file, and we want to include that in the next commit as well. We open up our editor and make the change. Now let's go ahead and check the contents of the index with git diff --staged. Our latest change isn't there. Did we forget to save? Let's see what's in the working directory. Aha, the README.md file was indeed modified, but it wasn't automatically included in the index. This is by design. Once we add the file to the index, Git creates a cached copy of it, as it was at the time it was added. In fact, we can also check the contents of the index by passing the --cached option, which yields the exact same result. If we want to include any modifications in the index, we have to explicitly say so by adding them. This might seem counterintuitive, but it's actually very useful. It allows us to decide what to include in a commit without having to worry about the state of our current directory. Here is how Linus Torvalds explained this design decision. "It's simply how I've always worked. "I tend to have dirty working trees, "with some random patch in my tree "that I do not want to commit." We could also keep modifying the same files in the working directory without affecting our previous decision. All of this before we make any commit at all. Knowing that a staged file is cached gives us the piece of mind and the ability to experiment without affecting our upcoming commit. But if we add a file to the index, but then regret it? No problem. To remove a file from the index, we use the reset command. The reset command, without arguments, implies git reset --mixed, which means the HEAD reference and the index will be updated to match the commit specified by the reference. The working directory remains untouched while any uncommitted files are removed from the index. Since we'll be wanting to do this fairly often, it's handy to have an alias for it. Let's call it unstage. Now, let's consider this. What if we made two different modifications to the same file but wanted to commit them separately? In that case, we can take advantage of the index to include part of the file we want to commit, and the rest in another. The -p option means --patch. And it allows us to interactively choose which sections of the patch to stage, also known as hunks. The help text below lists what we can do. Y means yes, stage this hunk. N means no, skip this one. J and K allow us to jump ahead to the next unstaged hunk, or to the previous one, respectively. S means split the hunk into smaller ones, if we need it to be more granular. E means edit, which opens up the patch in an editor, allowing us to manually edit it. D means skip the rest of this file, and Q means quit. In this case, we just want to commit the add function separately from the comment. Since both changes are part of the same hunk, we first need to split them by pressing S. At this point, we can confirm that we want to stage the first hunk, that is the add function, by pressing Y. Finally, we exclude the comment by pressing N. Since we are doing a partial staging, we can see that the calculator.c file is simultaneously staged and unstaged. Finally, we add the rest of the file to the index and create a commit. If we have a very dirty working directory, we also have the possibility to do a so-called interactive staging, where we get an overview of all the modified files and can decide which files to add entirely, and which to include partially. We can start an interactive stage by passing the -i option, which means --interactive. For example, we could add the entire README.md file by pressing U, followed by 1, and create a patch for calculator.c by pressing P, followed by 2. At this point, we could do the same thing we did before, that is to split the hunk and only stage the first part. But since the patch is so small, we might as well edit it manually. In order to do that, we press E, for edit. This opens up the patch in an editor, where we can simply remove the lines that we don't want to include in the next commit. Finally, we can quit the interactive mode by pressing Q. This is something I very rarely use since the interactive patch editor is usually good enough for most situations. However, it's useful to know it's there when you need it.
Verifying Commits
Once we are done deciding which files are going to be part of our next commit, it's useful to verify that what's in the index is consistent. It shouldn't contain anything other than what we intended, nor should it introduce any errors in the code base. As programmers, one of the first things we check for is the presence of unintended whitespace. For that, we can simply use the diff command. Git will automatically highlight all the invalid whitespace directly in the patch output so it's easy to spot. We could also check for whitespace errors by passing the --check option to diff. In this case, git diff will output the lines that contain invalid whitespace and exit with a known zero status code if it finds any, which is particularly useful when used in scripts. What places are considered invalid for whitespace to be in is controlled by the core.whitespace configuration option. By default, Git will look for rogue whitespace in three places indicated by these options. Blank-at-eol checks for spaces at the end of a line. Blank-at-eof checks for blank lines at the end of a file. On top of that, we can also tell Git to look for spaces with an indentation by using these options. Indent-with-non-tab checks for spaces that are used for indentation instead of tabs. Tab-in-indent is the exact opposite, that is it checks for tabs used for indentation instead of spaces. For example, this will teach Git to make sure that indentation is done exclusively with spaces. As you can see, Git is now telling us that line five is indented using tabs instead of spaces. If we were to discover unwanted whitespace in our local commit, that is a commit that only exists in our local repository, we can quickly remove them by passing the --whitespace option to git rebase. For example, let's say that the two last commits in the current branch contain whitespace that we wish to remove. In that case, we could say, "git rebase HEAD~2 --whitespace=fix." This will process the commits starting from two commits before HEAD, and remove any unwanted whitespace according to the core.whitespace setting. Once we are satisfied with the contents of the index, it's time to verify that it actually works. The code we're about to commit shouldn't contain any compilation errors nor failing tests. But how can we verify the contents of the index alone, separated from all the other changes that are in the working directory? Well, we can do that by taking advantage of the stash. The stash is a storage area where we can temporarily put unfinished work that we wish to take out of the working directory. A stash will normally include both the modified files in the working directory, as well as the contents of the index. For example, let's say we have a file called calculator.c. We have added two new functions to it, one for addition and one for subtraction. We want to commit these functions separately, so we stage the add function first while we leave the subtract function out of the index. Now, in order to make sure that the contents of our next commit are consistent, we can stash away the changes in the working directory, while at the same time leaving the index untouched, by saying, "git stash save --keep-index." If we had any new files that weren't previously part of the history, that is they are untracked, we will add them by including the --include-untracked option. We also have to provide a message for the stash. Since this is just temporary, we can simply write, "work in progress," or "WIP." The end result is that now we have a working directory that only contains the changes that are about to be committed without losing the rest of our work. At this point, we can run a build script to verify that the patch is going to leave the code in a consistent state. And if so, proceed with the commit. In this case, we have a makefile, so we can simply run make. Once we are done, we restore the files from the stash by saying, "git stash apply." However, since we have no reason to keep the stash around, we also want to remove it after we have restored its contents. So instead, we say, "git stash pop," and our changes are now back safe and sound in our working directory.
Documenting Commits
Making sure that the changes in our commits are atomic and consistent isn't enough. We also need to properly document them. A good commit needs a well-formed commit message that explains what the patch does to the code base. Git has a convention for how a well-formed commit message should look like. It should consist of two parts. One, a short one-sentence summary with maximum 50 characters in length. And two, an optional longer description that adds more detail about the change. For example, it could be the reasoning behind the refactoring, the problem a bug fix is solving, or instruction on how to use a new feature. It's also good measure to wrap the lines at 72 characters to make it readable on a standard 80-character console. Even with all good intentions, we all know that writing commit messages isn't something that comes natural to many programmers. Someone even created a website dedicated to making the job easier by offering generic pre-generated commit messages. If this is where things are at, we need to find a better way to encourage descriptive and well-formed commit messages. One way to address the problem is by creating a little reminder every time someone is about to make a commit. The reminder could take the form of a shell script that runs every time a commit is created in the local repository. The script will check the commit's message to make sure that it's well-formed. If it isn't, it will inform the user and ask them whether they want to correct it. In order to have Git run the script automatically, we could attach it to the commit client-side hook. Now, writing a script like that probably wouldn't be too hard. However, I happen to have already written a batch script that fits the description exactly, so we can use that to save time. Let's go ahead and download it into the .git/hooks/ directory in our repository and call it commit-msg.sh. Since it's generally not a good idea to blindly run scripts you found on the internet, let's take a moment to look at it. As you can see, the script is pretty straightforward. First, we check the length of the first line of the commit message. If it's longer than 50 characters, we print out a message to the user, asking them whether they want to correct it. If they press Y, we open up the entire commit message in the default editor. Finally, we exit with a zero status code, which tells Git that it's okay to continue with the commit. Now, let's grant all users the right to execute a script by using chmod. Finally, we can try it out by attempting to create a commit with a message that's too long. As you can see, the script is informing us that the summary of the commit message should be, at the most, 50 characters long, but is, in fact, 64. If we press Y, our default editor will open up, allowing us to write a well-formed commit message. You can download the script used in this demo at this URL and modify it to your liking. Notice that Git goes as far as configuring the editor to indicate when the first line of a commit is longer than 50 characters by highlighting it. Now, this is called dedication. The one thing we can't do is to automatically check for descriptive commit messages. For that, we count on code reviews to provide appropriate feedback.
Leaving a Trail
Last, but not least, it's important to leave a trail of commits that shows our thought process as we made our way through the code base. For example, if we wanted to add a new feature, we might want to show how we went about it incrementally, step by step. In this case, the first commit could be a refactoring that makes room for the new feature. The second commit would be writing a failing acceptance test and ignore it for now. The third would be actually implementing the feature. Note that each commit should leave the code base in a consistent state. The code should compile and all tests should pass. That's why we ignore the failing acceptance test. We want to communicate how the feature is supposed to work without breaking the test script. However, keeping a tight sequence of commits while working on something is very hard. Some people are able to pull it off, but for most of us, the creative process is a little fuzzier. It's littered with missteps, experiments, and changes of course. Fortunately, Git, being a distributed version control system, allows us to separate the history that exists only in our local machines from the one that we share publicly with the world. As long as we haven't shared our commits with anyone else, we are free to change their contents, messages and order at the best of our liking. Once we've published them to our remote repository, however, they become final. We are no longer allowed to change them. We look at the difference between private and public history and why we can rewrite one but not the other later in this module. One common way to clean up our local history before publishing it is by doing a so-called interactive rebase. We start it with git rebase -i command. Head~4 refers to the commit where we want to start the writing history. At this point, Git opens up our editor with a list of commits that fall within the specified range, offering us a series of actions we can take on each of them. For example, if we might want to reorder them, and merge together, or squash, multiple interim commits into a larger one. We can also change the commit message or completely edit our content. Once we are done planning our actions, we save the file and exit. At this point, Git will work the sequence of commits, starting from the first one in the range, and stop whenever we said that we wanted to take an action. The first step is the squashing of two commits. Here, we get a chance to edit a commit message to better describe the changes in the resulting commit. Second, we stop to reword a commit message. And last, we stop to completely edit the contents of a commit. In this case, we want to move the multiply function to be above the main function. Once we are done editing the file, we make it part of the same commit we stopped at by amending it. We look at the details of amending commits in module six. Finally, we tell Git to continue with the interactive rebase. The resulting history is now much more self-explanatory and easier for other people to follow since it reflects the incremental progress we have made to the code base.
Public vs. Private History
In Git, there is this convention that says that we are never allowed to rewrite public history, only our own private one. The notion of public versus private history applies to any distributed version control system. In this model, everyone works on their own local copy of the repository, building their private history. Once they're done, they agree to share their work with each other through a common instance of the repository, which everyone has access to. Once someone's private history is part of a shared repository, it's no longer only theirs, it becomes public. So, whenever we decide to tidy up our history, we only do it on our own private history, never on a public one. I think Git's original creator, Linus Torvalds, said it best, so I'm going to borrow his words. "People can and probably should "rebase their private trees, their own work. "That's a 'cleanup.' "But never other people's code. "That's a 'destroy history.'" But what does he mean by, "Never rebase other people's commits?" It means two things. First, if you didn't create a commit, it's not for you to change. And two, if you have pushed a commit to a shared repository, other people might pull it and build on top of it. So it's no longer yours, and you can't change it. But why do we need this rule at all? Well, consider this. Every time we change any aspect of a commit, we indirectly modify its unique ID. This ID is generated by calculating the SHA-1 hash of the SHA-1 hashes for its metadata fields combined. These fields are the tree that it references, the parent commit, the author, the committer, and the commit message. Changing any of these fields is going to affect the commit's own ID. Now since the ID of the parent commit is part of it as well, this means that once a commit changes ID, all commits that come after it also change ID, like a domino effect. Now if the old commits had been fetched by someone else, once they pull the modified ones, Git is going to treat them as completely different commits, simply by virtue of having different IDs. Therefore, Git is going to merge the old sequence of commits with the new ones. Now, imagine if someone had added new commits on top of the old ones in their own local repository. Now things are going to get even more complicated, because at this point, nobody can tell what has actually changed. In other words, chaos is going to ensue. So stick to the golden rule. If a commit only exists on your local machine, change it as you like. If you've shared it, it's final.
Summary
In this module, we have learned the importance of paying attention to the way we shape the history of our repositories. We have determined that good commits are ACID, atomic, consistent, incremental, and documented. For the rest of the module, we have looked at how to take advantage of some of Git's unique features, namely the index, the stash, and the ability to rewrite history, to prepare, verify, document, and line up our commits into a trail that's self-explanatory and that can act as a journal of our work for generations to come. In the next module, we'll look at how to follow this trail by answering questions such as, which commit introduced a particular change, or how a commit has moved across branches.
Searching History
Introduction
In this module we're going to look at how to query the history of our source code to answer any questions we might have about its past or present. For example, what commits are in this branch but not in that other one, which commit introduced this line of code, Who modified that file during the past few weeks. These are only a few of the questions we'll be able to answer by using Git's built in search commands. Let's get started.
Reachability
Before we can start talking about searching the history of our repository we need to take a step back and remind ourselves of Git's object model. In Git, history is structured as a directed acyclic graph that is a group of interconnected nodes where the connection between nodes only go in one direction and never look back on themselves. Visually you can think of it as multiple series of nodes where each node always points to the one that comes before it. In the context of Git, each node represents a commit. The last commit in each sequence is called a head and it's associated to a name, the branch reference. Now we said that each commit has a reference to its parent. Most commits only have one child but there is a special kind of commit that can have two or more. These commits are called branching points or fork points. Likewise, there are commits that can have more than one parent. These commits are called merge commits. Now the only way we can traverse this structure is by starting from the latest commit in a branch, the head and work backwards following the trail of parents. Whenever history diverges into two branches like in the case of M we can choose to either follow its first parent that is stay on the branch where the other branch was merged into or follow the second parent that is move to the branch that was merged. Based on this model we can state this principle a commit A is said to be reachable from another commit M if there exists a contiguous path of commits that lead from M to A or more simply if we can start from M and follow the trail of parents until we arrive at A. In other words commit A is reachable from commit M if A is an ancestor of M. Keeping this principle in mind common version control questions like which commits are in this branch but not in that other one become which commits are reachable from this branch's head but not from that other one's? The way we can ask for the question is by using the so called dot notation. For example, given two branches master and feature we can say git lg feature..master and that will give us the list of commits that are reachable from master's head but not from feature's. Notice that here instead of using the default log command we are using the lg alias that we created in module two to give the output and nicer formatting. During the rest of this module we're going to use lg instead of log. Now if we switch places between the two references and say git lg master..feature we get the commits that are reachable from feature but not from master. This special syntax is called the two dot notation or sometimes called dot dot notation. Note that the meaning of the dot dot notation in the context of git log is different than when used with commands like git diff where instead it's used to define a range of commits between two references. But if there is a two dot notation would you expect there to be a three dot notation? Well if you thought yes you'd be right. The three dot notation or dot dot dot is unique to git log and it results in the commits that are reachable from either of the branches but not from both. Of course this command doesn't only work with branch references you can use it with any commit reference whether it'd be a relative one, a commit ID or a tag. But what if we want to find the range of commits that exist between two references like in the case of git diff? In order to do that we'll need to pass the --ancestry-path option to git lg. This gives us the commits that are not only the ancestors of master but also the descendants of feature. In this case it corresponds to the merge commit M. Another very convenient way of visualizing the way our branches are structured also known as the branch topology is by using the log command with the --graph option to put them in a graphical representation. This visualization is fine but as soon as we start having more than two branches it can quickly become harder to interpret. Fortunately, we have an alternative. Meet the show-branch command. When using the show-branch command we specify the branch references we want to include in the visualization. In this case we have three master, feature and experiment what we get is a hierarchical view of the ancestry lines that link our branches. Let's take a closer look at this output. The first section shows the commits that are the branch's heads. The branch currently in the working directory is marked with an asterisk while the others are marked by an exclamation point. This second section shows the ancestry of commits for each of the branches. As you can see they are indented to match the position of the branch name in the first section. Regular commits are marked with a plus sign while merge commits are marked with a minus. The first common ancestor of all three branches is shown at the end of the list marked with both an asterisk and a plus sign for each branch. I use a show branch extensively in my daily work since I think it offers a more readable overview of the branch's topology than the one offered by the plain log command.
Tracking Commits
When working with branches, one of the things we'll want to check on a regular basis is which branches have we merged and which haven't. The quickest way to answer that question is by using the branch command together with the --merged option for example git branch --merged shows the list of branches that have been merged into the one referenced by head. We could also check another branch by specifying its name. In order to answer the opposite question that is which branches have not been merged we run the same command we did --no-merged option. As expected the results show the experiment branch which has not been merged into the current one. This approach works fine but it has a problem it can only tell us which branches contain commits that aren't reachable from head it doesn't tell us what those commits are. In order to find out the missing commits we could use the git lg command together with the dot dot notation like we have seen earlier in this module. Another way of finding out which commits have been merged into a particular branch is by running the show-branch command the same one we have seen earlier in this module only this time passing the --topic option. The --topic option filters out the commits that have been merged into the first branch reference thus only showing the ones that haven't. In this case since master is the first branch in the list of arguments we're seeing the commits from experiment that haven't been merged. If we change the order of the references by putting the feature branch first we see all the commits from both master and experiment that are missing in feature. Up until now we have only tracked entire commits. The commands we've been using only look at the commit ID to determine whether a commit is reachable or not. They don't take into consideration the changes that a commit introduces. This can sometimes be a problem because we don't always merge entire branches instead we pick and choose single commits to apply on top of a branch for example by using the cherry-pick command as we'll see in the next module. In these situations we want to look for commits that are patch equivalent meaning they introduce the same set of changes as another commit regardless of their commit IDs. In Git's notation patch equivalent commits are marked as prime so for example F prime is the patch equivalent commit of F but has a different commit ID. To find out which commits are patch equivalent between branches we can once again use the git lg command with the --cherry-mark option. As the name implies the cherry-mark option marks the commits reachable from one branch whose patch is present in a commit that's reachable from another one and vice versa. The commits that are missing on the right side in this case experiment are marked by a less than sign the ones missing on the left side that is master are marked with a greater than sign and finally the ones that are equal are marked with an equals sign. If we only wanted to consider one side of the comparison we would specify the --left-only or --right-only option respectively depending on which branches we are interested in. Notice how we have used the three dot notation to specify the range of commits to include in the comparison that's because we want all commits that are either reachable from one branch or the other but not from both. We also want to explicitly exclude merge commits since those aren't relevant to our question. Let's look at another example this time instead of marking the commits that are equivalent in both branches we exclude them entirely using the cherry-pick option. This is especially useful when we are only interested in knowing which commits are in one branch and not the other. For example if we only wanted to know which commits are missing in master we could say git lg cherry-pick right-only no-merges master dot dot dot experiment. Looking at only one side of the comparison is in fact so useful that there is a shorthand version for it instead of saying git log cherry-mark right-only no-merges we can simply say git lg cherry. If instead we wanted to know which commits are missing in master we will simply have to switch the order of the references. Now suppose we want to know when the feature branch was merged into master? In other words we want to get a list of merge commits that have happened between feature and master. To answer the question we could say git lg feature dot dot master dash dash merges which is equivalent of saying show me all the merged commits that are reachable from master but not from feature. In this case since there have been two merges on master we only see two commits. If you are only interested in seeing the latest merge that is the merge commit that is a descendant of feature in addition to being an ancestor of master we can add the --ancestry-path option. Finally if we switch the branch names around we get an empty list because master has never been merged into feature.
Tracking Changes
Up until now we have been tracking commits across different branches the next step is to track changes across commits. Let's start out with an easy question we want to know all the commits that contain modifications to a particular file. Once again we can find this information by using the log command this time passing the --follow option. What we are seeing here are the commits whose snapshots contain a calculator.c file that's different from the one containing their parents. If we also wanted to know what changes were made in each commit we would add the --patch option. Now let's go a bit more granular and find all the commits that either add or remove the string calculator. In order to do that we use the -S option of git lg followed by the string that we want to look for. We could also specify a regular expression instead of just a simple string by adding the --pickaxe-regex option. One thing to remember is that the -S option only looks for lines that are either added or removed but not both. This means that it doesn't show commits that contain modified lines. If we wanted to include those as well we'd have to use the -G option instead. As you can see this time we got an extra commit in the results. If we take a look at its patch by adding the --stat option we can confirm that commit C does in fact both add and remove a matching line. Another difference between -S and -G is that the argument of -G is always interpreted as a regular expression so we don't have to pass any extra option. This time we got even more results since apparently there are more commits that either add, remove or modify matching lines. Now let's do something a little more interesting. Say that we wanted to know in which commit the subtract function has been modified of course we could look for a string like function subtract using the git log -G option but that wouldn't be very precise. Git has long had the ability to pass the contents of text files in order to produce language-aware diffs providing for example language specific information in the hunk headers. While as of version 1.8.4 Git also learnt to use that same functionality to do language-aware searches thus functionality is exposed through the -L option. Git lg -L:subtract:calculator.c tells git to look for commits that modify a C function called subtract found in a file named calculator.c. But how does git know which parcel to use for a given file? Can we work with any language? Let's find out by searching for a method with the same name only this time in a java file. Interesting. We get no matches even though we know for a fact that such a method exists. As it turns out git is able to recognize C and C++ out of the box but there are other built-in language parcels available all we have to do is tell git which parcel to use for certain files by adding an entry in the .gitattributes file in our repository. In this file we specify that all files that have the .java extension should be processed using the java language parcel. At this point we can run the same command and get the commit that added, removed or modified the subtract method. You can find the list of all the available built-in language parcels in the documentation page for git attributes at this URL.
Tracking Authors
Sometimes knowing which commits modify a certain file isn't enough we need to know who made those changes in order to gather more information about their context tracking authors across commits is done through the options of the git log command for example say that you want to know the commits made by a certain author. In that case we will say git lg -- author equals and the name of the author we are looking for. Know that in git there is a difference between who authored a patch and who committed it to a repository. Let me show you let's look at a meta data of the latest commit in the current branch as you can see this particular commit was authored and committed at two different points in time but there might as well have been two different names there. So just as we can filter the commits by author we can also filter them by committer. Now in most projects the author and the committer of any given commit are going to be one and the same. However during the development of large projects such as the Linux canal it's common to have contributors submit a patch which is then reviewed by someone on the maintenance team. They then get to decide whether the patch should be committed or not. By recording who authored a change and who actually committed it to the repository into separate fields makes it possible to give both credit as well as responsibility for the work done. If we also wanted to limit the list of commits to a certain time frame we could say git lg --author --since 1 week this shows the commits made by that author within the last seven days. We could also set the output to a specific range like for example between one week and two days ago. Unfortunately the list of supported days for a month is undocumented. However Linus Torvalds who originally wrote the date parcel called implementation approxidate in the source code. If you want to know more you can find it in the date.c file in git or repository at this URL. When tracking authors we are not just limited to entire commits we can also establish authorship for each individual line of code in a given file thanks to the unfairly named blame command. The output shows the short version of the SHA1 hash that belongs to the latest commit that modified each line of code together with the author as well as the time stamp. If we wanted to we could also limit the range of commits by using the dot dot notation. In this case we are only seeing the commits made in the feature branch. Not only that but we can also filter specific portions of a file. This can be done both by line ranges as well as by function. Finally, something I like to do from time to time when I'm working in a long lived codebase is to gather a bit of statistics about a number of commits made by different authors. The command that does that is called shortlog by adding the -s option we sum the total number of commits per author while -n shows the results by that number. As a fun example here are the top 10 authors in the official Git repository on GitHub as of February 2016.
Summary
In this module we looked at how we can query the history of our repository in order to answer any questions we might have about its branches, commits, changes and authors. We started off by defining the concept of reachability as it applies to commits and branches. From there we learned how to track commits across branches using the dot notation and the show-branch command. Finally we saw how we can use the search related options of git log to track changes those across commits as well as authors. In the next module we're going to take a step back and see how fundamentally different git's approach to branching is compared to traditional version control systems and which unique opportunities data _____ offers to improve our daily workflow.
Branching and Merging
Introduction
In this module, we'll look at how to take advantage of Git's unique branching model to improve our daily workflow, both for our own personal productivity, as well as for the entire team. First, we're going to see how Git's approach to branching sets it apart from traditional version control systems, and what possibilities that approach offers. Then, we'll look at the different kind of branches we can work with, and how to choose the right merge strategy for each of them. Finally, we're going to go through a few unique, and frankly, rather impressive tools that Git puts at our disposal to resolve merge conflicts. Let's get started.
The Branching Model
Branching and merging in Git is radically different from how it works in traditional version control systems. Tools like CDS and Subversion taught us that branching is slow and takes up a lot of disk space. Merging is even worse: it can be both time consuming and flat-out risky. The natural conclusion, then, is that we should avoid branching at all costs until it becomes absolutely necessary. Well, Git turns that rule entirely on its head by making branching a cheap and fast operation, while also being smart about how it tenders merges. Let's take a deeper look at Git's branching model. In traditional version control systems, creating a branch means creating an exact copy of the entire working tree to a new directory, down to every single file. In some systems, the name of the directory becomes the branch name, while others record the directory's path and associate it with a name somewhere in the database. In Git, creating a branch literally means writing a value to a text file. That value is the SHA1 Hash of the commit that represents the tip of the branch, also known as the branch head. The name of the file itself becomes the branch name, and that's about it. No files are copied, and no databases are updated. In Git, a branch is nothing more than a commit ID stored in a 41 bytes text file. Let me prove that to you. Let's see what branches we have in our repository by saying git branch -- all. In this particular repository, we have three branches: master, feature and experiment. Since branches are really references to commits, another way of asking for a list of the local branches is by using the show-ref command. Normally, show-ref shows both branch references as well as tags, but we want to limit the output to just branches, so we pass the --heads option. What we are seeing here are the paths of the files that represents the branch references, together with the SHA1 hashes of their latest commits. Let's look at the content of one of those files. And there you go: nothing but the SHA1 hash of the master branch's head. At this point, it should come as no surprise that deleting a branch is simply a matter of deleting its corresponding reference file; however, that's usually not a good idea since the branch references might have been mentioned in the repository's configuration file, like is the case for remote branches. That is, local branches designed to track the evolution of branches in another repository. Instead, we should use Git's commands to make sure that no references are left broken. One such command is git branch with the -d option. Notice that since the branch contains commits that haven't been merged in master, Git warns us that we might lose those commits if we choose to delete that branch. This is a perfect example of why using the high level commands is better than manipulating Git's internal file system. In this particular case, we don't care about those unmerged commits, so we tell Git that it's okay to delete the branch by passing in the -D option instead; the - D forces Git to delete the branch, even if it contains unmerged commits. Note that the only thing we are deleting here is the branch reference. The commit that was once referenced by experiment, that is, G, is still in the repository, but it's left dangling. That is, it's no longer directly reachable through our reference. We can get the list of dangling commits in our repository by using the file system check command, or git fsck with the -- dangling option. As you can see, that's the same commit once referenced by experiment, but since the tip of the branch is dangling, that means that all its ancestors also become unreachable. We can prove that by passing the -- unreachable option to fsck. In this case, both the F and G commits are listed. The reason why F didn't show up before is because not all our reachable commits are dangling; some of them actually still retain a reference. They're referenced by their descendants, so technically, only the last commit in an unreachable sequence is dangling, since it's not referenced by anything. Unreachable commits can be recovered until a certain period of time from the branch's own journal called the reflog. We're going to talk about the different ways to recover commits in module six. For now, suffice to say that the easiest way to recover our deleted branch is to simply create a new branch reference that points to the same commit as the old one. Now that we got our experiment branch back, it's time to merge our feature branch into master. In order to merge a branch into the one currently referenced by head, we use the merge command. At the very least, git merge is going to do two things. First it will take the changes contained in the commit that are reachable from feature, but not from the current branch, and apply them on top of its latest commit, that is, C. Second, it will create a new commit that combines the alliance of the histories of feature and master by having two parents. This commit is called a merge commit. I said at least two things, because if any of the commits in feature happen to modify the same line in a file as one of the commits in master, the entire merge operation will stop due to a merge conflict. We're going to look at how Git helps us resolve merge conflicts later in this module.
Different Kinds of Branches
Making branching essentially free is one of Git's greatest achievements. That feature alone opens up a whole range of workflows that simply aren't possible, or very impractical to do in other version control systems. To be able to understand any of these workflows, we need to take a step back and go through some fundamental characteristics of branches. Regardless of which version control system you use, every time you create a branch, it's going to be one of two kinds: a long-running branch, or a topic branch. A long-running branch is a broadly scoped branch that exists for a long period of time, anywhere between a few weeks and the entire lifespan of the project. They're usually shared among a group of people or with the entire team. For example, it could be the branch where the next major version of the software is being worked on or the branch that contains back fixes for an already released version. Now, not all projects need multiple long-running branches; however, every project must have at least one: the main branch where all other branches are derived from and merged back into. This is often referred to as the trunk or default. Git's convention calls it master. A topic branch is a short-lived, disposable branch that focuses on a very specific task, hence the name topic, and is typically narrower in scope than a long-running branch. They are created off long-running branches to accomplish one goal, like for example, implement a new feature or fix a bug. If they manage to produce some usable results, they might be merged back into the long-running branch. In any case, once they are no longer needed, they are gone. While long-running branches are typically shared, topic branches can be either shared or individual, in which case, they only exist in the repository where they were created. For example, we might have a long-running branch named vNext that contains work towards the next version of our software. Then, we might have a topic branch for a feature we're working on, like say, login. Now, we mentioned that a topic branch can be either shared or individual. This property is particularly interesting because, while the distinction between long-running branches and topic branches applies to any version control system, when it comes to distributed version control systems, there is another distinction to make. Branches can be either public or private. Public branches exist in multiple copies of the repository, while private ones exist only in the repository where they were created. Git doesn't impose any rules, nor limitations on what branches are allowed to be public or private; however, generally speaking, long-running branches tend to be public, while topic branches can be either public or private. Consider this scenario where we have a public topic branch for the login feature. If we wanted to try out an alternative way of implementing the user of the indication, we could create a private topic branch named idea-for-login in our local repository. If we are satisfied with what we have achieved, we might decide to merge that private branch into the public login branch and share it with the rest of the team, before finally getting rid of it. As we discussed in module three, private branches have a major advantage compared to public ones. They let us rewrite our history as much as we want before publishing them, so no one will ever have to know or care about our private branches, since they only exist in our local repository. So, how does all this apply to workflows? Well, cheap, lightweight branches means we can use them any way we like, adjusting them to fit whatever workflow we decide to use for our own work, as well as within the context of a specific project or team. Git doesn't dictate how we should use our branches; it's completely up to us. However, when evaluating a workflow, it's useful to keep in mind the discarding principal. Regardless of its purpose, a branch can be either public or private, depending on who has access to it, and long-running or topic, depending on its lifetime and scope.
Different Ways of Merging
A branch is created in the same way regardless of how it's going to be used. It doesn't matter if it's going to be public or private, long-running or topic. What does change, however, is the way it gets merged. Let's look at different ways we can merge a branch. In Git, there are two kinds of merge operations. A fast-forward merge, and a true merge. Let's talk about fast-forward merges first. Consider this scenario: there are two commits, D and E, that are reachable from the feature branch, but not from master. The important thing to notice here is that the master branch points to a commit that's an ancestor of the commit referenced by feature. In other words, master and feature are on the same line of history. If we were to merge feature into master by saying git merge feature, Git will notice that master is already reachable from feature, and simply move the master reference forward so that it points to the same commit as feature. This is what's called a fast-forward merge, and doesn't create a merge commit. Now, let's look at a different scenario. In this case, the line of history of master has diverged from the one of feature with the creation of commit C. This time, if we were to merge feature into master, Git wouldn't be able to do a fast-forward merge because C isn't reachable from feature, and would be lost in the process. Instead, Git does a so-called true merge. That is, it applies the changes contained in the snapshots of D and E on top of C, followed by the creation of a merge commit, M, to tie the two lines of history. But what if we wanted to do a fast-forward merge with a diverged history, like in this example? Well, in that case, we could move the commits in the feature branch so that they become descendants of the latest commit in master. In other words, we will change the parent of D from B to C. That would put them on the same line of history as master without losing commit C. The commands to do that is git rebase. As the name implies, rebase allows us to change the base of our commits to another commit than it did originally. In this case, we want to rebase feature on top of master, so we say git rebase master feature. This means move head to feature, and rebase it on top of master. Git will now identify the first common ancestor of feature and master, also known as the merge base, to determine which commits are reachable from one, but not the other. Then, it will reapply each of those commits, one at a time, on top of the commit referenced by master. At this point, we can merge feature into master with a fast-forward merge. Note that rebasing is merging. The difference is that instead of merging two snapshots at once, with their entire sets of changes, each commit is merged one at a time. Any merge conflicts that will appear when doing a true merge, would also be there during a rebase. However, instead of facing them all at once, we can deal with them as the conflicting commit gets applied. Now, let's think about the opposite scenario. What if you wanted to create a true merge, even if the two branches are on the same line of history? In that case, we can specify the --no-ff option to Git merge, which stands for no fast-forward. So, how do we know if we should use one kind of merge or the other. Well, consider this: with fast-forward merges, every commit appears to be in the same line of history, even if in reality, it may have been done in any number of branches. True merges, on the other hand, reflect the way history is diverged and reconnected, even if the branch references that were involved no longer exist. So the advantage of using fast-forward merges is that history is linear, and therefore becomes very easy to read. The disadvantage is that if a merge introduce a problem, we can't easily revert it, because we can't tell which line of history the merged commits came from. With true merges, this becomes trivial since we can see the different branches of history and reverse the faulty merge commit. The downside is that the history littered with merge commits is harder to follow, since there are multiple lines of history that intersect with each other. So, in which situation are fast-forward merges more suitable than true merges and vice versa? Well, the answer lies in one important side effect of rebasing. Changing a commit's parent causes the ID of the commit itself to change along with all its descendants, so rebasing is rewriting history, and like we saw in module three, we should never rewrite commits that have been published, since they may have been fetched by someone else. So here is a general rule of thumb: if you are merging a public branch into another public branch, like for example between two long-running branches, then use true merges. This will make it obvious where the merge commits came from. If instead you are merging a private branch into a public one, like for example, merging your own topic branch into a long-running one, then prefer fast-forward merges. Other people likely don't care about the fact that those commits were done in a topic branch that only existed in our local repository so we should rebase our work on top of the public branch before merging it with a fast-forward merge. But what about traceability, you might ask? Well, if you keep your commits consistent and well documented, as we talked about in module three, it shouldn't be too difficult to identify and revert a faulty commit, regardless of whether the history is linear or it has multiple branches.
Cherry-picking
Besides fast-forward merges and true merges, there is another way we can bring commits into a branch: by using the cherry-pick command. Cherry-pick allows us to apply the patch from individual commits on top of the commit referenced by head. For example, let's say we are interested in merging just the commit E into master. In this case, we couldn't simply say git merge feature because that would bring in commit E, along with all its ancestors. Instead, we will use the cherry-pick command, and say git cherry-pick feature, where feature refers, of course, to commit E. At this point, Git will apply the patch from E on top of C, thus creating a new commit, E prime. E and E prime are going to have different IDs because they have different parents and different time stamps; however, they are going to be patch equivalent, so by convention, we mark them as prime, like we saw in model four. We can also cherry-pick more than just a single commit; in fact, as of Git 1.7.2. cherry-pick supports ranges of commits using the familiar dot dot notation. The important rule to remember here, is that the first commit you specify in the range is going to be excluded, while the second one is going to be included. Here is an interesting application of this feature. Let's say we want to cherry-pick all commits reachable from feature into master, starting from feature's branching point. In this example, the range we are interested in includes commits D and E. In order to cherry-pick both of them with a single command, we could say git cherry-pick, git merge-base master feature . . feature. Let's break this down. The beginning of the range is marked by the first common ancestor of both master and feature, that is, B, whose commit ID we obtain through the merge base command. That commit is excluded from the range. The end of the range is marked by the commit referenced by feature, that is E, which is included in the range. This command is so useful that we could make it a little bit more generic, and have an alias for it. Let's call it append. Notice that the alias here is a shell command, indicated by the exclamation mark, where the commit reference that marks the end of the range is defined as a parameter. This allows us to use it with any reference. Like for example, git append feature tilde one, which means cherry-pick the commit reachable from the parent of feature, but not from head, which in this case, corresponds to commit D.
Resolving Conflicts
Every time we merge a commit, like for example, when doing a merge, a rebase, or a cherry-pick, Git is going to combine the changes contained in the snapshot of the commit that's been merged with the one of the commit where it's been merged to. If a line in a given file has been added, deleted or modified only on one side of the merge, then Git will include that line in the resulting file as is, no questions asked. However, if both files happen to modify the same line, Git can't decide which one should be included, so it calls a merge conflict and stops, waiting for us to settle the dispute. In this situation, some version control systems, in a noble, but often vain attempt to be helpful, try to automatically resolve the conflict on our behalf, based on some smart algorithms. Git does none of that. Instead, it helps us resolve the conflict by granting us a set of tools that make the job easier. Let's look at a few of them. First, let's get ourselves a merge conflict by merging the feature branch into master. Git tells us that it can't apply the patch from commit E because of a conflict in the calculator.c file, and stops in the middle of the merge. Checking the status confirms that one file was successfully merged and added to the index, while calculator.c was modified by two commits and is still only in the working directory. Now, if we wanted to, we could get out of this situation right now, by simply passing the --abort option to Git merge, and head would be back at the commit it was pointing to before we started the merge. Let's go ahead and open up the conflicting file in an editor. You may notice that Git uses the same notation as the merge program included in the revision control system, or RCS suite of tools to highlight the lines involved in the conflict. The section above the equal signs contains the line as it appears in the commit that's being merged to, also called ours and referenced by HEAD. The section below it contains the line from the commit that's been merged, also called theirs, and referenced by a special ref called MERGE_HEAD. Now, the key to successfully resolve conflict lies in understanding the context in which the two changes were made. In this case, it's not obvious which line we should choose, just by looking at them. We need more information about the context, and that's exactly where Git shines. Let's quit our editor for now. While searching for more clues, let's find out which commits contain the conflicting files by passing the -- merge option to git lg. Assuming the commits are well documented, we might at this point have enough information to make a decision based on the contents of the commit messages. If we still can't decide how to resolve the conflict, we might be able to gather some more insights by looking at the file itself as it was before the conflicting commits were made. In Git parlance, this commit is called the merge base, and it's the first common ancestor of two or more commits. We can ask Git to find the merge base between two commits by using the merge base command. In this case, we want to get the merge base between the commit that's been merged to and the commit that's been merged. If there are multiple ancestors, Git is going to choose the best one; that is, the one that's closest to the specified commits in the line of history. Like for example, in this case, commits A and B are both common ancestors of C and E, but Git reports B as the merge base because it's the closest one. Once we have the merge base, we can do a so-called three-way merge. As opposed to the more traditional two-way merge, a three-way merge compares the conflicting files, not only against each other, but also against their common ancestor. We can ask Git to include all three versions of the conflicting lines in the calculator.c file in our working directory by using the --conflict option of git checkout. At this point, if we open up our calculator.c file, we see that Git has added a version of the line from the merge base after the pipe marks. If we would like Git to always do that in the case of a conflict, we can tell it to do so by setting the conflict style option to diff3. At this point, we can probably tell that in order to resolve the conflict, we need to combine both changes with the final line including both the word simple as well as the new line character. So let's go ahead and do that before finally getting rid of the merge markers. If we now do a git diff after we resolve the conflict, we get our other interesting output. Git shows a so-called combined diff containing both the original versions of the conflicting lines as well as the merged one with different indentations. Sometimes, the right way to resolve a conflict is to simply choose one version of the file entirely. If that's the case, we can do so by passing the --ours, or --theirs options respectively to git checkout. Once we are done, we can add the merged file to the index and create the merge commit. Regardless of how you decide to resolve a conflict, the important thing to remember is to never introduce changes in the merge commit that aren't part of either side of the merge. This will create a so-called evil merge, which as the name implies, can make it hard to track down the origin of the change, since it only exists in the merge commit.
Reusing Recorded Resolutions
So far, we have talked about how Git can help us resolve a single merge conflict, but Git can do more than that. It can even help us resolve the same conflict multiple times, thanks to a fairly unknown feature called rerere, or reuse recorded resolution. Interesting for sure, but why would we ever need to do that, you might ask? Well, consider this workflow: let's say that we have a private topic branch named feature that we've been working on for awhile. In order to make sure that our changes are still compatible with what's in the long-running branch, master we do a test merge from feature into master, but first, let's activate git rerere in our configuration file. We then proceed by merging feature, which unsurprisingly leads us right to the same conflict we saw before. This time, however, git rerere is aware of the fact that we are in the middle of a conflict resolution and is ready to record which lines are involved, as well as the line that will end up in the merged file. At this point, we go ahead and resolve the conflict. As you can see, rerere has recorded the resolution in its own cache. We can now create the merge commit and continue working on our topic branch. At some point, we might feel ready to merge our work back to master before sharing it with the rest of the team. However, we don't want to clutter the public history of the repository with our test merge commit. Instead, we want to remove it and rebase the topic feature branch on top of the latest commit in master. This will allow us to do a fast-forward merge and maintain a nice linear history, as we have discussed earlier in this module. So let's go ahead and do that. Remember that rebasing is the same as merging one commit at a time; this means that the conflict we resolved previously in the merge commit is going to appear again once we reach that same commit during the rebase, and sure enough, that's exactly what happens. However, this time, we don't have to do anything. Git rerere has recognized the lines in each side of the conflict and reused our previous resolution. Notice that it didn't add the merged file to the index so we still have a change to inspect it to make sure that everything looks okay. At this point, we can state the merged file and continue with the rebase. Git rerere is a real time saver when we have to merge two branches multiple times since we only have to resolve any given conflict exactly once.
Summary
In this module, we looked at how to take advantage of Git's branches to work with multiple lines of history. We started out by looking at Git's branching model and how it differs from the one of traditional version control systems. Then, we identified the different kinds of branches we can create: public or private, long-running or topic; and in what scenarios they're useful. Following up on that, we looked at the different ways we can merge branches; that is, with fast-forward merges, true merges and cherry-picking, and when it makes sense to prefer one way over the other. Finally, we saw some of the great tools that Git puts at our disposal to help us resolve merge conflicts. In the next module, we're going to discover Git's forgiving nature by learning how we can rewrite history to correct a mistake or to undo a previous decision.
Rewriting History
Introduction
In this module, we'll discover Git's forgiving nature by learning how to rewrite the history of our repository in order to correct a mistake or to backtrack on a previous decision. We'll start out by looking at how we can amend commits both recent and old, in order to modify their contents or metadata, in the process of rewriting history, we sometimes end up in a situation where we'd like to undo what we just did and start over. So our next topic is going to be how to reverse the state of our history by implementing our own undo command. Next we'll see how we can recover commits that we thought we had lost by taking advantage of the reflog. Finally, we'll see how we can even debug our code base using Git. Let's get started.
Editing Commits
Look, we all make mistakes. Sometimes we wouldn't even call it a mistake. We simply change our minds about a choice we made earlier. This happens all the time, especially when programming. Unfortunately for us though, version control systems have traditionally been rather unforgiving when it comes to changing history. Once something was committed to source control, it became final. Git is different, more human in a sense. In fact, Git is okay with us changing our history as much as we want, as long as we haven't shared it with anyone else. Let me give you a quick demonstration. Imagine that we just made a commit when we suddenly realize that we forgot to include one file in it. If we were using a tool Subversion, we will have to make a new commit with only the missing file and some apologetic comment. In Git, we can simply modify our previous commit. How? Just like we are used to. First we add it to the index, and then we use git commit. However, this time we pass the --amend option. This causes the contents of the index to become part of the same commit referenced by HEAD. When we do that, our default editor opens up with the message from that commit allowing us to further clarify it or just leave it as it is. If we know beforehand that we are going to want to reuse the message from the latest commit, we can speed things up a bit by specifying the -C option, which tells git to reuse the message from the commit with the specified reference. So -C HEAD reuses the message from the commit referenced by HEAD. Having the ability to rewrite history is incredibly liberating and arguably one of the greatest advantages of using Git. It allows us to make the tool fit our preferred way of working and not the other way around. For example, we can use commit as temporary snapshots of the state of our working directory while we work on a feature. If we make a misstep, we can quickly go back to a good state by removing the faulty commits. For example, let's say that the idea we developed in commits D and E turn out to be a dead end. That's not a problem. Since we are working in our private branch, we can simply get rid of them by using the reset command. Reset HEAD~2 moves HEAD to its second ancestor. That is, commit C. Notice that we added the --hard option. This tells Git not only to move the HEAD reference, but also to update index as well as a working copy to match the snapshot of C. This means that any uncommitted changes we might have had in our working directory would be lost. Let me open a brief parentheses here. You might have heard that in Git nothing is ever lost. And that certainly is true to an extent. The truth is that anything can be recovered as long as it has been committed. This makes git reset --hard one of the few destructive commands that exist in Git since it will literally reset the working directory to match the snapshot of a certain commit without warning. So be careful when you use it or you might end up losing part of your work. Now, back to our discussion. At this point, we might feel satisfied with what we have achieved in our feature branch and we are ready to share it with the rest of the team. However, before we do, we need to go through our temporary commits to ensure that they are consistent and well documented as we talked about in module three. For example, we might decide that the changes in commit B and C are strongly related to each other and should be squashed into the same commit. Once again, we can do that quickly by using the reset command, this time with --soft option. The difference between --hard and --soft is that the latter moves HEAD to commit B, but leaves the index and the working directory untouched. This means that the contents of the index still matches the ones that were part of commit C. At this point, we could include them in commit B by simply amending it like we did before. Or we could change our mind again and decide that commit B should really be split into two different commits after all. In order to do that, we need to move HEAD to its parent, that is, A, and the HEAD of the changes containing commit B exists only in the working directory. As it turns out, the reset command can do that too. We just need to invoke it using a third mode of operation called mixed, which is also the default one. The --mixed option moves HEAD to the specified commit and updates the index to match its snapshot, but doesn't update the working copy, which is left with the changes one containing commit B. Now we can simply stage and commit the two files separately. So, to summarize, git reset has three modes that control what's going to be changed. Hard moves HEAD and updates both the index and the working directory to match the destination commit. Mixed moves HEAD and updates just the index. And finally, soft simply moves HEAD. Invoking reset without any arguments implies mixed. Now, what if the commit we want to amend isn't the latest one in the branch? Like, let's say when we realize that it's commit A that should have had an extra file. Well, we could use git rest to move HEAD all the way back to A and amend a commit, but then we will have to recreate all its descendants, in this case, B and C. So, instead of doing that, it's way easier to simply create a new commit that adds the missing file and then squash it into A through an interactive rebase like we saw in module three. We can make this even quicker through a relatively unknown feature called auto-squashing. Let me show you how it works. First, let's change the message of our current commit by prefacing with the string fixup, exclamation mark, followed by the message of the commit we want it to be squashed into. Then we start our interactive rebase, starting from the first commit indicated by --root with the addition of the --autosquash option. Autosquash tells Git to look for a commit whose message begins with whatever comes after a string, fixup, exclamation mark, and automatically move that commit under it, changing the action to fixup. At this point, all we have to do is save the file and exit, letting rebase do its thing. You can imagine what a time saver this is, especially when you have a few of these fixup commits spread throughout the branch. An interactive rebase is all it takes to squash them into the right place. If we'd like Git to always do that for every interactive rebase, we can enable autosquashing in the configuration file through the rebase.autoSquash option, at which point we no longer have to explicitly add the --autosquash option when doing an interactive rebase.
Undo
Sometimes when we rewrite our commits, we end up with a history that looks nothing like the way we want it. When that happens, we wish there was a way to undo our actions, just like we can do in any editing software. If you've ever found yourself in one of those situations, you'll be surprised to find out that there is indeed a way to undo our actions in Git. Let me show you how. Every time a branch reference moves, that is, it points to a different commit than it did before, git records previous position in a sort of journal called the reflog. In every repository, there is a reflog for each branch as well as one for the HEAD reference. We can get the list of entries in the reflog of a given branch by using the reflog command followed by the name of the branch we are interested in. For example, git reflog master shows the reflog entries for the master branch. If we instead wanted to look at HEAD's reflog, we will simply omit the argument and say git reflog. The entries stored in the reflog are in reverse chronological order, with the most recent ones on top. Notice also that each entry has an index. This is very handy because we can use that index to point to the commit referenced by a certain reflog entry using this special syntax, reference@index, where reference can either be the name of a branch or HEAD. Index is the entry's position in the journal. So for example, if you wanted to look at say, the commit HEAD was point to two positions ago, we would say git so HEAD@2. Note that here we are using the so alias we defined in module two instead of the regular git show command in order to give a prettier and more concise output. If we instead wanted to look at the commit master was referencing just before the latest one, we would say git so master@1. Now, think about this. The reflog keeps track of the history of commits referenced by a given branch. just like the history of a web browsers keeps track of the URLs we visit. This means that the commit referenced by @1 is always the commit that was referenced before the current one. If we were to combine the reflog with the git reset command that we saw earlier in this module, like this, for example, we will suddenly have a way to move HEAD, the index, and the working directory to a previous commit referenced by a branch. This is essentially the same as pressing the back button in our web browsers. At this point, we have all the pieces we need to implement our own Git undo command, which we do in the form of an alias. Here it is. Now, there are three interesting bits to note here. One, we are defining the alias as a shell function named f, which is then invoked immediately. Two, we are using the rev-parse command followed by the --abbrev-ref option to get the name of the current branch, which we then concatenate with the @ syntax to form the reference to a previous position. Three, we have specified the position in the reflog as a parameter with the default value of one. This is the whole reason why we define the alias as a shell function, to provide a default value for the parameter using the standard shell syntax. The beauty of using an optional parameter like this is that allows us to undo any number of operations. However, if we don't specify anything, it's going to undo just the latest one. Let's try it out. Say that we first remove the last two commits in master, D and E, and then we merge the feature branch. At this point, we have a history that looks like this. Now, let's say that we want to undo our last two operations. We can do that quickly by using our undo alias. And there you go. Our history is now back to the way it looked like before we started rewriting it. But what if we wanted to undo the undo? Well, since git undo also counts as an operation, all we need to do is to once again undo our latest operation with git undo, which is the equivalent of saying git undo one.
Recovering Commits
In module five, we saw that when a commit is no longer reachable through a symbolic reference, it becomes unreachable, like in the case when we delete the last branch that was pointing to it. We also saw that we can get a list of unreachable commits in our repository by using the git fsck command with the --unreachable option. What we didn't talk about, however, is what actually happens to the commits once they become unreachable. Well, Git, at its core, is designed to work like a filesystem, and as such, it cares a great deal about the integrity of our data. So when we move, added, or delete commits in the process of rewriting history, nothing is actually lost. Even if there seems to be no way to get a hold of a commit, there is always one last reference: the reflog. HEAD's reflog, to be exact. Since it records all commits that have been referenced by HEAD at one point or another. In fact, if we don't count the reflog entries as references, the number of unreachable commits found in our repository is higher. Let me demonstrate it. First, let's count the number of unreachable commits, excluding the ones referenced by the reflog. As you can see, we have 367 unreachable commits right now. Now, let's count them again with the --no-reflogs option, which ignores the reflog entries. That's 409 commits. This means that there are 42 commits that are unreachable through a symbolic reference, like a branch or a tag, but are still reachable through the reflog. Keep in mind that the entries in the reflog won't stay around forever. Git will in fact remove entries older than a certain number of days as part of a garbage collection cycle. How many those days are is different based on whether the referenced commit is still reachable or not. The default expiration dates are 90 days for reachable commits, and 30 days for unreachable commits. We can change those values by setting the gc.reflogExpire and gc.reflogexpireUnreachable configuration variables, respectively. So, what does this mean to our day-to-day work? Well, it means that we can rewrite our history as much as we like without having to worry about losing our commits, since they'll still be around in the repository between one and three months before they are finally deleted. But how do we recover a commit from the reflog? Well, there are a couple of ways to do it depending on what it is we want to recover. If we want to restore a branch reference so that it points to an older commit, we can use the reset command as we've seen earlier in this module. If we instead are interested in recovering one specific commit, we can use the cherry-pick command that we saw module four. For example, let's say that our history consisted initially of these commits, but then we accidentally removed commit C while doing an interactive rebase so that history now looks like this. Now, we know that commit C is still in HEAD's reflog since HEAD did reference it at some point. So we could just go ahead and look through the reflog entries until we find the one whose commit message is C. But there is a quicker way. instead, we can simply search for that particular commit message in the entire reflog by using the git log command followed by the --grep and --walk-reflogs options, respectively. This will give us the list of reflog entries that point to a commit whose message contains the string C. At that point, we can simply cherry-pick the one with the lowest position since that's the most recent one. There is a problem, though. The cherry-pick command is going to apply the commit at the tip of the branch, which isn't exactly what we want. We would like to put C back to where it was originally: between B and D. So how do we do that? Well, as always, there are a few different ways to do it, but the one that involves the least number of steps is this. First, we move HEAD to C's old parent commit, that is, B. We are using checkout here because we want to move HEAD by itself without modifying master, which is still pointing to commit D. Then we cherry-pick commit C on top of B by fetching it directly from the reflog. One thing to notice here is that since we moved the HEAD reference a few times, the entry for commit C has probably moved down a few positions in the reflog since the last time we checked, so we should look for it again to make sure that we cherry-picked the latest one. We then follow up by cherry-picking commit D, which is still referenced by master. Finally, we move the master branch reference to point to the new D commit using the branch command we saw in module five. The -f option forces git to set an existing branch to a specific commit. As a final step, we move HEAD to master using checkout. And we are done. As you can see, commit C is now back to its place in our history.
Debugging
As incredible as it may sound, Git can even help us track down a problem in our codebase. How? Let me give you an example. Let's say that we just pulled the latest commits from a shared repository and decided to run the build script to verify that everything is okay. Uh oh, we got a problem. The code doesn't compile. In situations like this, the first thing we want to know is when did this happen? Or to be more precise, which commit broke our code? If we knew exactly where the problem is, we could answer that question pretty much immediately using git blame on the file that contains the error like we saw in module four. And that will give us the ID of the commit that modified the faulty line along with its author. However, if we can't determine where the problem is, our last resort is to manually go through a fair amount of commits, manually expecting each and every patch. Tedious, to say the least. Fortunately, once again Git has us covered thanks to a command called git bisect. Bisect, as the name sort of implies, helps us dissect our code by doing a binary search through its history of commits. Here is how it works. First, we need to tell git that we want to start a bisecting session by saying git bisect start. At this point, we'll need to give git a range of commits to search. We do that by giving it two commits. First, the commit where things are bad, and then the commit where things were last known to be good. In this case, we know that things are pretty bad right now, so we say git bisect bad HEAD, which marks the current commit E as the bad end of the range. Now, we don't really know the last time things were good in this repository, so we're going to say that the latest known good commit is the first one, A. Git has now enough information to start bisecting. It starts by moving HEAD to the commit right in the middle of the range, in this case, C. The next step is for us to check the state of the working directory. We do that by simply running our build script. It completes successfully, so commit C is good. Let's tell Git about that. This means that the faulty commit must have happened after C. In other words, it must be a descendant of C, so git moves head to the middle commit, this time in the upper half of the range, which brings us to commit D. Now, let's go ahead and run our build script once again. And here we have the compilation error. So commit D is bad. At this point, since there is only one commit left in the range, we can safely say that D is our faulty commit. And indeed, we can confirm that by looking at its patch. When we are done with our bisecting session, we move HEAD to its original position by saying git bisect reset. Thanks to git bisect, we only had to check two commits instead of five to determine which one of them introduced the error. But I know what you are thinking. Sure, this is certainly much faster than going through each and every commit manually, especially when you have a range containing dozens of commits. Still, having to run our build script manually at each step is time consuming. As it turns out, someone already thought of that. In fact, if you have an automated and repeatable way to verify the state of your working tree, like, for example, a build script, you can tell git to automatically run it at each step during the bisect operation. If the script exits with a nonzero code, git is going to assume that the current commit is bad, while a zero exit code means the commit is good. So, since we do have a build script, let's rerun our bisect session, this time letting git do all the work. git bisect start HEAD HEAD~4 is a shortened way of start bisecting and give the search range. The bad commit first, followed by the good one. Then we simply say git bisect run make, at which point git runs off doing its thing and gets back to us when it has found the first bad commit, which as we know, is commit D.
Summary
In this module, we learned how to rewrite history to correct a mistake or simply change the shape of our commits to make them easier to read and interpret. We started out by looking at how we can edit existing commits, both by squashing multiple ones together as well as breaking one commit into smaller ones. Then, we saw how we can undo our actions in case we ended up with a history that doesn't look like the way we intended by taking advantage of the reflog. We also saw how we can use the reflog to recover unreachable commits. Finally, we learned that git can even help us debug our code faster through bisecting, which means dissecting the history of our repository with a binary search algorithm. This concludes Advanced Git Tips and Tricks. I hope that you have the found the information covered in this course to be useful in your day-to-day work with Git, and that you have learned new ways to use this incredibly powerful tool. Thanks for watching.
Course author
Enrico Campidoglio
Enrico is an Italian programmer and mentor with a strong passion for software quality and knowledge sharing.
Course info
LevelAdvanced
Rating
(129)
My rating
Duration2h 28m
Released2 May 2016
Share course