What do you want to learn?
Skip to main content
How Git Works
by Paolo Perrotta
This course is for developers and system administrators who want to really understand Git. Whether you just started using Git, or you've been using it every day for months - this course will give you the knowledge you need to become a Git master.
Start CourseBookmarkAdd to Channel
Table of contents
Git Is Not What You Think
Hello, I'm Paolo. Welcome to Pluralsight. This is an advanced training about Git, but even if you're just a beginner, I think that you will be able to follow along and pick up some good knowledge. We're going to talk about how Git works internally under the hood. Why is that important? Well, of course there is some geeky pleasure in understanding how things work, but that's not the most important reason to know this stuff. Give me one minute to tell you the real reason we're talking about the internals of Git. When you think about Git, you probably think about the high-level user commands, the so-called porcelain commands. You're probably familiar with the basic ones such as add and commit, and if you worked with our remote repository, then you probably also used push and pull. And if you worked with branches, then you used branch, checkout, merge, maybe even rebase. The list goes on. Some people even get a little bit deeper than these into the low-level commands, the so-called plumbing commands such as cat-file, hash-object, and a few more. These are the basic building bricks that the porcelain commands are built upon. You might never need to use the plumbing commands unless you're doing some advanced Git scripting or the like. Now understanding all these commands can be hard, some of them can be confusing; however, here is a key point. You could argue that the secret to Git is not about knowing the commands, either porcelain or plumbing. Instead, the secret to Git is about knowing the conceptual model behind the Git. If you want to use Git safely and unleash all of its power and not get in trouble, then don't look at the commands. Look at the model instead. Once you do, the complexity of the Git commands kind of fades away. Suddenly Git looks simple, even elegant I promise. You don't get stuck anymore. So if you really want to become a Git master, then you should understand the model, and then you will also understand the commands much more deeply after you understand the model. And this is what I want to talk about in this training, the model, so let's get started.
Git Is an Onion
To wrap our head around Git, let's talk about what Git really is. It's not necessarily what you think. Imagine that Git is layered like an onion. We won't to try and understand the whole onion at once--that would be very ambitious to eat the whole onion. Instead, we will peel off the layers of the onion until we reach Git's conceptual core. If you look up Git on Wikipedia, you will read that it's a distributed revision control system. That's a mouthful. Not only Git does what other revision control systems do, it does that in a distributed way that's harder to understand than, for example, subversion that is client server. That's a lot of stuff to wrap your head around, so let's make it easier by peeling off one layer. Let's remove distribution. In this first part of your training, imagine Git is not distributed at all. If you can imagine that there is only one computer in the world, then there is a repository in that computer. That's all you want to think about for the moment. So Git becomes just a revision control system, no distribution. However, a revision control system is still a complex beast. It includes things such as history branches, merges. And these features make things more complicated, so let's make it simple instead again. Let's peel off one more layer. What happens if you forget about branches, history, and the like? Now we have a smaller onion. You can call it "a stupid content tracker" because that's all it does. It tracks content files or directories. And if you look at Git's documentation, you will see that this is actually Git's definition of itself, "Git, the Stupid Content Tracker." If you look at it as a content tracker, then Git is easier to understand, but let's take this one step further. Forget even about tracking files. Forget about the notion of a commit or versioning. Let's look at the very core of the onion, the basic idea behind Git, and I would say that at its core Git is just a map, a simple structure that maps keys to values. And this structure is persistent. It's stored on your disk. Now we got to the core. During this training, we will rebuild the onion from the inside out, and we will understand each layer in depth. In the first module of this training, we will talk about the first two layers of the onion. This information will be pretty technical, you might even wonder why are we going to so many details, but this is the groundwork. Be sure that by the time we get to the second module and we get to the upper layers of the onion you will be surprised by how concretely useful this deep understanding turns out to be. Even things you've been doing every day with Git might look different and simpler.
I just said that at its core Git is a map. That means that it's a table with keys and values. What are the keys, and what are the values? Well, the values are just sequences of bytes, for example the content of a text file or even a binary file. Any sequence of bytes can be a value. You can give a value to Git, and it will calculate a key for it, a hash. Git calculates hashes with a SHA1 algorithm, it's S-H-A-1, "Shawn" for Friends. Every piece of content has its own SHA1. For example, let's take a piece of content, the string Apple Pie. If you ask Git to generate a SHA1 out of this string, then you will get this hash, exactly this one. There is only one hash for this string. SHA1s are 20 bytes in hexadecimal format, so they are a sequence of 40 hex digits. This will be Git's key to store this content in the map. We can also calculate the SHA1 on command line. To do this, we need a command that you might never have heard about because it's a low-level plumbing command, git hash-object. So let's pass our piece of content to hash-object. I wish I could do it like this. It would be easy, but I can't. Hash-object is not very user-friendly (It's a plumbing command). So, if you do what I just did, Git will think that Apple Pie is the name of a file. Instead, I can use the echo command to output this content and then pipe the result into hash-object like this. I also need to tell hash-object to get its content from standard input, not very intuitive. If you're using Windows, then you will use different shell commands. But don't worry about doing this yourself anyhow. It's enough that you understand what this does. It prints out the hash for this piece of content. And here is the result. This is the SHA1 for the string Apple Pie. This is the same SHA1 that I showed you on the previous slide. We have the same content, so we get the same SHA1. If you change anything in the content, a single letter, for example, I will add the new line character at the end like this, then you get a completely different shawn. Every object in a Git repository has a SHA1. If you put the string Apple Pie in the file and store this filing Git, then the SHA1 we just generated will identify the file. As we'll see later. Directories also have their own SHA1, as do commits and so on. With so many SHA1s around, you might wonder what happens if they collide? After all, the number of possible SHA1s is large sure, but it's not infinite. What if I have two different pieces of content and just by chance they happen to have the same SHA1? Wouldn't that make a mess of my project and cause me to lose my data? Well, yes it would, but it's unlikely to happen. Let's see just how unlikely it is just because it's fun. Think of the US Powerball lottery. How many chances do you have of winning the lottery jackpot? Google tells me that the chances that a particular combination of numbers wins the jackpot are about 1 in 175 million. This is a large number, so let's try to visualize it. Imagine printing one ticket for every possible combination of numbers in the lottery. You get 175 million tickets. Now imagine putting all those tickets in a line, 1 every 25 cm. That's about 10 inches. That's a very long line of tickets, long enough to span the entire equator. Now imagine starting somewhere on the equator and taking a walk around the world. It's a long walk and also quite a bit of swimming, so it's going to take a while. And all across your trek you are walking along this very long line of lottery tickets Just once during your trip you're allowed to pick up a single ticket. And if you are really lucky, that's the one ticket that wins the jackpot. Congratulations! That's how hard it is to win the jackpot. Now imagine that you enjoyed winning the jackpot, so you want to try again. You take a second trip around the world, you once again pick up a single ticket along your way, and you win again. And now that's really, really good luck. Winning the jackpot twice in a row is almost miraculous good luck, in fact. Now imagine doing it a third time, and amazingly you win again, and again...and again six times in a row. Now winning the jackpot six times in a row is extremely unlikely, you will agree. Well, going back to Git, these are about the same chances of getting the same SHA1 for two different pieces of content. It's just not likely to ever happen to you or to anybody by chance. So by all practical purposes, SHA1s are unique. Not just unique in your project. You can think of them as if they were unique in the universe. You could put all of the data you will ever write in your life in the same Git repository, and Git would assign a different SHA1 to each version of each file and each folder. That's a lot of data. You might get some performance problems, but still no collisions. Later in this training when we talk about distribution, this piece of information will come useful. For now, I'm only mentioning it to say if you have ever worried that two SHA1a might collide in your Git project, then stop worrying now.
So we have seen that Git is a map where the keys are SHA1s and the values are pieces of content, but I also said that Git is not just a map, it's a persistent map. Where does persistence come from? Let's go back to the Git hash-object command we used a few minutes ago. If I want the Apple Pie content to be persistent, I can add the -w argument to this command. -w stands for write. So now besides generating the hash, Git will also save this piece of content in its repository. However, now we don't have a repository yet, so if I try this commit straightaway, Git complains. We're not in a Git project. We don't have a repository. I don't know where to save the content. So let's turn this directory into a Git project. There is a command for that, and you probably used it already. It's high-level porcelain command, git init. There, that's all it takes. Apparently nothing changed, but if you look at the hidden files and directories, on this computer I do that the -a switch, then we can see a new hidden subdirectory called .git. This is where your Git repository goes. So, now Git has a place to save stuff. And if we run the hash-object command again with -w, we get the hash, and we also save the content. Let's see where exactly. Let's peek inside the .git directory. There are a few files and folders here, but for now just look at this directory here, objects. This is called the object database. It's the place where Git saves all its objects like the string "Apple Pie" we just saved. Let's peek inside. Ignore this too, the info and pack subdirectories. For now they're not important. Instead, look at this subdirectory here. Its name is 23, and these are the first two hexadecimal digits of the SHA1 of the content we just saved. And if we look inside 23, there is a file in here, and the name of the this file is the remaining digits of the SHA1. It uses this scheme to organize content and spread it over multiple directories. It's just a trick to avoid piling up all the content into a single huge clutter directory. Our original string, "Apple Pie", is inside this file. This is what Git calls a blob of data. A blob is a generic piece of content. However, the original string has been mangled a bit inside the file. Git added a small letter and compressed the content to save space. So we can't just open the file and read it, but we can use another low-level plumbing command to look at the content. It's called git cat-file. Once again, don't worry if you don't remember this command. It's rarely used. I'm using it now just because I want to show you how Git saves content. Git cat-file takes the SHA1 of an object and an argument. If we run it with the -t argument, it stands for type, Git tells us what this piece of content is. It's a blob. And if we run it again with -p for pretty printing, then Git unzips the object, removes the other, and it prints out the actual content of the blob. And here it is, the string Apple Pie there. So far we have seen that Git is able to take any piece of content, generate a key for it, a SHA1, and then persist the content into the repository as a blob, a persistent map. This is the very basic of the Git model. Let's build on this and move on to the next layer of the onion.
We have seen that Git is a persistent map, but you probably don't see it as a map. You see it as something more than that, something that tracks your files in your directories, a content tracker. Let's see what that means. We need an example project, so I built a very simple one, a cookbook. In the root of the project there is a file named menu.txt. This is supposed to a menu, a list of all the recipes in the cookbook. Right now it only contains a single recipe, "Apple Pie". Then we have our recipes directory that contains the README that tell you that you are supposed to add one separate file for each recipe here. And indeed we have one file here with the recipe of the apple_ pie. This file is supposed to contain the entire recipe. For now it's just a placeholder actually, and it contains the string "Apple Pie". I'm using this string a lot here. I like apple pie. We'll fill in the real recipe later. So, we have three files, one in the root, and two in the recipes folder. It's a very simple project, but that's what we want for now. We want to understand how Git stores these files and folders, so it's better if we start simple. Let's make this a Git project with "git init". There, now we have a .git directory here. And because it's a brand new project, the object database in the database folder here is empty apart from the info and pack subdirectories. We can ignore those as usual. Now that we have a project, let's create our first commit for this project. Let's use the "git status" command to see the files and folders in the project root. You probably used the git status already. I configured my Git installation to use color, so we can see that both menu.txt and the recipes directory are red because they are untracked. That is, Git doesn't yet know what to do with them. You know that to commit a file I have to put it in the so-called staging area first. It's like a launch pad. Whatever is in the staging area will get into the next commit. We can add these files to the staging area with the "git add" command. Let's add menu.txt and then the recipes folder and all of its content. Now the files are green. It means that they have been staged. Let's commit them. I will use the -m argument to get commit so that I can give a commit message right here. There. Now the staging area is clean, and we can use another popular command, "git log", to look at the list of existing commits. There is only one, and it's SHA1 starts with these digits. Okay, good to know. So far this was business as usual for any Git user, and now let's go deeper. Let's open the hood and look inside the git/objects database. This is going to be short, but intense, so hold on though. If you look in the .git directory under objects, you will see that we have a bunch of subdirectories in here now. One of these is named with the first two digits of the commit, and here are the remaining digits, so this file must be the commit. A commit is compressed just like a blob, but by now we know how to peek inside compressed files. We can use git cat-file for that. I will git "cat-file" the commits SHA1 with -p so that it prints the content of the commit. And here it is. So, what's a commit? It's a simple and very short piece of text, nothing else. It's truly a simple as this. Git generates this text, and then it stores it pretty much the same way it stores a blob. It generates it's SHA1, it adds a small letter to the text to say this is a commit, it compresses the text, and it stores the result in a file in the object database. The commit text contains all the metadata about the commit, the name of the owner, the committer-- both are myself, and the date of the commit and the message, and then it contains something more, the SHA1 of a tree. What tree? Well, just like a blob is the content of a file stored in Git, a tree is a directory stored in Git. The commit is pointing at the root directory of the project. That's what this tree is, the root of the project. If you look in the object database, you will see a directory named with the first two digits of the trees hash, and inside it is the tree, a file name with the remaining digits of the hash, as usual. It's just like commit, see a piece of content that is generated by Git and that hash then stored in object database. So what's inside this tree? What does it look like? Let's cat-file it. Just like a commit, a tree is a tiny piece of text. That's all it is, and it contains a list of the content of the directory, a list of SHA1's actually. In this case we have a blob and another tree with our names. The blob is the menu.txt file that's in the root, and the tree is the recipes directory that's also in the root. There is also some additional data for the files and directories, access permissions, but otherwise that is it. That's all it takes for Git to store a directory. Now if you have great memory for hexadecimal numbers, I don't really, then you might find the SHA1 of this blob familiar. It's the same SHA1 as the "Apple Pie" string that we've seen earlier. Let me prove it for you. I will use cat-file -p as usual, pass it the SHA1 of the blob, and there it is, the string Apple Pie. That's what's inside menu.txt. So to recap, the commit points to a tree (the root) and this tree points to a blob (menu.txt) and another tree, recipes. And the blob is just a piece of content, the string "Apple Pie". Now let's finish the job. Let's look at this other tree and see what's in there. Let's use cat-file again to peek inside the recipe string. And there you are, two blobs. One of these blobs is the README file. I will cat-file it. Here, and there it is, the content of the README. The other blob... well this one looks familiar even to me now even if I can't remember numbers because it's the same SHA1 as the menu.txt blob. That' because these two files have exactly the same contents, so Git will not create two separate objects for them. It will just reuse the existing object that is already in the database. So to be picky, a blob is not really a file. A blob is just a content of a file. The file name and the file permissions are not stored in the blob. They are stored in the tree that points to the blob. You will see later why this is a good thing. In the meantime, let's look at the object database again. The recipes tree is pointing at the blob with the content of the README file, and it's also pointing at the blob with the content of apple_pie.txt, which is the same content as the menu.txt file, so it's actually the same blob. And there you are, the whole object database, all of it. One small note about this. If you try building this exact same project, and you try giving the exact same commands that I gave to Git, then you will see that you get exactly the same SHA1s for all the trees and all the blobs. However, the SHA1 of the commit, that one will be different because you have different data in your commit, a different author and a different commit date. The important thing to understand here is that there is no magic behind SHA1. If you have the same content I do, then you get the same hashes. A commit is also just a piece of content, and your commit has different content than mine, so you get a different hash. It's as simple as that.
Versioning Made Easy
Now brace yourself. We're going to talk about versioning. We're going to see how it works. You might think that versioning is a big deal and complicated, but now that you know about the object model you will see it's actually very simple. First, let's change a file. I will edit the menu.txt file. I will add the name of another recipe to it, "cheesecake". I'm in a cakes mood. Let's save the file. And now git status tells us that the file has changed, so let's stage it with "git-add" and create a new commit. There. Now our working area is aligned again, and if we look at the log, we can see both commits. Let's use the now familiar cat-file to peek inside this second commit. There, this commit has something more than the first one. It has a parent. The parent is the first commit of course. Commits are linked. That makes sense. Most commits have a parent. The very first commit is an exception. So, the commits are linked like this. Also, if you look at the hash of the tree that this second commit is pointing at, you will see that this is a brand new tree. It's not the same tree that the first commit was pointing at. It's like a different root. You will see why in a minute, but for now let's just draw a new tree here. Let's look at the content of this tree. You know how to do it by now, so I will go fast. We cat-file it. We can also use just the first few digits of the SHA1, and Git will automatically retrieve the whole SHA1 from the database unless there are multiple SHA1s starting with these first few digits. So, now we can see that the tree contains another tree (the "recipes" folder) and a blob ("menu.txt"). Now, menu.txt is a branch new blob itself because this file has changed. So, if we cat-file it, we can see it has the new content of the file, all of it, including both "Apple Pie" and "Cheesecake". However, the tree here that lists the content of the recipes directory, this one is the same object that we already had in the database since the first commit. Because the contents of this directory haven't changed. So, there is no more reason to create a new object. Git can just use the object that was already in the database. So here is the file structure of the object database after our second commit. The new commit is pointing to a new tree, which is pointing to a new blob and to the same tree as the first commit. Now it's clear why this tree must be new. This blob has changed, so the content of this tree must be different because it's pointing to a different blob. As usual, if you change anything in a piece of content, then you get a whole new object with a whole new SHA1. This tree, however, it hasn't changed because nothing inside the directory changed, so Git can reuse the same object. That's one of the reasons why Git is so efficient. It doesn't store things more than once. We changed a single file, so Git stored a new blob, and in our case a new tree and a new commit because they are ultimately pointing at that new file, so they are changed. The recent commits are really small, so that's still extremely efficient. If you count the number of object in this diagram, it's two commits plus six strings and blobs, eight objects in total. This is the current number of objects in the object database. Let's double-check it. The database itself is getting a bit crowded, so instead of counting the files let's use one of those seldom- used plumbing commands, git count-objects. And there you are, eight objects and they take a very small amount of disk space. Speaking of efficiency, you might be surprised that Git stores a new blob every time you change a file. What if I have a huge file and I only change a single line? Will Git store an entire new blob in this case and duplicate the rest of the file? Well, not really. Git also does another layer of optimizations to save more space. For example, as you keep working and adding content to the repository, Git might decide to store only the differences between the two files or even compress multiple objects in the same physical file. By the way, that kind of stuff is the reason for those mysterious info and pack directories in the database. However, those are really implementation details, so you can safely ignore them. To understand the Git model, it's good enough to think of each commit, blob, or tree as just files, separate files that are hashed and stored in the database. At commands level, this how Git actually works, and then it has another layer of optimizations that are probably not interesting to you unless you're working on the Git source code. Just know this. When it comes to being efficient, you can assume that Git always does the right thing.
One More Thing: Annotated Tags
Before we wrap up this model, just for completeness give me a couple of minutes to talk about one more type of object in Git, tags. A tag is like a label for the current state of the project. There are actually two types of tags in Git, regular tags and annotated tags. I'm going to talk about the second, annotated tags. I will talk about regular tags later. Annotated tags are the ones that come with a message. To create an annotated tag, you could use the git tag command with the -a argument, and you need a name for the tag, and you also need some kind of message here. We have an annotated tag. It's similar to creating a commit, and in fact an annotated tag is also an object in Git's object database, like commit. Let's use "cat-file" to peek inside it. In case of tags, "cat-file" can take either the tag's hash or the tag's name. I don't know the hash right now, so I will use the name of the tag. And here is the tag. It contains metadata such as the tag's message, the name, the tagger and date, and most importantly an object that the tag is pointing to. In this case, it's a commit. So, that's what the tag is. It's just a simple label attached to an object. So let's recap. In the Git object database you have blobs (arbitrary content), trees (the equivalent of directories), commits, and annotated tags. There is nothing else in the database, just these four types of objects. Congratulations! Now you know the entire Git object model.
What Git Really Is
Take one last look at this object model because there is something interesting to say about it. Look at the whole model from an abstract point of view. What do we have here? Well, we have a structure where some things contain data, blobs, and then there are other things called trees that contain blobs and other trees so the entire structure is recursive. And the names of the blobs and trees, they are not in the objects themselves. Instead, they are stored in the containing tree. So, you can have the same object, say the same blob or the same tree, pointed at by different tress with different names. Does this structure remind you of anything? Well, to me it looks an awful lot like a file system. Just like in a file system, you have content, files or blobs, and nested containers, directories or trees in this case, and you can have links. The same file or directory can be reached from different places with different names. It's like links in Linux or shortcuts in Windows. In fact, you might argue that that's what Git is. It's a high-level file system built on top of your netty file system. This shouldn't surprise us. After all, Git was written by Linus Torvalds who wrote Linux. He is an operating system kind of person, so when he built a version control system, he built it like a file system. That's just the way he thinks. It's a version file system, of course, because it also has commits, which add versioning. And that's what we mean when we say that Git is a content tracker. So we have seen that Git is a persistent map at its core, and layered on top of that is a stupid content tracker that looks a lot like a versioned file system. In the next module, we will put this theory to work to understand the all-important next layer of the onion. You will see how easy Git branches are and how easy branch-related operations are once you know the basic model of Git. It will be fun. See you in the next module.
Welcome to How Git Works, module two. This is where things really get interesting. In the previous module we laid the groundwork. Now let's make that information concretely useful. In module one we said that Git is a stupid content tracker. We had this metaphor of an onion. Now we can move on the next layer of the Git onion and look at the features that target into a full-fledged revision control system, features like branches and merges. I'm assuming that you already have a basic idea of what a branch is. Maybe you even use Git branches every day. But after this module you might end up looking at them in a different light.
What Branches Really Are
Let's go back to our cookbook project. For now it's still just a handful of files and a couple of commits. We haven't created any branches yet, but you probably know that as soon as you have a Git project you also have a branch. Git creates this branch for us when we do our first commit. Let's look at the list of branches in the project with git branch like this without any argument. And there it is, our default branch, the master branch. By now we are used to look inside the .git directory, so let's do it again looking for branches this time. What's branch? The master branch must have some kind of concrete representation in this folder. What does it look like? Well, Git normally puts branches here in a directory called refs and the subdirectory called heads. Ignore the other subdirectory for now. And there it is, a small 41 bytes file called master. This is our master branch. What's inside this file? You could probably guess it, but you don't have to guess it. It's not compressed, so I can just print its content to screen. And there you are. The file contains a single line, a SHA1. And as you probably expect, it's the SHA1 of the current commit, this commit here. To recap, we have two linked commits in this project, and we also have a master branch, and the branch is nothing else than a simple reference, a pointer to a commit essentially. That's why the directory that contains branches is called refs, references. Note that the master branch actually has no special status in Git. Yeah Git created it for us, but otherwise it's just a branch like any other, and all there is too it is this small file. I could actually delete or rename the master branch just by deleting or renaming this file. I could even create a new branch just by writing a new file into this folder containing the SHA1 of a commit. That would be hacking arguably, but it would work. Okay, let's not do that. Instead, let's create a new branch the right way by using a git branch with a branchname. About this branch. Imagine that we want to insert recipes in our cookbook, but we also get alternate recipes for a friend, and we want keep those ones in a separate branch. Let's call this new branch lisa, the name of our friend. Our idea is that we'd put our own recipes in master and our friend's recipes in the lisa branch. There we are. We have a new branch, we can see it listed amongst branches, and we can see it alongside master in the refs heads folder. And if we look at it, we see that it has exactly the same content as master, same commit. So this is what we have now, two commits, two branches, and the branches are pointing at the same commit.
The Mechanics of the Current Branch
Now that we have two branches, if we look at the list of branches again, we see that one branch is marked with an asterisk because it's the current branch. What does that mean concretely? I mean how does Git know that master is our current branch? There must be some kind of information, probably in the .git folder that says which branch is the current branch, some kind of file maybe that contains that information. And indeed there is such file, and you probably know the name of this file already. If you look at the .git folder again, you will see a file named HEAD in here. If you use Git, this should ring a bell. And if you look inside HEAD, then you will see that it contains a reference to a file, another file. This is Git's way to reference files, this syntax. It's saying HEAD is currently pointing at refs/heads/master, the file representing the master branch. There is only one HEAD, so there is only one current branch. That's what HEAD is, a reference to a branch, a pointer to a pointer if you wish. Let's add it to the diagram and move on. So now let's change the files in the project. I will add the list of ingredients for the apple pie. Here, let me add this recipe. I will just copy/paste the ingredients here. We don't have full recipe yet, but at least we know what to buy at the grocery store now. Here, let me add this file to Git and commit it. Okay, let's see what just happened inside Git step by step. Git created a few new objects in the object database for this commit. In particular, it created the commit itself. It's an object to remember. And this commit has the previous commit as a parent. Then Git looked inside the HEAD file to find what the current branch is, and it moved that branch to point at the new commit. So the master branch moved, but notice that HEAD itself did not move. It was pointing at master before the commit. It's still pointing at master. Master is moving. HEAD is just coming along for the ride. So far we didn't touch the new lisa branch. Lisa is still pointing at the previous commit where it was when we created it. Now let's make lisa the current branch. If you are used to Git, you know how to do it. It's an operation called git checkout. When I git checkout lisa, two things happen. The first thing that happens is that Git changes HEAD to point at lisa. There, now HEAD is pointing at refs/heads/lisa. The second thing that happens is more subtle. Git just replaced the files and folders in our working area, the working directory, with the files and folders in this commit. So after the checkout, our working area changed to the content of the commit pointed at by lisa. If I look at the content of the Apple Pie file here, the ingredients are gone. It is the previous version of the file. I'm sure this doesn't surprise you. I mean this is what you expect when you do a checkout, right? By the way, we will look at this process in more detail in a few minutes. So, that's what checkout means. It means move HEAD and update the working area. Now let's modify the Apple Pie recipe again. I will paste in Lisa's versions of the ingredients. (Working) I'm using almost the same list of ingredients as we had in the master branch, but Lisa also uses cinnamon in her version of the pie, and she uses more apples, 10 apples instead of 8. Let's commit these changes. (Working) By now we know what happens when we do a commit. Git adds the commit to the object database, and it moves the current branch, lisa, to point at the new commit. HEAD didn't change, master didn't change of course, but lisa changed. Now it points at the new commit. Now this looks a bit more like our intuitive notion of branches. But remember that branches are still just references to commits. That's all there is to branches actually. Enough talking about branches. Let's see what happens when we merge.
First, let's move back to the master branch. I will check it out. There. Now the branches didn't move remember, but HEAD did move. It's now pointing at master. And if I look into the Apple Pie recipe, I will find my own version of the recipe here, not Lisa's version. Now, let's say that I tried the two apple pie recipes. I actually cooked the apple pies, and I like Lisa's version a bit more, so I want to merge Lisa's changes from her branch, lisa, into the master branch. Let's do the magic. Git merge, and there you are. We have a conflict. We want to have both our changes and Lisa's changes in master, but Git is warning us that at least some of those changes are conflicting. We need to solve the conflict manually. Chances are you probably got in this situation already, either while using Git or some other versioning system. If we look inside the Apple Pie file, we will see that this line, this one was changing divergent way in our recipe and in Lisa's recipe, so now we need to take a stand and decide how many apples to use in the pie. Let's go for a middle ground. I can't just admit that Lisa's recipes better, so, you know, it a matter of pride. I will just concede that one more apple is okay. Okay, there. Now if we git status, we see that this file is not staged for the next commit. We need to add it explicitly. This is our way to tell Git that the conflict has been fixed. There. And now we can complete the merge. If we hadn't had conflicts, then Git would've done this last step automatically, but because we did have conflicts, we have to say okay, we are done fixing all the conflicts, Git. And we do that with a commit. Without even need to give it a commit message, Git knows that we are in middle of a merge, so it will create a suitable message automatically. We could change the message, but I won't. I will just approve it by putting quitting the editor. If you look at the log now, you will see a brand new commit, and if you look inside this commit with cat- file, remember cat-file it's a low-level command that we used to peek inside the objects in the database, there it is. It's just like any other commit we've seen so far. A merge is just a commit with one exception. It has two parents. That's what makes it a merge. A commit in Git usually has one parent, but it can have as many parents as you like actually. So let's update the diagram. Git created a new commit with two parents to represent the merge and moved master to point at the new commit. That's how merging works.
Time Travel for Developers
Okay, now give me just a few minutes for a short aside. From the first module of this training, you might remember diagrams such as this one. It's about trees and blobs. To make things easier, I avoided talking about trees and blobs in this second module. I mostly talked about commits. I will just mention tress and blobs again quickly to show you in more detail how Git manages your working directory. You know that the objects in the database are commits, trees, and blobs and also annotated tags. Also, there are none of them in this example. You also know that all these objects are arranged in a graph. They reference each other. There are references from a commit to its parents, references from a commit to its tree, and references from trees to blobs and other trees. These references all look alike, but they are used in two different ways. References between commits are used to track history. All the other references are used to track content. We've also seen that Git is good at reusing content so you can have objects that are reachable from more than one commit, like these ones here. The point I want to make is that when you checkout something Git doesn't care about history. It doesn't look at ways that commits connect to each other. It just cares about trees and blobs. So, if you looking towards from this commit here, then Git forgets about the link to the parent of the commit, and it looks at the tree in the commit and all the objects that can be reached from there. That is the entire state of the project at the time of the commit, a complete snapshot of every file, every folder. Git uses this information to replace the content of your working directory. That's how you travel back and forth in time with Git. It is the whole point of versioning. And if you look at this commit here, well same thing. It comes with an entire representation of the entire project. You might think that merge commits most be more complicated than that, but actually they're not. Okay, they have multiple parents, that's the definition of a merge, but Git doesn't care about that if you checkout. It just goes into the commit and retrieves the tree in the commit as usual. A merge commit will in general have its own tree because the objects in the merge might not be present in any of the parents. Same goes with a file that has lines from both parents, for example. On the other hand, from the merge commits tree you can probably reach objects that are also reachable from other commits. And once again, Git doesn't care about which blob or tree was introduced by which commit. When it's towards the commit, it just reuses objects that are already there, and it creates the objects that are not already there. And when it checks out a commit, it just looks at the tree and rebuilds the state of the project from there. I told you this story because I want to make a couple of points. First point, don't get confused with trees and blobs. Retrieving a past state in Git is a pretty simple affair. It's just a stupid content tracker. You should just focus on history, how commits connect to each other, and then you should trust Git to do the right thing with trees and blobs. The second point I want to make is that Git doesn't really care much about your working area. Remember, when you checkout, Git just replaces the working area with the stuff from the object database. Git mostly cares about the objects in the database, not your working directory. The objects in the database are immutable and persistent while the files in your working directory are expressive as they get. They can change as quickly as you can do a checkout. Git is not reckless with your working area. It will give you a warning before overriding your files. For example, if you try to do a checkout, but you have uncommitted changes, Git will tell you that. But other than that, as far as Git is concerned, your working area is the least important part of your project. All the good stuff is in the .git directory. And now that I made this aside, you can forget about trees and blobs for the rest of this training. From now on, we will mostly be concerned with commits and history.
Merging Without Merging
We have seen how branches and merges work, but there are a couple of interesting corner cases that we didn't consider. They are quite important in practice, so let's look at that. The first corner case is a special case of a merge. Let's checkout the lisa branch. There, HEAD moves to point at lisa. Now we're in lisa's mind again. Imagine that we managed to convince Lisa that our version of the apple pie, the one in master, is tastier than her version. You know, one less apple can work miracles. So she decided to update her version of the recipe, the one in her branch. Earlier on we merged lisa in master. Now we want to merge master in lisa. Now, how does Git handle this merge? It could do it in the usual way just like it did when we merged in the other direction. It could create a new commit that has two parents, these two commits here would be the parents, and then move lisa to point at the new commit. This new commit would be currently not to have conflicts because we already solved the conflicts when we merged in the other direction. So it would be easy for Git to create this commit, but it would also be wasteful. Think about what we're trying to achieve here. We want the commit that contains the latest version of all the stuff in master and the latest version of all the stuff in lisa. That's all we want. But we already have such a commit. It's the latest commit of master. It contains all the latest objects in master, of course, and also the latest objects in lisa because lisa's latest commit is a ancestor commit of master, and all the conflicts have already been solved in master. We learned by now that Git is frugal, it doesn't like waste, so it can spare a commit and just do this instead. It moves lisa to point at the same commit as master. So Git didn't have to create a new commit. This trick happens all the time in practice. It's called a fast-forward. Whenever you see this message on the screen, this is Git bragging about being able to spare a few objects in the object database and making your project's history less complicated. Good Git.
Losing Your HEAD
The second and last corner case I want to tell you about is a feature that turns out to be quite useful in practice. I will simplify the diagram for this. I will checkout master and forget about the lisa branch for a while. Actually, let's forget about everything except for the very latest commit. This will make it easier. So far, I always told you that HEAD is a reference to a branch, which in turn is a reference to a commit. When you checkout a branch, that means you are changing HEAD; however, you can also do something different. You can directly checkout a commit instead of a branch. I will checkout this commit. I will just use the commits SHA1. There. Now if you look inside HEAD, it's not pointing to a branch. It's pointing directly to a commit. And indeed there is no current branch at all. We're not on branch. This is a situation that is called detached HEAD, and I'm sure that you have seen this warning message from Git in the past. It was pretty scary to me the first time I saw it. How is that useful in practice? Well, let me work a little bit more so I can show it to you. Let's make some experiments in the Apple Pie recipe, something that I'm not sure I want to keep around. There it's good with 9 apples. It must be even better with 20, right? And I will commit this. (Working) What happens when I commit? Well, in this case Git cannot move the current branch as usual. There is no current branch, so it will track the latest commit by moving HEAD directly. HEAD is working exactly like branch here. Okay, let me hack in a few more changes. Let's make the pie sugar free. It's healthy. Another commit, another HEAD movement. Okay, now let's say that we've had enough of this. I tried cooking an apple pie with all these extra apples and no sugar. It tastes like cooked apples. I don't like that, so we'll abandon the experiment. I will checkout master again. (Working) Okay, now HEAD is back where it belongs on the master branch. So are our files. Everything is business as usual. There, we rolled back the latest two commits. But there is a nagging question here. What happened to these commits? Well, they are still in the object database somewhere together with all their trees and blobs, but unless I took note of their SHA1s, these commits and their connected objects are now unreachable. They cannot be reached by starting for a branch or a tag and walking the objects in the database. They are effectively isolated. I can only reach them directly by their SHA1s, and I'm bound to forget those too. If you have an experience with object related languages, then you know what happens to an object when it can't be reached by any reference. It gets garbage collected. At some point the system decides that the object is wasting precious memory, and it will delete the object and recover the memory. Well, this is exactly what happens in Git. Every now and then in the course of other operations Git decides that it's worth running a garbage collection. The garbage collector will look for objects in the database that cannot be ultimately reached from a branch or HEAD or a tag, and it will remove them to save disk space. Remember, each object is just a file in the object database, so removing them is as easy as deleting those files. So these commits I created will likely stay in the database for some time and then disappear. If I want to save them, I must act now. How do I do that? One thing that I can do is move back to the last commit. I can still do it because I have their SHA1s here and the garbage collector didn't run yet, so these objects are still in the database. There, that was a last minute save. And now that I have the commit, I can put a branch on it. Here, let's create a branch called nogood. Now I can checkout master again, and this time around the commits are safe. There is a branch now that acts as the entry point to this section of the object graph, so these object will never be garbage collected. And I can easily get back to them by checking out nogood if I wish. This is a common way to use a detached HEAD. When you want to try out something, go down maybe two, run a general experiment with your code, you don't have to leave behind the convenience of using Git. You can detach HEAD, do your experiment, still commit the experiment as much as you wish so that you won't lose data, and then you decide whether to keep the experiment or to do away with it. Just remember to put a branch on the stuff that you care about before you leave it behind.
Objects and References
Now we have a better picture of the nature of Git. Let's recap it. A Git repository is a bunch of objects linked to each other in a graph. As you know, they can be commits, blobs, trees, or tags. Then there are branches that are references to a commit. And finally, there is HEAD that's also a reference, but there is only one of it, and it marks our current position in the graph. It's usually pointing to a branch, but it could also be detached and pointing directly to a commit. Then there are a few rules. First rule. The current branch tracks new commits. So if you create a new commit by saying git commit or git merge, for example, then the current branch moves to the new commit. If you are in detached HEAD state, then HEAD itself moves to the new commit. Second rule. Your working directory is updated automatically. When you move to a commit, for example with git checkout, Git replaces the content of your working directory with the content that can be reached from that commit. Rule three. Any commit, blob, or tree that cannot be reached from either a branch, HEAD, or a tag is considered dead and can be garbage collected. And essentially this is the whole thing. Branches, merges, moving back in forth in time, it all boils down to these simple rules. Okay, I know what you're thinking now. What about rebases? We're about to talk about that. See you in module three.
Rebasing Made Simple
Welcome to How Git Works, module 3. Git is an onion, remember? And we're still looking at the versioning layer of the onion, the features that are getting to a full-fledged version control system. In the previous module, we talked about branching and merging. Now let's look at a couple more features that are also related to versioning and in particular at a very important one, rebasing. Branching and merging are standard features for any revision control system, but rebasing is way less common. Only a handful of version control system have it, and Git is by far the most popular of them. In a way, rebasing can been seen as Git's signature feature. Let's see how it works.
What a Rebase Looks Like
Here is our cookbook project again. I worked on it for a bit since the last module, and this is the situation that we have now. There two branches. The "master" branch got a couple of new commits since the last time we looked at it, and these commits changed the apple pie recipe a bit, just minor changes. The other branch, "spaghetti", is brand new. It has a couple of new commits, and these commits introduce a new recipe for "Spaghetti alla Carbonara". I just had to sneak an Italian dish into the string, at least one thing that I can pronounce properly. So, here is that situation again: We have two branches that diverged. To make diverge simpler, I also used different colors for the commits in the two branches. The apple pie commits are yellow, and the spaghetti commits are blue. Also, because spaghetti is the current branch, I drew it into green instead of drawing a separate HEAD pointer. Now, we want to put the content of the two branches together. We already know one way to do this. We can merge the two branches. We are already on the "spaghetti branch", so we could easily merge it with "master". I will not do this, however, but if I did, here is what would happen. We would have a new commit, and the parents of this new commit would be the former tips of the two branches. Also, the current branch would move to this new merge commit. This is the usual merge thing that we already know about. In this case, it should also be an easy merge because we're not expecting any conflicts. However, I will not complete this merge. Instead, I will use another way to put the two branches together. I will rebase the current branch over the other branch. If we rebase "spaghetti" over "master", then here is what happens. Git looks for the first commit in spaghetti that is also a commit in master. It's this commit here. This is the base of the "spaghetti" branch. All the history before this commit is already shared between the two branches, so it's not relevant here. Now Git detaches the entire spaghetti branch from this commit and moves it on top of master, so it changes the base of this branch. That's why it's called a rebase. Like in a merge, we might have to solve conflicts to complete the rebase, but in this case there are no conflicts. The two branches change different files, so we're done already. Now the spaghetti branch contains all the commits from the master branch plus the spaghetti stuff, which is what we wanted. What happens if we want it to work the other way as well and we want the stuff from spaghetti in the master branch? Just like in the merge, we can just checkout master and rebase the other way. Let's checkout master here. Master is the now the current branch. It switched to green in the diagram. And now let's rebase. Actually, in this particular case I could either rebase or merge, and it would make no different whatsoever. In both cases, Git can just fast-forward branch. A rebase can be fast-forwarded just like merge. So this is what we have now. Just like in a merge, we have all the commits that deal with the spaghetti and all the commits that deal with the pie in the same history. However, different than a merge, we got that result not by letting multiple branches flow together, but by rearranging the branches so that they look like one single branch. The way I just described it, a rebase is pretty simple; however, to be honest, I'm making it a bit too simple maybe. It's actually likely more complicated than that. Let's see why.
An Illusion of Movement
I didn't tell you the whole story about rebases. Let's take a small step back. I told you that when you rebase Git detaches the current branch from its base and moves it to the top of the target branch. But actually this process cannot happen literally like that. That would be impossible in Git. You cannot detach a commit from another commit and move it elsewhere because commits are database objects, and database objects are immutable. Remember what we said in the beginning of this training? If you change anything in a commit, then you get a different hash, a different SHA1, which means a different commit. And if you want to move commits around, then you must change at least one piece of data inside the commit. It's parent. So, you cannot do that. Let's take a step back and mention what happens if you change the parent of this commit. The parents SHA1 is stored inside the commit, so the commit data must change, and the commit must get a new SHA1. Now that this commit has a new SHA1, this other commit also has to change because its own parent has changed, so it gets a new SHA1 (as shown) for all the commits in the branch. So Git cannot just move the commits. The commits in the rebase branch must have different SHA1s, so there must be different objects in the database. In other words, new commits, and indeed that's what they are. Here is how rebasing really works. When you rebase, Git makes copies of the commits. It creates new commits with mostly the same data, actually exactly the same data except for their parents. So these new commits look almost exactly like the original commits, but they are new objects with new SHA1s, so they are new files with new file names in the database directory. And finally, Git moves the rebase branch to the new commits leaving the old commits behind. Keep this in mind because as we will see in the rest of this training sometimes rebases can be tricky, and you can avoid some confusion if you remember that rebasing is an operation that creates new commits.
Taking out the Garbage
One more thing about this rebasing process. I just told you that rebasing copies the data in the old commits to create new commits, but what happens to the old commits then? That's an interesting question. It depends on the case. In this case here, these commits are not very useful. There is no branch pointing at them. So the only branch that was pointing at them has moved over to the new commits, so these old commits are impossible to reach, almost impossible, because there are a few ways to retrieve them. For example, if you had written down their hashes, then you could still checkout them, but it's more likely that you will just lose track of them. So, why would Git waste disk space to keep around commits that cannot even be reached? In fact, Git doesn't keep them around. Every now and then when you run a command that is likely to generate this kind of unreachable commits, Git takes some time to look at the objects in the database, identify unreachable objects, commits, but also blobs and trees in some scenarios, and delete them. So, if I keep working on this project and at some point in the future I look into the Git database, these commits might well have been deleted. This is a form of garbage collection. In most modern programming languages, a value that can't be reached through any reference, for example an object that cannot be reached through any variable, is considered dead and removed by a garbage collector. Well, the same thing happens in Git. As usual, Git doesn't like waste.
The Trade-offs of Merges
Now we know what a rebase looks like and how it actually works under the hood; however, you might still wonder why rebases even exists. I mean we already have merging. Rebasing and merging seem to do something very similar. They both enroll existing commits in the history of a branch. So, if I'm working on the apple pie recipe and I want to also get the spaghetti recipe, I can have both in the same history by merging or by rebasing. So why do we have two ways of doing something similar? The reason why we have both merging and rebasing is that they have different tradeoffs. Let's focus on merging first. The whole point of merging is that it perseveres history exactly as it happened. In this case, for example, you can clearly see that the yellow commits and the blue commits were created independently, possibly in parallel, and then they were merged into one single timeline. If there were any conflicts during the merge, then this merge commit would include fixes to the conflicts. There is nothing else to understand really. It's this simple. But merging becomes a bit less simple when you're looking at a large project where it's used a lot. For example, let me show you the Git project for a popular open source library in Ruby. I'm using a tool called source tree to visualize the Git history. As you can see, there is a lot of branching and merging going on. Look at this area here. The developers seem to have been particularly merge happy in this period, so it can be very hard to follow the way that all of these branches diverged and then converged again. It's hard to understand, for example, which of these commits are contributing to which branches. Compare this graphical tool to the Git log command that we've been using so far. In a project such as this one, git log can be misleading. The log is showing history as if it were a single long timeline, one commit after another, but that's not what the project history actually looks like. It looks like a graph, not a line. The log is squashing the real history somehow, interleaving our related commits from different branches as if they were connected to each other while they're aren't. So merges preserve the project history, and in general that's good, except that the project history can be ugly and confusing, so that's not always necessarily a good thing. But one thing is for sure, merges never lie.
The Trade-offs of Rebases
Now let's look at rebasing. A rebased history looks really simple and neat. There is no reason for commands such as git log to arbitrarily squash commits into a single timeline because commits are arranged in a single timeline already. So, a project that uses a lot of rebasing generally looks more streamlined and clean than a project that uses a lot of merging, history-wise. Essentially, rebasing helps you refactor your project history so that it's always nice to look at. This neatness, however, comes at a cost. This nicely designed history is not real. It was forced by rebasing, which is a distractive operation. Rebasing creates new commits and leaves behind existing commits that might get garbage collected. So a rebase history looks cleaner, but it is a lie its own way. For example, in this case, it looks like the yellow commits were created first and blue commits were created later on top of them, but this is not what really happened. The yellow and blue commits were created in parallel in different branches. So in contrast to merges, rebases change the project history. This might not sound like a problem at all. You might say who cares what the history looked like originally. Surely you only care about the final result. Well, actually there are a few situations when you do care about history. There are some advanced Git commands, for example, that become less useful if you tamper with project history. Also, changing history means creating new commits and moving branches, and there are some scenarios where all the trickery carries out in confusing situations, like multiple commits with the same commit message in the same branch. Most importantly, there is one common scenario when this rewriting of history can become truly painful. This scenario has to do with distribution, so I have to ask you to be patient. I will talk about it in the next module. For now, just remember this. Rebases make your history cleaner, but they can also cause unwanted side effects. If I had to condense the differences between merges and rebases in just a single recommendation, it would be this. When in doubt, just merge. Rebasing is a power tool. It is quite useful, but you should only use it if you know what you're doing and you understand the consequences. And that's it about merging and rebasing for now. I promise that in the next module I will show you a concrete example of how mindless rebasing can land you in trouble.
Tags in Brief
So far I talked about the best features that turn Git into a revision control system, branches, merges, rebases. We have just one last such feature to talk about, tags. Tags are very useful, but they are too small to deserve their own module. I will mention them here to complete our discussion on versioning. You might remember that we talked about tags already in the first module. We even created a tag. There it is. Back then I told you that tags are one of the four types of objects in the database, together with commits, trees, and blobs. Now, let's get a bit deeper. In Git there are actually two types of tags. In module one we talked about one of the two, annotated tags. The other kind of tag doesn't have a specific name, so people sometimes call them non-annotated tags or lightweight tags. Let's create one. Let's say that I want to mark the current point in my project history. For example, let's look at the very latest commit. In this commit we have both spaghetti alla carbonara and an updated apple pie. Let's say that we want to tag this commit with a tag named dinner. We could create an annotated tag. Maybe you still remember that we can do that with tag -a. This tag would contain a lot of useful information such as the date that the tag was created, who created it, a description, and so on. However, in some cases I could decide that there is no reason to have all that information. I might just want to mark this commit with a simple label for my own use. If that is the case, then lightweight tag is enough. I can create such a tag by skipping the -a option in the tag command. There, now we have a tag. There it is. I did not have to provide the message or anything. Now, let's peek inside the .git refs directory. There is heads directory here that we already know about, it contains the branches, and then there is a tags directory that contains the tags. There are two tags in there, the one we already had and the one we just created. They are two simple files that contain the SHA1 of an object in the database. See. A tag is a reference to an object, in this case a commit just like a branch. I could actually turn this tag into a branch just by moving it to the refs heads directory. This is a lightweight tag, so it contains the SHA1 of a commit. An annotated tag is similar, but it contains the SHA1 of a tag object in the database, and that object in turn is referencing a commit besides containing all the extra information like the tag description. If tags look just like branches, then what's the difference between a tag and the branch? Simply enough, while branches move, tags don't. If I create a new commit right now, then master will move to track it because it's the current branch, but the tag will just stick to the same object forever. And that's all I had to tell you about tags.
A Version Control System
So, let's recap. Branches, merges, rebases, tags, these are the main features that turn Git from a stupid content tracker into a full revision control system. It took us a lot of talking, two entire modules, half of this training to go through the versioning features, but we finally completed this layer of the onion. Now we only have one more layer to go, and then we will know the entire onion. We will have the full picture of Git and how it works, so let's talk about that last layer.
Distributed Version Control
Hello, and welcome to How Git Works, module 4. We're only missing one last layer in our description of the Git onion, but it's a really important one. Distribution. So far we mentioned that there is only one computer in the world, the computer that you're running it on. Now let's see what happens if you use Git the way it's used in practice, to share projects across multiple computers.
A World of Peers
Imagine that you have a Git repository on a computer somewhere. It's this orange box here. And you also want the same repository somewhere else, probably on a different machine, so you want to have it here. I made it green. Now, the machine that hosts the green repository must be able to connect to the machine that hosts the orange repository, so you might have some technical setup to do here. You have to run a Git daemon process on the orange repo so that the green repo can connect to it and so on and so forth. But in this training we don't care about these technical details, so let's make it easy. I just moved the old cookbook project to GitHub and removed it from my computer. So now the orange repo is in the cloud so to say, a service hosted on the GitHub servers, and the green repo will be on my own computer. So, I want to get a copy of the project on this computer inside this empty directory. You probably know which command to use here. It's the git clone command. It takes the address of Git repository, which I can copy/paste from GitHub there, and now I have the project. All the files are here. But I didn't just get the files. I got the entire.git directory as well and all the files it contains. Here is what git clone did. It created an empty directory for the cookbook, and it copied the .git directory from the GitHub project to this directory. I simplified here. It didn't literally copy each and every file. For example, in recent versions of Git, git clone only copies one branch, the master branch. If I want to work with the other branches on the remote repo, I need to give specific commands to do so, but that's a detail, an optimization if you wish. The important part is Git did copy over the objects in the object database. They are in here. After copying this stuff, Git checked out the master branch to rebuild these files in the working area. Remember, the working area in Git is not very important. You can always rebuild it on the fly from the content of the .git directory. And since the .git directory contains the entire repository, now we have a copy of the project and its history on this computer. This is an important point, so it's worth repeating it. Now that we have two clones of the repo, one on GitHub and one on this computer, both clones are equally good. Git is not like subversion or other traditional revision control systems that need a centralized server and everyone else is just talking to that one server. Instead, both computers now contain the whole project and its history. We could have as many of these clones as we want synchronizing with each other. Of course, you can still decide that one specific clone is the most important one. For example, if you had multiple developers working on the same software project, then you would probably decide that the repo on GitHub is the reference repo, the one that you build the releases from, and everybody must synchronize with that one. That's why I drew you the GitHub right on top. You can still synchronize the developer's repos directly with each other, but even then you probably want to appoint a well-known reference copy that everybody synchronizes with. However, in Git that's not a technical issue. It's a social issue; it's a convention. From a technical standpoint, all of these clones are peers.
Local and Remote
Now we have the same project in two separate repos, orange and green. We're working on green, so it would be useful if green could remember the address of orange because we decided that orange is an important copy and we want to stay synchronized with it. Indeed, when we issued the git clone command, Git added a few lines to the configuration of our repository. It's here in the config file. We never looked at this configuration file before, but now that we cloned the repo we can find some useful information here. Each Git repository, such as this one, carry member information about other copies of the same repository. Each other copy is called a remote. You can define as many remotes as you want, but when you clone a project Git immediately defines a default remote and calls it with a conventional name, origin. Here is the configuration of origin, and it points to the URL that we cloned the project from. The rest of the configuration is more complicated. We don't need to look at the details here. Just know that the default configuration says that we have one master branch that maps over the master branch of the remote. You can tweak this configuration to change the policies that you use to synchronize with remotes, but the default is pretty obvious. So, now Git remembers which other repo or repos we want to synchronize with, but to synchronize Git also needs to know the current state of origin, which by interest are there on the remote, which commits those branches are currently pointing at and so on. And in fact, Git does store that information. If we ask it for branches, then it will just show the local branches. We only have master now. But if you list the branches with the --all switch, then you see all the references, including the ones on the remote, the remote branches and the current position of HEAD. Git tracks a remote by just exactly like it tracks local branches, by writing those branches as references in the refs folder. If you look inside that folder, you will see an origin folder in here that contains the references to branches, tags, and the current HEAD pointer of origin. Git will automatically update this information when we connect to a remote. There is one wrinkle here. If you look inside this folder, you might find that some of the branches are missing. In this case, I can only see the remotes HEAD here and not the branches. That's because of a low-level optimization in Git. To avoid maintaining one small file for each branch, Git sometimes compacts some of them into a single file called packed-refs here. There is no simple command to unpack this file, so you will have to take my word for it that the branches that are not in the refs directory must be in this file. This can happen for both local and remote branches. But in both cases, whether the branches are still individual files or packaged together in packed-refs, they're still conceptually the same thing. All branches, local or remote, are still references to a commit, and Git tracks all of that. Since we cannot peek inside the files for some of these branches because they've been packed, let's use this plumbing command, git show-ref, to see which commits they're pointing at. Git show-ref master lists all of the branches that have master in their names, which means the local master branch and the remote master branch. And as you can see, they're pointing at the same commit while the lisa branch is still pointing at an older commit. So, bottom line, you know that a local branch in Git is just a reference to a commit. Well, a remote branch is exactly the same thing. Whenever you synchronize with a remote, Git updates remote branches. Let's see how that synchronization happens in practice.
The Joy of Pushing
In the very first module of this training, we said that each Git object is just a sequence of bytes identified by a SHA1. I also insisted a lot that SHA1s are unique; I said unique in the universe. Finally, this is the point in our training where we can see how that uniqueness is truly useful. Look back at our two repositories. When we cloned, we copied the objects from the orange repo to the green repo. Now we mentioned that we added a few new objects to the green repo, for example a new commit and the associated blobs and trees. Synchronization is mostly about getting the same objects on all the clones. But now it's very easy to synchronize because each object is immutable and has a unique SHA1, so Git will never get confused. It can just copy the missing objects from one repo to the other. Well, okay, it's not quite that simple because copying the objects around is not enough. Git also has to keep the branches synchronized on the various clones, and that's where things get a bit tricky. Let's see how this works. I will make a change to this repo. After some experimentation, I realize that having just a little bit of lemon juice in my apple pie makes it taste even better. I will add this to the recipe-- and commit it. There we are. So now we have a few new objects in the database, a new blob to represent the file I changed, a new tree that represents the updated project root folder that is pointing to that blob, and this new commit here. And the local master branch is pointing at the new commit while the master branch on origin is still pointing at the previous commit. Of course, nobody changed that branch yet, and origin doesn't even have this commit, and neither does it have the other new database objects. So, let's send both the new objects and the updated branch to origin. You probably know the command that does that, git push. There we are. Now our new objects have been pushed to the remote, and the branches on origin moved to point at the latest commit. We can easily check that because Git updated our remote branches to align with the current state of origin.
The Chore of Pulling
Now what happens when they're other repos pushing to origin so the state of origin might change at any time? No, we cannot just write changes to the remote. We also must read the changes from the remote. Things get a bit more complicated here, so I will use a diagram here instead of a demo. Imagine that we have a remote repo that looks like this. It's a single commit. I will use different colors for the commits, and I will not throw trees and blobs, I will skip them because they would make this diagram too busy. When we clone this repo, we get the same objects on our local client, and here are the branches. Now let's say that we add the one commit and we push. If there are no changes to the remote's master branch, then things are easy. Git copies our new commit and the associated objects to the remote, and then updates the remote's master branch to point at the new commit, and it also reflects the change in the branches on origin by updated the origin/master branch on the local repo. This is what we did when we pushed our changes a few minutes ago. Now let's do it again. This is the initial situation. We had the commit, and we prepared to push, but this time we mentioned that something has changed on the remote as well. Someone pushed another commit to the remote. Now we can't just push. We have a conflict here. We have two different histories that need to be reconciled. In this case, we basically have two options. One option, which I would not recommend except in very special cases, is to force a push. We can do that with the command line switch on the Git push command, git push -f, which stands for force. This means that we're forcing the remote to take our new objects and change its history to match our local history. So, we're probably losing data on the origin. Here we're losing the very commit. Now branch is pointing at that commit any more, so it will be garbage collected eventually. We're also creating a very confusing situation for all other people synchronizing to the same remote because now their local history will be conflicting with the history in origin. So, probably forcing a push is not a good idea. Let's do it again properly. This is the situation we had before the push. What we want to do in general is we want to fix the conflict on our own machine before we push. To do that, we need first to fetch the data from the remote. There is a command to do that called git fetch. We get the new objects from the remote, and we also update the current position of the remote branches, as usual. Now that we have the new commit and the related objects, we can merge our local changes with the remote history. So, we did a fetch. Now we do a merge. Of course during the merge we might have to fix merging conflicts and the like, but the important point here is that we are not rewriting history. Merges never do that. Instead, they just add the new objects. So, once we do the merge, our history is the history from the remote plus some more stuff, and we can push that new stuff to the remote without rewriting the remote's history. This is what you do most of the time. You fetch the changes from the remote, you merge them into your own repo, and then you push the result. This sequence of a git fetch followed by a git merge is so common that there is one single command that does both. It's called, you guessed it, git pull, a fetch followed by a merge.
There is one more important thing to say about this process of pushing and pulling, and it has to do with rebasing. In the previous module of this training we talked about rebasing, and back then I told you that there are a few cases where rebases do not work very well. Now we can finally see why. Say that we have this repo freshly cloned with two branches that are both tracking branches on origin. We're working on the lisa branch, and we decide to roll the changes from master into lisa. You know that we can do this with either a merge or a rebase, so let's try the rebase this time. Git copies over the lisa commit so that its parent is now the latest commit on master, and there we are. However, remember that this new yellow commit that we have here is not the same commit as the previous yellow commit. Instead, it's a copy, a different database object. I marked it with an explanation point to tell it apart from the original commit. The original commit will actually be garbage collected at some point. So, now we have a conflict again. We can't just push because we have different histories on our local repo and on origin. This particular conflict, however, doesn't seem like much. We can fix it easily, for example by doing a false push or a pull followed by a push. In any case, we can work around this, and then we have the same stuff on origin that we have on local. Good job. We can call it a day. However, things break down when we introduce another user. Our friend, Annie, is also working on the same cookbook repository, and she still has the original known exclamation mark, the yellow commit in our repository. Not only that, she also kept working on the lisa branch. She added a commit there, so now Annie has a pretty nasty conflict to sort out the next time she synchronizes with origin. She needs to understand what happened first, and then to solve the conflicts even though she didn't cause the conflicts herself. There are good chances that even after solving the conflicts she will end up with a confusing history that includes both yellow commits even though they look exactly the same. So, this is the bottom line when it comes to rebasing. As a general rule, never rebase stuff that has been shared with some other repository. It's okay to rebase commits that you haven't shared yet in general, but remember that it's easy to rebase share commits by mistake and then expect some trouble. And that's the major reason why I warned you about rebases in the previous module.
We are almost done with our discussion of Git distribution, but it's worth taking a few minutes to discuss two distribution-related features that are not features of Git. Instead, they are features of GitHub, but they are so essential for modern development, especially modern open source development, that it's a good idea to mention them just to avoid confusion. Imagine that there is this project on GitHub that we want to contribute to. It belongs to a user named Pluralsight. We could simply clone this project, but then it would be stuck on our local machine because we don't have the right access to Pluralsight's repository, so we cannot push to it. What we can do from the GitHub web interface is to create our own copy of the project on GitHub. This called a fork. A fork is kind of like clone, but it's a remote clone. We are cloning the project from someone else's GitHub account to our own GitHub account. So now we have a new project in the cloud, and we can clone that one on our local machine. When we do that, Git creates a remote in our local repo pointing at origin. Origin is pointing at our own GitHub project, not the original project, of course. Actually, from Git's point of view there is no connection at all between our project and the original project that we fork from. GitHub does know that the two projects are connected, but Git doesn't. So if we want to track changes to the original project, then we need to add another remote pointing at it. This is not something that Git does automatically. We have to do it ourselves. A common convention is to call this a remote upstream. Now we have our local project with multiple remotes. We can work on it, and we can synchronize all our local changes with origin. If we commit local changes, we can just push those changes to origin. If there are changes on upstream, we can pull them into our local project, solve any conflicts, and then push them to origin. One thing that we still cannot do, however, is to push changes to upstream. For example, we might like to contribute our orange commit to the original project, but we still do not have right access to upstream, so GitHub gives us an alternative. We can send a message to the maintainers of upstream and ask them to pull our changes. You know how it this is called. It's a pull request. Once again, pull requests are not a Git feature. They're not even a version control feature, strictly speaking. In a way, they are a social network feature. You're just sending a message to people. If those people like your changes, then GitHub makes it easy for them to do a remote pull and get your changes from origin. And that's all about forks and pull requests, two of the most important tools in modern open source development.
The Whole Onion
At the beginning of this training, I promised that by the end of it you would understand how Git really works. Congratulations, now you do. To get here, we started right from the core of Git, a simple map of hashes to objects. Then we looked into those objects, and we got to the point where we could see Git as a stupid content tracker that tracks changes to your files and directories. From there we moved on to the revision control features of Git. We talked about branches and merges and rebases. And finally, we looked at the distribution-related features of Git that are probably the main reason why you use Git in the first place. And there he is, the whole onion. We wrapped our heads around the entire thing. This is all you need to start exploring Git on your own. Sure, if you're planning on use Git a lot, then there are still plenty of things that you might want to learn about, more advanced features, dozens of command line options, okay actually hundreds of command line options. But, now that you understand the fundamentals, you can confidently keep learning without fear of getting confused. And if you like this training you might want to do two things. First, rate this training at Pluralsight and if you have a minute even leave me a comment here, with feedback. I love ratings and I love feedback. And also, you can move on to this other training that is published here on Pluralsight. It's called Mastering Git and it goes a few steps farther than discribing the Git onion. In a sense it tells you how to cook the onion. It explains more sophisticated commands and techniques that you will want to know when you are using Git in your daily job. Thanks a lot for watching this screencast, and have fun on your way to Git mastery.
Paolo Perrotta is a traveling coach and a software mentor. He wrote
"Metaprogramming Ruby", widely praised as one of the essential books
Released10 Mar 2016