Understanding Distributed Version Control Systems
-
A Brief History of Version Control
In this first module, we're going to look at a brief history of Version Control Systems. And along the way we're going to remind ourselves of some of the key principles behind Version Control. You can break Version Control Systems into three generations. The first generations operated using individual File Locks, the second generation are what's commonly known as Centralized Version Control Systems, abbreviated to CVCS, and the third generation, which this course is about, are Distributed Version Control Systems. And this breakup into 3 generations I get from a book by Eric Sink, called Version Control by Example. And he has very kindly made that book available for free on his website. It's an excellent book and I highly recommend you take the time to read it. But for the purposes of this course, I'm actually going to add in an extra generation. I'm going to call it Generation 0. This is where you use no version control at all. And the reason I've put Generation 0 in, is this in fact where most programmers start. Usually when you're taught to program, you're not taught about version control, so you don't use it at all. And many developers can go for several years before they start making use of version control systems. In fact, as I've visited user groups and talked about Distributed Version Control, I've been surprised to discover that there are quite a few developers out there who are well into their professional careers, but still aren't using any version control whatsoever.
-
Generation 0 - Working without Source Control
Now the main reason that many developers think that they can get away without using version control is because they're working alone. They have no need to be able to merge in changes that other people are making to their code because there are no other developers working on their code. So instead of using version control, they simply make regular backups of their code, or at least they try to remember to make regular backups of their code, and those backups are the only fallback if something goes wrong. So Generation 0 using no Version Control at all can sort of work, if you're a lone developer, although later in this course I'll be making the case for why you should still use version control even if you're working alone. But when you go from one to two developers, now you can start to get yourself in all kinds of trouble if you're not using a version control system. Let me explain what I mean. Imagine two developers want to both work on the same application, but they're not using version control. A common way to approach this would be to put the code on a Network Share. In the morning, our first developer, let's call him Stan, takes a copy of all the code on the Network Share onto his computer, and then Ollie also takes a copy from the Network Share onto his computer. Now Stan wants to work on database.c, so he says to Ollie, I'm working on database.c, and Ollie says that's fine, I don't need to work on that file, so Stan gets to work making some changes. Ollie wants to work on user-interface.c, so he says to Stan, I'm changing user-interface.c, and Stan says, that's fine, I'm not working on it, so Ollie starts to make some changes. At the end of the day, Stan copies his changed database.c onto the Network Share and Ollie copies his changed user-interface.c onto the Network Share as well. So far, so good, but just before Stan goes home he suddenly remembers that he also changed app.h and he thinks, well Ollie wasn't working on that so I can just copy it onto the Network Share. But the trouble is, just before he goes home, Ollie remembers that he also had changed app.h and he thinks to himself, Stan wasn't working on that, so I can copy mine onto the Network Share. Now Stan's changes on the Network Share have been overwritten and in the morning when they both copy off the Network Share again, suddenly there's a problem. Neither of their code compiles and Stan's changes from yesterday in app.h have been lost. So as you can see, trying to work without version control with two developers is extremely dangerous. Sooner or later, you're going to lose some code. In fact, I have even spoken to some people who have tried to scale this approach above two developers to five developers. All I can say is, that's absolute madness. You're going to run into all kinds of troubles if you try to coordinate multiple developers working on the same code base without using version control systems.
-
Generation 1 - File Locks
So how do version control systems get us out of this mess? Well, the Generation 1 version control systems did so by the simplest means possible. They made use of File Locks. One developer could exclusively lock a file and only they could work on it until they relinquished the lock. And a couple of examples of Generation 1 version control systems are SCCS, which came out in the early 70s, and RCS, which came out in the early 80s. Now although this solves the problem of developers overwriting each other's changes, using exclusive file locks can cause more problems. In particular, it can cause bottlenecks. Sooner or later, you'll find that you want to work on a file that another developer has got exclusively locked. And so you'll be hassling them to hurry up and finish with the file, but that might not be possible. They might be off sick or on Holiday or they might be doing something that's taking a very long time. And in my experience, what often happens is that developers try to work round this by making the changes they want to on a temporary local copy, and the intention is that when that file becomes available, they will get a lock on it and they will manually merge in their changes, but this can be quite problematic because manual merges are extremely error prone.
-
Generation 2 - Merge Before Commit
So the second generation of version control systems brought a significant benefit. It's allowed concurrent edits. Two developers could actually make their changes to the same file at the same time. Let's look again at our timeline at some examples of centralized version control systems. One of the first was called CVS. It came out in 1990. Then there was IBMs Rational ClearCase, SourceSafe, Perforce, and in 2000 Subversion came out, and Subversion has become extremely popular. On many of the statistics that I looked at to see what the most popular version control system was, Subversion came out well on top. There's also SourceGear Vault, and in 2005 Microsoft finally replaced the much maligned SourceSafe with an improved centralized version control system, TFS. But how can a centralized version control system let two people work on the same file at the same time. Well the way it does it is it forces you to merge before you commit if someone else has made a change to the file that you're changing. Let me explain what I mean by that. Let's imagine our same two developers are going to try to make changes to app.h at the same time again, but this time they've learned their lesson and they're using a centralized version control system. So now instead of a Network Share, they have a central server and they perform what's called a Get Latest operation. This is where they go to the central server and they ask for the latest version of all the files in the project. Once both of them have got the latest version, they can begin editing the files as they wish on their local machines. So Stan makes some changes to app.h and Ollie also makes some changes to app.h. When Ollie's ready, he does a commit or a check-in, it's sometimes called, to the central server, and this goes nice and simply for him because he's the first one to check-in. But what happens when Stan wants to make his changes? Are his changes going to overwrite Ollie's? Well, no, because when Stan tries to commit he's going to be blocked. The central server tells him, someone else has changed the file you're working on in the meantime, so you can't commit. What you need to do is perform a merge and the tool will help you to merge correctly. In fact, it may well be possible for the changes to be merged automatically. So now Stan has got on his machine a merged app.h, which has got his changes and Ollie's changes in it. And when he's happy that everything's still working correctly, he's now able to commit to the central server. And so we end up with the situation that we've got both people's changes on the central server. And this is what we mean by merge before commit. Stan was forced to do the merge before he was allowed to commit his changes to the central server.
-
Generation 3 - DVCS Timeline
So what does Distributed Version Control bring to the table? Well one of the big innovations is Commit before merge. With a Distributed Version Control System, you can actually commit your changes before you have to worry about doing the merge, and we'll see how that's possible later in this course. Now let's bring up our timeline again and have a look at some of the key Distributed Version Control Systems. It's a little bit hard to find out who exactly was first, but one of the earliest was called BitKeeper in the late 90s. A few years later, one called Darcs came out and 2005 in particular was a really good year for Distributed Version Control Systems. Three very successful _____ projects began that year, Bazaar, Mercurial, and the most successful of them all, Git, were all introduced in 2005. Another very important driver in the success of Distributed Version Control systems, was a launch in 2008 of github.com. GitHub is an open source project hosting site and what it does really well is showcase some of the unique powers and capabilities of Distributed Version Control Systems, and GitHub has rapidly become one of the most popular places to store open source projects. There have been more Distributed Version Control Systems created in the meantime. A couple of examples would be Plasticscm and Veracity, and these version control systems are a little bit more focused on the enterprise, providing some of the features that companies are used to using with second generation version control systems and may be missing in the open source implementations like Git or mercurial.
-
DVCS - A Crazy Idea?
Now I have to be honest with you. When I first heard about Distributed Version Control Systems, I wasn't initially convinced by the idea. In fact, some of the things that people were presenting as benefits of Distributed Version Control seemed like rather crazy ideas to me. For example, they would say, with a Distributed Version Control System, you have a copy, not just of the latest code, but of your entire history, of every past version of every file, all on your machine. And I was using SourceSafe at the time and that was extremely slow just to get the latest version, so that sounded really, really slow to me and like a bit of a waste of disk space. Why would I need the historical version of every file? Or they would say with a Distributed Version Control System, you don't have to have a central repository, you don't need to have one computer that everyone gets the latest from and commits their changes to, but that didn't sound like a good idea to me either. There's a good reason why we have a central repository and that's because it's generally a very bad idea to release into production just from any random developers PC. What you want is a _____ 9-place where the latest and greatest version of your code can always be found and built by the _____ machine. Or they would say, with a Distributed Version Control System, it doesn't matter if your central repository catches fire because you've got a backup on every computer, every computer has got the full history. But of course, any sane company that's using a centralized version control system, has already got a good backup plan in place. There's no way that you would leave your central repository _____ un-backedup, or at least hopefully not. Another favorite advertising slogan for Distributed Version Control Systems is you can code on a plane if you want to. You can be completely disconnected from the central server, write some code, and commit, all while you're disconnected. Well that one didn't impress me too much either. I've never really felt the need to code on a plane and, in fact, I go most places on my bike, and I'm certainly not tempted to try and code while I'm cycling. And the final one that I heard a lot is people would say with a Distributed Version Control System, it's really easy to create loads and loads of branches. You can create hundreds of the things. But most people who have used centralized version control systems won't find that a very attractive proposition. More branches, that means more merges, and I've got enough problems as it is. So maybe you have similar reservations and are wondering, why exactly did we need a third generation of version control system, what's wrong with centralized version control systems? They do most of the things we need, they allow us to have multiple concurrent edits on the same file, they allow us to do branching and merging, and they're generally fairly well-understood by the developers who use them. So as this course progresses, I hope to persuade you that far from being a crazy idea, Distributed Version Control Systems really do offer some substantial benefits over the second generation centralized version control systems.
-
Module Summary
So let's summarize what we've looked at in this first module. Well, we've been through a brief history of version control and we started off actually looking at what it's like to work without version control, and in particular the madness of trying to work with several developers working on the same files without using source control. Then we looked very briefly at how Centralized Version Control Systems work, and in particular we saw how they allow multiple concurrent edits to the same file by requiring the second person to commit to do a merge before they can continue with that commit. And finally, we looked on the timeline at some Distributed Version Control Systems. And now we're going to move on to answer the questions, how do they work and what benefits can they offer me?
-
DVCS Basics
Module Introduction
Hi, my name is Mark Heath, and in this module we'll be looking at the basic principles and operations of Distributed Version Control Systems. We're actually going to start off by looking at what a DAG is. DAG stands for Directed Acyclic Graph, and once you understand what one of these is, it will really help you to see what's going on with Distributed Version Control. Then we're going to run through each of the basic operations of a Distributed Version Control System, Clone, Commit, Push, Pull, and Merge. And we're also going to look at a slightly more advanced topic of the different ways that you can do Branching with a Distributed Version Control System, and we're going to look at two different ways that you can do this using Clones as Branches and using Labels to identify Branches.
-
DAGs Explained
Now in my opinion, one of the best ways to understand what's going on with Distributed Version Control is to understand the concept of a DAG, which as I've already said stands for Directed Acyclic Graph. Now I know that sounds quite complicated, but bear with me because I hope to show in this module that they are, in fact, quite straight forward. So when we say Graph, all we mean is that we've got some nodes, and lines joining those nodes together. The lines are often called edges. So here's a node, I've joined it as a circle, and identified it as node number 1, and now I joined node number 2 to it and node number 3. The directed part means that each of the lines have arrows. They're directed edges and that's because there's a parent child relationship going on here. Node 3 is the child of node 2 and node 2 is the child of node 1. And what does it mean that they're acyclic? Well all it means is that if you follow the arrows you can't get back to where you started. If you follow the arrows from node 3, you can't get back to node 3. If we add an extra arrow to this stack, then it's still acyclic. You still can't get back to node 3 by following the arrows, but I can't add an arrow like this. Node number 1 can't be the child of node number 3. DAGs can be a little bit more complicated than this. For example, this is a valid DAG as well and there are a number of special types of node on display in this example. Node number 1 is what we would call a Root node. It's got no parents and typically a DAG will have just one Root node, although it could have more. Node number 2 represents something that we'd call in source control, a Branch. It's got three child nodes, nodes 3, 4, and 5, so it's branching out at this point. And node number 7 merges a couple of those branches back together, it's got more than one parent. And finally, nodes number 8 and 9 are also special. They've got no child nodes, so we call them Heads or Leaf nodes. Obviously child nodes could be added in the future to those Head nodes, or indeed to any of the other nodes in the graph. So how is a DAG useful for storing our version control history?
-
Version History as a DAG
Well, each node in the DAG represents a single commit or change we've made to the version control system. So here in this very simple example, we've made 3 changes, 3 commits, and as you'd expect, each node contains information about what files were changed and what lines were changed in these files, but also just as importantly, each node also contains information about who its parent node is. So node 2 or version 2 is a change to version 1 and node 3 is a change to version 2. Now each node in a deck is uniquely identified by a hash, and this hash usually uses the SHA-1 algorithm, although it could use a different hashing algorithm. The important things is that hash is calculated using the changes you've made to the code and using your parent node. So if you want to change anything about a commit node, it will get a new hash. Now on my diagrams I've just shown individual numbers, but actually when you work with Distributed Version Control Systems, you'll see that their hashes are much longer than that. I've just used individual numbers because it makes it easier for me to talk about what node I'm referring to in these tutorials.
-
Cloning
So with this in mind, let's start to look at the basic workflow that you would use when working with Distributed Version Control, and for the sake of illustration, let's imagine that our developer from the last module, Stan, has gone to work at a new company that uses Distributed Version Control. Now Stan's first job is going to be to get the code off the server. Now although we said in the first module that Distributed Version Control Systems don't need to have a central server, in practice almost everyone uses them. Every company will need a place where everyone knows the latest version of the code is stored. So, on the server we're going to find the latest version of the code and that's stored using the DAG format that we just looked at. Now with a centralized version control system, what Stan would do is a Get Latest. He'd get just the version of the files as they were in version 3, but with a Distributed Version Control System, what he needs to do is called a Clone. And when he does a clone, he copies the entire DAG from the server onto his local machine. Now that might sound like it's going to be a little bit slow to you, but I can assure you, in actual practice you find that this works extremely quickly. After all, it's just a file copy. And they tend to be surprisingly efficient on disk space as well. So if you're concerned about speed and performance, I'd encourage you to draw your own conclusions after you've actually tried it. You might be pleasantly surprised. And you can see already here why people say it doesn't matter if your central server catches fire, because you really have got a copy of all the information. The entire DAG is stored now on Stan's local computer.
-
Making Commits
So now Stan has got a copy of the DAG on his local computer. He's ready to start some work. So let's say he does some work and he fixes a bug and once he's happy that he's fixed the bug, he makes what's called a commit. And in making the commit, he adds a new node to his DAG, we'll call it node number 4 here. Now the important difference to understand here to a centralized version control system, is that this commit has not gone to the server yet. If we were to look at the server's DAG, we would see it's still only got nodes 1, 2, and 3 in it. Stan's commit is only on his local machine at this moment. Now this opens up an interesting possibility that you can't typically do in a centralized version control system. Stan can actually make another change and commit that _____ too. He's now made two local changes and neither of them have made it to the central server yet. Now you may be thinking, what's the use of that? Surely if Stan's making some bug fixes and changes, he wants to share them with everyone else. So how do we get these onto the central server? Well, what Stan has to do is what's called a push. He's going to push the nodes in his DAG that aren't on the server up to the server, so currently, as we recall, the server's just got nodes 1, 2, and 3 on it, and so we need to find out what nodes Stan has got that the server hasn't, and because of those hash _____ it's actually really easy for it to determine this. We work out that Stan has got nodes 4 and 5 and so they need to get pushed to the server, and notice that both changes can be pushed in one go. All of the local changes that Stan has made can be pushed together up to the server, and this means the server and clients are now synchronized again. They've both got exactly the same DAG.
-
Handling Conflicts
Now I'm sure you're thinking, that's all very well and good if no one else has made any conflicting changes, but what if someone else has made a change in the meantime? What if the server looks like this, and before Stan gets around to pushing his changes someone else has gotten there first and pushed nodes 10 and 11 onto the server's DAG? What will happen when Stan gets around to pushing nodes 4 and 5? Well, maybe what we'd like to happen is something like this. Stan's changes, nodes 4 and 5, just get pushed onto the end of the server's DAG, but unfortunately, that is not allowed, that's not going to work, and the reason is that node 4's parent isn't node 11, node 4's parent is in fact node 3. So we can't simply tag node 4 onto the end of the DAG without changing its hash, and changing its hash is as good as throwing it away and making a completely different node. So what's actually going to happen when Stan tries to push to the server is he's going to get blocked. In this sense it's no different to a centralized version control system. It doesn't let you do the commit because someone else has made a change. In this case, Stan has been able to do the commit locally, but he's been blocked from doing a push. What he needs to do is called a Pull from the Server or in git terminology what I'm about to describe is called a Fetch. So Stan, in his DAG, has got the following five nodes, but the server has got two more that he hasn't got. So what happens when Stan pulls from the server is that those nodes are added to his DAG. And you can see here that Stan ends up in a situation where his DAG has got two head nodes, 5 and 11, and this would be a problem on the server, because which version would the build machine build? Should it build version 5 or should it build version 11? What we want is a merge, a change that's got the changes from both nodes 4 and 5 and nodes 10 and 11 in it. So what Stan needs to do is perform a merge and he can use the Merge command in his distributed version control _____. What this will do is create a new node in his DAG, node number 6, whose parents are 5 and 11, and they will contain both changes. Again, this isn't any different to the types of merge that you'll be used to for the centralized version control system. It will happen automatically if the changes don't conflict, and if the changes do conflict, it will bring up a visual tool that will allow you to decide what you want to do about those conflicts, but one big difference with centralized version control systems is that you're making these changes in complete isolation from the changes you made before. You'll see that we've added a new node to the DAG. We've not changed node number 5, so any of the work that Stan's already done can't be lost by a merge that's gone wrong. If something goes wrong in this merge, we can just throw away node number 6 and try it again. Again, this happens locally. The server doesn't know about the merge yet and won't do until we perform another push, and so that's what Stan needs to do next. He now has three nodes in his DAG that the server doesn't have, so when he performs a push, we get the following situation, and the server has still got a single head, so the build machine knows which node it needs to build from, it's node number 6.
-
Clones as Branches
That's almost all there is to it, but there's one other thing I've not talked about yet, and that's how you do Branching with Distributed Version Control Systems. There's actually two different approaches you can take. The first one I'm going to talk about is quite commonly used by people who use Mercurial, which is often abbreviated to hg after the abbreviation for mercury in the periodic table. And this approach is a very simple one. You basically just take another Clone of your repository and call that your Branch. So for example, if you've got a repository and its DAG has got these three nodes in, you could say, this is my version 1 repository, and then you'd clone it and you'd say this is my version 2 repository, and you'd host both of these on your central server. And so if someone wanted to make a change to version 1, they'd clone it, make their change, and push to version 1. And if someone wanted to work on version 2, they'd clone that, make a change, and push to version 2. The nice thing about this approach is that you don't have to learn any more special commands than the ones we looked at already. And another nice advantage of this approach is that you know that whenever you have two heads in your repository, they always need to be merged, because with this approach each repository is being used to represent a single branch in your code. And one of the implications of using this technique is that on your local development machine, you'd have one folder containing the version 1 code and a different folder containing the version 2 code. And if you're used to working with centralized version control systems, this is what you'll be used to anyway. Each branch is stored in its own folder, and depending on what type of development you're doing, that might be a useful feature or it might not be what you want.
-
Labels as Branches
The other approach to Branching it so to have all your Branches stored in the same repository in the same DAG, and this is the approach that git tends to encourage. And the way it allows you to do this is by using Labels that refer to specific nodes. So by default, git will give you a master branch when you create your new repository. And so it will point initially to your first commit, but what happens when you make your second commit is that the master branch label moves to point to the next node, and if you add a third commit it moves again. And you can create your own branch at any point. Let's imagine I wanted to create a version 1 branch. Actually, all that would be happening when I did that is I'll be adding a label to node number 3. The version 1 label points to node number 3, and you need to tell git which branch you're working on. So you tell it, check out version 1, and so git would then know we're now working on version 1. When we make another commit, the version 1 label moves along, but the master label, the master branch, is still pointing at the previous node. And this allows us, if we want, to create another branch, and we could create the other branch based on where the master branch is. So we could have version 2 and we could tell it we're working on version 2. Then when we make another commit, node number 5, the version 2 label points at node number 5 and, again, the master branch is still on node number 3. And so you can see here that really all that's happening is branches are giving us convenient labels that help us navigate our way around the DAG. The only slightly complicated thing about using this technique for branching is that my master branch label might be pointing to a different node than the master branch label on the repository I initially cloned from, and it will be, if I've made my own local commits that haven't been pushed yet. And the way git handles this is that when I do a pull, and maybe I'll get a couple of new nodes in, it has a way of showing me that the master branch on the origin, which is git's name for the repository that I cloned from originally, is in a different place to where my idea of where the master branch is, but this allows me to see quite easily that I need to merge my master with where the origins idea of master is, so that we can synchronize where the master branch ought to be pointing to. And in fact, git's pull command makes this really easy. It not only fetches the nodes from the server that you don't have, but it performs the merge for you if necessary.
-
Module Summary
So to summarize what we've learned in this module, we've seen that repositories in a Distributed Version Control System are stored in a DAG structure, with each node having a hash as a unique identifier. This allows us to make local commits that we only push to the server when we're ready. And it also separates the merge from the commit. We can commit our changes and then perform the merge later in a separate commit. We've seen how we need to pull other people's commits from the server, and we need to push our own commits to the server, and we've also looked at two ways you can branch. You can make use of the fact that every clone effectively is a branch and so just have two repositories, one for each branch you want to work on. Or you can use the technique that makes use of labels, allowing a single repository to contain many branches. And although I said that the first of these is used in Mercurial and the second is used in git, you'll find, in fact, that you can use either in any Distributed Version Control System. And in the next module we're going to see how Distributed Version Control Systems are useful for single developer projects. And we'll also be demonstrating the use of the workflow that we've looked at in this module, how we can manipulate and explore the DAG using the commands of a Distributed Version Control System.
-
DVCS for Single Developer Projects
Module Introduction
Hi, my name is Mark Heath, and in this module we'll be looking at how we can use Distributed Version Control Systems on single developer projects. We're going to start this module off by looking at what the benefits are for using Distributed Version Control on your single developer projects, and I'm hoping to persuade you that it really does make sense to use it on all your projects. It's so simple to set up that you never really need to work without version control again. Then I'm going to do a demo and show you how to use a Distributed Version Control System to set up your repository and how to make changes and commits to it, and for this demo I'm going to be using Mercurial. Now we'll also be showing the use Git later on in this course, but I want to show you that the principles of Distributed Version Control are the same no matter which actual implementation you're using. And in fact, as we go through this demo, I'll be explaining what the equivalent Git commands are, and you'll see that there's a very close correspondence. For every command in Mercurial, there's a very similar one, often with exactly the same name in Git. And also, at the end of our demo I'll be showing how Distributed Version Control Systems make it really easy for us to backup our work to an offsite private repository. So all that source code sitting on your computer that's currently not under version control at all, you can have a very quick and easy way to keep it backed up and safe.
-
Single Developer Projects
What exactly do I mean by single developer projects? Well really all I'm talking about is anytime you are the only developer working on a software project. Now obviously if you're self-employed, for example, then all of the projects you work on may be single developer projects. But even if you work at a large company on a team of developers, it's very common for you to actually be often writing little bits of software on your own. For example, you might be creating a small utility that helps you get something related to your job done in an automated fashion, or maybe you're making a prototype of a new feature that you want to add to a larger application, or doing some kind of other experiment to try something out. Maybe you're writing some kind of test harness to test, again, a part of a larger application. Also, I very often create small learning projects, sometimes these only last for a few hours. Other times I'll work on them for a few days while I'm teaching myself a new technology. Or maybe you've come up with a great idea that you think is going to eventually turn into a business, and you're starting to make the software for that, just a little bit at a time in your spare time. And in my experience, most software developers have a number of these types of projects just sitting on their computer at work or at home and quite often they're not under any source control at all. And there's a number of reasons why that's the case. Let's have a look at a few of them. One common reason is to say I don't have a version control server. This is quite common if you're working on something at home. You've just got one computer that's your development environment and you don't have a server to submit your code to. Or maybe at your place of work there is a central server, but you don't think that the prototype or utility that you're writing belongs on there. You don't want to put it in there alongside the main production code. Or maybe you're thinking to yourself, this project isn't going to last very long, I'm only going to work on it for a couple of days, so it's not worth bothering with source control at all. Or maybe you think using version control will slow me down. I just want to code this out as fast as I can. It's quite common when you're prototyping to be rapidly writing new code. You don't want anything to slow you down. Well maybe you think of version control as something that's only really relevant if you've got multiple developers working on a project or maybe you think, I only need version control if I'm branching and for small single developer projects, there's usually no need to maintain multiple different branches of your software. So these are the types of things that are the barriers to people using version control on their single developer projects. Let's now look at the benefits and see why it's worth using even despite these objections that you might have.
-
Benefits of DVCS
So I've selected five benefits of using Distributed Version Control whenever you're working on these single developer projects. Let's go through each of them one by one. First of all, with Distributed Version Control System you don't actually need to set up a separate server. You don't need to have another computer that's going to host the repository, and you don't need to have a process that's always running. You can very easily just create a Distributed Version Control repository on your development machine. This means there is an absolutely minimal barrier to entry. You can try this out right now. Just install a Distributed Version Control System and you can start using it immediately. It means it's a great way to learn Distributed Version Control. There's no risk to trying it out. The second really great benefit is that using Distributed Version Control allows you to back out of mistakes easily. As I've said, when you're prototyping you're often working very fast, you're sometimes making sweeping changes to your application just on a whim. Well if you're using version control, then you can easily back out of any mistakes you make. If you thought something was a good idea and it turns out that it breaks everything, you can easily roll back to the previous version. And Distributed Version Control actually also makes it easy to do all kinds of experiments, even in parallel. You could have a few different ideas that you're trying out and switching between and then only merge them into the main branch when you're happy that those experimental ideas were actually successful. And if you're not using version control at all on these single developer projects, then you'll find very often that when you have made a mistake, you're completely reliant on the undo feature of your development environment to get you out of the mess you've made. The third benefit I want to highlight is that it gives you the ability to really easily pick up where you left off. Often I find when I write little utilities to help me with my work, they're useful for a couple of days, and then I forget about them, sometimes for months or even years, and then later I want to pick up and use them again. But what can sometimes happen when you're picking up a project that you haven't worked on for a number of years, is that you don't even know where the latest version of the code is. Maybe you've changed computers since last time you worked on it and now you've got to search through various backup archives or USB keys to try and find the code again. If you've been using Distributed Version Control Systems, you can easily examine every old copy of your application you find to see which one is the latest version. And also, you can easily see what you were doing last by having a look at the commit messages. So keeping these even apparently short-lived projects in source control can actually give you benefits a number of years down the line. The fourth benefit I want to talk about is the ability to synchronize between computers. Just because you're the only developer working on this project, doesn't mean that you're only going to be working on one computer. You might have a desktop and a laptop or one computer in the office and one computer at home, and you want to be able to work on the same source code, but from two different computers. If you're relying on things like putting the code on USB keys and ferrying it between computers, that again can be very unreliable and hard to work out which computer has got the latest version, but with Distributed Version Control it's really easy to backup your work to the cloud from one computer and pull it down from another. And it will even handle conflicts, those times where you made one change in one computer and another on another computer and you need to merge them together. And in our demo shortly, I'll be showing you a little bit about how that works. And the final benefit I want to mention is that you can do all this completely free. Most Distributed Version Control Systems are open source, and you can also host your repositories on the internet for free. Of course, there are some commercial tools that you may decide to buy to give you better experience, but again, there's minimal barrier to entry. If you want to try Distributed Version Control you can do so without having to pay out any money.
-
Demo - Introduction
So let's get started with our demo. As I've said, for this demo I'm going to be using Mercurial, and the way that I'm getting Mercurial installed on my PC is by installing this application called TortoiseHg. This includes Mercurial and puts the command line tool onto your path, but it also includes a user-interface that allows you to view the history and view the changes, and also it gives you a Windows Explorer shell integration. We'll be looking later in this course at some other graphical tools that you can use and there are many equivalent programs for Git, but this is the one that I use when I'm working with Mercurial, and you can get it here at this website, Tortoise Hg.bitbucket.org. Let's have a quick look at the application that we're going to be adding to source control for this demo. It's a website that I made for my children to help them with your math homework. It simply allows them to attempt various math problems such as multiplications and additions and it tells them whether they got it right or wrong. Let's see if we can do this one. And I've made this site using ASP.NET MVC, MVC, AngularJS, and BootStrap. And the reason I chose those technologies is not because I'm particularly good at using them, but because I've been watching some Pluralsight videos on how to use them and so I thought this would be a good way of me learning a bit. And because I'm learning as I go, having version control is going to be really useful for me because I'm probably going to make some mistakes along the way and it'll be good to be able to back out of those and to revert to a previous version. And as you can see here, I've already made a bit of a start to this application and it's high time I got it under source control, so let's do that now.
-
Demo - Creating a Repository
So here I've got a command prompt and I've navigated to the folder in which I've put the source code for this website. Now the command line tool for mercurial is called hg. If I type hg here and press return, it will give me a list of the main available commands and I can get more detailed help on those commands by typing hg help and then the command name. Let's type hg status. As you can see, it complains that I haven't got a repository at this location, and that's because I haven't initialized a new repository yet. The command that I can do that with is called hg init, now we've got a Mercurial repository at this location. If you're wondering what's actually happened when I did that, let's have a look at the contents of the folder. And you can see here that what's happened is a new folder has appeared. It's called .hg, and this is the folder that's going to store all of Mercurial's information about your repository. In particular, the DAG is contained in here. All the history of every file will be inside this folder. And you wouldn't normally go looking inside this folder, there's rarely a need to make any changes in there. And that's because it's the job of the Mercurial command line tool to manage the contents of what's in that folder. If we wanted to stop this from being a Mercurial repository, then we could do that by simply deleting the .hg folder, but only do that if you really don't mind losing all the history of every file in your repository. All you'll be left with is the version of each file that's contained in your source code folder. Now we haven't actually added any files yet and we can see what files are
-
Demo - Ignore Files
available to be added to our repository by typing the hg status command. As you can see, there's an awful lot of files there and, in fact, I can see some of these files I don't even want to be added to source control. The packages folder, for example, is something that the new _____ can restore automatically, so I'd rather exclude that. So, often, one of the first things you'll need to do when you're adding a project to source control is to set up an ignore file. And for Mercurial, the file you need to create is a .hg ignore file. Git's got a similar concept called a .git ignore file. Let's create one now. Now Mercurial actually gives us a choice of two different syntaxes for this file. I like the one that's called glob. I'm going to tell it that I don't want the packages directory to be included, also the bin and obj directories. I also don't want to include any user specific files. And also, I want to exclude the database files from my App_Data folder. Let's save that and run the status command again. As you can see now, we have a smaller list of files that it needs to add. There's still quite a lot here because the ASP.NET MVC template creates quite a lot of files for you by default, but I'm happy for all of these to be added into my repository. And the way we do this with Mercurial, is we call the hg add command. As you can see, all of these files currently have a ? next to them, and that's because Mercurial doesn't know whether we want it to be included in the repository or not. If I say hg add, that's going to tell it that I want all of these files to be added to my repository. However, they haven't been added yet. All that's happened is I've marked them for inclusion next time I commit. If I run status again, you'll see that now they've all got an A next to them. That means they're ready to be added when we do a commit. Before we do our commit, I want to show you two easier ways of setting up your hg ignore file. The first is to simply do a web search for one that someone's already created. For example, if I search for hg ignore Visual Studio, then this stackoverflow question has got a good example hg ignore file that you could use, and you can do the same for Git ignore files, although I tend to prefer to create my own, so that I know what I'm excluding and I'm only leaving out the things that I really don't want in my repository. But there's also another way. I can launch TortoiseHg's visual tool, which I can do with a thg command. And this brings up something that allows me to have a look at the files that are staged for commit. And we'll also be using this application later to look at our history. So let's create quickly a temporary file that we don't want included in our repository. When we come back into TortoiseHg Workbench, we can look down and we'll see that temporary file is now shown in the list with a ?. It doesn't know whether it should be added to the repository or not. What I can do is right-click it and click Ignore and it will give me the option to add a new filter to my ignore file. I could ignore exactly this file name or I could change it to ignore all .tmp files. So now when I refresh the file list, that tmp file will disappear because it knows that I don't want to include it in the repository. So you may find that an easy way of creating your ignore file if you don't like typing it directly.
-
Demo - Making a Commit
So we're ready to make our first commit, and we do that with the hg commit command. Hg commit, and I can use the -n parameter to pass my commit message. And I'll just call this the "initial version." Now all of those files that were marked for addition have now been added to the first commit in our DAG. If we type hg log, you'll see that we've got one change set. You can see it's hash here, you can see the commit message that I added, the date and time. It's also got my user name and email address, and that's something you can easily set up in Mercurial using the TortoiseHg application. If we hg status now, it shows nothing, because I haven't made anymore modifications since I last did a commit. Let's make some changes to this application and then we'll do another commit. So I've made a very simple change to this application. I've allowed you to select what problem types you can be shown in the quiz. Let's go back to our command prompt and see if we can commit these changes. The first thing I'm going to do is type hg status again. And you can see here that it's noticed that I've modified two files, that's what the m stands for. Now with Mercurial, I don't need to do anything else, I can just do a commit immediately. Git is slightly different. Git wouldn't assume that I wanted to necessarily commit those changes on my next commit, so I would have to call git add again and tell it that I want those two changes to be committed. But let's just add these changes as a new commit. Now if we do an hg log again, we'll see my second commit. It's got a different hash and it's got information about my commit message and the date and time that I did it on. If I want to see in a bit more detail what files are actually changed, then the _____ GUI application is going to be helpful for me. Let's launch that up again. Here you can see it's showing the two commits that I've already made, the initial version with lots of files being added, and then my second commit with two modifications. And if I click on these two files that I modified, I can see the difference, the lines that I've added and the lines that I've modified. And now I've made another change. I've added two new problem types to NumberMaker. I've added multiplication and _____ Half problem types. Let's go back to our command line and commit these changes. Again, as usual, I'll type hg status to see what changes I've made. You can see that I've modified four files as part of doing this. And there's also a new file that I've added called SunProvider, and Mercurial has put the ? against that, because I haven't explicitly said that I want to add it to my repository. So let's use the hg add command again. Calling hg add with no parameters tells Mercurial, just add all of the unchecked files, so it adds SunProvider. Now if I do hg status again, I'll see that all five files are now going to be part of my next commit. And if we do another hg log, we'll see that now my DAG has got three nodes in it, we've got three revisions. Now that we've got three commits in our repository,
-
Demo - Navigating the DAG
let's make use of one of the powerful features of Distributed Version Control to quickly go back and look at the repository as it was in a previous state. And to do that in Mercurial, I use the hg update command, and I need to tell it which revision I want to go back to. Now I could type the hash in, so if I wanted to go back to the beginning I would need to type in 16c97a5, so on, but actually Mercurial gives each change set its own simple identifier starting from 0. So if I want to go back to the very first one, I can just type hg update 0. If I do this, it's going to set the code in my local folder to the state that it was in at the time of the initial revision. So if I go back into Visual Studio, Visual Studio will notice that a number of files have been modified. I need to tell it to reload. Now when I run, we'll see that we're right back to the original version again. Probably the first time you do this it can make you feel a little bit nervous, what's happened to my latest code. Well don't worry, it's completely safe in that .hg folder and we can get back to it with another hg update command. Let's go back to the latest revision again, and let's go back to Visual Studio and make sure we've got everything back. Reload everything again, and compile and run. And as you can see, we're right back up to date. This is one of the commands that is slightly different in Git. Git uses a command called checkout and has a slightly different way of telling it where you want to move to in the repository.
-
Demo - Creating a BitBucket Repository
So the situation we're now in is that we've got our source code and all of the previous versions all stored locally on our PC, so we can easily go back to any previous point in time. But that hasn't really solved our backup problem, because if anything happens to this computer, we'll lose not only our latest version, but all the backup versions as well. Now you may well have a backup program that's backing up your hard disk, but I want to show you is how you can synchronize this repository to one that's hosted on the web. And to do that, we're going to make use of a site called Bitbucket. The nice thing about Bitbucket is it allows you to have an unlimited number of private code repositories, either in Git or Mercurial. So let's create a repository on Bitbucket to store this application. If I click Create Repository, give it a name, I'm going to leave it as a private repository, and for this example, we're using Mercurial. And Bitbucket gives us the option for if we've got existing project that needs to be pushed up. And here it gives me the command that I need to use to send my code up to this repository. I do an hg push and then the URL of this repository. So let's do that now. Bitbucket compares the DAGs both on my local machine and on the internet, and it realizes that I've got three change sets on mine that need to be pushed, so it sends them up to Bitbucket. Let's have a look on Bitbucket and see if those changes made it up there. And here you can see my three changes and the messages and the times of them. If I click on the hash for one of these changes, it'll give me more information about what files were changed. And it gives me a really nice way of viewing the _____ diffs as well. Now it would be quite inconvenient to remember that URL every time we needed to push, so Mercurial allows us to store it as the default location to push to, and we do that by editing a file called hgrc that lives inside that .hg directory. Let's do that now. (typing) Now when I do an hg push, it knows where to push to, and this time there are no changes that I needed. I can also store my password in that hgrc file if I don't want to enter it every time. Git has a similar concept, but actually makes it even easier for you because Git has a command where you add what it calls a Remote and you tell it the URL of where you want to push to.
-
Demo - Working on a Second Computer
Now one of the great benefits of having done this is that it allows us to work on this project on another computer. So let's show how we would do that. Now I'm not actually going to switch to a different computer, but let's create a new folder and let's clone that repository into it as though we were working on a different computer. _____ (typing) So on computer 2, we would do an hg clone, and then we'd need that URL of our Bitbucket repository. Because it's a private repository, I do need my password, even to be able to clone it. However, if it was a public repository, anyone could clone it if they had the URL. Let's do hg log and just check that all of those three commits are in this new repository. And we can see here that they're' all present and correct. So let's load this up into Visual Studio. And this is going to be a very good test to see if we got our hg ignore file right, because if there were any critical files that we forgot to include within our repository, then this one isn't going to build. If we have a look at what files are in this folder, for example, you'll see that the packages directory isn't there. So let's run this and see if it works. And as you can see, NuGet is running to get all of the NuGet packages that are missing, because they weren't included in our repository. And finally it's loaded up, and as you can see, it all seems to be present and correct. So let's make another change to our application on this computer, and we're going to add a new problem type to NumberMaker for rounding to the nearest 10. Now let's do an hg status. As expected, we can see we've changed two files. So let's commit those changes. And I'll include in the message that we're doing this on computer 2. Now currently that commit is only on computer 2's repository. It's not on Bitbucket and it's not on our original computer's repository. To get it all the way back to our original computer, we first need to push those changes to Bitbucket. So let's do hg push, and even though I haven't edited the hgrc file on this repository, it knows where it needs to push to, because it knows where it originally cloned from, and the default push location is where you cloned from. And you can see we've pushed one changed set up to Bitbucket. Let's go look on the Bitbucket site and see that change set. And here we can see the change from computer 2 has been pushed to Bitbucket. So now back on our original computer, how do we get that change that we've pushed to Bitbucket back into our repository? _____ we need to do an hg pull. Again, because I edited the hgrc file, it knows where it needs to pull from. So when I do this, we should expect one commit to be pulled down and one new node to be added to our local DAG. And as you can see, that's exactly what we got. We got one new node. If we do an hg log, you'll see that now we've got the commit that was made on computer 2. One thing that Mercurial doesn't do by default on the hg pull command, is actually update you so that your working folder contains the code from that latest commit. And that's because the pull command might actually require a merge to be done. In this case it doesn't, because we hadn't made any changes on this computer in the meantime. So what I can do is call hg update to move to that new commit that we just pulled down. And you can see here that those two files were updated locally for me. So we've run through most of the main commands of Distributed Version Control. We've seen how we can add files and commit them, we've seen how we can navigate back in time through the repository, we've seen how we can push to an external repository and pull from it. The only thing we've not really looked at is merging, but I'm going to save that for the next module where we're talking about using Distributed Version Control in open source projects. But hopefully you've seen, it really is very easy to get started with Distributed Version Control Systems.
-
Mercurial and Git Command Recap
Let's just recap the commands we looked at in the demo. I started off by installing Mercurial, and to do that I installed TortoiseHg. If you wanted to use Git, you'd use Msysgit. To create a new repository, we used hg init, and the Git command is git init. To find out which files in our working folder had been modified, added or deleted, we used the hg status command, and Git has a git status command. To tell Mercurial that we didn't want certain files to be included in our repository, we created a .hgignore file. And with Git, you'd create a .gitignore file. To access the GUI that showed us the history of our commits, we used the TortoiseHg Workbench, accessed with thg. With Git, there's one called gitk that you can use. In fact, there's quite a number of visual tools that are available for both Mercurial and Git and we will be seeing more of them later in this course. To tell Mercurial that I want to add new files to the repository, I use hg add. Git also has a command called git add, which not only tells it that you want to add files to the repository, but you also use it when you're telling it you've changed a file and you want that to be committed on the next commit. When you're ready to commit your changes, it's hg commit or git commit. To look at the changes that are already in your repository, you can use hg log, git log, and if you want to move backwards to a previous commit, you can use hg update or git checkout. To clone a repository that's on the web, such as one at Bitbucket, we used hg clone and the git equivalent is git clone. To pull down changes from a repository that's on the web, we used hg pull, and the git equivalent of that is git fetch. Git does have a git pull command which does a few additional things for you to save you a few steps. In this demo, we haven't yet shown how you do a merge, but in Mercurial it's hg merge and in Git it's git merge. So if after you did your hg pull you found you had two heads in your repository, then you would have needed to call hg merge. And finally, to push your changes to repository, it's hg push and Git, again, uses the same terminology, git push. And as you can see from this table, although Mercurial and Git do have different ways of working in some cases, the basic concepts are actually very similar and if you've used one, you should be able to make the transition to use the other one without too much difficulty.
-
Module Summary
So in this module, I've tried to make the case for why you should use Distributed Version Control on single developer projects. It allows you to roll back mistakes, to experiment on branches, to synchronize your work between two computers, and to pick up from where you left off. And using Distributed Version Control for single developer projects, such as projects like my NumberMaker one, is a great way to learn how to use the tools. It's quite easy to get started and you're not having to deal with some of the more advanced concepts that you'd need to deal with if you were working in a large team. And we also saw how Bitbucket gives you free backup for your repository in the cloud. And, in fact, there's a wide variety of choices available to you for backing up your repositories to the cloud. Some of them are free and some of them are paid for. In the next module, we're going to see how Distributed Version Control Systems are really useful for open source projects.
-
DVCS for Open Source Projects
Module Introduction
In this module, we're going to be looking at using Distributed Version Control Systems with open source projects. We're going to start off by covering what the benefits of using Distributed Version Control with open source projects are, and we'll look from two perspectives. The benefits for the owners of the open source project and also the benefits of people who are making use of it and maybe making contributions. We'll also be looking at the workflow for making a contribution to an open source project, and we'll do that in two ways. We'll look at some DAG diagrams again like we did an earlier module, showing what's going on when somebody makes a contribution to an open source project. And we'll also do a hands-on demo using GitHub as our example. We'll actually make a contribution to a GitHub project and I'll walk you through the whole process, both from the contributor's point of view and also from the project owner's point of view. And in this module we'll also talk a little bit more about merging and, in particular, we'll look at three types of merge. There's fast-forward merges, regular merges, and we'll also see what rebasing is. Now if you've made use of any open source projects, it's quite likely you already know that Distributed Version Control Systems are very big in the world of open source software. We've already mentioned GitHub, which is one of the largest open source project-hosting sites. And as its name suggests, it only allows you to use Git as the version control system for those projects. But if you look at some of the other popular open source hosting sites such as Google Code, Sourceforge, CodePlex or Bitbucket, you'll find that all of them allow you to use Git or other Distributed Version Control Systems such as Mercurial. And the trend seems to be more and more for new projects to use Distributed Version Control Systems instead of centralized, which used to be the most popular choice for open source projects. And as well as being used as the version control system for lots of open source projects, the tools themselves, such as the Git and Mercurial command line tools are also open source. In fact, almost all of the Distributed Version Control Systems that I know of are at least in part open source. But why has Distributed Version Control become so popular? What benefits does it offer that have caused so many open source project owners to switch to it?
-
Owner and Contributor Requirements
Well, to answer that question, let's start by thinking about open source project hosting from the perspective of the owners of those projects. What do they want from a version control system? Well one important thing is control over who has access to the source code repository. In the centralized version control systems this was about who has commit access, who's allowed to actually commit to your repository. Often on small open source projects, there's only one person who can and that's the project owner. Maybe larger open source projects would allow a team of trusted individuals to have commit access. And with Distributed Version Control System, the question is who has push access? Who's allowed to push from their repository into the master repository? Owners of open source projects also want to be able to accept contributions from the community. And they'd like to be able to accept those contributions even from people who they haven't granted commit or push access to. But before accepting those contributions, there are a number of checks that need to take place. Is the code of sufficient quality? Does it meet the project coding standards? Dopes the feature that's being contributed even belong in this project? so owners need to be able to either accept or reject these contributions and often the owner of an open source project doesn't have a lot of free time to deal with these contributions. so what they really want is for the workflow of accepting a contribution to be as simple as possible. They want to be able to review the code easily, to see the changes that the person has made. I maintain a number of open source projects and occasionally people have sent me a zipped up copy of the entire code base and left me to try and work out what it is that they've changed. That's not very easy to work with. So you want to be able to quickly look at a _____ diff of the files that have changed. Also, if you're accepting the contribution, you'll need to merge it into the latest version of the code. This is important because sometimes the owner doesn't get around to dealing with a contribution until sometime later and there may need to be merges because of other changes made to the same files in the meantime. Again, you want your source control tool to do this hard work for you rather than relying on your own ability to manually merge someone else's changes into the latest version. And the users of an open source project also have requirements for what they want from the version control system. First and foremost, they want easy access to the source code. It should be real easy to get the latest version, and not only to just get it once off, but to be able to keep it up to date and Distributed Version Control allows us to do this with an initial clone and then pulls to get changes that have been made since we did the original claim to keep our copy up to date. Also, sometimes the users of an open source library want to make a contribution, and we want to make it as easy as possible for them to do that. There should be a simple workflow that allows them to submit just their changes in a way that's easily accepted. But also it needs to be possible if the owner of the open source project has rejected your changes for whatever reason, to improve them and then resubmit your patch or your contribution after that initial code review. Another really useful feature would be for you to be able to independently submit multiple bug fixes or features to the open source project, and this allows them to be accepted individually. So if, for example, you were to add 3 or 4 new features and the owner of the project wanted 2 of them, but not the third, it would be a pain if they were all rolled in together. And Distributed Version Control Systems make this easy by allowing you to easily create many branches and you could make each change on a different branch. Finally, an important thing for anyone who's making a contribution to an open source project is that they get the proper credit for their contribution. With a centralized version control system, if you don't give someone commit access, then their name won't be listed as the committer of the change that they've made. They would have to be credited in the check-in comment instead, or something like that. One of the very interesting features of Distributed Version Control Systems is that it's possible for you to actually have your commit appear in the master repository without ever having been given push access. And we'll see how that's possible in a demo later on. Finally, sometimes you want to make changes to an open source project that the owner isn't interested in having at all. Maybe they don't like your idea or maybe they finished working on this project and they're not accepting anymore contributions. So what would be really useful in that case is the ability to create your own customized personal version, sometimes called a "fork" that you can add your own features in to. But if the original project is still getting more bug fixes and more features added to it, what you'd also like is the ability to pull those into your personal fork, so that your fork contains all the latest and greatest features from the original project, and in addition, it's got your changes as well. And that's something that Distributed Version Control Systems also allow really well, the ability for you to maintain your own fork and still be able to pull in changes from the original source. So what do Distributed Version Control Systems offer that make them particularly suitable for open source projects? Well the three things that we're going to be looking at in this module are forks, which are basically clones of the repository, Pull Requests, which are where you ask the owner of the open source project to accept the changes you've made on your fork in the official repository, and we'll also see how the flexibility of Branching in Distributed Version Control Systems is also a real benefit when making contributions to open source software.
-
Contribution Workflow
So let's work through the first stage of contributing to an open source project. Let's imagine somebody is using an open source project and they've identified a contribution that they'd like to make to it. How do they do that? Well the first step is to create what we've called a "fork." And all a fork really is, is a publicly visible clone. GitHub makes it particularly easy to create a fork. You can go to any open source project and create your own fork. Now the benefits of creating a fork is that you own it. You as a contributor are able to push to your fork, even though you're not allowed to push to the original open source project because you don't have the rights to. Now what you can do is clone your fork locally onto your computer. You make the changes that you need to make, again, locally on your computer, and preferably you do this on a branch for the feature or bug fix that you're working on. Then you push your changes to your personal fork, so now they're visible on the web to anyone else who's interested in them. Finally, as a contributor you'd issue a pull request and this sends a message to the owner of the open source project that you've got some changes in your fork that you'd like them to pull into their repository, so they can be shared with everyone else. Let's have a look at that workflow, but using DAG diagrams to help us see what's going on. I've divided this diagram up in to four quadrants. In the top left, we've got the repository that's on GitHub, the publicly visible one. This is if you like the official repository that everyone knows is the place to get the latest and greatest version from. And also in the bottom left I've shown what the project's owner might have on their local machine, which would just be a clone of what's on the official repository, unless they happen to be doing some work in progress. When you create a fork, all you're doing initially is cloning the official repository and also hosting it on GitHub, but on your personal account so you have access rights to this fork, whereas you don't have access rights to the official repository. Now when you want to do some work, you can clone your fork locally and then you'd make your changes. So here I've shown we've added another commit and rather than doing it on the master branch, we've created a new branch called Issue123, which would be a good name for the branch if there was a GitHub Issue with that number that you were addressing. But of course, this is just on your local computer. You want to make it publicly visible, so you push and, again, you push to your fork, which you have push rights for. You can't push to the official repository, you wouldn't be allowed. Now we're ready to issue our pull request, but before we look at how the owner would handle the pull request, let's work through the process on GitHub and actually make a contribution to a GitHub project and issue a pull request. And for this example, I've taken that demo project that we made in the last module
-
Demo - Creating a Fork
called NumberMaker, and I've converted it to a Git repository and I've put it on GitHub. And it's on my personal GitHub account, so it's Mark Heath/number-maker. Now obviously, I already have permission to push to this repository, so we need another GitHub user who can make a contribution, someone who doesn't have push access to this repository. So I've created another user, and let's sign in as that user. _____ (typing). This user is just called acontributor. And the important button is here in the top right. This is the button that's going to allow me to create a fork, so let's create a fork now. And here we have it, we've got a fork of number-maker, and you can see here that this fork is on acontributor's GitHub page, and it shows us that it's been forked from Mark Heath's number-maker. So the next step is that our contributor needs to take a clone of this fork and do some work on it. Now the URL that we need to use to make that clone is available to us here, down in the bottom right. So on the contributor's computer, we do a git clone and then we put that URL in. _____ (typing) And now we're ready to make our changes. Now before we start making any changes to the code, remember we said we were going to try to do this on a branch and we said that we'd like to name the branch after the issue we're fixing. Well there aren't any issues on this project yet, so let's create one. And rather than creating the issue on my fork, I'm actually going to create the issue on the original open source project, because I want to tell the owner that I want to contribute this particular feature. So we go to the original one and we click Issues, and I'm going to create a New Issue. I want to add a new type of problem, which is rounding to the nearest whole number. And this is something that as a contributor I am allowed to do on the main site. I can create Issues, I just can't push commits to the code base. And we see here that this is Issue #1. So let's create a branch in order to work on this feature. I'm going to use a command called git checkout to do this. And the _____ minus b switch means create a new branch and switch to that branch. And I'm going to call the branch issue-1. And you can see here that it tells me it's created the new branch and it's switched to it. So any commits I now make will be on that branch. So let's make my changes, and we should really test it to check it worked. Okay, that's good. Let's commit our changes. We'll use git status to see what changes have been made. We can see that it's picked up that I've modified two files. Unlike Mercurial though, Git doesn't assume that when I do my next commit I want to include these changes, so I either need to use the git add command or when I do a git commit, I need to use the -a switch to say I want to include those modified files, so let's do that. (typing) And so you can see here that we've added a new commit on the issue-1 branch that adds our new feature. Now I want to push those changes up to my fork. Remember here I'm the contributor, so I'm not able to push it to the main repository, but I should be able to push it to my local fork. So to push our changes to our fork, we're going to use the git push command.
-
Demo - Pushing Changes and Issuing Pull Requests
And we're going to tell it that we want to push to origin, which is git's name for where we cloned from, and we're going to push the issue-1 branch. _____ (typing) And we can see that we've pushed our commit up to our fork and it's created a new branch on our fork. Let's go to the website and have a look. Here's our contributor's number-maker fork. Let's have a look at the recent commits. Now our new commit isn't visible here and that's because we're still looking at the master branch. Let's look down at the branches that are available in this repository. Here we can see there's an issue-1 branch, so let's switch to that branch. And now we can see that here's the feature that our contributor has added, adding the decimal rounding feature. Now we're ready to make our pull request, and to do that we need to go back to the original open source repository that we cloned from, so we'll go to Mark Heath/number-maker. And you'll see that GitHub has really conveniently given us an easy-to-click button to do the pull request. It's noticed that I recently pushed the issue-1 branch, so it knows that I might want to do a pull request. So let's click this button. It gives me the option here to write a few additional notes for the owner of the open source project. And GitHub has also detected that there's going to be no problems merging this pull request, because nothing else has happened on the master repository in the meantime. So let's send the pull request. And we can see here on the main repository, that now there is an open pull request from acontributor, and we can have a look at the conversation around it. _____ any opportunity for the owner of this project and if other people who are interested to comment on this pull request. What's more, it's really easy to have a look at what changes were made. You can look at the commits. In our case we've only made one commit, but there's no reason why you can't actually do several commits before you issue a pull request. And we can also look at the Files Changed. We can see in this case we were just adding some new lines of code. So it makes it really easy for the owner, or for anyone else who's interested, to review this pull request and to comment on any problems that they notice in the code.
-
Accepting Pull Request Workflow
So we've got to the point now where the contributor has created a fork, they've pushed there changes to their fork, and they've issued a pull request. Now the _____ the owner of the open source project to deal with that pull request in some way. The first thing they'll probably want to do is perform a code review. And as we saw, the GitHub website makes that really easy. They may want to there and then request that further improvements are made before they're willing to accept this contribution or they may even tell the contributor that they're not interested in this particular feature being added. However, they may like what they see and want to include it in the main project. Now as we'll see shortly, GitHub does allow us to streamline this process, but what they would do if they wanted to properly check out this contribution is pull from the contributor fork onto their local machine. This allows them to actually test the code that's been contributed. And because they've pulled into a branch on their local repository, they can discard these changes if they decide they don't want them. Assuming that the owner does like what they see, they then perform a merge to merge the changes that have been made by the contributor into the master branch. Again, this is all happening on the owner's own local repository, it's not happened on the official public repository yet. So now, once that's been done successfully, the owner can push the merge changes up to the master repository and everyone can benefit from this contribution. Let's have a look at that again using our DAG diagrams. So as you remember, the state we left these diagrams was with the contributor's Issue123 fix pushed to their fork and a pull request issued. So the next step, as we've said, is for the owner to pull that pull request into their local repository like this. And it's at this point that they do that testing and verification, and if they're happy, they perform a merge. And in this particular case, which we're going to look at in a bit more detail in a moment, we see that merge just means that the master branch moves along one node in the deck to be at the same point that the Issue123 branch is at. And, in fact, now the Issue123 branch becomes redundant so we can get rid of that label. Now we can push that to the official repository, which is up on GitHub. And that's all there is to it. Now we can see that the contributor's commit has made it all the way up to the official repository.
-
Three Types of Merging
Now before we go any further, I just want to talk about a number of different types of merge that might be used at this point when you're merging in a contribution from another user. There are actually three types of merge that are commonly used. The first I've just called a regular merge, the second is what we just saw an example of, which Git calls a fast-forward merge, and the third is called Rebase. Let's briefly look at what each of these three types of merge does. So, an example of when you might need a regular merge is where you've got a DAG that looks like this. The master branch is in one place, and then you've got another branch that you want to merge in with master. A regular merge just creates a new node on the DAG that's got two parents. One parent is the original master branch, the other parent is the feature branch, and _____ after the merge we move the master branch pointer to point at the merge node. And this type of merge is what will normally happen where some other changes have been going on in the background while you were working on your feature branch. And, in fact, after you've done this merge you can delete the old branch. It's just a label, you don't need it anymore. But there is, in fact, a much simpler type of merge, which is often called a fast-forward merge. And this is the case that we looked at in our example. The new work that's been done on the feature branch has been completed before anyone else has done any further work on the master branch. Now it would be possible to just do a regular merge at this point, and create a new node that has two parents, one being the feature branch and the other being the original location of the master branch, but this is, in fact, unnecessary. If you look carefully at this diagram, you'll see that we don't actually really need a merge at all. We didn't really need to create the Issue123 branch because it turns out that we never did anything else on the master branch in the meantime. So actually, we could perform a merge simply by moving the master label onto node number 4, like so. And again, now we don't need the Issue 123 branch anymore, so we can delete it. And that's all there is to a fast-forward merge. So if you see Git saying that it's done a fast-forward merge, that's what it means. It's just moved the branch pointer along, it didn't need to create a new commit in your repository. The third type of merge is the most controversial because it's actually changing the history of your repository. Imagine we have this situation again that we looked at for the regular merge. You've made a couple of changes on a feature branch and in the meantime, different changes have been made on the master branch. What a rebase does, instead of creating a new merge node, actually recreates the changes that you made against node number 2, but makes them against the master branch. Now the reason that I've not left the numbers 3 and 4 on these two new nodes is that they are actually different nodes, they've got a different hash. And the original nodes 3 and 4 can just be thrown away and forgotten. What this means is that now we're in a situation where we can do a fast-forward merge. So the master branch pointer can now point to the end. The reason people like this is because it leaves you with a nice linear history. Although, of course, this history can be slightly misleading, because it looks like these commits were all made in chronological order, when in fact, commits 3 prime and 4 prime could've actually been made before 5 and 6. And, in fact, some people when they're doing a rebase, squash all the commits together into one. Like this, for example, we've taken those commits 3 and 4, we've squashed them into one commit that's got a new parent of 6. Rebase is a really nice and powerful feature, but you should only use it if you know what you're doing. In particular, once you've pushed your nodes up onto a public repository, it's too late to rebase them. Nodes 3 and 4 can be thrown away, because currently they're only on a local repository. If they've been shared with other users, then it's too late to throw them away and you probably shouldn't be doing a rebase. Git includes rebase, by default, whereas most of the other Distributed Version Control Systems, like Mercurial, are much more conservative about changing history. You have to add the rebase command as an extension.
-
Demo - Accepting Pull Request
So let's return to our number-maker project, and this time, I'll be putting myself in the shoes of the owner rather than the contributor. So let's log out of GitHub and let's log back in as the owner. And I can see here on my newsfeed that acontributor has opened a pull request to number-maker. And, in fact, GitHub will email the project owner whenever a pull request is issued. So let's have a look at this pull request. Here we can see I can have a look at the information that the contributor provided. I could respond to them and maybe ask for improvements to be made. And as we already saw, I can look at the commits and the changes. But let's imagine for this example, I'm happy with this commit and I want to merge it. This is where GitHub really comes in useful. It can short circuit the process that we just looked at. I don't need to pull this pull request onto my local machine and then push it to the master repository, I can just push this big green button and it will merge the pull request automatically. If I want to do it manually, I can click this link and it will actually run through all of the git commands I need to use to pull in those changes to my local repository, test them, and then push them back up to the master repository. But as I've said, in this case I'm happy to let GitHub do it for me. So let's click merge pull request. I've got the option if I want to, to modify the commit message, I'm just going to confirm it. And I'll leave a message for the contributor. Now let's go and have a look at our repository history. And what's really interesting is that acontributor has got their commit showing up in the history for Mark Heath/number-maker, even though they have no rights to push anything to this repository. And you can also see that the automatic Accept Pull Request button from GitHub actually does a regular merge here. GitHub has actually created a new merge commit on my behalf.
-
When Contributions are Rejected
Finally, let's switch back to the contributor and ask what they would do if their pull request wasn't accepted. Well, they've got a number of options. First of all, they might just need to make a few improvements to their code and reissue the pull request. Maybe next time it will be accepted. Or maybe they're just going to abandon that pull request and do some additional changes on a different branch. And that's one of the benefits of using branches. You can just leave them behind if something goes wrong with them. And maybe the owner of the open source project isn't interested in accepting any of your pull requests. Well one of the nice things about Distributed Version Control Systems is that even if the owner isn't going to accept your pull requests, you can still pull from the original repository to keep your fork up to date with all the latest and greatest changes. Let's briefly explore each of those three scenarios using DAG diagrams. So here's the diagram from earlier in this module where the contributor has pushed a change to their fork and has issued a pull request. But let's imagine in this scenario that the owner has rejected that pull request, maybe because it doesn't meet the coding standards of the project. What the contributor would do would be to make another change on the Issue123 branch. And now they would push that change up to their fork and then they can reissue the pull request. As you can see, it's quite straight forward to do this and you may find that you need to make multiple changes before your pull request is accepted. But let's imagine that this pull request doesn't get immediately accepted and the contributor would like to make another contribution, maybe implementing a different feature. What they would do is rather than doing it on the Issue123 branch, they would create another branch and start making changes on that. So here we've got an Issue456 branch. And when that feature is ready, they can push it to their fork and issue a pull request from the Issue456 branch. And the owner, if they want, can accept just Issue123 or just Issue456 or can accept them both and merge them together. But let's imagine that for whatever reason the owner isn't interested in pulling either of our pull requests, what could we do then? Well to make our diagram a little bit simpler for you to follow, let's just go back to the situation where we've just got one pull request, Issue 123, and the owner has decided they don't want to accept this pull request. What might happen next is the owner of this open source project might make their own change. So here I've shown node 4 appears in the DAG. On our local fork, what we'd like is to have both our own bug fix and this new change that is in the official repository. And so the way we would do that is pull onto our local repository from the GitHub official repository. And we'd end up with a local DAG that looked like this. And what we'd then need to do is do a merge so that our DAG has got one head that's got both the latest change from the official repository and our bug fix on. And if we want to, we can push that up to our publicly visible fork. So that other people can benefit from our enhanced version, even if the original owner doesn't want to accept our change, and we can keep repeating this process as many times as we want to. Whenever new features get added to the official repository, we can pull them into our fork and merge them in to include our own customized changes.
-
Module Summary
In this module, I've tried to show you why it is that Distributed Version Control Systems are so popular with open source projects. We've seen that it offers many benefits for the owner of the project. In particular, they can maintain control over who has access to the repository, while at the same time allowing anyone to make contributions. We've also seen how easy it is to code review the contributions that people make and websites like GitHub make this especially easy. And also, it allows the owner to process contributions whenever it's convenient for them. The version control tool is going to handle merging for us, so even if we've moved on the main repository since the contributor issued their pull request, we can still easily merge in their changes. We also saw that there were many benefits for the contributors as well. They get credit for their contribution. Their name appears in the commit history of the official repository. They can easily maintain their own personal fork if they want to, taking the project in new directions and keeping up to date with any changes in the original project. And also, if they want to submit many individual features or bug fixes they can do so individually by making use of how easy it is to branch with Distributed Version Control Systems. Along the way, we had a chance to look briefly at the three types of merge that you might encounter. The regular merge, where you add a new node into your DAG that merges the two branches, the fast-forward merge, where you simply move a branch pointer and don't need to add a new node to your DAG at all, and the rebase, where you actually throw away your original commits and make new ones that make the same changes, but to a different parent node. And this makes your DAG appear as a much more simple linear line instead of having lots of branches and merges in it. Let's move on in the next module to look at how Distributed Version Control Systems can be useful in a commercial environment.
-
DVCS for Commercial Projects
Module Introduction
Hi, my name is Mark Heath, and in this module we'll be looking at using Distributed Version Control Systems with commercial projects. Now in the last two modules we looked at how Distributed Version Control can be really useful for single developer projects and for open source projects. So in this module, we're going to focus particularly on how it's useful in the "Enterprise", particularly when you're working with large teams of developers. It's not uncommon in commercial environments to have 10 or 20 or even more developers, all working together on the same code base, and that can put significant demands on your version control system. For one thing, it's going to need to be really good at merging. Now of course, most commercial development companies have already got a version control system that they're working with, so you may well have a large amount of Legacy code already stored in a centralized version control system. And all your build machines, software lifecycle management software, and deployment processes will be set up to work with that centralized version control system. And all the developers will be familiar with how they can get the latest and make commits using that system. So the question I find a lot of companies are asking is, is it really worth the effort to migrate to Distributed Version Control Systems? what benefits are there to making what is potentially quite a disruptive change? Often when you're working on these large projects in the Enterprise, you'll have quite complex branching requirements. For example, you may find that you need to provide Hotfixes for several different versions of your product that different customers are using. Or maybe you have a particular customer who wants their own customized special version of your software and they're willing to pay for it, so you may need to create a branch that just has the changes that they've asked for. And this is something that Distributed Version Control Systems can really help us with. Now of course, there are companies who are already using Distributed Version Control Systems. I tried to find some statistics on this and perhaps the best set I could come up with came from the Eclipse Community Survey. What it showed was that in 2010, Subversion, which is probably the most popular centralized version control system, was used by 58% of their users, whereas Git, which is the most popular Distributed Version Control System, was used by 13%. But as you can see, in just two years to 2012, the use of Git almost doubled, and that was largely at the expense of centralized version control systems. So you can see from this, an increasing number of companies are beginning to see that there are benefits to using Distributed Version Control Systems instead of centralized. So in this module, I want to highlight seven benefits of Distributed Version Control Systems, which I think are particularly relevant for commercial projects. And these are Little and Often Commits, the use of Personal Branches, the ability to create Ad-Hoc Teams, complete flexibility to implement just about any branching strategy that you can think of, support for disconnected working, the elimination of a "Code Freeze", and also the ability to use it for automated deployment. And I'll explain briefly each of these as we go through this module. But you may also be asking, what's the catch? Are there any gotchas that I need to be aware of? What problems might we run into if we migrate from a centralized version control system to a distributed one? And whilst in this course I've been trying to emphasize how good Distributed Version Control Systems are, there are still some areas which may cause you some difficulties, and so I've selected 7 of those, which again, we'll briefly look at later in this module. And these are issues you might run into if you're using very large repositories or putting large files into them, if you've got a _____ process that needs exclusive file locking. We'll look at the issues of getting a development team up to speed with how to use the tools and the workflows, we'll also _____ say that the open source tools can be quite limited in their server administration and integration with software lifecycle management tools, and what you can do about that. And finally, we'll look at a problem that you might run into if you're trying to make changes to the history of your source control repository.
-
Little and Often Commits
So the first benefit I want to highlight is what I've called, Little and Often Commits. With a centralized version control system, the moment at which you do a commit or a check-in as it's sometimes called, you share it with all the rest of the developers. And that means it's important that you don't commit until you've made sure that the code compiles and is passing all its unit tests. Maybe you need to do some integration tests as well and maybe your process dictated that it should be code reviewed as well. And so with a centralized version control system, what can often happen is that developers can check code out and have it checked out for several days, maybe even weeks or months at a time. And this can cause a number of problems. For one, it can make merges more difficult as the commits are much bigger and contain more changes. And also, if you've had code checked out for several weeks and then you make a change that breaks everything, it can be really hard to roll back because your version control system doesn't have a record of what state the code was in yesterday. So the nice thing about Distributed Version Control Systems is that you can commit whenever you like. Whenever you want to save the work that you've done up to this point, you can do so, but you can delay sharing it with the rest of your development team until you're ready, and that's when you do a push. So this allows you to create what you can think of as save-points, a bit like when you're playing a computer game and you want to save where you're up to so that you can get back to that point if something goes wrong. The ability to do Little and Often commits allows you to very easily back out of the mistakes you make. If you're concerned that will result in, when you do a push lots and lots of commits being added to the history, you could use the rebase command that we talked about in the last module, to combine all of these little commits into a single commit if that's what you'd prefer. And if you're concerned that these Little and Often commits are only stored on your local developer machine are not being backed up, well they could actually be pushed to the central server onto your own personal branch or maybe to a separate fork that you have for backup purposes.
-
Personal Branches
The second benefit I want to highlight is the ability to create multiple personal branches. With a centralized version control system, branches are very public. They're often represented as a folder on the central server, and they're usually long-lived as well. Once you create a branch, it stays there on the server for the lifetime of the project. So usually what that means with a centralized version control system, is that developers won't be allowed to just create branches whenever they like. _____ There'll need to be agreement from the whole team that we really do need another branch to be created. Now with Distributed Version Control Systems, you can still create public long-lived branches if that's what you want. But you can also very easily create local, short-lived branches, a branch just to do a little piece of work like a bug fix or add a new feature, and it's a branch that's just on your local repository. It never actually needs to get pushed to the central repository. In fact, after you've merged it into the main branch on your local repository, you can delete the branch because as we said earlier in this course, a branch is simply a label pointing at a particular node in the DAG. And this opens up lots of interesting possibilities, such as the ability to create experimental branches. You could do what's sometimes called a spike, where you're just trying out an idea, coding really fast, and just seeing if it will work. Or maybe you've had an idea for some refactoring that you'd like to do and it'll take a little while to complete. Or you could create a branch to do that on and work at it a little bit at a time. If you do take advantage of Distributed Version Control Systems' ability to let you create personal branches, then it's going to be really useful for you if you need to do a context switch. The classic example is that you're working away on a particular feature and your boss comes to you and asks you to divert all your attention onto fixing a bug. Now with a centralized version control system, what you'd hope is that that bug can be fixed in a file that you're not working on at the moment so that you can make the modification and check in just that one file, but what if your code is currently in a non-compiling state? Well what you can do with Distributed Version Control Systems is simply commit all of those changes and then create a new branch from before you started making your feature to do the bug fix in. And then when you're finished, you can push that bug fix and then change back to the branch which had your work in progress. In fact, if you're using Git, Git has got a really nice feature called stashing that allows you to put the work that you were doing in progress off to one side without even having to commit it into your repository history. Let's just have a quick look at some DAG diagrams to show you how personal branches might be used. So imagine on your computer you've got a copy of the repository and there's the master branch and you want to add a new feature. You might make a couple of commits on a feature branch and then when you need to switch context to fix the bug, you'd go back to the master branch, which is at revision 2, create a new bug fix branch, and do your bug fix work on that. And all the while in the meantime, you may have had a long-running refactoring project going on that you plan to merge in later when it's ready. So as you can see, there's _____ scope for you to create as many of these personal branches as you feel would be useful for whatever you're doing. Of course, if you think this seems rather complicated, you don't have to use it, but it's nice to know that this feature is available for the occasions where you do want to do this.
-
Ad-hoc Teams
The next benefit I want to discuss is the ability to form Ad-hoc Teams really easily. Sometimes when you're working on a large project you want to be able to share your work in progress code with some team members, but not with everybody, you want to work together with a small sub-tem on a particular feature. Well with Distributed Version Control Systems there's actually lots of different ways you can go about this. Probably the most obvious way is to use feature branches. So, for example, if we had a development branch that we were working on and we wanted to add two new features to it, we'd create one branch for feature 1 and a separate branch for feature 2. And then the developers who are working on feature 1 can make commits and push to the feature 1 branch, and the developers working on feature 2 could make commits and push to that, and that way they can share their work with each other without treading on the toes of the developers working on the different features. When a feature is complete, it can be merged back into the development branch and when the second feature is complete, then that can be merged in as well. As well as using feature branches, you can actually also do it using forks and this avoids the need to create any branches on the central server. You use the technique that we discussed earlier in this course of actually setting up clones of your repository or forks and using those as branches. So, for example, if you had your master repository, you could create one fork for feature 1 and a separate fork for feature 2, and then Alice and Bill who are working on feature 1 can push and pull changes from the feature 1 fork and Carla and Duncan who are working on feature 2, can push and pull changes to the feature 2 fork. And only when they've completed those features would they push their changes to the master repository. And, in fact, there is even a third way that you can go about this. Most Distributed Version Control Systems actually make it possible for you to pull directly from another developer's machine, from their local repository. With Mercurial, you can use the hg serve command and the Git equivalent is called git daemon. So with this setup, we've got Alice and Bill who are going to be working on feature 1 and Carla and Duncan who are going to be working on feature 2, but when they want to share their code with each other, they can simply connect to each other's development machines and pull each other's changes from there. So as you can see, there are lots of possibilities for how you can allow sub-teams to work together on features without getting in each other's way.
-
Branching Flexibility
The next benefit I want to talk about is the complete branching flexibility that you have with Distributed Version Control. As I've said, commercial projects quite often have complicated branching requirements and with Distributed Version Control Systems you can easily branch from any node in your history. One of the features that I haven't yet mentioned of Distributed Version Control Systems is called labels. And you use labels to mark key revisions in your history. For example, you typically label every time you release a new version of your software. In fact, these labels are very similar to the branch labels that we talked about, with the one difference being that labels don't move from node to node like branch labels do. And so, if for example, you needed to Hotfix a version of your software that you released some time ago, you'd simply go back to the label that relates to that released version and create a new branch starting from that point. So, for example, here we've got our repository history and we've released version 1 and version 2 of our software, but we haven't actually created any branches yet. If we find we need to hotfix version 1, then we can simply go back and create a new branch coming off version 1, the version 1 hotfix branch, and we can easily do the same for version 2. It's also very easy to merge any two branches together. They don't have to be directly related to each other like they sometimes have to be with centralized version control systems, and this allows you to do useful things like creating integration branches. What do I mean by integration branch? Well, for example, imagine you've got a branch for feature 1 and a branch for feature 2, and you've also got a development branch, and you're interested in seeing whether these features are going to be able to integrate well into your development branch. But what you can actually do is create an integration branch that integrates feature 2 with development, for example, and give that to your Testing to see if they can find some problems. And you could do the same with feature 1 and development, and you could even, if you wanted to, see what feature 1 and feature 2 merged together would be like in another integration branch. So as you can see, you've got great flexibility, not only in the way you branch, but in the way you choose to merge as well. Now you may think that that diagram I just showed looks rather complicated and you could envision some developers getting confused as to what branch they were supposed to be working on and when it comes time to merge, what branch they ought to be merging into. And so just the same as with centralized version control systems, it's really important that you have a well-understood branching and merging strategy for your project. One of the branching strategies that's gained a lot of traction amongst users of Distributed Version Control Systems is known as Git-flow. And you could read about Git-flow at this URL _____ Git-flow was devised by Vincent Driessen. As you can see here in this diagram he's created, there's a development branch and you can have feature branches that come off your development branch. You also have release branches and hotfix branches. And the rest of this article contains guidance on how you would go about creating these branches and merging them. So if you're looking for a good branching strategy to get you started, this would be a great place to begin.
-
Disconnected Working
The next benefit I want to speak about is Distributed Version Control Systems support for disconnected working. For example, it's sometimes necessary to work remotely. Maybe you need to work at home on a laptop and you haven't got access to the company network from your home. With a centralized version control system, that would mean that you couldn't make any commits at all, but with Distributed Version Control you can quite easily do a lot of work, make several different commits, and then push them when you get back into the office. Or maybe you're working on a customer site, actually trying to troubleshoot a problem by making changes to the code while you're there. In that case, as well, it would be really useful to be able to commit the changes you make so that when you get back to the office you can easily merge your changes into the main code base. Another scenario that's becoming increasingly common is the use of outsourced teams, maybe located in a different country. And very often if you're using centralized version control systems, it's not practical to set up a scenario where the outsource team is able to directly connect to your central server, but with Distributed Version Control, that wouldn't even be necessary. They can work in their own fork, making their own commits, and creating their own branches as they need. And then you can pull their changes into your master repository at convenient times, and either perform the merge yourself or request that they perform the merge on their end, merging their work in with what you've been doing. And you may notice that this is actually very similar to the model that we talked about for open source projects. The outsourced teams becomes a little bit like a contributor to an open source project and they're issuing pull requests whenever they've completed a section of work. In fact, some companies have taken this open source model and used it internally inside their company. So, for example, if you've got a new starter or a junior developer, you may not want them necessarily to just be able to commit directly to your central repository. You may want all of their work to be code reviewed before it's committed. Or what you could do is simply not give them push rights to the central repository and require that when they've done a piece of work they issue a pull request to someone else who'll code review it, and if it's accepted they will push it to the central repository.
-
Eliminate Code Freeze
If you've worked at a company that uses centralized version control systems, you may have encountered what's known as a "Code Freeze". A "code freeze" is where you tell all the developers that they're not allowed to make any commits to the central repository. Sometimes this is done because you want to stabilize the code base ready for a new release of your software. You only want critical bug fixes to be committed at that stage. Or maybe when a waterfall development strategy is being used, you want to prevent the developers from making any code changes before the requirements or design phase has been completed. And this can be quite frustrating for developers because it means that there's a large period of time where they're unable to do any work. They may have lots of things that they could be getting on with, but because there's a code freeze they can't actually do any work. Well, Distributed Version Control Systems eliminate the problems of code freeze because you can just create local personal branches that we talked about earlier. You can do your bug fixes, refactorings, or experiments in those while the code freeze is on, and when the code freeze is lifted, then you can push any of those changes that are ready, up to the central repository.
-
Automated Deployment
The final benefit I want to mention is actually an example of thinking outside the box. And this is making use of Distributed Version Control System for deploying your software. And this is particularly popular with website deployment. What you would do is set up a situation in which you're able to push your code to a staging or to a production environment, maybe by pushing to a fork that's hosted on another computer or to a particular branch. And what you have is something that's waiting for changes to be pushed to those forks or branches, and when they are, it kicks off a build process and makes sure that that website that you've deployed goes live on that server. And a couple of examples of commercial systems that are making use of this are Heroku and Windows Azure. And this allows developers really easily to get their changes pushed up into a staging or production environment, and also to simply roll back if something goes wrong. It means that you can make use of the version control tools that you already know to do your deployments, rather than having to learn a separate tool.
-
Large Repositories
So far, I've painted a very rosy picture of the benefits of Distributed Version Control Systems in a commercial environment. So does that mean I think it's a complete no-brainer for you to simply upgrade your version control systems to Distributed Version Control tomorrow. Well there are a few things that you do need to be aware of before you make the transition, and so let's have a quick look at seven of them. First of all, it's not uncommon for companies that use centralized version control systems to have extremely large repositories. Maybe they put hundreds of individual projects inside a single repository. And often, when you've got this situation, developers will make use of the ability in centralized version control systems to do a partial get latest. Maybe you select just a single folder and say get me the latest of that folder and you just work on that part of the repository and never pull down the rest of it onto your development machine. With Distributed Version Control, when you do a clone you get everything. So if you've got really, really large repositories, then you're forcing developers to get the entire source code tree onto their computer. Also, branching works slightly differently between distributed and centralized version control systems. In a centralized version control system, you can branch sub-folders of your repository, whereas in Distributed Version Control Systems branches are _____ at a global level, so that might actually result in rather confusing branch naming. If you've got multiple different projects in a single repository and they've all got different reasons to branch, you'd have one branch for version 1 of one project and another version 1 branch for a different project, both in the same repository, and that could get quite confusing. So Distributed Version Control Systems tend to encourage more modularity Instead of having one huge repository with many projects in, you create more repositories, smaller ones, just for single projects. If you need to batch many projects up together into one big combined product, then you can make use of features called submodules or subrepositories that allow you to have a top-level repository that consists of a number of smaller project-level repositories. of course, doing that would involve a little bit of extra complexity in managing the top-level repository. But in my opinion, if you adopt this approach of having more, but smaller repositories, it will actually improve the architecture of your software because it will be composed of smaller modular and loosely-coupled components.
-
Large Files
Another thing that can happen in a centralized version control system is that people start adding extremely large individual files into their source control repository, and this really should be avoided if possible with Distributed Version Control Systems. So if you've got large amounts of test data, you've got binary dependencies that maybe are frequently changing or very large in size, or you've got a lot of graphics or video content that goes with your project, you should probably consider hosting them elsewhere if possible. remember, with Distributed Version Control System, when you do that initial clone, you're not just downloading the latest set of files, but you're downloading every file that's ever been added to the repository. So even if somebody added a huge file and then realized their mistake and removed it in a later commit, that would still be part of the history that needs to be downloaded on a clone. So one example that you might consider if you're using Visual Studio, for example, is making use of the Nuget technology to manage your dependencies, rather than committing them directly into the source control repository.
-
Exclusive Locks
Another feature present in many centralized version control systems is exclusive file locking. And these are useful because not every type of flow can be easily merged. For example, if you have got graphics stored in your repository or maybe scripts for a Windows Installer product or help files, then if two people make changes to them at the same time, the merge algorithms may fail to merge them or may claim that they've successfully merged them, but produce files that are useless. And with centralized version control systems, the solution to this was to set these files up to be exclusively locked, so only one person could be working on them at one time. However, Distributed Version Control Systems typically don't support exclusive locking. So if this is something you're used to and it's part of your process, you will find it's missing with Distributed Version Control. There are some exceptions to this rule. For example, SourceGear Veracity does support locking files.
-
Learning Curve
The next challenge I want to talk about is the need to make sure that all your developers are up to speed on how to use Distributed Version Control Systems before you make the transition. In fact, whatever version control system you're using, it's really important that everyone understands how to use it. So you need to make sure everyone has learned the workflow, which with Distributed Version Control includes pulling and pushing, which aren't part of a centralized version control system, and also how to do branching and merging. Sometimes I find developers are very fearful of merging because maybe they've had a bad experience of it in the past, and so a bit of training on how to do merges will go a long way to helping them feel comfortable with using a Distributed Version control system. Another thing it's really important to make sure everyone has learned is how to get out of any mistakes they've made. Almost everyone who uses version control will at some point realize that they forgot to include a file that they should have included or included a file that they shouldn't have included or made a check-in into the wrong branch, or done a merge that went wrong. And it's important to know how you get out of those mistakes once you've made them. So I'd say any time invested in learning version control is time well-spent. Don't be afraid to take half a day or a day or even longer out of your normal development time to make sure everybody understands the tools that they're going to be using.
-
Server Administration and Software Lifecycle Management
If you are going to be using Distributed Version Control, you'll want a central server and you'll need somebody to be able to administer that server. They'll need to be able to create new repositories for new projects or forks of existing projects and they'll also need to be able to manage user permissions to give people read access or push access and create new users. Now one of the issues you'll find with the open source Distributed Version Control tools is that they typically don't come with a fully-featured web management interface that would let you easily do this kind of thing. So it can actually seem a little bit unfriendly compared to what companies are used to with their existing centralized version control systems. And a related problem is integration with your software lifecycle management. In a commercial environment, it's very likely that you'll want some kind of defect tracking system, either one build in to the source control or the ability to integrate with the one you've already got, and the open source tools like Git and Mercurial don't come with their own defect tracking system. Similarly, you might want code review support like the code review support that we saw on the GitHub website. And again, that doesn't come out of the box with Git or Mercurial. So what's the solution to these limitations of not particularly friendly server administration and integration with the software lifecycle management tools. We have actually got a couple of options. One is that you could buy into a cloud service. For example, GitHub not only hosts open source projects, but actually allows you to pay them to host your repositories privately. And if you did so as a company, you could then make use of their issue tracking and their code review features for your internal projects. Likewise, the Bitbucket site that we saw in a demo earlier in this course, offers paid options that companies can make use of. Another option is FogCreek Kiln product, which is actually built on top of both Git and Mercurial. And even Visual Studio now has the ability to host your source code in the cloud as a Git repository. Of course, not all companies will be happy with the idea of storing their source code in the cloud, so there are commercial products that add extra value and features that you might have been used to with your centralized version control system that you can host on-premises, and examples of this that we've mentioned already briefly in this course are SourceGear Veracity, and Plastic SCM, and in fact, Team Foundation Server, which I've just said can host Git repositories, can also be used on-premises. So you can see, there are a number of options available to you, if you as a company want to make use of Distributed Version Control systems and have a little bit more of a richer experience for administering the server and for integrating with other parts of the software lifecycle management tools.
-
Immutable History
And I'll round off this limitation section with one file gotcha that you might run into at some point. It's sometimes said that with Distributed Version Control Systems history is immutable, you can't change it, but that's not entirely true. It is actually possible to rewrite history. So if you accidentally checked in a file that you now wish had never been checked in at all, you could rewrite the entire history of your repository with that particular file excluded. But what you need to be aware of is that whenever you change history, you're also going to end up changing the hashes, so it's essentially, in terms of the DAG, completely new nodes. And so having created new nodes, you need to delete the old nodes. And that would be easy in theory, unless somebody else had pulled a copy of those nodes that you wanted to delete and when that happens, there's a danger that they can make a comeback. Let me show you what I mean with a DAG diagram. Imagine we've got a repository here with a master branch and we might commit number three, which has got some kind of a problem with it. We wish we hadn't made this commit and we'd like to roll it back and get rid of it, but one way we can get rid of it is to simply move the master branch pointer back to node number 2. Now node number 3 might still be in our DAG, but nobody's going to reference it and nobody's going to make anymore commits based on it, and it can be safely deleted. But the trouble comes if somebody else on our development team actually pulled from our repository before we moved the master branch back. Now they've got node number 3 on their repository and if they make a commit, it's parent will be node number 3. So when they push their changes, not only does their new change come, node number 4, but node number 3 which we tried to delete, makes a comeback. Now we're in a difficult position. We can't just move the master branch pointer back again, because we want to keep the change we made in revision 4 and throw away the change we made in revision 3. So what you could do is do what's sometimes called a cherry pick to take the changes that were made in revision 4 and apply them to revision 2. So now we've got a new node, which I've called 4 _____ prime, which has got the changes that were made in revision 4, but they're now made against revision 2. And now we have two nodes that we're hoping that nobody else is going to do any work based on. So as you can see from this example, it is possible to get rid of commits that you don't want from your DAG, but you do have to make sure that you coordinate yourselves well with the other developers or otherwise, you'll find that things that you were trying to get rid of, keep making a comeback, and I've seen that happen a number of times in open source projects where people have tried to get rid of a particular commit from the history and it keeps coming back time and again, because other people pulled from their repository and based their own work on that.
-
Module Summary
So let's briefly summarize what we learned in this module. We looked first at seven benefits of using Distributed Version Control Systems with commercial projects. And you can see from this list it brings a lot to the table and there's a lot to commend it for use in the commercial environment. But we also looked at a number of the problems that you might run into. In particular, we saw that some of the ways that you might have been used to working with centralized version control systems don't translate well to Distributed Version Control Systems, so you need to be prepared for a slight change in the way you work. And we also saw that you may wish to augment the open source tools with some commercial tools that add in some of the features that may be missing from the open source versions of Distributed Version Control Systems.
-
Taking it Further
Module Introduction
Choosing a DVCS
Working from the Command Line
Graphical Client Apps
IDE Integration
Resources for Mastering Git and Mercurial