“Alone we can do so little; together we can do so much.” – Helen Keller
Ever heard of version control? I certainly hadn’t when I started programming. I was giddily coding away various data science tasks without realizing the importance of writing efficient code and the need to manage my overall codebase. It was only when I got into the industry that I understood the criticality of version control.
The very first thing I learned was the value of Git and GitHub. While I always knew about them (I’ve often used them for cloning open source code from Google Research and other top data science organizations), I never really understood their real purpose.
The beauty of version control was a revelation to me. The way I could create a remote project and have all my team members work on different features parallelly yet independently but still have a stable running code at the end of the day left me spellbound. Suddenly I had found a panacea for all the problems that I used to face while collaborating on a project.
I’m really excited to share this article with you on the in’s and out’s of Git and GitHub. We’ll cover how both these tools works and how you can use them to make your data science projects easier to track. As a data scientist, you need to have a solid grasp on these tools. Not only will you face interview questions on this, but you’ll rely a lot on Git and GitHub in your data science role.
If you are collaborating with other fellow data scientists on a project (which you will, more often than not), there will be times when you have to update a piece of code or a function. This is where Git & GitHub will help you to create a better workflow. Whatever changes you make, you can easily make them available to all the collaborators. And if you make a mistake, you can always roll back to a previous version.
Let’s dive into the world of Git and GitHub!
Git is a widely used Version Control System (VCS) that lets you keep track of all the modifications you make to your code. This means that if a new feature is causing any errors, you can easily roll back to a previous version.
But Git isn’t just any VCS, it’s a Distributed VCS. This means that every collaborator of the project will have a history of the changes made on their local machine. So people can work on different features of the project without having to communicate with the server hosting the remote version of the project. This is super efficient and you can easily merge any changes made to the project with the remote copy.
Since it is written in the C language, speed and performance are ingrained in Git right from its inception. Besides this, Git also provides a lot of buffers before actually saving any changes to the project.
If you want to know more about Git, just head over to their official website and find answers to all your questions!
GitHub is a widely used platform for version control that uses Git at its core. It lets you host the remote version of your project from where all the collaborators can have access to it. Not just your own team members, but any member of GitHub can contribute to your code (that is of course if you choose to accept the changes made). We will discuss all of this in detail in this article.
GitHub is like a social platform where you can find a plethora of open-source projects with their codes. All the new and emerging technologies can be found on this platform. You can collaborate on amazing projects and have discussions on your contributions! This is the best open-source platform you’ll find and is a data scientist’s dream!
You can check out our monthly collection of the best open-source data science projects on GitHub here.
There is surely a lot that you can do on GitHub, so let’s get started.
Repository or Repo is a folder that contains all the project files and the history of the revisions made to each file. There are two repositories of your project that you will work with throughout the lifetime of your project – Remote repo and Local repo:
git clone <Repo-URL>
Cloning means creating a copy of the remote repo on your local machine. Now you can make changes to the project on your local machine.
git commit -m “<commit message>”
When you commit a change, you save the changes you made to your files in the repo. When working with Git from your local machine, using the commit command will save your files in the local repo. To make those changes in the remote repo, you will use the push command.
git push origin <branch>
Push command allows you to transfer all the changes on your local repo to the remote repo. Now all the fellow developers will have access to the changes you made and they can update their local repositories.
git pull <remote-repo>
If push meant transferring code to the remote repo, the Pull command allows you to transfer all the changes from the remote repo to your local repo. So any changes that your fellow developer pushed to the remote repo, you can transfer them to your local repo using the pull command.
There are a few more terms that you will need to know but they are not required right now. We will cover them in detail in the latter part of this article. For now, let’s create our very first GitHub repository!
The first thing you should do is download Git on your system. Kudos to those who already came prepared! All the others, head over here and download Git for your operating system. It is pretty straight forward and you will be done in no time.
Now, Git programs are designed to work with a Unix style command-line environment. Linux and macOS already have an interface for this in their native command-line terminals. So all the git commands that I will be using in this article should work fine with their terminals.
Windows, however, has a completely different command-line interface called Command Prompt which is not a Unix style command-line environment. So what do we do? Well, don’t worry, you already installed Git Bash when you installed Git.
Git Bash is a command-line interface for Windows that emulates the Git command-line experience. So, as long as you are implementing Git commands inside Git Bash, you should be fine.
I will be using the terms terminal or command line interchangeably to refer to the command-line environment, for macOS and Linux users, and Git Bash, for Windows users.
The next thing I want you to do is to create a project folder where you will save your local repository. Then follow these steps to open your terminal inside that project folder:
You are ready to start working with Git!
Repository or Repo is a folder that contains all the project files and the revision made to each file. The project directory you made above isn’t a repository. A repo needs to be initialized using the git init command.
Once you do that, a hidden .git folder will be created inside your project/working repository. This is your local Git repo. If you don’t see it, it’s probably because it is hidden and you need to change some properties in the settings to make it visible. But don’t worry, it is still there even if you don’t see it. Git will store all the changes you make to your project files inside this folder.
But before we make any changes to our repo, we want Git to know who we are. We can do that using the git config command. Using this we can set the user name and user email address. Now every time we make a commit, this information is saved by Git so that you know who made that change.
git config — global user.name <user-name>
git config — global user.email <user-email>
If you use the –global option, then Git will save this information for all the repositories in the system. So, you can leave it out if you only want the information to be saved for this particular repository.
Now, you can start building your project and adding files to your project directory. My project is about writing “Hello world” in different programming languages. So, I am going to add a few files to my project directory for that purpose. You should do the same for your project directory too as it is empty right now.
Once you have created and added files to your project directory, you can add them to your local Git repository using git add <file-name>
If you want to add more than one file at a time, use git add
Are we done? Wasn’t the command for committing changes something else?
You are absolutely right! We haven’t added the files to the local repo yet. We have just told Git that some changes were made and we want to save these changes in the next commit/save. As of now, these “added” files are in a place called the Staging area.
Staging area is an intermediate place between your working directory and local Git repo where any changes that you made can be reviewed before you actually commit them to the repo.
You can check the state of the staging area using git status
You will see a message similar to the one shown above which tells you that a change was made that needs to be committed. All the pending commits will be shown here.
Now you can take a snapshot of all the changes you made, which are reflected in the staging area, and save them in the Git repo using git commit -m “<commit message>”
Your commit message should be terse but lucid so that fellow developers can easily determine why you made that change.
Once you do that, you will get the following message:
Now all your files have been committed to the Git repo. You can check the status of the staging area and this time it will reflect that there is nothing to commit:
Henceforth, whenever you make a change to any of the files, like a bug fix or anything, and add it to the staging area, Git will know which files the changes were made to and will record the entire content of the file in the commit.
I am going to make a few changes to one of my code files and commit it after the changes are made:
As you can see, when I added my files again, Git was smart enough to know that I only made changes to a single file which is reflected in its output. The commit that I made after this only updated that specific file and not the others.
The whole point of version control is to keep a record of the changes that were made. You can do this using the git log command. It gives you a complete view of all the commits that were made in reverse chronological order:
As you can see, my name, email address, timestamp, and the commit message are all reflected in the logs. This makes it fairly easy to track who made what changes and determine when the bug was first introduced in the project.
So far, we were working on a local repository, meaning all the changes that were made were tracked in your local machine and our fellow developers were not able to see them yet. To make that happen, you need to create a remote repository – a repository that can be accessed from anywhere and by anyone. This is where GitHub comes in!
Your remote repository lives on the GitHub server and anybody can access it. So let’s create a remote repository!
Note: You would need to create a GitHub account for this.
2. On the next page, give a name to your repository and a short description. Once you are done, click Create repository:
Bravo! You just created your very first remote repository.
GitHub made private repositories free for individuals in January 2019. And in April 2020, GitHub made private repositories free for all, including organizations.
Once you have created your GitHub repository, GitHub will prompt you to upload your files to the remote repository:
As we have already created our local repository, we first need to sync our local and remote repos. We can do this using git remote add origin <URL>
The command creates a connection between the local and remote repos. Once we do that, we no longer have to refer to the remote repo by the URL every time. We can just use the name origin to refer to the remote repo.
Now that the remote repo has been added, all you have to do is push your commits from local repo to the remote repo so that all your fellow developers can view the changes.
You transfer the local repo to the remote repo on GitHub server using git push -u origin <branch-name>
Origin is the name of our remote repository.
Now all our fellow collaborators have access to this newly updated repository.
We saw how we can create our own local repository and push it to GitHub. Now, if you weren’t the one who created this repo, you would have to make a copy of this on your local machine.
Once a project has been uploaded onto the remote repository, developers can get their own copy of this repo using the git clone command. Developers can then work on their local copy of the repo, make changes to it, and upload it to the remote repo.
It is very easy to clone remote repos from GitHub. Just head over to whichever repository you want to clone and click the Clone or download button to copy the URL:
You could also download the repo directly from here as a Zip file, but we will use Git to download it from our terminal.
To clone it to your local machine, you need to head over to your terminal and provide the URL in the following command”
git clone <URL>
This will make a local copy of the repo on your local machine.
You will notice that a folder with the same name as the remote repository is created inside your current directory. This folder is your project/working directory containing the local Git repo. You need to navigate inside this directory to make changes to your local repo.
Use the cd <repository-name>/ to navigate inside your working directory:
Now that you are inside your working directory, you can make any changes you like. If you like those changes and think it will solve a bug or add a really cool feature to the project, just commit to your local repo first, and then push it to the remote repo on GitHub so that your fellow developers are up to date on the new change.
Branching is one of the most fundamental features of Git. Branches let you work on new features or bug fixes of the main project code that is present on the master branch.
Branches are like a reference to a commit. Any changes that you now make to this branch will ensue from this point onwards. Even if you mess up with this one, rest assured as it will do no harm to your actual working code.
Branches let you experiment with new features or ideas and even let you create multiple branches in parallel to experiment with different features. Any number of people can work on a particular branch and you can have as many branches as you want.
You can create a new branch using git branch. This will contain all the files that are present in the master branch. You can make changes to files on this branch. Once you are confident that your code is working well, you can integrate it with the master branch using the git merge command.
So far we have been working on the master branch. Now we will see how to create a new branch using git branch <branch-name>
This will create a new branch and you can check that out using the git branch command:
Git will always keep you up to date on which branch you are working on by mentioning it within parenthesis at every command. This will ensure you are working on the right branch.
Let us now work in a different branch using git checkout <branch-name>
Right now I am working in the loops branch. I want to make changes to one of the files in my repo by adding a for-loop to my code. I will do that and then commit the changes:
Once I have done that, let me just move back to the master branch and check the change that I had made:
Wait – what’s going on here? The change I made has not reflected in the master branch! And that is exactly what should have happened. When I made the commit, I was inside the loop branch so my changes were saved in that specific branch. Hence, the change did not get committed to the master branch. My main code is safe.
Next, if I wanted to commit those changes to my master branch as it is free from any bugs and I love the new “loops” feature, I could do that using the git merge <branch name> -m “<message>” command from the branch you want to merge it to (the master branch in this case). This will merge the loop branch with the master branch:
Awesome! We have added a fully functional new feature to our main code. Now all that is left to do is to commit these changes to the remote repository!
The last Git command you need to know is pull. This lets you fetch any update from the remote repo and merge it with the local repo. For instance, after you have cloned your remote repo, someone has made an update to the remote repo on some branch. Now you need to fetch these changes and merge with your local repo so that you are up to date on this new change.
Let’s say after I added a for-loop to the Python code file, someone added a for-loop to the Java code file as mine is without a for-loop:
I can update my local repo using the pull command. All you have to do is type git pull origin <branch-name>
This will update my local repo. Now even I have a for-loop in my Java file!
With GitHub, you can explore anybody’s repository. That’s the beauty of open-source, right? There will be times when you will genuinely like someone’s project and would be tempted to contribute to it. Or if you want to suggest fixes to someone’s project, it is better to make those fixes and then send a pull request so that you can contribute to their project.
This can be achieved by a process called forking.
To fork someone’s repository, head to the repository page and on the top, click Fork. This will create a copy of the repository in your account.
I have forked Analytics Vidhya’s Data Science Hacks repo. Do check it out. It’s a great repo for all the data science hacks that you will ever need for greater efficiency!
Now you can clone this repository onto your local machine, make the changes you want to make or the features you want to add, and push it to your remote copy.
Finally, you can request the creator of the project to accept the fixes that you have made or the new features you added. This is called a pull request.
When you navigate to the repository in your profile, you will see an option called ‘New pull request‘:
Click that button and GitHub will take you to the next page where it will show you the branches you want the original creator to merge:
Once you create a pull request, you will be prompted to define the changes that you made to the original creator’s repository. You can provide as much description as you want about the changes that you made. And finally, send the pull request.
Now the ball is in the original creator’s court. Either they will accept your changes and merge it with their branch in which case congratulations as you just made an open-source contribution and you should be proud of yourself! But even if they don’t, don’t be heartbroken There are a plethora of other open-source projects waiting for your attention. So move on and keep making those contributions!
If you are looking to contribute to an open-source project and don’t know where to start, you can start with the repo that I created for this article – “Hello-world”. Clone it, make changes to it, and send pull requests. I will accept all of them! Let’s make it a one-stop-shop to learn every programming language!
We have really covered a lot here, and if you have patiently implemented everything that I did in the article, give yourself a pat on the back. You deserve it!
But this is really just the tip of the iceberg. There is a lot more to Git & GitHub than what I covered in this article. If you want repositories to explore and contribute to, I recommend going through this article which has a list of some of the most innovative machine learning GitHub projects.
Excellent article. Most of the Got tutorials get far too technical for a new user and they often leave the reader as confused as they were before. Your article was easy to understand and follow.
Thanks Scott. Glad you found it useful.
I use git and github a lot. This article deserves attention. Great work.. I would suggest you to create a article on submodules in github. There is hardly any clear guide for submodules.
Thank you, Jay. That's a good suggestion, I will surely consider writing an article on that.
Very Detailed Explanation !! I am a Data Science Enthusiast and this article gave me wide knowledge of how i am going to create projects in near future ! I have bookmarked this article ! Great Work !! Thanks !!
Excellent post. Easy to catch for newbies like myself.
Thanks, Arjun. Really glad to hear that you found it useful!