The vast majority of Git resources discuss how to use Git. Very few describe how Git actually works and even fewer look under the hood at Git’s code. In this article, I’m going to examine the initial commit of Git’s code to help you understand Git from the code perspective. If you are unfamiliar with what an initial commit is, I recommend you check out my article detailing the concept of an initial commit in Git.
Background
There is a reason I decided to examine the first version of Git’s code instead of the current version. Git was created in 2005 by Linus Torvalds, who is also the creator of Linux. Git has been under active development for about 15 years. Therefore, the current codebase is fairly large and complicated. It includes hundreds of code files written in more than five programming languages, more than 58,000 commits by over 1,300 developers, and tens of thousands of lines of code. This makes sense since new features, enhancements, and optimizations are continuously being added by the open source community.
However, the code in Git’s initial commit is contained in only 10 files, is less than 1000 total lines of code, and is fully written in the C programming language. Most importantly, the code actually runs. The fact that Git’s original version is so small and simple to understand makes it a great resource for learning how Git fundamentally works. Before we dive right into the code, we’ll describe how we can obtain a copy of Git’s original codebase.
Obtaining a local copy of Git’s initial commit
I created a project called Baby Git, in which I checked out Git’s initial commit and fully documented it with inline code comments describing how each piece of code works. The only (minor) modifications made to the code are to facilitate compiling on modern operating systems. The codebase is hosted on BitBucket and you can view it in your web browser using the following link:
https://bitbucket.org/jacobstopak/baby-git/src/master/
You can also clone it down to your local machine using the command:
`git clone https://bitbucket.org/jacobstopak/baby-git.git`
Alternatively, if you’d like a copy of Git’s initial commit in its purest, unadulterated form, you can get it by following these steps:
1) Clone the Git repository to your local machine using the command `git clone https://github.com/git/git.git`
2) Browse into the `git` directory
3) Run the command `git log –reverse` to see Git’s commit log in reverse order. The initial commit has an ID of e83c5163316f89bfbde7d9ab23ca2e25604af290
4) Run the command `git checkout e83c5163316f89bfbde7d9ab23ca2e25604af290` to check out Git’s initial commit into the working directory
Now, we’ll explore what Git’s original revision actually contains and what it can do.
What is in Git’s initial commit?
If you list the contents of Git’s initial commit, you’ll see the eight following C code files with a `.c` extension:
init-db.c
-update-cache.c
-write-tree.c
-commit-tree.c
-read-tree.c
-cat-file.c
-show-diff.c
-read-cache.c
The first seven of these `.c` files each directly correspond to one of the original seven Git commands. When the codebase is compiled, each of these seven `.c` files is compiled into its own executable. For clarity, we can relate each of these to the modern name of that command, as follows:
Git’s Initial Commit | Current Git Version | Purpose |
init-db | git init | Initialize a Git repository |
update-cache | git add | Add a file to the staging area |
write-tree | git write-tree | Write a new tree object to the Git repository using the content in the staging index |
commit-tree | git commit | Create a new commit object in the Git repository based on the specified tree |
read-tree | git read-tree | Display the contents of a tree object from the Git repository |
show-diff | git diff | Show the differences between staged Git files and their corresponding working directory versions |
cat-file | git cat-file | Display the contents of objects stored in the Git repository |
The eighth `.c` file is `read-cache.c`. It defines various functions that are used program-wide in Git’s original revision. It also contains definitions of external (or global) variables used by the program.
In addition to these eight `.c` files, Git’s original codebase contains one `.h` header file called `cache.h`. This file is included in all the source files via the `#include` preprocessing directive. This file contains other `#include` directives of library header files, token definitions, declarations of external variables and structure templates, and function prototypes.
Finally, Git’s first commit contains a `Makefile` which is used to compile the code and a `README` with a humorous, yet very informative, description of Git’s core concepts.
Next, we’ll describe the different parts of a Git repository.
Components of a Git repository
There are four main components in an original Git repository:
- Objects (blobs, trees, commits)
- Object database
- Current directory cache
- Working directory
Objects
In order to store and track our code file changes, Git needs a way to understand file content. Git does this by creating three types of objects that can be used to identify file content and keep track of it over time. Git objects can be classified in three types – blobs, trees, and commits.
Blobs
The term “blob” stands for Binary Large Object. A blob is simply a file containing data in binary format. Since all computer files are ultimately stored as binary, any file stored on our computer can be treated as a blob. Git creates a blob for each file that we add to the repository. Furthermore, Git needs a way to be able to identify each blob and distinguish blobs from each other. This is done by naming and referencing each blob using the SHA-1 hash of its deflated (compressed) content. Each piece of content is guaranteed to have a unique SHA-1 hash, so it is safe for Git to reference objects this way.
A blob object has the following structure:
“`
‘blob’ (blob object tag)
‘ ‘ (single space)
size of blob (in bytes)
‘