Want to see the full-length video right now for free?
Git's command interface is generally considered somewhat confusing and inconsistent. Luckily, beneath the command interface is a wonderfully consistent and straightforward object model. By gaining an understanding of the object model, we can easily wrap our heads around even the most complex Git operation by thinking in terms of these objects.
In this video, we'll lay the foundation of the Git object model, describing exactly how Git stores and references our code.
To help illustrate the points of our discussion, we'll create
a file with the classic hello world
as the contents.
$ echo 'hello world' > readme.md
To begin the discussion on the Git object model, we'll need a Git repository.
The "repository" is the hidden directory .git/
which contains all of the
objects Git uses to track our revisions, such as branches, remotes, etc.
Starting from an empty directory, we'll run:
$ git init
and Git kindly creates the repository for us.
Here are the contents of the .git/
directory immediately after running git
init
:
$ tree .git
.git
├── HEAD
├── config
├── description
├── hooks/
│ ├── applypatch-msg.sample
| └── # ... (other hooks)
├── info/
│ └── exclude
├── objects/
│ ├── info/
│ └── pack/
└── refs/
├── heads/
└── tags/
The .git directory contains a handful of files and subdirectories (ending with a /), even if nothing has been tracked yet. In this discussion, we'll be focusing on the following paths:
Name | Function |
---|---|
HEAD |
A "pointer" to the currently checked out object (more on this later). |
objects/ |
Where Git stores its representation of all files, directories, and commits. |
refs/ |
Where Git stores all branches, tags, remotes, etc. |
config
, description
, hooks/
, and info/
contain metadata, and we can
ignore them during this discussion.
First, we'll be focusing on the objects/
directory. It's best to think about this
directory as a mini file-based database where Git will store its
representation of our files, directories, and commits.
If we take a peek at the objects/
directory, we'll see that it remains
empty. Git does not run on its own in the background.
If we run git status
, we see that Git is aware there is a new file, but
has not started tracking it. Thus, the file is not in the .git/objects
directory yet.
$ git add readme.md
Now we'll take another look, and see that we have our first Git object!
$ tree -I "info|pack" .git/objects
.git/objects
└── 3b
└── 18e512dba79e4c8300dd08aeb37f8e728b8dad
(The -I
bit means ignore the provided pattern. Those files aren't relevant
to our discussion.)
Here we can see a new directory and file. Every object in Git has a unique 40-character hex string as its name, and Git uses this name to store the file (this technique is called content-addressable storage). Git takes the first 2 characters of the object's name and uses them as the directory, with the remaining 38 as the file name. We'll ignore this subtlety going forward, as it is a performance optimization, but it is good to know for later on.
This first object we've added to our Git repository is a "blob." Git uses
blobs to track files, but you should note that a blob stores only the contents
of our file, not the name. We can see this by asking Git to show us the contents
of the first object it has stored using the cat-file
command and the
object's name (note that we only need to provide the first eight or so characters
from the name, not the whole name):
$ git cat-file -p 3b18e512
hello world
Again, we see here that Git is only storing the contents of the file, namely
hello world
. Other information like the mode, permissions, and file name is
stored elsewhere.
Because Git only stores the contents of the file rather than the name or any metadata, it can tell when you are giving a different name to the same version of a file and therefore process it quickly. If we have two files with identical content but different names, the content has only been stored once!
When storing our file, Git uses a "blob" which only concerns itself with the contents of the file, ignoring our filename (for now). However, it does need a name, and we've seen that Git uses a seemingly random 40-character hex string as the name.
It turns out this string is in fact not random at all, but is produced by taking the content that Git stores and running it through a "hash function," the SHA-1 hash function in this case. The hash function takes a string of any size as an input, and returns a 40-character hex string as the output.
SHA-1 has a number of useful properties, but the most important to Git are:
The use of these hash values as the names for our objects makes operations like deep comparison of files and directories easy, as we're only comparing the hash values. Going forward, we'll refer to these names as a "hash."
We're close to fully understanding how Git stores our file, but if we try to
replicate the hash function on the content "hello world," it doesn't match
with the 3b18e512...
hash.
$ echo 'hello world' | shasum
22596363b3de40b06f981fb85d82312e8c0ed511 -
This is because Git prepends a bit of metadata before the
object contents, specifically the object type, length, and a separator
character, and then passes the combined string through the hash function,
available as shasum
:
$ echo -e 'blob 12\0hello world' | shasum
3b18e512dba79e4c8300dd08aeb37f8e728b8dad -
Now we have a complete understanding of how Git names and stores our blob objects. Git uses a similar metadata structure for the other object types, but from here on we'll only discuss the primary content of the Git objects, ignoring the metadata.
This metadata has the benefit of allowing us (and Git!) to know the type of
an object in isolation. Let's check the object's type with the -t
, for
"type", flag passed to cat-file
:
$ git cat-file -t 3b18e512
blob
It wouldn't be very useful if we could only track the contents of files, and not even the file names or the structure. This is where "tree" objects come in. A "tree" in Git is an object (a file, really) which contains a list of pointers to blobs or other trees. Each line in the tree object's file contains a pointer (the object's hash) to one such object (tree or blob), while also providing the mode, object type, and a name for the file or directory.
$ git add readme.md
$ git commit -m 'Add readme'
$ tree -I "info|pack" .git/objects
.git/objects
├── 3b
│ └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
├── 73
│ └── 94b8cc9ca916312a79ce8078c34b49b1617718
└── ef
└── 34a153025fffb8a498fff540f7c93963937291
In committing our readme.md
file, we create two new objects. One is the
object for the commit itself, but the other, ef34a15...
, is the new "tree"
object that represents our current directory. (Remember that the full hash for
an object is its directory name prepended to the file name, ef
and 34a15...
in this case). We can view it with:
$ git cat-file -t ef34a15
tree
$ git cat-file -p ef34a15
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad readme.md
Our new tree object consists of a single line, listing the mode/permissions, the type, the hash, and the file name of our readme.md file.
Again, we'll note that tree objects themselves do not have names, much like blobs. Parent trees associate names for subtrees, and the root tree, referred to as the "working tree" of a repository, in fact has no name.
This has two fun characteristics:
.git
repo
directory.Trees require at least one file, and optionally have subtrees. An empty
directory would yield an empty tree, and this is why we can't track empty
directories in Git and have to use tricks like adding a hidden .gitkeep
file
into the directory. (Technically, there's no strict reason for this limitation,
it's just that no one has fixed it... ¯\_(ツ)_/¯)
We'll create a subtree with:
$ mkdir app
$ touch app/script.rb
$ git add --all
$ git status
## master
A app/script.rb
$ git commit -m 'Another file in app Dir'
Looking in the objects/
directory, there are now multiple new objects.
Of most interest to us is the new tree object for our working directory, our
current root tree. Since there are so many objects now, we can't just guess
which is which, but we can ask Git directly to show us a specific tree using
the ls-tree
command.
$ git ls-tree master
040000 tree 67b21f78a4548b2ba3eab318bb3628d039e851e6 app
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad readme.md
We passed master
as the "tree" to view since master
, a branch, points at a
commit, which points at a tree. By passing master
as the argument, we're
identifying the tree master
indirectly points at.
We have a new tree object and it contains a new line, the first line,
identifying a subtree for our app
subdirectory. We can view the
tree by grabbing its hash and running:
$ git ls-tree 67b21f78a4548b2ba3eab318bb3628d039e851e6
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 script.rb
And here we see a single line for the script.rb
file in the app
directory. This rounds out our understanding of trees. To review:
objects/
directory by hashing their
contents (the list of objects described above).A commit is the final piece in our object puzzle, and it connects all
of the other objects together, acting as the hub of our object graph. Just like with
our other objects, we can use cat-file -p
to get a pretty printed view of
how Git stores commit objects.
$ git log --oneline --decorate
* f95b2fe (HEAD -> master) Another file in app dir
* ef34a15 Add readme
$ git cat-file -p f95b2fe
tree 0cae7dc167b255c0123c7c396fc48ce40fc35cfa
parent ef34a153025fffb8a498fff540f7c93963937291
author Chris Toomey <chris@ctoomey.com> 1441311544 -0400
committer Chris Toomey <chris@ctoomey.com> 1441311544 -0400
Another file in app dir
A commit is a file, just like our blob and tree objects, with a specific structure. That structure includes:
This file is then passed through the SHA-1 hash function and saved alongside
our other objects in the .git/objects
directory.
It's worth noting that the commit object only contains a single reference to a working directory; Git doesn't store diffs. When diffing between two commits, it compares the working trees of the commits, computing the diff on demand.
Commits are the core of the Git object graph, and we'll explore this by walking this graph just a bit:
# using the "parent" commit hash from above
$ git cat-file -p ef34a153025fffb8a498fff540f7c93963937291
tree 7394b8cc9ca916312a79ce8078c34b49b1617718
author Chris Toomey <chris@ctoomey.com> 1441311368 -0400
committer Chris Toomey <chris@ctoomey.com> 1441311368 -0400
Add readme
Here we've walked the graph and viewed the parent commit, which we see is the
root commit (as it has no parent
), and now we go a step further to view the
working tree of this parent commit:
$ git cat-file -p 7394b8cc9ca916312a79ce8078c34b49b1617718
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad readme.md
Here we see the initial directory with the single script.rb
file, and
finally:
$ git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
hello world
And with that, we've covered the primary objects Git uses to store our code:
We've also seen how we can walk through the history by following the hash references from commits, trees, and eventually to blobs. This wraps up the foundation of the object model, but in the next video we'll see how branches, remotes, and tags fit in, as well as how Git commands operate on these objects.