Want to see the full-length video right now for free?Sign In with GitHub for Free Access
Git's command interface is generally considered somewhat confusing and inconsistent. Luckily, beneath the command interface is a wonderfully consistent and straightforward object model. By gaining an understanding of the object model, we can easily wrap our heads around even the most complex Git operation by thinking in terms of these objects.
In this video, we'll lay the foundation of the Git object model, describing exactly how Git stores and references our code.
To help illustrate the points of our discussion, we'll create
a file with the classic
hello world as the contents.
$ echo 'hello world' > readme.md
To begin the discussion on the Git object model, we'll need a Git repository.
The "repository" is the hidden directory
.git/ which contains all of the
objects Git uses to track our revisions, such as branches, remotes, etc.
Starting from an empty directory, we'll run:
$ git init
and Git kindly creates the repository for us.
Here are the contents of the
.git/ directory immediately after running
$ tree .git .git ├── HEAD ├── config ├── description ├── hooks/ │ ├── applypatch-msg.sample | └── # ... (other hooks) ├── info/ │ └── exclude ├── objects/ │ ├── info/ │ └── pack/ └── refs/ ├── heads/ └── tags/
The .git directory contains a handful of files and subdirectories (ending with a /), even if nothing has been tracked yet. In this discussion, we'll be focusing on the following paths:
||A "pointer" to the currently checked out object (more on this later).|
||Where Git stores its representation of all files, directories, and commits.|
||Where Git stores all branches, tags, remotes, etc.|
info/ contain metadata, and we can
ignore them during this discussion.
First, we'll be focusing on the
objects/ directory. It's best to think about this
directory as a mini file-based database where Git will store its
representation of our files, directories, and commits.
If we take a peek at the
objects/ directory, we'll see that it remains
empty. Git does not run on its own in the background.
If we run
git status, we see that Git is aware there is a new file, but
has not started tracking it. Thus, the file is not in the
$ git add readme.md
Now we'll take another look, and see that we have our first Git object!
$ tree -I "info|pack" .git/objects .git/objects └── 3b └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
-I bit means ignore the provided pattern. Those files aren't relevant
to our discussion.)
Here we can see a new directory and file. Every object in Git has a unique 40-character hex string as its name, and Git uses this name to store the file (this technique is called content-addressable storage). Git takes the first 2 characters of the object's name and uses them as the directory, with the remaining 38 as the file name. We'll ignore this subtlety going forward, as it is a performance optimization, but it is good to know for later on.
This first object we've added to our Git repository is a "blob." Git uses
blobs to track files, but you should note that a blob stores only the contents
of our file, not the name. We can see this by asking Git to show us the contents
of the first object it has stored using the
cat-file command and the
object's name (note that we only need to provide the first eight or so characters
from the name, not the whole name):
$ git cat-file -p 3b18e512 hello world
Again, we see here that Git is only storing the contents of the file, namely
hello world. Other information like the mode, permissions, and file name is
Because Git only stores the contents of the file rather than the name or any metadata, it can tell when you are giving a different name to the same version of a file and therefore process it quickly. If we have two files with identical content but different names, the content has only been stored once!
When storing our file, Git uses a "blob" which only concerns itself with the contents of the file, ignoring our filename (for now). However, it does need a name, and we've seen that Git uses a seemingly random 40-character hex string as the name.
It turns out this string is in fact not random at all, but is produced by taking the content that Git stores and running it through a "hash function," the SHA-1 hash function in this case. The hash function takes a string of any size as an input, and returns a 40-character hex string as the output.
SHA-1 has a number of useful properties, but the most important to Git are:
The use of these hash values as the names for our objects makes operations like deep comparison of files and directories easy, as we're only comparing the hash values. Going forward, we'll refer to these names as a "hash."
We're close to fully understanding how Git stores our file, but if we try to
replicate the hash function on the content "hello world," it doesn't match
$ echo 'hello world' | shasum 22596363b3de40b06f981fb85d82312e8c0ed511 -
This is because Git prepends a bit of metadata before the
object contents, specifically the object type, length, and a separator
character, and then passes the combined string through the hash function,
$ echo -e 'blob 12\0hello world' | shasum 3b18e512dba79e4c8300dd08aeb37f8e728b8dad -
Now we have a complete understanding of how Git names and stores our blob objects. Git uses a similar metadata structure for the other object types, but from here on we'll only discuss the primary content of the Git objects, ignoring the metadata.
This metadata has the benefit of allowing us (and Git!) to know the type of
an object in isolation. Let's check the object's type with the
"type", flag passed to
$ git cat-file -t 3b18e512 blob
It wouldn't be very useful if we could only track the contents of files, and not even the file names or the structure. This is where "tree" objects come in. A "tree" in Git is an object (a file, really) which contains a list of pointers to blobs or other trees. Each line in the tree object's file contains a pointer (the object's hash) to one such object (tree or blob), while also providing the mode, object type, and a name for the file or directory.
$ git add readme.md $ git commit -m 'Add readme' $ tree -I "info|pack" .git/objects .git/objects ├── 3b │ └── 18e512dba79e4c8300dd08aeb37f8e728b8dad ├── 73 │ └── 94b8cc9ca916312a79ce8078c34b49b1617718 └── ef └── 34a153025fffb8a498fff540f7c93963937291
In committing our
readme.md file, we create two new objects. One is the
object for the commit itself, but the other,
ef34a15..., is the new "tree"
object that represents our current directory. (Remember that the full hash for
an object is its directory name prepended to the file name,
in this case). We can view it with:
$ git cat-file -t ef34a15 tree $ git cat-file -p ef34a15 100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad readme.md
Our new tree object consists of a single line, listing the mode/permissions, the type, the hash, and the file name of our readme.md file.
Again, we'll note that tree objects themselves do not have names, much like blobs. Parent trees associate names for subtrees, and the root tree, referred to as the "working tree" of a repository, in fact has no name.
This has two fun characteristics:
Trees require at least one file, and optionally have subtrees. An empty
directory would yield an empty tree, and this is why we can't track empty
directories in Git and have to use tricks like adding a hidden
into the directory. (Technically, there's no strict reason for this limitation,
it's just that no one has fixed it... ¯\_(ツ)_/¯)
We'll create a subtree with:
$ mkdir app $ touch app/script.rb $ git add --all $ git status ## master A app/script.rb $ git commit -m 'Another file in app Dir'
Looking in the
objects/ directory, there are now multiple new objects.
Of most interest to us is the new tree object for our working directory, our
current root tree. Since there are so many objects now, we can't just guess
which is which, but we can ask Git directly to show us a specific tree using
$ git ls-tree master 040000 tree 67b21f78a4548b2ba3eab318bb3628d039e851e6 app 100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad readme.md
master as the "tree" to view since
master, a branch, points at a
commit, which points at a tree. By passing
master as the argument, we're
identifying the tree
master indirectly points at.
We have a new tree object and it contains a new line, the first line,
identifying a subtree for our
app subdirectory. We can view the
tree by grabbing its hash and running:
$ git ls-tree 67b21f78a4548b2ba3eab318bb3628d039e851e6 100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 script.rb
And here we see a single line for the
script.rb file in the
directory. This rounds out our understanding of trees. To review:
objects/directory by hashing their contents (the list of objects described above).
A commit is the final piece in our object puzzle, and it connects all
of the other objects together, acting as the hub of our object graph. Just like with
our other objects, we can use
cat-file -p to get a pretty printed view of
how Git stores commit objects.
$ git log --oneline --decorate * f95b2fe (HEAD -> master) Another file in app dir * ef34a15 Add readme $ git cat-file -p f95b2fe tree 0cae7dc167b255c0123c7c396fc48ce40fc35cfa parent ef34a153025fffb8a498fff540f7c93963937291 author Chris Toomey <email@example.com> 1441311544 -0400 committer Chris Toomey <firstname.lastname@example.org> 1441311544 -0400 Another file in app dir
A commit is a file, just like our blob and tree objects, with a specific structure. That structure includes:
This file is then passed through the SHA-1 hash function and saved alongside
our other objects in the
It's worth noting that the commit object only contains a single reference to a working directory; Git doesn't store diffs. When diffing between two commits, it compares the working trees of the commits, computing the diff on demand.
Commits are the core of the Git object graph, and we'll explore this by walking this graph just a bit:
# using the "parent" commit hash from above $ git cat-file -p ef34a153025fffb8a498fff540f7c93963937291 tree 7394b8cc9ca916312a79ce8078c34b49b1617718 author Chris Toomey <email@example.com> 1441311368 -0400 committer Chris Toomey <firstname.lastname@example.org> 1441311368 -0400 Add readme
Here we've walked the graph and viewed the parent commit, which we see is the
root commit (as it has no
parent), and now we go a step further to view the
working tree of this parent commit:
$ git cat-file -p 7394b8cc9ca916312a79ce8078c34b49b1617718 100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad readme.md
Here we see the initial directory with the single
script.rb file, and
$ git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad hello world
And with that, we've covered the primary objects Git uses to store our code:
We've also seen how we can walk through the history by following the hash references from commits, trees, and eventually to blobs. This wraps up the foundation of the object model, but in the next video we'll see how branches, remotes, and tags fit in, as well as how Git commands operate on these objects.