Git Object Model

Video

Want to see the full-length video right now for free?

Notes

Git's command interface is generally considered somewhat confusing and inconsistent. Luckily, beneath the command interface is a wonderfully consistent and straightforward object model. By gaining an understanding of the object model, we can easily wrap our heads around even the most complex Git operation by thinking in terms of these objects.

In this video, we'll lay the foundation of the Git object model, describing exactly how Git stores and references our code.

To help illustrate the points of our discussion, we'll create a file with the classic hello world as the contents.

$ echo 'hello world' > readme.md

Git Repository

To begin the discussion on the Git object model, we'll need a Git repository. The "repository" is the hidden directory .git/ which contains all of the objects Git uses to track our revisions, such as branches, remotes, etc. Starting from an empty directory, we'll run:

$ git init

and Git kindly creates the repository for us.

Here are the contents of the .git/ directory immediately after running git init:

$ tree .git
.git
├── HEAD
├── config
├── description
├── hooks/
│   ├── applypatch-msg.sample
|   └── # ... (other hooks)
├── info/
│   └── exclude
├── objects/
│   ├── info/
│   └── pack/
└── refs/
    ├── heads/
    └── tags/

The .git directory contains a handful of files and subdirectories (ending with a /), even if nothing has been tracked yet. In this discussion, we'll be focusing on the following paths:

Name	Function
`HEAD`	A "pointer" to the currently checked out object (more on this later).
`objects/`	Where Git stores its representation of all files, directories, and commits.
`refs/`	Where Git stores all branches, tags, remotes, etc.

config, description, hooks/, and info/ contain metadata, and we can ignore them during this discussion.

Adding Our First Object

First, we'll be focusing on the objects/ directory. It's best to think about this directory as a mini file-based database where Git will store its representation of our files, directories, and commits.

If we take a peek at the objects/ directory, we'll see that it remains empty. Git does not run on its own in the background.

If we run git status, we see that Git is aware there is a new file, but has not started tracking it. Thus, the file is not in the .git/objects directory yet.

$ git add readme.md

Now we'll take another look, and see that we have our first Git object!

$ tree -I "info|pack" .git/objects
.git/objects
└── 3b
    └── 18e512dba79e4c8300dd08aeb37f8e728b8dad

(The -I bit means ignore the provided pattern. Those files aren't relevant to our discussion.)

Here we can see a new directory and file. Every object in Git has a unique 40-character hex string as its name, and Git uses this name to store the file (this technique is called [content-addressable storage][]). Git takes the first 2 characters of the object's name and uses them as the directory, with the remaining 38 as the file name. We'll ignore this subtlety going forward, as it is a performance optimization, but it is good to know for later on.

[content-addressable storage]: https://en.wikipedia.org/wiki/Content-addressable_storage

Blobs

This first object we've added to our Git repository is a "blob." Git uses blobs to track files, but you should note that a blob stores only the contents of our file, not the name. We can see this by asking Git to show us the contents of the first object it has stored using the cat-file command and the object's name (note that we only need to provide the first eight or so characters from the name, not the whole name):

$ git cat-file -p 3b18e512
hello world

Again, we see here that Git is only storing the contents of the file, namely hello world. Other information like the mode, permissions, and file name is stored elsewhere.

Because Git only stores the contents of the file rather than the name or any metadata, it can tell when you are giving a different name to the same version of a file and therefore process it quickly. If we have two files with identical content but different names, the content has only been stored once!

Hashing Overview

When storing our file, Git uses a "blob" which only concerns itself with the contents of the file, ignoring our filename (for now). However, it does need a name, and we've seen that Git uses a seemingly random 40-character hex string as the name.

It turns out this string is in fact not random at all, but is produced by taking the content that Git stores and running it through a "hash function," the [SHA-1 hash function][] in this case. The hash function takes a string of any size as an input, and returns a 40-character hex string as the output.

SHA-1 has a number of useful properties, but the most important to Git are:

Determinism - The same input will always result in the same output.
Defined range - All outputs are 40 char hex.
Uniformity - Outputs are evenly distributed over the possible space, and a small change in input yields a huge change in output.

The use of these hash values as the names for our objects makes operations like deep comparison of files and directories easy, as we're only comparing the hash values. Going forward, we'll refer to these names as a "hash."

[SHA-1 hash function]: https://en.wikipedia.org/wiki/SHA-1

Object Storage Subtleties

We're close to fully understanding how Git stores our file, but if we try to replicate the hash function on the content "hello world," it doesn't match with the 3b18e512... hash.

$ echo 'hello world' | shasum
22596363b3de40b06f981fb85d82312e8c0ed511  -

This is because Git prepends a bit of metadata before the object contents, specifically the object type, length, and a separator character, and then passes the combined string through the hash function, available as shasum:

$ echo -e 'blob 12\0hello world' | shasum
3b18e512dba79e4c8300dd08aeb37f8e728b8dad  -

Now we have a complete understanding of how Git names and stores our blob objects. Git uses a similar metadata structure for the other object types, but from here on we'll only discuss the primary content of the Git objects, ignoring the metadata.

This metadata has the benefit of allowing us (and Git!) to know the type of an object in isolation. Let's check the object's type with the -t, for "type", flag passed to cat-file:

$ git cat-file -t 3b18e512
blob

Trees

It wouldn't be very useful if we could only track the contents of files, and not even the file names or the structure. This is where "tree" objects come in. A "tree" in Git is an object (a file, really) which contains a list of pointers to blobs or other trees. Each line in the tree object's file contains a pointer (the object's hash) to one such object (tree or blob), while also providing the mode, object type, and a name for the file or directory.

$ git add readme.md
$ git commit -m 'Add readme'
$ tree -I "info|pack" .git/objects
.git/objects
├── 3b
│   └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
├── 73
│   └── 94b8cc9ca916312a79ce8078c34b49b1617718
└── ef
    └── 34a153025fffb8a498fff540f7c93963937291

In committing our readme.md file, we create two new objects. One is the object for the commit itself, but the other, ef34a15..., is the new "tree" object that represents our current directory. (Remember that the full hash for an object is its directory name prepended to the file name, ef and 34a15... in this case). We can view it with:

$ git cat-file -t ef34a15
tree

$ git cat-file -p ef34a15
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    readme.md

Our new tree object consists of a single line, listing the mode/permissions, the type, the hash, and the file name of our readme.md file.

Again, we'll note that tree objects themselves do not have names, much like blobs. Parent trees associate names for subtrees, and the root tree, referred to as the "working tree" of a repository, in fact has no name.

This has two fun characteristics:

The repo doesn't care what you call it. You can rename your local directory that contains your repository to anything you'd like. Git is blissfully unaware of the name of the directory that contains the .git repo directory.
We can rename subtrees as much as we want, and only parent objects need to update. The subtree object itself and everything below remain untouched.

[it's just that no one has fixed it]: https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F

Subtrees

Trees require at least one file, and optionally have subtrees. An empty directory would yield an empty tree, and this is why we can't track empty directories in Git and have to use tricks like adding a hidden .gitkeep file into the directory. (Technically, there's no strict reason for this limitation, [it's just that no one has fixed it][]... ¯\_(ツ)_/¯)

We'll create a subtree with:

$ mkdir app
$ touch app/script.rb
$ git add --all
$ git status
## master
A  app/script.rb
$ git commit -m 'Another file in app Dir'

Looking in the objects/ directory, there are now multiple new objects.

Of most interest to us is the new tree object for our working directory, our current root tree. Since there are so many objects now, we can't just guess which is which, but we can ask Git directly to show us a specific tree using the ls-tree command.

$ git ls-tree master
040000 tree 67b21f78a4548b2ba3eab318bb3628d039e851e6    app
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    readme.md

We passed master as the "tree" to view since master, a branch, points at a commit, which points at a tree. By passing master as the argument, we're identifying the tree master indirectly points at.

We have a new tree object and it contains a new line, the first line, identifying a subtree for our app subdirectory. We can view the tree by grabbing its hash and running:

$ git ls-tree 67b21f78a4548b2ba3eab318bb3628d039e851e6
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    script.rb

And here we see a single line for the script.rb file in the app directory. This rounds out our understanding of trees. To review:

Trees list out the contents of a directory (blobs and subtrees).
For each object, the mode, permissions, type, hash, and name is listed.
Tree objects must contain at least one blob or tree.
Trees can be nested to any depth.
Trees, like blobs, don't store names. The names are stored in parent trees.
Trees are named and stored in the objects/ directory by hashing their contents (the list of objects described above).

Commit

A commit is the final piece in our object puzzle, and it connects all of the other objects together, acting as the hub of our object graph. Just like with our other objects, we can use cat-file -p to get a pretty printed view of how Git stores commit objects.

$ git log --oneline --decorate
* f95b2fe (HEAD -> master) Another file in app dir
* ef34a15 Add readme

$ git cat-file -p f95b2fe
tree 0cae7dc167b255c0123c7c396fc48ce40fc35cfa
parent ef34a153025fffb8a498fff540f7c93963937291
author Chris Toomey <chris@ctoomey.com> 1441311544 -0400
committer Chris Toomey <chris@ctoomey.com> 1441311544 -0400

Another file in app dir

A commit is a file, just like our blob and tree objects, with a specific structure. That structure includes:

The hash of the working tree - This is always a single tree, never more, never less. The tree can point to any number of subtrees, but for every version of our code there is one root working tree.
The hash of the parent commit(s) - Git tracks the history by simply pointing at the previous commit. In the case of a merge, there can be multiple parent commits, or, in the case of the root commit, no parents.
The author info and date.
The commiter info and date.
The full commit message.

This file is then passed through the SHA-1 hash function and saved alongside our other objects in the .git/objects directory.

It's worth noting that the commit object only contains a single reference to a working directory; Git doesn't store diffs. When diffing between two commits, it compares the working trees of the commits, computing the diff on demand.

Walking the Git Object Graph

Commits are the core of the Git object graph, and we'll explore this by walking this graph just a bit:

# using the "parent" commit hash from above
$ git cat-file -p ef34a153025fffb8a498fff540f7c93963937291
tree 7394b8cc9ca916312a79ce8078c34b49b1617718
author Chris Toomey <chris@ctoomey.com> 1441311368 -0400
committer Chris Toomey <chris@ctoomey.com> 1441311368 -0400

Add readme

Here we've walked the graph and viewed the parent commit, which we see is the root commit (as it has no parent), and now we go a step further to view the working tree of this parent commit:

$ git cat-file -p 7394b8cc9ca916312a79ce8078c34b49b1617718
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    readme.md

Here we see the initial directory with the single script.rb file, and finally:

$ git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
hello world

Conclusion

And with that, we've covered the primary objects Git uses to store our code:

git base object model

Blobs - Contain the contents of a file.
Trees - List the contents of a directory, connecting "blobs" with names and permissions to reference files and subdirectories.
Commits - Store a reference to a specific version of the code (a single tree), the direct parentage (parent commit hash(es)), and other metadata.

We've also seen how we can walk through the history by following the hash references from commits, trees, and eventually to blobs. This wraps up the foundation of the object model, but in the next video we'll see how branches, remotes, and tags fit in, as well as how Git commands operate on these objects.

Mastering Git

25 minutes