---
title: Rebuilding Git in Ruby
teaser: From zero to commit! Let's rebuild Git with a more familiar approach for Rails
  developers to get a better understanding on how Git works under the hood.
tags: git,ruby
author: Joël Quenneville
published_on: 2016-03-14
---

Git is a distributed version control system (DVCS) that we use every day to
manage our code. It is a powerful tool but have you ever wondered how it works
its magic? The Git internal docs can be intimidating, incomplete, and don't
have examples. Digging through the Git's implementation can also be
intimidating, particularly if you aren't familiar with C.

Pulling apart the engine and putting it back together is one of the best ways to
understand how a system works. However, instead of writing C, let's use
something more familiar to us as Rails developers. Let's re-implement Git in
Ruby!

If you want to dig deeper into the implementation, check out the [RGit source on
Github][rgit].

[rgit]: https://github.com/JoelQ/rgit

## Git commands

Git is built in modular fashion following the [UNIX philosophy][unix philosophy]
of small, sharp tools. Each command is its own script file and the top level
`git` command simply proxies to them. Git ships with a number of built-in
commands but custom commands can be written as long as they follow a given
naming convention.

```ruby
#!/usr/bin/env ruby

# bin/rgit

command, *args = ARGV

if command.nil?
  $stderr.puts "Usage: rgit <command> [<args>]"
  exit 1
end

path_to_command = File.expand_path("../rgit-#{command}", __FILE__)
if !File.exist? path_to_command
  $stderr.puts "No such command"
  exit 1
end

exec path_to_command, *args
```

This script does one of three things when we call it:

- Outputs usage information if no subcommand was given
- Outputs an error message if no script for the subcommand was found
- Runs the given subcommand if it is found

Notice that we pass on any additional arguments to the subcommand.

As good UNIX citizens, we output messages to the standard error stream and
return a non-zero exit code when errors occur.

[unix philosophy]: https://en.wikipedia.org/wiki/Unix_philosophy

## Initializing a repository

Git stores all of its data and metadata in a `.git` directory in the root of the
repository. The `git init` command initializes the `.git` directory and a few
subdirectories as follows:

```tree
.git
├── HEAD
├── config
├── objects
│  ├── info
│  └── pack
└── refs
    ├── heads
    └── tags
```

`HEAD` is a file that has the hard-coded value `ref: refs/heads/master`. We'll
need this file later. `config` contains configuration for the repo. We'll ignore
it for now in the interest of simplicity. The remaining items in the tree are
empty directories.

Generating this structure is mostly a lot of calls to `Dir.mkdir`

```ruby
#!/usr/bin/env ruby

# bin/rgit-init

RGIT_DIRECTORY=".rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
REFS_DIRECTORY = "#{RGIT_DIRECTORY}/refs".freeze

if Dir.exists? RGIT_DIRECTORY
  $stderr.puts "Existing RGit project"
  exit 1
end

def build_objects_directory
  Dir.mkdir OBJECTS_DIRECTORY
  Dir.mkdir "#{OBJECTS_DIRECTORY}/info"
  Dir.mkdir "#{OBJECTS_DIRECTORY}/pack"
end

def build_refs_directory
  Dir.mkdir REFS_DIRECTORY
  Dir.mkdir "#{REFS_DIRECTORY}/heads"
  Dir.mkdir "#{REFS_DIRECTORY}/tags"
end

def initialize_head
  File.open("#{RGIT_DIRECTORY}/HEAD", "w") do |file|
    file.puts "ref: refs/heads/master"
  end
end

Dir.mkdir RGIT_DIRECTORY
build_objects_directory
build_refs_directory
initialize_head

$stdout.puts "RGit initialized in #{RGIT_DIRECTORY}"
```

This script is called `rgit-init` in keeping with the conventions expected by
the `rgit` command we built. If there is already a `.rgit` directory, we output
an error message and exit with a non-zero exit code. Real Git allows you to
safely "re-initialize" a repository but let's opt out of this edge case for our
MVP.

The `init` command is a little verbose but very boring. It creates a bunch of
directories as well as the `HEAD` file.

## Adding files to the staging area

Git allows capture a snapshot of the current state of a file via the `git add`
command. The set of these snapshots is called the _staging area_. A list of
snapshots and their metadata is stored at `.rgit/index`. Staging a file
takes a few steps:

- Create a SHA based on the file contents
- Create a blob by compressing the file contents
- Save that blob as `rgit/objects/<first-two-characters-of-sha>/<rest of sha>`
- Add the SHA and original file path to the index so we can retrieve it later.

The index is a binary file that has the [following format][git-index-format]:

[git-index-format]: https://github.com/git/git/blob/master/Documentation/technical/index-format.txt

    DIRC <version_number> <number of entries>

    <ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
    <ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
    <ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>

    # more entries

A lot of this metadata comes in handy for calculations done by other commands.
If you try to open this file however, you will see a bunch of gibberish.

<kbd>cat .git/index</kbd>

```git-index-format
bin/rgit-initTREE52 1?Ibin/rgitU?U?2????        ???
C??B=????''9bin2 0
?Cԣ̏k?i??`V:??3'9Z?6??赠xa?cǢbF
```

This is because the contents of the index file is stored as a _binary format_
for performance reasons.

For simplicity and human-readability, let's ignore most of the metadata and use
a text format. We can return and add these features as they become necessary in
the future.

For now, RGit's index format will look like:

    <SHA> <path>
    <SHA> <path>
    <SHA> <path>

    # more entries

Let's look at the actual Ruby code to do all this!

```ruby
#!/usr/bin/env ruby

require "digest"
require "zlib"
require "fileutils"

RGIT_DIRECTORY = ".rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
INDEX_PATH = "#{RGIT_DIRECTORY}/index"

if !Dir.exists? RGIT_DIRECTORY
  $stderr.puts "Not an RGit project"
  exit 1
end

path = ARGV.first

if path.nil?
  $stderr.puts "No path specified"
  exit 1
end

file_contents = File.read(path)
sha = Digest::SHA1.hexdigest file_contents
blob = Zlib::Deflate.deflate file_contents
object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}"
FileUtils.mkdir_p object_directory
blob_path = "#{object_directory}/#{sha[2..-1]}"

File.open(blob_path, "w") do |file|
  file.print blob
end

File.open(INDEX_PATH, "a") do |file|
  file.puts "#{sha} #{path}"
end
```

Let's start versioning Rgit with Rgit! First we need to add a file to the
staging area:

<kbd>rgit add bin/rgit</kbd>

Our `.rgit` directory now looks like:

```tree
.rgit
├── HEAD
├── index
├── objects
│   ├── b3
│   │   └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags
```

Notice that we now have a file in the `objects` directory. It contains the
compressed source of `bin/rgit`.

Finally, our index looks like:

<kbd>cat .rgit/index</kbd>

    b302dd6f8cd2b385b170e78c14503342c0ba6ae8 bin/rgit

## Committing files

Blobs are the contents of a particular file at a particular time. In order to
capture a snapshot of the entire project, Git bundles a bunch of these into a
_commit_.

In order to capture the directory structure of the project, Git creates a "tree"
object for each directory of a project. Each tree object contains a list of the
tracked files and their associated blob as well as tree objects for
subdirectories.

This gives us a tree structure that mirrors the tracked project's filesystem.
Directories are represented by "tree" objects while files are "blobs". This
whole tree structure is then tied to a "commit" object so that we can refer to
it later.

The commit command does three things:

1. Build the tree/blob structure
2. Create a commit object that points to that structure
3. Update the current branch to point to the this commit.

Because creating objects is a common task, I've extracted it to `RGit::Object`.

```ruby
# lib/rgit/object

require "fileutils"

module RGit
  RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze
  OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze

  class Object
    def initialize(sha)
      @sha = sha
    end

    def write(&block)
      object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}"
      FileUtils.mkdir_p object_directory
      object_path = "#{object_directory}/#{sha[2..-1]}"
      File.open(object_path, "w", &block)
    end

    private

    attr_reader :sha
  end
end
```

This class handles all of the directory/path related tasks as well as opening
the file. It then yields to the given block for the actual writing of the
object's contents.

With this refactor done, let's take a look at the commit command:

```ruby
#!/usr/bin/env ruby

# bin/rgit-commit

$LOAD_PATH << File.expand_path("../../lib", __FILE__)
require "digest"
require "time"
require "rgit/object"

RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze
INDEX_PATH = "#{RGIT_DIRECTORY}/index"
COMMIT_MESSAGE_TEMPLATE = <<-TXT
# Title
#
# Body
TXT

def index_files
  File.open(INDEX_PATH).each_line
end

def index_tree
  index_files.each_with_object({}) do |line, obj|
    sha, _, path = line.split
    segments = path.split("/")
    segments.reduce(obj) do |memo, s|
      if s == segments.last
        memo[segments.last] = sha
        memo
      else
        memo[s] ||= {}
        memo[s]
      end
    end
  end
end

def build_tree(name, tree)
  sha = Digest::SHA1.hexdigest(Time.now.iso8601 + name)
  object = RGit::Object.new(sha)

  object.write do |file|
    tree.each do |key, value|
      if value.is_a? Hash
        dir_sha = build_tree(key, value)
        file.puts "tree #{dir_sha} #{key}"
      else
        file.puts "blob #{value} #{key}"
      end
    end
  end

  sha
end

def build_commit(tree:)
  commit_message_path = "#{RGIT_DIRECTORY}/COMMIT_EDITMSG"

  `echo "#{COMMIT_MESSAGE_TEMPLATE}" > #{commit_message_path}`
  `$VISUAL #{commit_message_path} >/dev/tty`

  message = File.read commit_message_path
  committer = "user"
  sha = Digest::SHA1.hexdigest(Time.now.iso8601 + committer)
  object = RGit::Object.new(sha)

  object.write do |file|
    file.puts "tree #{tree}"
    file.puts "author #{committer}"
    file.puts
    file.puts message
  end

  sha
end

def update_ref(commit_sha:)
  current_branch = File.read("#{RGIT_DIRECTORY}/HEAD").strip.split.last

  File.open("#{RGIT_DIRECTORY}/#{current_branch}", "w") do |file|
    file.print commit_sha
  end
end

def clear_index
  File.truncate INDEX_PATH, 0
end

if index_files.count == 0
  $stderr.puts "Nothing to commit"
  exit 1
end

root_sha = build_tree("root", index_tree)
commit_sha = build_commit(tree: root_sha)
update_ref(commit_sha: commit_sha)
clear_index
```

This file does several things:

1. Exits with error code and message if there are no files to commit
2. Creates all the necessary tree objects for the files in the index
3. Creates a commit object pointing to the root tree object
4. Updates the current branch to point to the commit
5. Clears the index

Building the tree is done in two passes. First the index is converted into a
hash structure representing the file tree. Secondly, this structure is converted
to tree objects on the filesystem. Both steps are done recursively.

For the commit message, we simply open a file using the user's
[`$VISUAL`](https://thoughtbot.com/blog/visual-ize-the-future) editor. Once
the user exit their editor, we read the file an put the contents into the
commit.

Let's see it all come togeter. Staging and committing `bin/rgit` and
`bin/rgit-add` gives us the following results in `.rgit`:

```tree
.rgit
├── COMMIT_EDITMSG
├── HEAD
├── index
├── objects
│   ├── 63
│   │   └── 45493c987e6144cc68142ad2405db681b28628
│   ├── 8c
│   │   └── fe566596683acae588039156f40ecaff282c30
│   ├── ae
│   │   └── 161568392ed9aa321466446a9bb01acb111e4f
│   ├── b3
│   │   └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8
│   ├── f9
│   │   └── 60e7d48c47e86289a653b0afc0b7a13a9d372e
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags
```

In order to find the current state, we first look up what branch we are on by
checking `.rgit/HEAD`. This points to `.rgits/refs/heads/master`, the master
branch. The master branch points to its latest commit. The commit in turn points
to a tree object representing the root of the project. This tree object points
to another tree object representing the `bin/` directory which in turn points to
two blob objects containing the compressed contents of `bin/rgit` and
`bin/rgit-add` at the time of the commit.

![](https://images.thoughtbot.com/rebuilding-git-in-ruby/mTnEMJCNS2wM3a9JtbCw_rgit-commit-tree.png)

This structure of objects pointing to each other is what makes Git so powerful.
By simply changing a few of these pointing files, we can switch to different
points in history.

## Let's build something together

Have an idea for an application? Need help refactoring an existing codebase? Want to build up your team's programming confidence? Take a look at all the great services we offer and [let's talk](https://thoughtbot.com/services/ruby-on-rails-development) about we can help you and your organization succeed.
