Git is a distributed version control system (DVCS) that we use every day to manage our code. It is a powerful tool but have you ever wondered how it works its magic? The Git internal docs can be intimidating, incomplete, and don’t have examples. Digging through the Git’s implementation can also be intimidating, particularly if you aren’t familiar with C.
Pulling apart the engine and putting it back together is one of the best ways to understand how a system works. However, instead of writing C, let’s use something more familiar to us as Rails developers. Let’s re-implement Git in Ruby!
If you want to dig deeper into the implementation, check out the RGit source on Github.
Git commands
Git is built in modular fashion following the UNIX philosophy
of small, sharp tools. Each command is its own script file and the top level
git
command simply proxies to them. Git ships with a number of built-in
commands but custom commands can be written as long as they follow a given
naming convention.
#!/usr/bin/env ruby
# bin/rgit
command, *args = ARGV
if command.nil?
$stderr.puts "Usage: rgit <command> [<args>]"
exit 1
end
path_to_command = File.expand_path("../rgit-#{command}", __FILE__)
if !File.exist? path_to_command
$stderr.puts "No such command"
exit 1
end
exec path_to_command, *args
This script does one of three things when we call it:
- Outputs usage information if no subcommand was given
- Outputs an error message if no script for the subcommand was found
- Runs the given subcommand if it is found
Notice that we pass on any additional arguments to the subcommand.
As good UNIX citizens, we output messages to the standard error stream and return a non-zero exit code when errors occur.
Initializing a repository
Git stores all of its data and metadata in a .git
directory in the root of the
repository. The git init
command initializes the .git
directory and a few
subdirectories as follows:
.git
├── HEAD
├── config
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
HEAD
is a file that has the hard-coded value ref: refs/heads/master
. We’ll
need this file later. config
contains configuration for the repo. We’ll ignore
it for now in the interest of simplicity. The remaining items in the tree are
empty directories.
Generating this structure is mostly a lot of calls to Dir.mkdir
#!/usr/bin/env ruby
# bin/rgit-init
RGIT_DIRECTORY=".rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
REFS_DIRECTORY = "#{RGIT_DIRECTORY}/refs".freeze
if Dir.exists? RGIT_DIRECTORY
$stderr.puts "Existing RGit project"
exit 1
end
def build_objects_directory
Dir.mkdir OBJECTS_DIRECTORY
Dir.mkdir "#{OBJECTS_DIRECTORY}/info"
Dir.mkdir "#{OBJECTS_DIRECTORY}/pack"
end
def build_refs_directory
Dir.mkdir REFS_DIRECTORY
Dir.mkdir "#{REFS_DIRECTORY}/heads"
Dir.mkdir "#{REFS_DIRECTORY}/tags"
end
def initialize_head
File.open("#{RGIT_DIRECTORY}/HEAD", "w") do |file|
file.puts "ref: refs/heads/master"
end
end
Dir.mkdir RGIT_DIRECTORY
build_objects_directory
build_refs_directory
initialize_head
$stdout.puts "RGit initialized in #{RGIT_DIRECTORY}"
This script is called rgit-init
in keeping with the conventions expected by
the rgit
command we built. If there is already a .rgit
directory, we output
an error message and exit with a non-zero exit code. Real Git allows you to
safely “re-initialize” a repository but let’s opt out of this edge case for our
MVP.
The init
command is a little verbose but very boring. It creates a bunch of
directories as well as the HEAD
file.
Adding files to the staging area
Git allows capture a snapshot of the current state of a file via the git add
command. The set of these snapshots is called the staging area. A list of
snapshots and their metadata is stored at .rgit/index
. Staging a file
takes a few steps:
- Create a SHA based on the file contents
- Create a blob by compressing the file contents
- Save that blob as
rgit/objects/<first-two-characters-of-sha>/<rest of sha>
- Add the SHA and original file path to the index so we can retrieve it later.
The index is a binary file that has the following format:
DIRC <version_number> <number of entries>
<ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
<ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
<ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
# more entries
A lot of this metadata comes in handy for calculations done by other commands. If you try to open this file however, you will see a bunch of gibberish.
cat .git/index
bin/rgit-initTREE52 1?Ibin/rgitU?U?2???? ???
C??B=????''9bin2 0
?Cԣ̏k?i??`V:??3'9Z?6??赠xa?cǢbF
This is because the contents of the index file is stored as a binary format for performance reasons.
For simplicity and human-readability, let’s ignore most of the metadata and use a text format. We can return and add these features as they become necessary in the future.
For now, RGit’s index format will look like:
<SHA> <path>
<SHA> <path>
<SHA> <path>
# more entries
Let’s look at the actual Ruby code to do all this!
#!/usr/bin/env ruby
require "digest"
require "zlib"
require "fileutils"
RGIT_DIRECTORY = ".rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
INDEX_PATH = "#{RGIT_DIRECTORY}/index"
if !Dir.exists? RGIT_DIRECTORY
$stderr.puts "Not an RGit project"
exit 1
end
path = ARGV.first
if path.nil?
$stderr.puts "No path specified"
exit 1
end
file_contents = File.read(path)
sha = Digest::SHA1.hexdigest file_contents
blob = Zlib::Deflate.deflate file_contents
object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}"
FileUtils.mkdir_p object_directory
blob_path = "#{object_directory}/#{sha[2..-1]}"
File.open(blob_path, "w") do |file|
file.print blob
end
File.open(INDEX_PATH, "a") do |file|
file.puts "#{sha} #{path}"
end
Let’s start versioning Rgit with Rgit! First we need to add a file to the staging area:
rgit add bin/rgit
Our .rgit
directory now looks like:
.rgit
├── HEAD
├── index
├── objects
│ ├── b3
│ │ └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
Notice that we now have a file in the objects
directory. It contains the
compressed source of bin/rgit
.
Finally, our index looks like:
cat .rgit/index
b302dd6f8cd2b385b170e78c14503342c0ba6ae8 bin/rgit
Committing files
Blobs are the contents of a particular file at a particular time. In order to capture a snapshot of the entire project, Git bundles a bunch of these into a commit.
In order to capture the directory structure of the project, Git creates a “tree” object for each directory of a project. Each tree object contains a list of the tracked files and their associated blob as well as tree objects for subdirectories.
This gives us a tree structure that mirrors the tracked project’s filesystem. Directories are represented by “tree” objects while files are “blobs”. This whole tree structure is then tied to a “commit” object so that we can refer to it later.
The commit command does three things:
- Build the tree/blob structure
- Create a commit object that points to that structure
- Update the current branch to point to the this commit.
Because creating objects is a common task, I’ve extracted it to RGit::Object
.
# lib/rgit/object
require "fileutils"
module RGit
RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
class Object
def initialize(sha)
@sha = sha
end
def write(&block)
object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}"
FileUtils.mkdir_p object_directory
object_path = "#{object_directory}/#{sha[2..-1]}"
File.open(object_path, "w", &block)
end
private
attr_reader :sha
end
end
This class handles all of the directory/path related tasks as well as opening the file. It then yields to the given block for the actual writing of the object’s contents.
With this refactor done, let’s take a look at the commit command:
#!/usr/bin/env ruby
# bin/rgit-commit
$LOAD_PATH << File.expand_path("../../lib", __FILE__)
require "digest"
require "time"
require "rgit/object"
RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze
INDEX_PATH = "#{RGIT_DIRECTORY}/index"
COMMIT_MESSAGE_TEMPLATE = <<-TXT
# Title
#
# Body
TXT
def index_files
File.open(INDEX_PATH).each_line
end
def index_tree
index_files.each_with_object({}) do |line, obj|
sha, _, path = line.split
segments = path.split("/")
segments.reduce(obj) do |memo, s|
if s == segments.last
memo[segments.last] = sha
memo
else
memo[s] ||= {}
memo[s]
end
end
end
end
def build_tree(name, tree)
sha = Digest::SHA1.hexdigest(Time.now.iso8601 + name)
object = RGit::Object.new(sha)
object.write do |file|
tree.each do |key, value|
if value.is_a? Hash
dir_sha = build_tree(key, value)
file.puts "tree #{dir_sha} #{key}"
else
file.puts "blob #{value} #{key}"
end
end
end
sha
end
def build_commit(tree:)
commit_message_path = "#{RGIT_DIRECTORY}/COMMIT_EDITMSG"
`echo "#{COMMIT_MESSAGE_TEMPLATE}" > #{commit_message_path}`
`$VISUAL #{commit_message_path} >/dev/tty`
message = File.read commit_message_path
committer = "user"
sha = Digest::SHA1.hexdigest(Time.now.iso8601 + committer)
object = RGit::Object.new(sha)
object.write do |file|
file.puts "tree #{tree}"
file.puts "author #{committer}"
file.puts
file.puts message
end
sha
end
def update_ref(commit_sha:)
current_branch = File.read("#{RGIT_DIRECTORY}/HEAD").strip.split.last
File.open("#{RGIT_DIRECTORY}/#{current_branch}", "w") do |file|
file.print commit_sha
end
end
def clear_index
File.truncate INDEX_PATH, 0
end
if index_files.count == 0
$stderr.puts "Nothing to commit"
exit 1
end
root_sha = build_tree("root", index_tree)
commit_sha = build_commit(tree: root_sha)
update_ref(commit_sha: commit_sha)
clear_index
This file does several things:
- Exits with error code and message if there are no files to commit
- Creates all the necessary tree objects for the files in the index
- Creates a commit object pointing to the root tree object
- Updates the current branch to point to the commit
- Clears the index
Building the tree is done in two passes. First the index is converted into a hash structure representing the file tree. Secondly, this structure is converted to tree objects on the filesystem. Both steps are done recursively.
For the commit message, we simply open a file using the user’s
$VISUAL
editor. Once
the user exit their editor, we read the file an put the contents into the
commit.
Let’s see it all come togeter. Staging and committing bin/rgit
and
bin/rgit-add
gives us the following results in .rgit
:
.rgit
├── COMMIT_EDITMSG
├── HEAD
├── index
├── objects
│ ├── 63
│ │ └── 45493c987e6144cc68142ad2405db681b28628
│ ├── 8c
│ │ └── fe566596683acae588039156f40ecaff282c30
│ ├── ae
│ │ └── 161568392ed9aa321466446a9bb01acb111e4f
│ ├── b3
│ │ └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8
│ ├── f9
│ │ └── 60e7d48c47e86289a653b0afc0b7a13a9d372e
│ ├── info
│ └── pack
└── refs
├── heads
│ └── master
└── tags
In order to find the current state, we first look up what branch we are on by
checking .rgit/HEAD
. This points to .rgits/refs/heads/master
, the master
branch. The master branch points to its latest commit. The commit in turn points
to a tree object representing the root of the project. This tree object points
to another tree object representing the bin/
directory which in turn points to
two blob objects containing the compressed contents of bin/rgit
and
bin/rgit-add
at the time of the commit.
This structure of objects pointing to each other is what makes Git so powerful. By simply changing a few of these pointing files, we can switch to different points in history.
Let’s build something together
Have an idea for an application? Need help refactoring an existing codebase? Want to build up your team’s programming confidence? Take a look at all the great services we offer and let’s talk about we can help you and your organization succeed.