--- title: Rebuilding Git in Ruby teaser: From zero to commit! Let's rebuild Git with a more familiar approach for Rails developers to get a better understanding on how Git works under the hood. tags: git,ruby author: Joël Quenneville published_on: 2016-03-14 --- Git is a distributed version control system (DVCS) that we use every day to manage our code. It is a powerful tool but have you ever wondered how it works its magic? The Git internal docs can be intimidating, incomplete, and don't have examples. Digging through the Git's implementation can also be intimidating, particularly if you aren't familiar with C. Pulling apart the engine and putting it back together is one of the best ways to understand how a system works. However, instead of writing C, let's use something more familiar to us as Rails developers. Let's re-implement Git in Ruby! If you want to dig deeper into the implementation, check out the [RGit source on Github][rgit]. [rgit]: https://github.com/JoelQ/rgit ## Git commands Git is built in modular fashion following the [UNIX philosophy][unix philosophy] of small, sharp tools. Each command is its own script file and the top level `git` command simply proxies to them. Git ships with a number of built-in commands but custom commands can be written as long as they follow a given naming convention. ```ruby #!/usr/bin/env ruby # bin/rgit command, *args = ARGV if command.nil? $stderr.puts "Usage: rgit []" exit 1 end path_to_command = File.expand_path("../rgit-#{command}", __FILE__) if !File.exist? path_to_command $stderr.puts "No such command" exit 1 end exec path_to_command, *args ``` This script does one of three things when we call it: - Outputs usage information if no subcommand was given - Outputs an error message if no script for the subcommand was found - Runs the given subcommand if it is found Notice that we pass on any additional arguments to the subcommand. As good UNIX citizens, we output messages to the standard error stream and return a non-zero exit code when errors occur. [unix philosophy]: https://en.wikipedia.org/wiki/Unix_philosophy ## Initializing a repository Git stores all of its data and metadata in a `.git` directory in the root of the repository. The `git init` command initializes the `.git` directory and a few subdirectories as follows: ```tree .git ├── HEAD ├── config ├── objects │ ├── info │ └── pack └── refs ├── heads └── tags ``` `HEAD` is a file that has the hard-coded value `ref: refs/heads/master`. We'll need this file later. `config` contains configuration for the repo. We'll ignore it for now in the interest of simplicity. The remaining items in the tree are empty directories. Generating this structure is mostly a lot of calls to `Dir.mkdir` ```ruby #!/usr/bin/env ruby # bin/rgit-init RGIT_DIRECTORY=".rgit".freeze OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze REFS_DIRECTORY = "#{RGIT_DIRECTORY}/refs".freeze if Dir.exists? RGIT_DIRECTORY $stderr.puts "Existing RGit project" exit 1 end def build_objects_directory Dir.mkdir OBJECTS_DIRECTORY Dir.mkdir "#{OBJECTS_DIRECTORY}/info" Dir.mkdir "#{OBJECTS_DIRECTORY}/pack" end def build_refs_directory Dir.mkdir REFS_DIRECTORY Dir.mkdir "#{REFS_DIRECTORY}/heads" Dir.mkdir "#{REFS_DIRECTORY}/tags" end def initialize_head File.open("#{RGIT_DIRECTORY}/HEAD", "w") do |file| file.puts "ref: refs/heads/master" end end Dir.mkdir RGIT_DIRECTORY build_objects_directory build_refs_directory initialize_head $stdout.puts "RGit initialized in #{RGIT_DIRECTORY}" ``` This script is called `rgit-init` in keeping with the conventions expected by the `rgit` command we built. If there is already a `.rgit` directory, we output an error message and exit with a non-zero exit code. Real Git allows you to safely "re-initialize" a repository but let's opt out of this edge case for our MVP. The `init` command is a little verbose but very boring. It creates a bunch of directories as well as the `HEAD` file. ## Adding files to the staging area Git allows capture a snapshot of the current state of a file via the `git add` command. The set of these snapshots is called the _staging area_. A list of snapshots and their metadata is stored at `.rgit/index`. Staging a file takes a few steps: - Create a SHA based on the file contents - Create a blob by compressing the file contents - Save that blob as `rgit/objects//` - Add the SHA and original file path to the index so we can retrieve it later. The index is a binary file that has the [following format][git-index-format]: [git-index-format]: https://github.com/git/git/blob/master/Documentation/technical/index-format.txt DIRC # more entries A lot of this metadata comes in handy for calculations done by other commands. If you try to open this file however, you will see a bunch of gibberish. cat .git/index ```git-index-format bin/rgit-initTREE52 1?Ibin/rgitU?U?2???? ??? C??B=????''9bin2 0 ?Cԣ̏k?i??`V:??3'9Z?6??赠xa?cǢbF ``` This is because the contents of the index file is stored as a _binary format_ for performance reasons. For simplicity and human-readability, let's ignore most of the metadata and use a text format. We can return and add these features as they become necessary in the future. For now, RGit's index format will look like: # more entries Let's look at the actual Ruby code to do all this! ```ruby #!/usr/bin/env ruby require "digest" require "zlib" require "fileutils" RGIT_DIRECTORY = ".rgit".freeze OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze INDEX_PATH = "#{RGIT_DIRECTORY}/index" if !Dir.exists? RGIT_DIRECTORY $stderr.puts "Not an RGit project" exit 1 end path = ARGV.first if path.nil? $stderr.puts "No path specified" exit 1 end file_contents = File.read(path) sha = Digest::SHA1.hexdigest file_contents blob = Zlib::Deflate.deflate file_contents object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}" FileUtils.mkdir_p object_directory blob_path = "#{object_directory}/#{sha[2..-1]}" File.open(blob_path, "w") do |file| file.print blob end File.open(INDEX_PATH, "a") do |file| file.puts "#{sha} #{path}" end ``` Let's start versioning Rgit with Rgit! First we need to add a file to the staging area: rgit add bin/rgit Our `.rgit` directory now looks like: ```tree .rgit ├── HEAD ├── index ├── objects │ ├── b3 │ │ └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8 │ ├── info │ └── pack └── refs ├── heads └── tags ``` Notice that we now have a file in the `objects` directory. It contains the compressed source of `bin/rgit`. Finally, our index looks like: cat .rgit/index b302dd6f8cd2b385b170e78c14503342c0ba6ae8 bin/rgit ## Committing files Blobs are the contents of a particular file at a particular time. In order to capture a snapshot of the entire project, Git bundles a bunch of these into a _commit_. In order to capture the directory structure of the project, Git creates a "tree" object for each directory of a project. Each tree object contains a list of the tracked files and their associated blob as well as tree objects for subdirectories. This gives us a tree structure that mirrors the tracked project's filesystem. Directories are represented by "tree" objects while files are "blobs". This whole tree structure is then tied to a "commit" object so that we can refer to it later. The commit command does three things: 1. Build the tree/blob structure 2. Create a commit object that points to that structure 3. Update the current branch to point to the this commit. Because creating objects is a common task, I've extracted it to `RGit::Object`. ```ruby # lib/rgit/object require "fileutils" module RGit RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze class Object def initialize(sha) @sha = sha end def write(&block) object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}" FileUtils.mkdir_p object_directory object_path = "#{object_directory}/#{sha[2..-1]}" File.open(object_path, "w", &block) end private attr_reader :sha end end ``` This class handles all of the directory/path related tasks as well as opening the file. It then yields to the given block for the actual writing of the object's contents. With this refactor done, let's take a look at the commit command: ```ruby #!/usr/bin/env ruby # bin/rgit-commit $LOAD_PATH << File.expand_path("../../lib", __FILE__) require "digest" require "time" require "rgit/object" RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze INDEX_PATH = "#{RGIT_DIRECTORY}/index" COMMIT_MESSAGE_TEMPLATE = <<-TXT # Title # # Body TXT def index_files File.open(INDEX_PATH).each_line end def index_tree index_files.each_with_object({}) do |line, obj| sha, _, path = line.split segments = path.split("/") segments.reduce(obj) do |memo, s| if s == segments.last memo[segments.last] = sha memo else memo[s] ||= {} memo[s] end end end end def build_tree(name, tree) sha = Digest::SHA1.hexdigest(Time.now.iso8601 + name) object = RGit::Object.new(sha) object.write do |file| tree.each do |key, value| if value.is_a? Hash dir_sha = build_tree(key, value) file.puts "tree #{dir_sha} #{key}" else file.puts "blob #{value} #{key}" end end end sha end def build_commit(tree:) commit_message_path = "#{RGIT_DIRECTORY}/COMMIT_EDITMSG" `echo "#{COMMIT_MESSAGE_TEMPLATE}" > #{commit_message_path}` `$VISUAL #{commit_message_path} >/dev/tty` message = File.read commit_message_path committer = "user" sha = Digest::SHA1.hexdigest(Time.now.iso8601 + committer) object = RGit::Object.new(sha) object.write do |file| file.puts "tree #{tree}" file.puts "author #{committer}" file.puts file.puts message end sha end def update_ref(commit_sha:) current_branch = File.read("#{RGIT_DIRECTORY}/HEAD").strip.split.last File.open("#{RGIT_DIRECTORY}/#{current_branch}", "w") do |file| file.print commit_sha end end def clear_index File.truncate INDEX_PATH, 0 end if index_files.count == 0 $stderr.puts "Nothing to commit" exit 1 end root_sha = build_tree("root", index_tree) commit_sha = build_commit(tree: root_sha) update_ref(commit_sha: commit_sha) clear_index ``` This file does several things: 1. Exits with error code and message if there are no files to commit 2. Creates all the necessary tree objects for the files in the index 3. Creates a commit object pointing to the root tree object 4. Updates the current branch to point to the commit 5. Clears the index Building the tree is done in two passes. First the index is converted into a hash structure representing the file tree. Secondly, this structure is converted to tree objects on the filesystem. Both steps are done recursively. For the commit message, we simply open a file using the user's [`$VISUAL`](https://thoughtbot.com/blog/visual-ize-the-future) editor. Once the user exit their editor, we read the file an put the contents into the commit. Let's see it all come togeter. Staging and committing `bin/rgit` and `bin/rgit-add` gives us the following results in `.rgit`: ```tree .rgit ├── COMMIT_EDITMSG ├── HEAD ├── index ├── objects │ ├── 63 │ │ └── 45493c987e6144cc68142ad2405db681b28628 │ ├── 8c │ │ └── fe566596683acae588039156f40ecaff282c30 │ ├── ae │ │ └── 161568392ed9aa321466446a9bb01acb111e4f │ ├── b3 │ │ └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8 │ ├── f9 │ │ └── 60e7d48c47e86289a653b0afc0b7a13a9d372e │ ├── info │ └── pack └── refs ├── heads │ └── master └── tags ``` In order to find the current state, we first look up what branch we are on by checking `.rgit/HEAD`. This points to `.rgits/refs/heads/master`, the master branch. The master branch points to its latest commit. The commit in turn points to a tree object representing the root of the project. This tree object points to another tree object representing the `bin/` directory which in turn points to two blob objects containing the compressed contents of `bin/rgit` and `bin/rgit-add` at the time of the commit. ![](https://images.thoughtbot.com/rebuilding-git-in-ruby/mTnEMJCNS2wM3a9JtbCw_rgit-commit-tree.png) This structure of objects pointing to each other is what makes Git so powerful. By simply changing a few of these pointing files, we can switch to different points in history. ## Let's build something together Have an idea for an application? Need help refactoring an existing codebase? Want to build up your team's programming confidence? Take a look at all the great services we offer and [let's talk](https://thoughtbot.com/services/ruby-on-rails-development) about we can help you and your organization succeed.