Fetching Source Index for http://rubygems.org/

Nick Quaranto

Like you, I’ve sat at my terminal watching Bundler emit this post’s title and do nothing for quite a while. Imagine what we could be doing instead of waiting for dependencies to resolve! I’m out of ideas already, I love resolving dependencies.

Why it’s slow

It’s actually not Bundler that is slow…it’s RubyGems itself. To understand why this process takes a long time, you need a bit of a history lesson with how RubyGems handles its index of gems. There are three indexes available:

  • Latest index (newest versions for a given gem on a given platform)
  • Big index (all versions for all gems on all platforms)
  • Prerelease index (only prerelease gems for all gems on all platforms)

Usually we just need to request the “latest” index when you gem install something. However, Bundler needs the big index. This has a serious size difference though:

wget http://rubygems.org/latest_specs.4.8.gz
wget http://rubygems.org/specs.4.8.gz
du -h *
172K    latest_specs.4.8.gz
436K    specs.4.8.gz

These indexes are big gzipped and Marshal‘d arrays of the gem name, version and platform. Our first slowdown is actually in parsing this huge array.

irb -rubygems -rbenchmark
>> Benchmark.bmbm { |x| x.report { Marshal.load(Gem.gunzip(File.read("specs.4.8.gz"))) }
2.250000   0.050000   2.300000 (  2.321536)
total: 2.300000sec

user     system      total        real
2.280000   0.030000   2.310000 (  2.299291)

Once unzipped/unpacked, the entries in that array usually look like:

["rails", Gem::Version.new("3.0.3"), "ruby"]

Bundler also needs a given gem’s dependencies. If you haven’t noticed already, those dependencies aren’t in the index at all, they’re in the gemspecs, which are stored individually at a completely different location, also gzipped and Marshal’d.

irb -rubygems -ropen-uri -rpp
>> compressed = open("http://rubygems.org/quick/Marshal.4.8/rails-3.0.0.gemspec.rz").read
>> inflated = Gem.inflate(compressed)
>> unmarshalled = Marshal.load(inflated)
>> pp unmarshalled.dependencies
[Gem::Dependency.new("activesupport", Gem::Requirement.new(["= 3.0.0"]), :runtime),
 Gem::Dependency.new("actionpack", Gem::Requirement.new(["= 3.0.0"]), :runtime),
 Gem::Dependency.new("activerecord", Gem::Requirement.new(["= 3.0.0"]), :runtime),
 Gem::Dependency.new("activeresource", Gem::Requirement.new(["= 3.0.0"]), :runtime),
 Gem::Dependency.new("actionmailer", Gem::Requirement.new(["= 3.0.0"]), :runtime),
 Gem::Dependency.new("railties", Gem::Requirement.new(["= 3.0.0"]), :runtime),
 Gem::Dependency.new("bundler", Gem::Requirement.new(["~> 1.0.0"]), :runtime)]

So that’s basically how RubyGems figures out dependencies out to a N level, it has to make separate requests to each gemspec and continue to jump through until all possibilities are exhausted. At some point when you gem install a gem, add -V on and you’ll see all of these requests happening.

Those requests obviously take a lot of time, no matter how good Bundler’s resolver algorithm gets. I think we’ve pushed this system to its limits, and the fact that it does complete resolves in a reasonable amount of time is impressive.

What you can do

So it’s still slow. My general advice is to:

  • Check in your vendor/cache directory with your .gem files. If bundle install doesn’t make one, force it with bundle pack.
  • On new installs, CI runs, and deploys, use bundle --local which will attempt to resolve using only vendor/cache
  • Lock down to specific versions (or use the pessimistic operator) in your Gemfile

What we have done about it

From the RubyGems side, I think we’ve done a good thing by making the long requests go out to CloudFront, so big gems get a CDN boost. However, all requests being are still being made to the Gemcutter server at RackSpace before being redirected to S3/CloudFront, so the network latency with that request doesn’t help those outside of the US get their gems faster.

At Cape Code, Matt and I worked on a new resolver endpoint for Bundler. The idea was that Bundler could make a request to this new API that would return one level of dependencies for a given set of gems. We can’t move the entire Bundler resolver algorithm to the server side, but this could cut down the number of requests it needs to make out for gemspecs.

This will speed things up a bit, but it doesn’t solve the root problem here.

What needs to happen

What we really need is:

  1. A better indexing scheme
  2. A mirroring system that isn’t horrible (read: round robin DNS)

RubyGems definitely needs a better indexing scheme, but this is difficult since making the client support it is going to be rough (and we have to worry about backwards compatability!)

Thankfully, our server is now in Ruby (one of the first goals of the Gemcutter project) so we can iterate rapidly and drop the changes into a gem plugin (think gem fast_install rails). I’ve been talking to some fellow robots here about some possibilities (differential indices for one) but we need to bang some code out soon.

I’m looking into getting a mirroring system set up, but as always, we need contributors to help. My first stop has been with MirrorBrain, but I’m open to anything that works and will be easy to setup. My only real requirement is that it takes less than 1 minute to get a gem distributed. Perhaps we need BitTorrent? The gem files are small (most are way under 1MB) so I can’t see that as being hard to accomplish.

My goal is to get rid of at least one of these problems in 2011. Want to help? Hop on IRC (#rubygems on irc.freenode.net) and the Gemcutter mailing list as well.