---
title: Block Web Crawlers with Rails
teaser: 'Search engines "crawl" and "index" web content through programs called robots
  (a.k.a. crawlers or spiders). Here are some approaches to blocking them in Ruby
  on Rails apps.

  '
tags: rails,ruby,seo,web
author: Dan Croak
published_on: 2017-02-10
---

Search engines "crawl" and "index" web content
through programs called robots (a.k.a. crawlers or spiders).
This may be problematic for our projects in situations such as:

* a staging environment
* migrating data from a legacy system to new locations
* rolling out alpha or beta features

Approaches to blocking crawlers in these scenarios include:

* authentication (best)
* `robots.txt` (crawling)
* `X-Robots-Tag` (indexing)

## Problem: duplicate content

With multiple environments or during a data migration period,
[duplicate content][dup] may be accessible to crawlers.
Search engines will have to guess which version to index,
assign authority,
and rank in query results.

[dup]: https://moz.com/learn/seo/duplicate-content

For example, we periodically back up our production data
to the [staging environment][env] using [Parity]:

[env]: http://12factor.net/dev-prod-parity
[Parity]: https://github.com/thoughtbot/parity

<kbd>
production backup
<br />
staging restore production
</kbd>

## Things search engines do

In order to provide results,
a search engine may prepare by doing these things:

1. check a domain's robots settings (e.g. `http://example.com/robots.txt`)
1. request a webpage on the domain (e.g. `http://example.com/`)
1. check the webpage's `X-Robots-Tag` HTTP response header
1. cache the webpage (saving its response body)
1. index the webpage (extract keywords from the response body for fast lookup)
1. follow links on the webpage to other webpages and repeat

Steps 1, 2, 3, and 6 are generally "crawling" steps
and steps 4 and 5 are generally "indexing" steps.

## Solution: authentication (best)

The most reliable way to hide content from a crawler is
with authentication such as [HTTP Basic authentication][basic]:

[basic]: http://api.rubyonrails.org/classes/ActionController/HttpAuthentication/Basic.html

```ruby
class ApplicationController < ActionController::Base
  if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
    http_basic_authenticate_with(
      name: ENV.fetch("BASIC_AUTH_USERNAME"),
      password: ENV.fetch("BASIC_AUTH_PASSWORD"),
    )
  end
end
```

This often is all we need for situations such as a staging environment.
The following approaches are more limited
but may be more suitable for other situations.

Notice we can control whether crawlers are allowed to access content
via [config in the environment][config].
We can use Parity again to add configuration to Heroku staging:

[config]: http://12factor.net/config

<kbd>
staging config:set DISALLOW_ALL_WEB_CRAWLERS=true
</kbd>

## Solution: robots.txt (crawling)

The [robots exclusion standard][standard]
helps robots decide what action to take.
A robot first looks at the [`/robots.txt`][txt] file on the domain
before crawling it.

[standard]: https://en.wikipedia.org/wiki/Robots_exclusion_standard
[txt]: http://www.robotstxt.org/robotstxt.html

It is a de-facto standard
(not owned by a standards body)
and is opt-in by robots.
Mainstream robots such as `Googlebot`
respect the standard
but bad actors may not.

An example `/robots.txt` file looks like this:

```txt
User-agent: *
Disallow: /
```

This blocks (disallows) all content (`/`) to all crawlers (`User-agent`s).
See [this list of Google crawlers][google] for examples of user agent tokens.

[google]: https://support.google.com/webmasters/answer/1061943

Globbing and regular expressions are not supported in this file.
[See what can go in it][docs].

[docs]: http://www.robotstxt.org/robotstxt.html

Add [Climate Control][cc] to the `Gemfile`
to control environment variables in tests:

[cc]: https://github.com/thoughtbot/climate_control

```ruby
gem "climate_control"
```

In `spec/requests/robots_txt_spec.rb`:

```ruby
require "rails_helper"

describe "robots.txt" do
  context "when not blocking all web crawlers" do
    it "allows all crawlers" do
      get "/robots.txt"

      expect(response.code).to eq "404"
      expect(response.headers["X-Robots-Tag"]).to be_nil
    end
  end

  context "when blocking all web crawlers" do
    it "blocks all crawlers" do
      ClimateControl.modify "DISALLOW_ALL_WEB_CRAWLERS" => "true" do
        get "/robots.txt"
      end

      expect(response).to render_template "disallow_all"
      expect(response.headers["X-Robots-Tag"]).to eq "none"
    end
  end
end
```

[Google recommends no robots.txt][rec]
if we want all our content to be crawled.

[rec]: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started#some-sample-robotstxt-files

In `config/routes.rb`:

```ruby
get "/robots.txt" => "robots_txts#show"
```

In `app/controllers/robots_txts_controller.rb`:

```ruby
class RobotsTxtsController < ApplicationController
  def show
    if disallow_all_crawlers?
      render "disallow_all", layout: false, content_type: "text/plain"
    else
      render nothing: true, status: 404
    end
  end

  private

  def disallow_all_crawlers?
    ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
  end
end
```

If we're using an authentication library such as [Clearance] site-wide,
we'll want to skip its filter in our controller:

[Clearance]: https://github.com/thoughtbot/clearance

```ruby
class ApplicationController < ActionController::Base
  before_action :require_login
end

class RobotsTxtsController < ApplicationController
  skip_before_action :require_login
end
```

Remove the default Rails `robots.txt` and prepare the custom directory:

<kbd>
rm public/robots.txt
<br />
mkdir app/views/robots_txts
</kbd>

In `app/views/robots_txts/disallow_all.erb`:

```txt
User-agent: *
Disallow: /
```

## Solution: `X-Robots-Tag` (indexing)

It is possible for search engines to [index content without crawling it][wo]
because websites might link to it.
So, our `robots.txt` technique blocked crawling, but not indexing.

[wo]: https://www.quora.com/unanswered/How-can-a-search-engine-index-without-crawling

Adding a [`X-Robots-Tag` header][x-head] to our responses
short-circuits the entire process;
well-behaved crawlers won't make HTTP requests at all
to content on the domain.

[x-head]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag#using-the-x-robots-tag-http-header

You may have seen meta tags like this in projects you've worked on:

```html
<meta name="robots" content="noindex,nofollow">
```

The `X-Robots-Tag` header has the same effect as the `robots` meta tag
but applies to all content types in our app (e.g. images, scripts, styles),
not only HTML files.

To block robots in our environment, we want a header like this:

```txt
X-Robots-Tag: none
```

The [`none` directive is equivalent to `noindex, nofollow`][equiv].
It tells robots not to index, follow links, or cache.

[equiv]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag#using-the-x-robots-tag-http-header

In `lib/rack_x_robots_tag.rb`:

```ruby
module Rack
  class XRobotsTag
    def initialize(app)
      @app = app
    end

    def call(env)
      status, headers, response = @app.call(env)

      if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
        headers["X-Robots-Tag"] = "none"
      end

      [status, headers, response]
    end
  end
end
```

In `config/application.rb`:

```ruby
require_relative "../lib/rack_x_robots_tag"

module YourAppName
  class Application < Rails::Application
    # ...
    config.middleware.use Rack::XRobotsTag
  end
end
```

Our specs will now pass.

## Conclusion

Our environment's content can be blocked in three different ways from
crawling and indexing by
web robots that respect the robots exclusion standard
(most importantly Google).

Use authentication to entirely hide it,
or robots.txt plus the `X-Robots-Tag` for more granular control.
