Search engines “crawl” and “index” web content through programs called robots (a.k.a. crawlers or spiders). This may be problematic for our projects in situations such as:
- a staging environment
- migrating data from a legacy system to new locations
- rolling out alpha or beta features
Approaches to blocking crawlers in these scenarios include:
- authentication (best)
With multiple environments or during a data migration period, duplicate content may be accessible to crawlers. Search engines will have to guess which version to index, assign authority, and rank in query results.
staging restore production
In order to provide results, a search engine may prepare by doing these things:
- check a domain’s robots settings (e.g.
- request a webpage on the domain (e.g.
- check the webpage’s
X-Robots-TagHTTP response header
- cache the webpage (saving its response body)
- index the webpage (extract keywords from the response body for fast lookup)
- follow links on the webpage to other webpages and repeat
Steps 1, 2, 3, and 6 are generally “crawling” steps and steps 4 and 5 are generally “indexing” steps.
The most reliable way to hide content from a crawler is with authentication such as HTTP Basic authentication:
class ApplicationController < ActionController::Base if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present? http_basic_authenticate_with( name: ENV.fetch("BASIC_AUTH_USERNAME"), password: ENV.fetch("BASIC_AUTH_PASSWORD"), ) end end
This often is all we need for situations such as a staging environment. The following approaches are more limited but may be more suitable for other situations.
Notice we can control whether crawlers are allowed to access content via config in the environment. We can use Parity again to add configuration to Heroku staging:
staging config:set DISALLOWALLWEB_CRAWLERS=true
It is a de-facto standard
(not owned by a standards body)
and is opt-in by robots.
Mainstream robots such as
respect the standard
but bad actors may not.
/robots.txt file looks like this:
User-agent: * Disallow: /
This blocks (disallows) all content (
/) to all crawlers (
See this list of Google crawlers for examples of user agent tokens.
Globbing and regular expressions are not supported in this file. See what can go in it.
Add Climate Control to the
to control environment variables in tests:
require "rails_helper" describe "robots.txt" do context "when not blocking all web crawlers" do it "allows all crawlers" do get "/robots.txt" expect(response.code).to eq "404" expect(response.headers["X-Robots-Tag"]).to be_nil end end context "when blocking all web crawlers" do it "blocks all crawlers" do ClimateControl.modify "DISALLOW_ALL_WEB_CRAWLERS" => "true" do get "/robots.txt" end expect(response).to render_template "disallow_all" expect(response.headers["X-Robots-Tag"]).to eq "none" end end end
Google recommends no robots.txt if we want all our content to be crawled.
get "/robots.txt" => "robots_txts#show"
class RobotsTxtsController < ApplicationController def show if disallow_all_crawlers? render "disallow_all", layout: false, content_type: "text/plain" else render nothing: true, status: 404 end end private def disallow_all_crawlers? ENV["DISALLOW_ALL_WEB_CRAWLERS"].present? end end
If we’re using an authentication library such as Clearance site-wide, we’ll want to skip its filter in our controller:
class ApplicationController < ActionController::Base before_action :require_login end class RobotsTxtsController < ApplicationController skip_before_action :require_login end
Remove the default Rails
robots.txt and prepare the custom directory:
User-agent: * Disallow: /
It is possible for search engines to index content without crawling it
because websites might link to it.
robots.txt technique blocked crawling, but not indexing.
X-Robots-Tag header to our responses
short-circuits the entire process;
well-behaved crawlers won’t make HTTP requests at all
to content on the domain.
You may have seen meta tags like this in projects you’ve worked on:
<meta name="robots" content="noindex,nofollow">
X-Robots-Tag header has the same effect as the
robots meta tag
but applies to all content types in our app (e.g. images, scripts, styles),
not only HTML files.
To block robots in our environment, we want a header like this:
none directive is equivalent to
It tells robots not to index, follow links, or cache.
module Rack class XRobotsTag def initialize(app) @app = app end def call(env) status, headers, response = @app.call(env) if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present? headers["X-Robots-Tag"] = "none" end [status, headers, response] end end end
require_relative "../lib/rack_x_robots_tag" module YourAppName class Application < Rails::Application # ... config.middleware.use Rack::XRobotsTag end end
Our specs will now pass.
Our environment’s content can be blocked in three different ways from crawling and indexing by web robots that respect the robots exclusion standard (most importantly Google).
Use authentication to entirely hide it,
or robots.txt plus the
X-Robots-Tag for more granular control.