--- title: Block Web Crawlers with Rails teaser: 'Search engines "crawl" and "index" web content through programs called robots (a.k.a. crawlers or spiders). Here are some approaches to blocking them in Ruby on Rails apps. ' tags: rails,ruby,seo,web author: Dan Croak published_on: 2017-02-10 --- Search engines "crawl" and "index" web content through programs called robots (a.k.a. crawlers or spiders). This may be problematic for our projects in situations such as: * a staging environment * migrating data from a legacy system to new locations * rolling out alpha or beta features Approaches to blocking crawlers in these scenarios include: * authentication (best) * `robots.txt` (crawling) * `X-Robots-Tag` (indexing) ## Problem: duplicate content With multiple environments or during a data migration period, [duplicate content][dup] may be accessible to crawlers. Search engines will have to guess which version to index, assign authority, and rank in query results. [dup]: https://moz.com/learn/seo/duplicate-content For example, we periodically back up our production data to the [staging environment][env] using [Parity]: [env]: http://12factor.net/dev-prod-parity [Parity]: https://github.com/thoughtbot/parity production backup staging restore production ## Things search engines do In order to provide results, a search engine may prepare by doing these things: 1. check a domain's robots settings (e.g. `http://example.com/robots.txt`) 1. request a webpage on the domain (e.g. `http://example.com/`) 1. check the webpage's `X-Robots-Tag` HTTP response header 1. cache the webpage (saving its response body) 1. index the webpage (extract keywords from the response body for fast lookup) 1. follow links on the webpage to other webpages and repeat Steps 1, 2, 3, and 6 are generally "crawling" steps and steps 4 and 5 are generally "indexing" steps. ## Solution: authentication (best) The most reliable way to hide content from a crawler is with authentication such as [HTTP Basic authentication][basic]: [basic]: http://api.rubyonrails.org/classes/ActionController/HttpAuthentication/Basic.html ```ruby class ApplicationController < ActionController::Base if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present? http_basic_authenticate_with( name: ENV.fetch("BASIC_AUTH_USERNAME"), password: ENV.fetch("BASIC_AUTH_PASSWORD"), ) end end ``` This often is all we need for situations such as a staging environment. The following approaches are more limited but may be more suitable for other situations. Notice we can control whether crawlers are allowed to access content via [config in the environment][config]. We can use Parity again to add configuration to Heroku staging: [config]: http://12factor.net/config staging config:set DISALLOW_ALL_WEB_CRAWLERS=true ## Solution: robots.txt (crawling) The [robots exclusion standard][standard] helps robots decide what action to take. A robot first looks at the [`/robots.txt`][txt] file on the domain before crawling it. [standard]: https://en.wikipedia.org/wiki/Robots_exclusion_standard [txt]: http://www.robotstxt.org/robotstxt.html It is a de-facto standard (not owned by a standards body) and is opt-in by robots. Mainstream robots such as `Googlebot` respect the standard but bad actors may not. An example `/robots.txt` file looks like this: ```txt User-agent: * Disallow: / ``` This blocks (disallows) all content (`/`) to all crawlers (`User-agent`s). See [this list of Google crawlers][google] for examples of user agent tokens. [google]: https://support.google.com/webmasters/answer/1061943 Globbing and regular expressions are not supported in this file. [See what can go in it][docs]. [docs]: http://www.robotstxt.org/robotstxt.html Add [Climate Control][cc] to the `Gemfile` to control environment variables in tests: [cc]: https://github.com/thoughtbot/climate_control ```ruby gem "climate_control" ``` In `spec/requests/robots_txt_spec.rb`: ```ruby require "rails_helper" describe "robots.txt" do context "when not blocking all web crawlers" do it "allows all crawlers" do get "/robots.txt" expect(response.code).to eq "404" expect(response.headers["X-Robots-Tag"]).to be_nil end end context "when blocking all web crawlers" do it "blocks all crawlers" do ClimateControl.modify "DISALLOW_ALL_WEB_CRAWLERS" => "true" do get "/robots.txt" end expect(response).to render_template "disallow_all" expect(response.headers["X-Robots-Tag"]).to eq "none" end end end ``` [Google recommends no robots.txt][rec] if we want all our content to be crawled. [rec]: https://developers.google.com/webmasters/control-crawl-index/docs/getting_started#some-sample-robotstxt-files In `config/routes.rb`: ```ruby get "/robots.txt" => "robots_txts#show" ``` In `app/controllers/robots_txts_controller.rb`: ```ruby class RobotsTxtsController < ApplicationController def show if disallow_all_crawlers? render "disallow_all", layout: false, content_type: "text/plain" else render nothing: true, status: 404 end end private def disallow_all_crawlers? ENV["DISALLOW_ALL_WEB_CRAWLERS"].present? end end ``` If we're using an authentication library such as [Clearance] site-wide, we'll want to skip its filter in our controller: [Clearance]: https://github.com/thoughtbot/clearance ```ruby class ApplicationController < ActionController::Base before_action :require_login end class RobotsTxtsController < ApplicationController skip_before_action :require_login end ``` Remove the default Rails `robots.txt` and prepare the custom directory: rm public/robots.txt mkdir app/views/robots_txts In `app/views/robots_txts/disallow_all.erb`: ```txt User-agent: * Disallow: / ``` ## Solution: `X-Robots-Tag` (indexing) It is possible for search engines to [index content without crawling it][wo] because websites might link to it. So, our `robots.txt` technique blocked crawling, but not indexing. [wo]: https://www.quora.com/unanswered/How-can-a-search-engine-index-without-crawling Adding a [`X-Robots-Tag` header][x-head] to our responses short-circuits the entire process; well-behaved crawlers won't make HTTP requests at all to content on the domain. [x-head]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag#using-the-x-robots-tag-http-header You may have seen meta tags like this in projects you've worked on: ```html ``` The `X-Robots-Tag` header has the same effect as the `robots` meta tag but applies to all content types in our app (e.g. images, scripts, styles), not only HTML files. To block robots in our environment, we want a header like this: ```txt X-Robots-Tag: none ``` The [`none` directive is equivalent to `noindex, nofollow`][equiv]. It tells robots not to index, follow links, or cache. [equiv]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag#using-the-x-robots-tag-http-header In `lib/rack_x_robots_tag.rb`: ```ruby module Rack class XRobotsTag def initialize(app) @app = app end def call(env) status, headers, response = @app.call(env) if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present? headers["X-Robots-Tag"] = "none" end [status, headers, response] end end end ``` In `config/application.rb`: ```ruby require_relative "../lib/rack_x_robots_tag" module YourAppName class Application < Rails::Application # ... config.middleware.use Rack::XRobotsTag end end ``` Our specs will now pass. ## Conclusion Our environment's content can be blocked in three different ways from crawling and indexing by web robots that respect the robots exclusion standard (most importantly Google). Use authentication to entirely hide it, or robots.txt plus the `X-Robots-Tag` for more granular control.