The Sitemap protocol was introduced by Google in 2005, but is now supported by all of the major search engines. Unrelated to a traditional website sitemap navigation page, it defines an XML schema for listing the URLs within a site, including metadata such as when a URL as last updated, therefore allowing search engines to crawl the site more intelligently.
Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site. It’s easy to add support for a dynamically generated Sitemap to a Rails application. This post documents how I went about it for this site, whereby blog posts are stored in instances of a Post model. Obviously you’ll likely need to adapt some of these instructions for your own application.
The first step is to set up a dedicated route and controller for the sitemap. Add the following route towards the bottom of config/routes.rb:
map.sitemap '/sitemap.xml', :controller => 'sitemap'
This routes all requests for /sitemap.xml to a controller dedicated to serving the sitemap. Next, create app/controllers/sitemap_controller.rb:
class SitemapController < ApplicationController layout nil def index headers['Content-Type'] = 'application/xml' latest = Post.last if stale?(:etag => latest, :last_modified => latest.updated_at.utc) respond_to do |format| format.xml { @posts = Post.sitemap.published } end end end end
The index action gets the latest Post model instance and then checks the HTTP request for staleness using ActionController’s stale? method, which does so by checking the HTTP ETag and Last-Modified headers. This ensures that the Sitemap is only served to the client if it contains fresh content, otherwise an HTTP 304 Not Modified status is returned. The @posts instance variable is set to the result of executing the chained sitemap and published named scopes within the Post model. This is how those named scopes are defined in app/models/post.rb:
class Post < ActiveRecord::Base named_scope :published, :conditions => { :published => true } named_scope :sitemap, :select => 'slug, created_at, updated_at', :limit => 49999 # +1 for About page to make 50,000 end
The sitemap named scope only selects the slug, created_at and updated_at columns because they’re all that’s required within the generated XML. I also limit it to 49,999 results because as you’ll see shortly the view template includes a hard-coded reference to my site’s static About page. The Sitemap protocol specifies that each Sitemap file contain no much than 50,000 URLs, hence the limit. I’ll worry about how to handle more than 50,000 posts in the extremely unlikely event that I write that many!
The final piece of the puzzle is the view template. Although originally using Builder for view generation, I switched to using Haml (HTML Abstraction Markup Language) because it’s simpler and faster. Haml is based on the idea of removing all duplication from markup and of using meaningful indentation to describe structure. This is what the app/views/sitemap/index.xml.haml file looks like:
- base_url = "http://#{request.host_with_port}" !!! XML %urlset{:xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9"} - for post in @posts %url %loc #{base_url}#{post.permalink} %lastmod=post.last_modified %changefreq monthly %priority 0.5 -# About page %url %loc #{base_url}/about %lastmod 2009-08-28 %changefreq monthly %priority 0.5
This small quantity of Haml generates a Sitemap XML file that looks like the extract below. Job done!
<?xml version='1.0' encoding='utf-8' ?> <urlset xmlns='http://www.sitemaps.org/schemas/sitemap/0.9'> <url> <loc>http://johntopley.com/2010/02/02/the-apple-ipad</loc> <lastmod>2010-02-02</lastmod> <changefreq>monthly</changefreq> <priority>0.5</priority> </url> <url> <loc>http://johntopley.com/2010/01/14/the-best-of-twitter-2009</loc> <lastmod>2010-01-22</lastmod> <changefreq>monthly</changefreq> <priority>0.5</priority> </url> ... <url> <loc>http://johntopley.com/about</loc> <lastmod>2009-08-28</lastmod> <changefreq>monthly</changefreq> <priority>0.5</priority> </url>
Comments
There are 2 comments on this post. Comments are closed.
Great post! Looking forward to more rails stuff and also you correcting my spelling on SO =)
Heh, thanks Jonnii! :-)