Now that my new blog is live I needed a way to take daily back-ups of the MySQL database to a remote server for safekeeping. Writing a custom Rake task is perfect for this purpose. The plan I decided on was to use the mysqldump program to dump the contents of the relevant database tables into a (compressed) SQL file, which would then get uploaded to Amazon S3 using Marcel Molina’s wonderful AWS-S3 Gem. Then I would set up cron to run the Rake task every night. As you've probably guessed from that last sentence, these instructions assume that the Rails application is running on a UNIX-like operating system. However, it wouldn’t be much work to adapt them for Windows.
I'd been wanting to try Amazon S3 for a while because it’s a well-proven, mature and resilient way of storing data in “The Cloud” for minimal cost. You only get charged for what you use and the sign up process is trivial. Finally, there are now numerous applications that can talk to S3—on the Mac I use Transmit and S3Hub—and client libraries available for all major programming languages.
I decided to separate the configuration needed by the Rake task (things like my S3 credentials) into a YAML file. The YAML file format should be familiar to every Rails programmer because pretty much every Rails application uses one to look up the database connection details. I rather like YAML because it’s as simple to understand and as readable as the old Windows INI file format (with the addition of some powerful extra features) and you don’t have to pay the angle bracket tax that XML imposes that increases the file size. YAML is also great because with a couple of method calls you can turn the contents of a YAML file into a Ruby hash and indeed the Rake task I'm about to show you does this.
This is what the s3.yml configuration file I have in the blog application’s /config directory* looks like:
defaults: &defaults access_key_id: SECRET secret_access_key: SECRET bucket: datadumps.johntopley.com development: <<: *defaults dump_dir: /Users/john/rails log_file: /Users/john/Desktop/datadump.log test: <<: *defaults dump_dir: /Users/john/rails log_file: /Users/john/Desktop/datadump.log production: <<: *defaults dump_dir: /home/deploy/datadumps log_file: /home/deploy/public_html/railsapp/shared/log/datadump.log
Most of this should look familiar to anyone who has ever seen a Rails' database.yml file. Three sets of configuration are specified for the three environments that Rails provides out of the box i.e. development, test and production. Of course I could have just set it up for production because I don’t need to back up the contents of my development database to S3, but I set it up this way so that I could test the Rake task locally.
This file also uses YAML merge keys which some of you may not have seen before. The idea is that you specify a default configuration—i.e. the defaults: section—and then just override the bits that need to change for each environment.
As you can see there are five pieces of configuration in the file: the unique Amazon S3 Access Key ID and Secret Access Key, the name of the S3 bucket (effectively a folder) that will hold the dump files, a directory where the dump file will be held prior to being transferred to S3 and finally a directory for the log file that the Rake task will use to keep a record of what it’s doing.
Now let’s take a look at the Rake task itself. The file data_dump.rake goes in the application’s lib/tasks directory. Incidentally, I'm using → to indicate where I've had to wrap the line for space purposes; in reality these statements are all on one line.
Here’s what the Rake task looks like:
namespace :db do namespace :data do desc 'Dumps the database for the current RAILS_ENV, compresses it and uploads to Amazon S3' task :dump => :environment do require 'aws/s3' TIMESTAMP = '%Y%m%d-%H%M' TABLES = 'categories_posts comments messages posts' db = YAML::load(open("#{RAILS_ROOT}/config/database.yml")) s3 = YAML::load(open("#{RAILS_ROOT}/config/s3.yml")) logger = Logger.new(s3["#{RAILS_ENV}"]['log_file'], 5, 10 * 1024) file = "#{s3["#{RAILS_ENV}"]['dump_dir']}/#{RAILS_ENV}→ -#{Time.now.utc.strftime(TIMESTAMP)}.sql.gz" cmd = "mysqldump -u #{db["#{RAILS_ENV}"]['username']} -p#{db["#{RAILS_ENV}"]['password']} -h #{db["#{RAILS_ENV}"]['host']}→ --add-drop-table --add-locks --extended-insert→ --lock-tables #{db["#{RAILS_ENV}"]['database']}→ #{TABLES} | gzip -cf9 > #{file}" logger.info 'Getting ready to create a database dump' `#{cmd}` bucket = "#{s3["#{RAILS_ENV}"]['bucket']}" logger.info "Data dump created, transferring '#{File.basename(file)}' to→ Amazon S3 '#{bucket}' bucket" AWS::S3::Base.establish_connection!( :access_key_id => "#{s3["#{RAILS_ENV}"]['access_key_id']}", :secret_access_key => "#{s3["#{RAILS_ENV}"]['secret_access_key']}" ) AWS::S3::S3Object.store(File.basename(file), open(file), bucket, :content_type => 'application/x-gzip') logger.info 'Removing local data dump file' cmd = "rm -f #{file}" `#{cmd}` end end end
The two namespace statements mean that the task itself is run using the command rake db:data:dump. The current Rails environment is passed in to the task, where it can be referred to using the RAILS_ENV variable.
The TABLES constant is just a space-separated list of database tables that should end up in the dump file. One thing I'm doing here to save space and bandwidth is only dumping those tables that contain dynamic data. For example, I'm excluding the categories table that holds the names of the 18 blog categories that you can see down there at the bottom of this page, but I am including the categories_posts table which is the join table bridging the many-to-many relationship between posts and their categories. So the generated SQL file on its own is not sufficient to repopulate the database, I have to re-run my ActiveRecord Migrations first and reload the seed data. Of course you may choose to back up all your database tables if you wish by listing them on this line. Just be aware of the effect on the dump file size.
Next, the config/database.yml and config/s3.yml files are opened and assigned to a couple of variables and a logger is instantiated. You’ll notice that the first argument to the Logger constructor is the location of the log file, which is read from the s3 variable using a hash syntax. It’s looking up the value stored under the log_file key within the section that corresponds to the current Rails environment.
The variable named file is assigned the file path for the dump file. This is created by looking up the name of the dump_dir and concatenating the name of the current Rails environment with the current date and time. The resulting filenames end up looking like production-20090914-0100.sql.gz. Feel free to customize to taste.
The next statement builds the command line for the mysqldump program and stores it in the cmd variable. The output from mysqldump is piped into the gzip program and the output redirected into the file path held in the file variable. This command is executed by enclosing it in backquotes, which is Ruby’s handy way of letting you run shell commands.
You may notice that the logging in this Rake task is very minimal. It just tells you what it’s about to do before it does it, so if something goes wrong the only information you get is how far the script got. I’ll leave more comprehensive logging and the logging of error conditions as an exercise for the reader. Suffice to say that I haven’t had any problems yet running this script as-is in production, so it’s probably not worth going overboard trying to anticipate every possible error condition.
Now that the dump file has been created, it’s time to upload it to Amazon S3. The establish_connection! method is invoked, which is actually a Ruby block that accepts two parameters i.e. the S3 Access Key ID and Secret Access Key looked up in the configuration file. Then the store method is called, being passed the file path, the file contents, the name of the S3 bucket and finally the MIME type for the file. The last action the task performs is executing the rm -f command to remove the generated dump file from the local filesystem.
So there you have it, a fairly simple Rake task that will dump the contents of a MySQL database to Amazon S3. Run the task from cron using RAILS_ENV=production rake:db:data:dump. I hope you found this helpful.
* On the server it’s held elsewhere and I have Capistrano create a symbolic link to the configuration file in the correct location.
Comments
There aren’t any comments on this post. Comments are closed.