Managing Zero-Downtime Schema Changes and Performant Data Migrations in a Rails App - A Case Study

Fun Fax

Here at Iora Health, we receive over 1,300 faxes per day — just a tiny slice of the millions of faxes sent every day in industries like healthcare where faxing is still used as a primary method of inter-office communication. While the broader healthcare industry is working towards a “fax free” world, our aim as software engineers working on Iora’s collaborative care platform, Chirp, is to try to streamline the current process as much as possible for the team members that handle these documents daily on behalf of our patients.

Gandalf against an army

Us beating back the tide of faxes.

At the outset of the project, however, we immediately faced an obstacle in our data modeling that prevented us from building the planned functionality. We had technical debt in the initial structure of two tables that required us to convert the column type and backfill the data of a large and heavily utilized PostgreSQL table in Production — the incoming_faxes table, which held around 222,000 records at the time of the migration and is accessed hundreds, sometimes thousands, of times a day. This article describes how we resolved the schema changes and backfilled the data without any detrimental impact to the application or interruption in the stream of faxes coming our way.

Facing the Fax

When Iora receives a fax, it is routed to Sfax for viewing and processing. The Chirp team had already further automated the processing step by configuring webhooks in Sfax that call our Rails app whenever a fax is received. This Sfax Webhooks controller creates an incoming fax record in our database and then enqueues an asynchronous Resque job to apply optical character recognition (OCR) to the fax — making its text content available for indexing, searching, and storage — and then creates a Box file (both an actual file in Box and a corresponding metadata record in our database) and moves the fax into a Box folder for incoming faxes. From there our Operations Assistants (OAs) review and manage the faxes in that Box folder. The #ShadowFax team hoped to further automate the process by providing OAs with a way to review and manage incoming faxes directly in the Chirp user interface (UI). With this in place, OAs would only need to access the Box UI for later retrieval of the file.

Old and new fax processing workflows

Simplified fax processing workflows

At the outset there was no formal relationship between the incoming_faxes and box_files tables in the database underlying the Chirp. The only link between the two was a unidirectional belongs_to ActiveRecord association. Our proposed feature set would rely on database queries joining the two tables to filter box files by the initial fax’s location in order to display only relevant incoming faxes for a given user, which would be impossible in the current configuration as there was no foreign key relationship to enable the join. In order to build that feature, we first needed to convert the column type of incoming_faxes.box_file_id to a UUID, and then establish a relationship between the two in the database itself by making it a foreign key linked to box_files.id.

Data model of faxes and Box files


Simplified Active Record models at the outset:

class IncomingFax < ApplicationRecord
  belongs_to :box_file
end

class BoxFile < ApplicationRecord
end

Not so Fax

My approach for this process followed a modified version of the Change Column Type approach laid out in this excellent blog post from Braintree about zero-downtime schema changes:

  1. Add a new column (box_file_uid) and backfill it with a copy of the old column’s (box_file_id) values. (Note: We simplified this set of steps from the suggested approach because the code using this field was limited to a single case, worked automatically upon switching to a new column, and any failures could be retried in Resque.)
  2. Ignore the old column (box_file_id) in ActiveRecord to ensure it is no longer in use anywhere
  3. Drop the old column (box_file_id)

I released the initial data migration as follows:

class AddBoxFileUidToIncomingFax < ActiveRecord::Migration[5.2]
  def up
    add_column :incoming_faxes, :box_file_uid, :uuid
    add_index :incoming_faxes, :box_file_uid, unique: true
  end
  def down
    remove_column :incoming_faxes, :box_file_uid
  end
end

And the backfill task:

namespace :one_offs do
  task backfill_incoming_fax_box_file_uid: :environment do
    verify_convertable!
    IncomingFax.find_each do |fax|
      fax.update!(box_file_uid: fax.box_file_id)
    end
  end

  def verify_convertable!
    unconvertable_rows_count = IncomingFax.where(box_file_id::uuid::text <> box_file_id).count
    if unconvertable_rows_count > 0
      raise <<-ERROR
        Unable to convert incoming_faxes.box_file_id because #{unconvertable_rows_count} rows are not valid uuids
      ERROR
    end
  end
end

And updated the models to establish an explicit relationship through the new foreign key:

class BoxFile < ApplicationRecord
  has_one :incoming_fax, foreign_key: box_file_uid
end

class IncomingFax < ApplicationRecord
  belongs_to :box_file, foreign_key: box_file_uid
end

The schema changes worked from the get-go thanks to the short duration of the ACCESS EXCLUSIVE lock placed on the table during an ADD COLUMN operation. And the backfill task looked promising due to the use of the Rails method find_each, which in turn leverages find_in_batches to yield one 1000-record batch at a time.

However, upon running the backfill task in our pre-production environment it became apparent that the code was not performant enough to run with confidence in production. The task ran for over an hour before I interrupted it and decided it was time to go back to the drawing board (or coding board…because that’s kind of what a monitor is…right?).

Back on the Fax Track

With the guidance of the trusty Ruby Benchmark module to assess a variety of different refactors, I ultimately landed on an approach using pluck to obtain the relevant IDs without loading other irrelevant columns for the records in question, and using raw SQL for the UPDATE statement. A second pass at this was released as follows:

namespace :one_offs do
  task backfill_incoming_fax_box_file_uid: :environment do
    verify_convertable!

    ids = IncomingFax.pluck(:id, :box_file_id, :box_file_uid)
    ids_count = ids.count
    ids.each_with_index do |incoming_fax_ids, index|
      puts #{index + 1} of #{ids_count} updating…”
      id, box_file_id, box_file_uid = incoming_fax_ids
      next if (box_file_id.blank? || box_file_uid.present?)
      sql = <<-SQL
        UPDATE incoming_faxes SET box_file_uid = ‘#{box_file_id}’ WHERE id = ‘#{id}      SQL
      ActiveRecord::Base.connection.execute(sql)
    end
  end
end

This approach worked in part because the number of records being loaded into memory in this case is still relatively small, but larger tables — with millions instead of hundreds of thousands of rows — might be better served with those initial batching mechanisms.

By this time there had been significant back-and-forth on the pull request (PR) and many iterations of the task done locally, so the rake task was optimized for performance and not code style. But performant it was — the task went from running for an hour-plus to 36 seconds in the pre-production environment.

Once this was shipped and the new column was fully populated, I moved on to Step 2: ignore the old column. I opened another PR to ignore the legacy column using the handy Rails 5 class method ignored_columns. This method prevents attribute accessors from being implicitly defined based on the model schema, and removes the ability to reference the column in SQL queries. This provides an extra layer of security to ensure no code is accessing the old column before dropping it entirely.

The simplified IncomingFax model:

class IncomingFax < ApplicationRecord
  belongs_to :box_file, foreign_key: box_file_uid
  self.ignored_columns = %w(box_file_id)
end

Finally, after ensuring that the new column was working successfully and the old column was being fully ignored, I was able to open the final PR for Step 3: drop the old column. By iterating carefully through the preceding steps, sunsetting this legacy column becomes significantly less risky and can be done without breaking a (proverbial) sweat.

Fax Forward

This work, while not particularly glamorous, unlocked huge potential for dealing with faxes and box files in a sensible way on the back end. It enabled us to develop an initial pilot of the fax processing features, complete with:

  1. A navigation element with notifications of incoming faxes scoped to a user’s location
  2. A list of incoming faxes and metadata by location
  3. The ability to view and search a single fax from within Chirp
  4. The ability to rename, assign to a patient, move within Box, and create a task from within Chirp
Chirp screenshot - viewing list of incoming faxes

Viewing incoming faxes for a fake patient in our test environment

Chirp screenshot - viewing and processing a single fax

Viewing and processing a single fax for a fake patient in our test environment

And all of this is now possible because we were able to modify the underlying tables without any disruption in a careful, stepwise fashion.

Acknowledgements

Huge thanks to my fellow Chirp Team engineers Max Fierke and Matt Doller for providing insightful feedback on the approach and code detailed in this article, and for more generally developing a culture of thoughtfulness and intentionality in how we treat our mission-critical data at Iora Health. Thanks also to the team at Braintree for their handy overview of best practices for safe database schema migrations in PostgreSQL.