September 17, 2024
We are building NeetoCal which is a Calendly alternative. Recently, we deployed the latest code to production. The code change involved deleting a table. To our horror, during the deployment, we noticed that some users experienced errors with status code 500 for a few minutes. This happened because the migration to drop the tables ran quickly and the tables got deleted. However, the old code was still referring to those tables.
This kind of issue is pretty common with schema migrations, especially when you're dropping tables or columns. If the migration finishes before the code deployment catches up, you end up with old process still trying to access tables or columns that no longer exist. This mismatch can cause temporary errors, like the 500s we saw.
The safest bet might be to turn on the maintenance mode every time we run a schema migration. At NeetoCal, we deploy changes to production every day. We only want to schedule downtime when it's absolutely necessary. So this option was ruled out. We also heard that some companies manually restart their dynos during schema migrations to roll out the new code changes. However, this did not sit well with us either.
Most of the people we talked to solve this problem by having two step deployments.
Deployment 1: Deploy the code that is not using the table.
Deployment 2: Drop the table.
This can work and it does work. However we were worried about a potential edge case. Let's say that a piece of code is still referring the table. After the deployment 1 this code is working and we don't see anything going wrong. It could be due to recent merges slipping in unnoticed, as we ship very fast.
However when do we deployment 2 the migrations will run first, dropping the table, followed by the deployment of the code. When new code boots up, we realize that one part of the app is not working.
Now we are in trouble. We are in trouble because the table is gone. If we have taken the database backup then we can restore the backup but that causes all kinds of issues because we might not catch this bug for sometime. In the meantime other tables are getting new data. So restoring backup is a messy solution.
The only solution is the fix the code. Now we need to fix the code in a rush. That's what we want to avoid. Before we look at our solution, let's look at what we found when we looked at other soutions.
At NeetoCal, we are using the strong_migrations gem to catch unsafe migrations. The gem catches unsafe migration like removing a column but it doesn't capture unsafe operations like dropping a table.
Upon some digging, we found
this issue
where the author of the gem expressed unwillingness to add drop_table
as an
unsafe operation.
No worries. We can add dropping of a table as an unsafe operation in
strong_migration
ourselves. Here's how it can be done.
# config/initializers/strong_migrations.rb
StrongMigrations.add_check do |method, args|
if method == :drop_table
stop! "Dropping tables via migrations is discouraged."
end
end
To drop the table, we can use the safety_ensured
block provided by the
strong_migrations
gem to mark the step as safe.
# db/migrate/20240809131941_drop_users.rb
class DropUsers < ActiveRecord::Migration[8.0]
def change
safety_assured { drop_table :users }
end
end
While this gets the work done, it doesn't solve the problem of "some code still referring the table" problem. Hence, this solution was a "no go" from our side.
Sam Saffron had run into similar problems. He came up with a solution and he wrote about it in this blog.
His solution was not to drop the tables and columns immediately. Instead use "defer drops" to drop column or tables at least 30 minutes after the particular migration was run.
He introduced ColumnDropper and TableDropper to get this work done.
We felt that this solution adds an extra layer of complexity and we rejected this solution. Infact later we found that they ran into some issues with "defer drops" as discussed here.
After some internal discussion, we also decided to follow a three-step deployment process to ensure zero downtime and easy roll back without any data loss.
In "Deployment 1", we remove all the code that refers to the table we want to drop. This ensures that nothing in the application is depending on that table anymore.
In "Deployment 2", the table will be renamed. For example, table users
will be
renamed to users-deprecated-on-2024-08-09
. This step helps catch any dangling
code that is still referring the old table. If any part of the app still tries
to use the table, the errors will show up, and we can fix the problem one of the
two ways. We can revert the migration and the code. Or we can change the code.
We have a choice. If we delete the table then we don't have a choice.
Finally, in the "Deployment 3", once we’re confident that the table is no longer in use, we can drop it completely. Since the table follows a specific naming pattern, it's clear that it's ready to be safely deleted.
We can follow a similar approach when dropping columns. To add an extra layer of
safety, we mark the column that we need to drop as ignored using ActiveRecord's
ignored_columns
method. For example, if we need to drop the display_name
column from the
users
table, start by marking it as ignored:
class User < ActiveRecord::Base
self.ignored_columns += [:display_name]
end
By doing this, even if the display_name
column still exists in some lingering
code, our model won’t recognize it. This helps avoid any accidental references
to the column in your code. Once you’ve successfully dropped the column, you can
remove this line from your model.
If our model won't recognize it then why do we need RuboCop for dropping a
column. Once again the answer is to avoid an edge case. Let's say that we are
using executing SQL directly. And this SQL is referring the column
display_name
. Since a direct SQL is being used adding this column to
ignored_columns
will have no impact. Once this column is deleted then only we
will get to know about the error.
By renaming the column we maintain the data but at the same time all the lingering code would start failing.
Now the task was to build a custom cop to enforce the policy.
# bad
drop_table :users
# bad
drop_table :users do |t|
t.string :email, null: false
t.string :first_name, null: false
end
# good
drop_table :users_deprecated_on_2024_08_09
# good
drop_table :users_deprecated_on_2024_08_09 do |t|
t.string :email, null: false
t.string :first_name, null: false
end
We need to handle removal of column similarly.
# bad
remove_column :users, :email
# bad
change_table :users do |t|
t.remove :email
end
# good
remove_column :users, :email_deprecated_on_2024_08_09
# good
drop_table :users do |t|
t.remove :email_deprecated_on_2024_08_09
end
We added these two cops to our rubocop-neeto repo.
Renaming a column brings the same challenges as we have discussed in the previous sections. Renaming a column directly will cause temporary downtime since the new code references the new column name while the old code refers to the old column name. To avoid downtime, we need to deliberately carry out this operation in multiple deployments.
Here are the steps to rename the username
column to display_name
in the
users
table:
Deployment 1
display_name
column to
the table.class AddDisplayNameToUsers < ActiveRecord::Migration[8.0]
def change
add_column :users, :display_name, :string
end
end
ActiveRecord
callbacks can help with this:class User < ApplicationRecord
before_save do
self.display_name = username if will_save_change_to_username? }
end
end
username
column to the display_name
column:User.update_all('display_name = username')
Deployment 2
display_name
column instead of the old username
column, and
then remove the double writes to both columns.Deployment 3
class DropUsernameFromUsers < ActiveRecord::Migration[8.0]
def change
remove_column :users, :username
end
end
This approach might seem tedious, but it's essential for achieving zero downtime during the migration and to avoid any edge cases. We can apply the same steps when renaming tables as well. For more details on that process, check out the steps mentioned by the strong_migrations gem.
Running schema migrations can be scary, especially when they involve dropping tables and columns. But with the right safeguards in place, we can confidently deploy updates without worrying about any surprises.
If this blog was helpful, check out our full blog archive.