Skip to content

Commit 8d80d5d

Browse files
committed
Initial guide updates
1 parent 1e5d0ef commit 8d80d5d

3 files changed

Lines changed: 55 additions & 121 deletions

File tree

guides/backfilling_data.md

Lines changed: 49 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -38,27 +38,27 @@ defmodule MyApp.Repo.DataMigrations.BackfillPosts do
3838
end
3939
```
4040

41-
The problem is the code and schema may change over time. However, migrations are using a snapshot of your schemas at the time it's written. In the future, many assumptions may no longer be true. For example, the new_data column may not be present anymore in the schema causing the query to fail if this migration is run months later.
41+
The problem is the code and schema may change over time. However, migrations are using a snapshot of your schemas at the time it's written. In the future, many assumptions may no longer be true. For example, the `new_data` column may not be present anymore in the schema causing the query to fail if this migration is run months later.
4242

43-
Additionally, in your development environment, you might have 10 records to migrate; in staging, you might have 100; in production, you might have 1 billion to migrate. Scaling your approach matters.
43+
Additionally, in your development environment, you might have 10 records to migrate. In staging, you might have 100. In production, you might have 1 billion to migrate. Scaling your approach matters.
4444

4545
Ultimately, there are several bad practices here:
4646

4747
1. The Ecto schema in the query may change after this migration was written.
48-
1. If you try to backfill the data all at once, it may exhaust the database memory and/or CPU if it's changing a large data set.
49-
1. Backfilling data inside a transaction for the migration locks row updates for the duration of the migration, even if you are updating in batches.
50-
1. Disabling the transaction for the migration and only batching updates may still spike the database CPU to 100%, causing other concurrent reads or writes to time out.
48+
2. If you try to backfill the data all at once, it may exhaust the database memory and/or CPU if it's changing a large data set.
49+
3. Backfilling data inside a transaction for the migration locks row updates for the duration of the migration, even if you are updating in batches.
50+
4. Disabling the transaction for the migration and only batching updates may still spike the database CPU to 100%, causing other concurrent reads or writes to time out.
5151

5252
## Good
5353

5454
There are four keys to backfilling safely:
5555

5656
1. running outside a transaction
57-
1. batching
58-
1. throttling
59-
1. resiliency
57+
2. batching
58+
3. throttling
59+
4. resiliency
6060

61-
As we've learned in this guide, it's straight-forward to disable the migration transactions. Add these options to the migration:
61+
To disable the migration transactions, add these options to the top of your migration:
6262

6363
```elixir
6464
@disable_ddl_transaction true
@@ -72,7 +72,8 @@ We'll start with how do we paginate efficiently: `LIMIT`/`OFFSET` by itself is a
7272
For querying and updating the data, there are two ways to "snapshot" your schema at the time of the migration. We'll use both options below in the examples:
7373

7474
1. Execute raw SQL that represents the table at that moment. Do not use Ecto schemas. Prefer this approach when you can. Your application's Ecto schemas will change over time, but your migration should not, therefore it's not a true snapshot of the data at the time.
75-
1. Write a small Ecto schema module inside the migration that only uses what you need. Then use that in your data migration. This is helpful if you prefer the Ecto API and decouples from your application's Ecto schemas as it evolves separately.
75+
76+
2. Write a small Ecto schema module inside the migration that only uses what you need. Then use that in your data migration. This is helpful if you prefer the Ecto API and decouples from your application's Ecto schemas as it evolves separately.
7677

7778
For throttling, we can simply add a `Process.sleep(@throttle)` for each page.
7879

@@ -81,9 +82,9 @@ For resiliency, we need to ensure that we handle errors without losing our progr
8182
Finally, to manage these data migrations separately, we need to:
8283

8384
1. Store data migrations separately from your schema migrations.
84-
1. Run the data migrations manually.
85+
2. Run the data migrations manually.
8586

86-
To achieve this, be inspired by [Ecto's documentation on creating a Release module](`Ecto.Migrator`), and extend your release module to allow options to pass into `Ecto.Migrator` that specifies the version to migrate and the data migrations' file path, for example:
87+
If you have `mix` available in production, you can use `mix ecto.migrate --migrations-path "priv/repo/data_migrations"`. However, most applications use releases in production, so you need extend your release module (see [Ecto's documentation on creating a Release module](`Ecto.Migrator`)). The idea is to provide a `migrate_data` function that specifies the version to migrate and the data migrations' file path, for example:
8788

8889
```elixir
8990
defmodule MyApp.Release do
@@ -108,21 +109,15 @@ If the data can be queried with a condition that is removed after update then yo
108109
Here's how we can manage the backfill:
109110

110111
1. Disable migration transactions.
111-
1. Use keyset pagination: Order the data, find rows greater than the last mutated row and limit by batch size.
112-
1. For each page, mutate the records.
113-
1. Check for failed updates and handle it appropriately.
114-
1. Use the last mutated record's ID as the starting point for the next page. This helps with resiliency and prevents looping on the same record over and over again.
115-
1. Arbitrarily sleep to throttle and prevent exhausting the database.
116-
1. Rinse and repeat until there are no more records
112+
2. Use keyset pagination: Order the data, find rows greater than the last mutated row and limit by batch size.
113+
3. For each page, mutate the records.
114+
4. Check for failed updates and handle it appropriately.
115+
5. Use the last mutated record's ID as the starting point for the next page. This helps with resiliency and prevents looping on the same record over and over again.
116+
6. Arbitrarily sleep to throttle and prevent exhausting the database.
117+
7. Rinse and repeat until there are no more records
117118

118119
For example:
119120

120-
```bash
121-
mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_posts
122-
```
123-
124-
And modify the migration:
125-
126121
```elixir
127122
defmodule MyApp.Repo.DataMigrations.BackfillPosts do
128123
use Ecto.Migration
@@ -184,6 +179,12 @@ defmodule MyApp.Repo.DataMigrations.BackfillPosts do
184179
end
185180
```
186181

182+
To test it in development/test environments:
183+
184+
```bash
185+
mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_posts
186+
```
187+
187188
## Batching Arbitrary Data
188189

189190
If the data being updated does not indicate it's already been updated, then we need to take a snapshot of the current data and store it temporarily. For example, if all rows should increment a column's value by 10, how would you know if a record was already updated? You could load a list of IDs into the application during the migration, but what if the process crashes? Instead we're going to keep the data we need in the database.
@@ -193,23 +194,26 @@ To do this, it works well if we can pick a specific point in time where all reco
193194
Here's how we'll manage the backfill:
194195

195196
1. Create a "temporary" table. In this example, we're creating a real table that we'll drop at the end of the data migration. In Postgres, there are [actual temporary tables](https://www.postgresql.org/docs/12/sql-createtable.html) that are discarded after the session is over; we're not using those because we need resiliency in case the data migration encounters an error. The error would cause the session to be over, and therefore the temporary table tracking progress would be lost. Real tables don't have this problem. Likewise, we don't want to store IDs in application memory during the migration for the same reason.
196-
1. Populate that temporary table with IDs of records that need to update. This query only requires a read of the current records, so there are no consequential locks occurring when populating, but be aware this could be a lengthy query. Populating this table can occur at creation or afterwards; in this example we'll populate it at table creation.
197-
1. Ensure there's an index on the temporary table so it's fast to delete IDs from it. I use an index instead of a primary key because it's easier to re-run the migration in case there's an error. There isn't a straight-forward way to `CREATE IF NOT EXIST` on a primary key; but you can do that easily with an index.
198-
1. Use keyset pagination to pull batches of IDs from the temporary table. Do this inside a database transaction and lock records for updates. Each batch should read and update within milliseconds, so this should have little impact on concurrent reads and writes.
199-
1. For each batch of records, determine the data changes that need to happen. This can happen for each record.
200-
1. [Upsert](https://wiki.postgresql.org/wiki/UPSERT) those changes to the real table. This insert will include the ID of the record that already exists and a list of attributes to change for that record. Since these insertions will conflict with existing records, we'll instruct Postgres to replace certain fields on conflicts.
201-
1. Delete those IDs from the temporary table since they're updated on the real table. Close the database transaction for that batch.
202-
1. Throttle so we don't overwhelm the database, and also give opportunity to other concurrent processes to work.
203-
1. Rinse and repeat until the temporary table is empty.
204-
1. Finally, drop the temporary table when empty.
205197

206-
Let's see how this can work:
198+
2. Populate that temporary table with IDs of records that need to update. This query only requires a read of the current records, so there are no consequential locks occurring when populating, but be aware this could be a lengthy query. Populating this table can occur at creation or afterwards; in this example we'll populate it at table creation.
207199

208-
```bash
209-
mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_weather
210-
```
200+
3. Ensure there's an index on the temporary table so it's fast to delete IDs from it. I use an index instead of a primary key because it's easier to re-run the migration in case there's an error. There isn't a straight-forward way to `CREATE IF NOT EXIST` on a primary key; but you can do that easily with an index.
201+
202+
4. Use keyset pagination to pull batches of IDs from the temporary table. Do this inside a database transaction and lock records for updates. Each batch should read and update within milliseconds, so this should have little impact on concurrent reads and writes.
203+
204+
5. For each batch of records, determine the data changes that need to happen. This can happen for each record.
205+
206+
6. [Upsert](https://wiki.postgresql.org/wiki/UPSERT) those changes to the real table. This insert will include the ID of the record that already exists and a list of attributes to change for that record. Since these insertions will conflict with existing records, we'll instruct Postgres to replace certain fields on conflicts.
207+
208+
7. Delete those IDs from the temporary table since they're updated on the real table. Close the database transaction for that batch.
209+
210+
8. Throttle so we don't overwhelm the database, and also give opportunity to other concurrent processes to work.
211+
212+
9. Rinse and repeat until the temporary table is empty.
211213

212-
Modify the migration:
214+
10. Finally, drop the temporary table when empty.
215+
216+
Let's see how this can work:
213217

214218
```elixir
215219
# Both of these modules are in the same migration file
@@ -355,6 +359,12 @@ defmodule MyApp.Repo.DataMigrations.BackfillWeather do
355359
end
356360
```
357361

362+
And to test in development/test environments:
363+
364+
```bash
365+
mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_weather
366+
```
367+
358368
---
359369

360-
This guide was originally published on [Fly.io Phoenix Files](https://fly.io/phoenix-files/backfilling-data/).
370+
This guide was originally published on [Fly.io Phoenix Files](https://fly.io/phoenix-files/backfilling-data/). See [Automatic and Manual Ecto Migrations by Wojtek Mach](https://dashbit.co/blog/automatic-and-manual-ecto-migrations) for more examples on running Ecto migrations.

guides/safe_migrations.md

Lines changed: 1 addition & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ milliseconds which could be acceptable for you. However, once your table has
2727
100+ million records, the difference becomes seconds which is more likely to be
2828
felt and cause timeouts. Therefore, err on the side of safety, but
2929
**always benchmark for your own database**. Also consider the hardware the
30-
database is running; eg a Raspberry Pi 2B on a microSD will run much slower.
30+
database is running: for example, a Raspberry Pi 2B on a microSD will run much slower.
3131

3232
## Table of Contents
3333

@@ -48,11 +48,6 @@ database is running; eg a Raspberry Pi 2B on a microSD will run much slower.
4848
- [Adding a PostgreSQL extension](#adding-a-postgresql-extension)
4949
- [Squashing migrations](#squashing-migrations)
5050

51-
Read more about safe migration techniques:
52-
53-
- [Migration locks in "Anatomy of a Migration"](migration_anatomy.html)
54-
- [How to backfill data and change data in bulk (aka: DML)](backfilling_data.html)
55-
5651
## Adding an index
5752

5853
Creating an index will [block writes](https://www.postgresql.org/docs/8.2/sql-createindex.html) to the table in Postgres.
@@ -801,76 +796,6 @@ end
801796
>
802797
> Creating extensions typically requires superuser privileges. In managed database services (AWS RDS, Heroku), some extensions may not be available.
803798
804-
## Squashing Migrations
805-
806-
If you have a long list of migrations, sometimes it can take a while to migrate
807-
each of those files every time the project is reset or spun up by a new
808-
developer. Thankfully, Ecto comes with mix tasks to `dump` and `load` a database
809-
structure which will represent the state of the database up to a certain point
810-
in time, not including content.
811-
812-
- `mix ecto.dump`
813-
- `mix ecto.load`
814-
815-
Schema dumping and loading is only supported by external binaries `pg_dump` and
816-
`mysqldump`, which are used by the Postgres, MyXQL, and MySQL Ecto adapters (not
817-
supported in MSSQL adapter).
818-
819-
For example:
820-
821-
```
822-
20210101000000 - First Migration
823-
20210201000000 - Second Migration
824-
20210701000000 - Third Migration <-- we are here now. run `mix ecto.dump`
825-
```
826-
827-
We can "squash" the migrations up to the current day which will effectively
828-
fast-forward migrations to that structure. The Ecto Migrator will detect that
829-
the database is already migrated to the third migration, and so it begins there
830-
and migrates forward.
831-
832-
Let's add a new migration:
833-
834-
```
835-
20210101000000 - First Migration
836-
20210201000000 - Second Migration
837-
20210701000000 - Third Migration <-- `structure.sql` represents up to here
838-
20210801000000 - New Migration <-- This is where migrations will begin
839-
```
840-
841-
The new migration will still run, but the first-through-third migrations will
842-
not need to be run since the structure already represents the changes applied by
843-
those migrations. At this point, you can safely delete the first, second, and
844-
third migration files or keep them for historical auditing.
845-
846-
Let's make this work:
847-
848-
1. Run `mix ecto.dump` which will dump the current structure into
849-
`priv/repo/structure.sql` by default. Check `mix help ecto.dump` for more
850-
options.
851-
2. During project setup with an empty database, run `mix ecto.load` to load
852-
`structure.sql`.
853-
3. Run `mix ecto.migrate` to run any additional migrations created after the
854-
structure was dumped.
855-
856-
To simplify these actions into one command, we can leverage mix aliases:
857-
858-
```elixir
859-
# mix.exs
860-
861-
defp aliases do
862-
[
863-
"ecto.reset": ["ecto.drop", "ecto.setup"],
864-
"ecto.setup": ["ecto.load", "ecto.migrate"],
865-
# ...
866-
]
867-
end
868-
```
869-
870-
Now you can run `mix ecto.setup` and it will load the database structure and run
871-
remaining migrations. Or, run `mix ecto.reset` and it will drop and run setup.
872-
Of course, you can continue running `mix ecto.migrate` as you create them.
873-
874799
## Credits
875800

876801
Created and written by David Bernheisel with recipes heavily inspired from Andrew Kane and his library [strong_migrations](https://github.com/ankane/strong_migrations).
@@ -879,8 +804,5 @@ Created and written by David Bernheisel with recipes heavily inspired from Andre
879804
- [Strong Migrations by Andrew Kane](https://github.com/ankane/strong_migrations)
880805
- [Adding a NOT NULL CONSTRAINT on PG Faster with Minimal Locking by Christophe Escobar](https://medium.com/doctolib/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c)
881806
- [Postgres Runtime Configuration](https://www.postgresql.org/docs/current/runtime-config-client.html)
882-
- [Automatic and Manual Ecto Migrations by Wojtek Mach](https://dashbit.co/blog/automatic-and-manual-ecto-migrations)
883807

884808
Special thanks for sponsorship: Fly.io
885-
886-
Special thanks for the reviewers.

mix.exs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -191,14 +191,16 @@ defmodule EctoSQL.MixProject do
191191
source_url: @source_url,
192192
extras: [
193193
"CHANGELOG.md",
194-
"guides/safe_migrations.md",
195194
"guides/migration_anatomy.md",
195+
"guides/safe_migrations.md",
196+
"guides/squashing_migrations.md",
196197
"guides/backfilling_data.md"
197198
],
198199
groups_for_extras: [
199-
Guides: [
200-
"guides/safe_migrations.md",
200+
"Migration Guides": [
201201
"guides/migration_anatomy.md",
202+
"guides/safe_migrations.md",
203+
"guides/squashing_migrations.md",
202204
"guides/backfilling_data.md"
203205
]
204206
],

0 commit comments

Comments
 (0)