You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: guides/backfilling_data.md
+49-39Lines changed: 49 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,27 +38,27 @@ defmodule MyApp.Repo.DataMigrations.BackfillPosts do
38
38
end
39
39
```
40
40
41
-
The problem is the code and schema may change over time. However, migrations are using a snapshot of your schemas at the time it's written. In the future, many assumptions may no longer be true. For example, the new_data column may not be present anymore in the schema causing the query to fail if this migration is run months later.
41
+
The problem is the code and schema may change over time. However, migrations are using a snapshot of your schemas at the time it's written. In the future, many assumptions may no longer be true. For example, the `new_data` column may not be present anymore in the schema causing the query to fail if this migration is run months later.
42
42
43
-
Additionally, in your development environment, you might have 10 records to migrate; in staging, you might have 100; in production, you might have 1 billion to migrate. Scaling your approach matters.
43
+
Additionally, in your development environment, you might have 10 records to migrate. In staging, you might have 100. In production, you might have 1 billion to migrate. Scaling your approach matters.
44
44
45
45
Ultimately, there are several bad practices here:
46
46
47
47
1. The Ecto schema in the query may change after this migration was written.
48
-
1. If you try to backfill the data all at once, it may exhaust the database memory and/or CPU if it's changing a large data set.
49
-
1. Backfilling data inside a transaction for the migration locks row updates for the duration of the migration, even if you are updating in batches.
50
-
1. Disabling the transaction for the migration and only batching updates may still spike the database CPU to 100%, causing other concurrent reads or writes to time out.
48
+
2. If you try to backfill the data all at once, it may exhaust the database memory and/or CPU if it's changing a large data set.
49
+
3. Backfilling data inside a transaction for the migration locks row updates for the duration of the migration, even if you are updating in batches.
50
+
4. Disabling the transaction for the migration and only batching updates may still spike the database CPU to 100%, causing other concurrent reads or writes to time out.
51
51
52
52
## Good
53
53
54
54
There are four keys to backfilling safely:
55
55
56
56
1. running outside a transaction
57
-
1. batching
58
-
1. throttling
59
-
1. resiliency
57
+
2. batching
58
+
3. throttling
59
+
4. resiliency
60
60
61
-
As we've learned in this guide, it's straight-forward to disable the migration transactions. Add these options to the migration:
61
+
To disable the migration transactions, add these options to the top of your migration:
62
62
63
63
```elixir
64
64
@disable_ddl_transactiontrue
@@ -72,7 +72,8 @@ We'll start with how do we paginate efficiently: `LIMIT`/`OFFSET` by itself is a
72
72
For querying and updating the data, there are two ways to "snapshot" your schema at the time of the migration. We'll use both options below in the examples:
73
73
74
74
1. Execute raw SQL that represents the table at that moment. Do not use Ecto schemas. Prefer this approach when you can. Your application's Ecto schemas will change over time, but your migration should not, therefore it's not a true snapshot of the data at the time.
75
-
1. Write a small Ecto schema module inside the migration that only uses what you need. Then use that in your data migration. This is helpful if you prefer the Ecto API and decouples from your application's Ecto schemas as it evolves separately.
75
+
76
+
2. Write a small Ecto schema module inside the migration that only uses what you need. Then use that in your data migration. This is helpful if you prefer the Ecto API and decouples from your application's Ecto schemas as it evolves separately.
76
77
77
78
For throttling, we can simply add a `Process.sleep(@throttle)` for each page.
78
79
@@ -81,9 +82,9 @@ For resiliency, we need to ensure that we handle errors without losing our progr
81
82
Finally, to manage these data migrations separately, we need to:
82
83
83
84
1. Store data migrations separately from your schema migrations.
84
-
1. Run the data migrations manually.
85
+
2. Run the data migrations manually.
85
86
86
-
To achieve this, be inspired by [Ecto's documentation on creating a Release module](`Ecto.Migrator`), and extend your release module to allow options to pass into `Ecto.Migrator` that specifies the version to migrate and the data migrations' file path, for example:
87
+
If you have `mix` available in production, you can use `mix ecto.migrate --migrations-path "priv/repo/data_migrations"`. However, most applications use releases in production, so you need extend your release module (see [Ecto's documentation on creating a Release module](`Ecto.Migrator`)). The idea is to provide a `migrate_data` function that specifies the version to migrate and the data migrations' file path, for example:
87
88
88
89
```elixir
89
90
defmoduleMyApp.Releasedo
@@ -108,21 +109,15 @@ If the data can be queried with a condition that is removed after update then yo
108
109
Here's how we can manage the backfill:
109
110
110
111
1. Disable migration transactions.
111
-
1. Use keyset pagination: Order the data, find rows greater than the last mutated row and limit by batch size.
112
-
1. For each page, mutate the records.
113
-
1. Check for failed updates and handle it appropriately.
114
-
1. Use the last mutated record's ID as the starting point for the next page. This helps with resiliency and prevents looping on the same record over and over again.
115
-
1. Arbitrarily sleep to throttle and prevent exhausting the database.
116
-
1. Rinse and repeat until there are no more records
112
+
2. Use keyset pagination: Order the data, find rows greater than the last mutated row and limit by batch size.
113
+
3. For each page, mutate the records.
114
+
4. Check for failed updates and handle it appropriately.
115
+
5. Use the last mutated record's ID as the starting point for the next page. This helps with resiliency and prevents looping on the same record over and over again.
116
+
6. Arbitrarily sleep to throttle and prevent exhausting the database.
117
+
7. Rinse and repeat until there are no more records
If the data being updated does not indicate it's already been updated, then we need to take a snapshot of the current data and store it temporarily. For example, if all rows should increment a column's value by 10, how would you know if a record was already updated? You could load a list of IDs into the application during the migration, but what if the process crashes? Instead we're going to keep the data we need in the database.
@@ -193,23 +194,26 @@ To do this, it works well if we can pick a specific point in time where all reco
193
194
Here's how we'll manage the backfill:
194
195
195
196
1. Create a "temporary" table. In this example, we're creating a real table that we'll drop at the end of the data migration. In Postgres, there are [actual temporary tables](https://www.postgresql.org/docs/12/sql-createtable.html) that are discarded after the session is over; we're not using those because we need resiliency in case the data migration encounters an error. The error would cause the session to be over, and therefore the temporary table tracking progress would be lost. Real tables don't have this problem. Likewise, we don't want to store IDs in application memory during the migration for the same reason.
196
-
1. Populate that temporary table with IDs of records that need to update. This query only requires a read of the current records, so there are no consequential locks occurring when populating, but be aware this could be a lengthy query. Populating this table can occur at creation or afterwards; in this example we'll populate it at table creation.
197
-
1. Ensure there's an index on the temporary table so it's fast to delete IDs from it. I use an index instead of a primary key because it's easier to re-run the migration in case there's an error. There isn't a straight-forward way to `CREATE IF NOT EXIST` on a primary key; but you can do that easily with an index.
198
-
1. Use keyset pagination to pull batches of IDs from the temporary table. Do this inside a database transaction and lock records for updates. Each batch should read and update within milliseconds, so this should have little impact on concurrent reads and writes.
199
-
1. For each batch of records, determine the data changes that need to happen. This can happen for each record.
200
-
1.[Upsert](https://wiki.postgresql.org/wiki/UPSERT) those changes to the real table. This insert will include the ID of the record that already exists and a list of attributes to change for that record. Since these insertions will conflict with existing records, we'll instruct Postgres to replace certain fields on conflicts.
201
-
1. Delete those IDs from the temporary table since they're updated on the real table. Close the database transaction for that batch.
202
-
1. Throttle so we don't overwhelm the database, and also give opportunity to other concurrent processes to work.
203
-
1. Rinse and repeat until the temporary table is empty.
204
-
1. Finally, drop the temporary table when empty.
205
197
206
-
Let's see how this can work:
198
+
2. Populate that temporary table with IDs of records that need to update. This query only requires a read of the current records, so there are no consequential locks occurring when populating, but be aware this could be a lengthy query. Populating this table can occur at creation or afterwards; in this example we'll populate it at table creation.
3. Ensure there's an index on the temporary table so it's fast to delete IDs from it. I use an index instead of a primary key because it's easier to re-run the migration in case there's an error. There isn't a straight-forward way to `CREATE IF NOT EXIST` on a primary key; but you can do that easily with an index.
201
+
202
+
4. Use keyset pagination to pull batches of IDs from the temporary table. Do this inside a database transaction and lock records for updates. Each batch should read and update within milliseconds, so this should have little impact on concurrent reads and writes.
203
+
204
+
5. For each batch of records, determine the data changes that need to happen. This can happen for each record.
205
+
206
+
6.[Upsert](https://wiki.postgresql.org/wiki/UPSERT) those changes to the real table. This insert will include the ID of the record that already exists and a list of attributes to change for that record. Since these insertions will conflict with existing records, we'll instruct Postgres to replace certain fields on conflicts.
207
+
208
+
7. Delete those IDs from the temporary table since they're updated on the real table. Close the database transaction for that batch.
209
+
210
+
8. Throttle so we don't overwhelm the database, and also give opportunity to other concurrent processes to work.
211
+
212
+
9. Rinse and repeat until the temporary table is empty.
211
213
212
-
Modify the migration:
214
+
10. Finally, drop the temporary table when empty.
215
+
216
+
Let's see how this can work:
213
217
214
218
```elixir
215
219
# Both of these modules are in the same migration file
@@ -355,6 +359,12 @@ defmodule MyApp.Repo.DataMigrations.BackfillWeather do
This guide was originally published on [Fly.io Phoenix Files](https://fly.io/phoenix-files/backfilling-data/).
370
+
This guide was originally published on [Fly.io Phoenix Files](https://fly.io/phoenix-files/backfilling-data/). See [Automatic and Manual Ecto Migrations by Wojtek Mach](https://dashbit.co/blog/automatic-and-manual-ecto-migrations) for more examples on running Ecto migrations.
Copy file name to clipboardExpand all lines: guides/safe_migrations.md
+1-79Lines changed: 1 addition & 79 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ milliseconds which could be acceptable for you. However, once your table has
27
27
100+ million records, the difference becomes seconds which is more likely to be
28
28
felt and cause timeouts. Therefore, err on the side of safety, but
29
29
**always benchmark for your own database**. Also consider the hardware the
30
-
database is running; eg a Raspberry Pi 2B on a microSD will run much slower.
30
+
database is running: for example, a Raspberry Pi 2B on a microSD will run much slower.
31
31
32
32
## Table of Contents
33
33
@@ -48,11 +48,6 @@ database is running; eg a Raspberry Pi 2B on a microSD will run much slower.
48
48
-[Adding a PostgreSQL extension](#adding-a-postgresql-extension)
49
49
-[Squashing migrations](#squashing-migrations)
50
50
51
-
Read more about safe migration techniques:
52
-
53
-
-[Migration locks in "Anatomy of a Migration"](migration_anatomy.html)
54
-
-[How to backfill data and change data in bulk (aka: DML)](backfilling_data.html)
55
-
56
51
## Adding an index
57
52
58
53
Creating an index will [block writes](https://www.postgresql.org/docs/8.2/sql-createindex.html) to the table in Postgres.
@@ -801,76 +796,6 @@ end
801
796
>
802
797
> Creating extensions typically requires superuser privileges. In managed database services (AWS RDS, Heroku), some extensions may not be available.
803
798
804
-
## Squashing Migrations
805
-
806
-
If you have a long list of migrations, sometimes it can take a while to migrate
807
-
each of those files every time the project is reset or spun up by a new
808
-
developer. Thankfully, Ecto comes with mix tasks to `dump` and `load` a database
809
-
structure which will represent the state of the database up to a certain point
810
-
in time, not including content.
811
-
812
-
-`mix ecto.dump`
813
-
-`mix ecto.load`
814
-
815
-
Schema dumping and loading is only supported by external binaries `pg_dump` and
816
-
`mysqldump`, which are used by the Postgres, MyXQL, and MySQL Ecto adapters (not
817
-
supported in MSSQL adapter).
818
-
819
-
For example:
820
-
821
-
```
822
-
20210101000000 - First Migration
823
-
20210201000000 - Second Migration
824
-
20210701000000 - Third Migration <-- we are here now. run `mix ecto.dump`
825
-
```
826
-
827
-
We can "squash" the migrations up to the current day which will effectively
828
-
fast-forward migrations to that structure. The Ecto Migrator will detect that
829
-
the database is already migrated to the third migration, and so it begins there
830
-
and migrates forward.
831
-
832
-
Let's add a new migration:
833
-
834
-
```
835
-
20210101000000 - First Migration
836
-
20210201000000 - Second Migration
837
-
20210701000000 - Third Migration <-- `structure.sql` represents up to here
838
-
20210801000000 - New Migration <-- This is where migrations will begin
839
-
```
840
-
841
-
The new migration will still run, but the first-through-third migrations will
842
-
not need to be run since the structure already represents the changes applied by
843
-
those migrations. At this point, you can safely delete the first, second, and
844
-
third migration files or keep them for historical auditing.
845
-
846
-
Let's make this work:
847
-
848
-
1. Run `mix ecto.dump` which will dump the current structure into
849
-
`priv/repo/structure.sql` by default. Check `mix help ecto.dump` for more
850
-
options.
851
-
2. During project setup with an empty database, run `mix ecto.load` to load
852
-
`structure.sql`.
853
-
3. Run `mix ecto.migrate` to run any additional migrations created after the
854
-
structure was dumped.
855
-
856
-
To simplify these actions into one command, we can leverage mix aliases:
857
-
858
-
```elixir
859
-
# mix.exs
860
-
861
-
defpaliasesdo
862
-
[
863
-
"ecto.reset": ["ecto.drop", "ecto.setup"],
864
-
"ecto.setup": ["ecto.load", "ecto.migrate"],
865
-
# ...
866
-
]
867
-
end
868
-
```
869
-
870
-
Now you can run `mix ecto.setup` and it will load the database structure and run
871
-
remaining migrations. Or, run `mix ecto.reset` and it will drop and run setup.
872
-
Of course, you can continue running `mix ecto.migrate` as you create them.
873
-
874
799
## Credits
875
800
876
801
Created and written by David Bernheisel with recipes heavily inspired from Andrew Kane and his library [strong_migrations](https://github.com/ankane/strong_migrations).
@@ -879,8 +804,5 @@ Created and written by David Bernheisel with recipes heavily inspired from Andre
879
804
-[Strong Migrations by Andrew Kane](https://github.com/ankane/strong_migrations)
880
805
-[Adding a NOT NULL CONSTRAINT on PG Faster with Minimal Locking by Christophe Escobar](https://medium.com/doctolib/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c)
0 commit comments