Initial guide updates

josevalim · josevalim · commit 8d80d5d82daa · 2026-05-18T13:37:02.000+02:00
diff --git a/guides/backfilling_data.md b/guides/backfilling_data.md
@@ -38,27 +38,27 @@ defmodule MyApp.Repo.DataMigrations.BackfillPosts do
 end
 ```
 
-The problem is the code and schema may change over time. However, migrations are using a snapshot of your schemas at the time it's written. In the future, many assumptions may no longer be true. For example, the new_data column may not be present anymore in the schema causing the query to fail if this migration is run months later.
+The problem is the code and schema may change over time. However, migrations are using a snapshot of your schemas at the time it's written. In the future, many assumptions may no longer be true. For example, the `new_data` column may not be present anymore in the schema causing the query to fail if this migration is run months later.
 
-Additionally, in your development environment, you might have 10 records to migrate; in staging, you might have 100; in production, you might have 1 billion to migrate. Scaling your approach matters.
+Additionally, in your development environment, you might have 10 records to migrate. In staging, you might have 100. In production, you might have 1 billion to migrate. Scaling your approach matters.
 
 Ultimately, there are several bad practices here:
 
 1. The Ecto schema in the query may change after this migration was written.
-1. If you try to backfill the data all at once, it may exhaust the database memory and/or CPU if it's changing a large data set.
-1. Backfilling data inside a transaction for the migration locks row updates for the duration of the migration, even if you are updating in batches.
-1. Disabling the transaction for the migration and only batching updates may still spike the database CPU to 100%, causing other concurrent reads or writes to time out.
+2. If you try to backfill the data all at once, it may exhaust the database memory and/or CPU if it's changing a large data set.
+3. Backfilling data inside a transaction for the migration locks row updates for the duration of the migration, even if you are updating in batches.
+4. Disabling the transaction for the migration and only batching updates may still spike the database CPU to 100%, causing other concurrent reads or writes to time out.
 
 ##  Good
 
 There are four keys to backfilling safely:
 
 1. running outside a transaction
-1. batching
-1. throttling
-1. resiliency
+2. batching
+3. throttling
+4. resiliency
 
-As we've learned in this guide, it's straight-forward to disable the migration transactions. Add these options to the migration:
+To disable the migration transactions, add these options to the top of your migration:
 
 ```elixir
 @disable_ddl_transaction true
@@ -72,7 +72,8 @@ We'll start with how do we paginate efficiently: `LIMIT`/`OFFSET` by itself is a
 For querying and updating the data, there are two ways to "snapshot" your schema at the time of the migration. We'll use both options below in the examples:
 
 1. Execute raw SQL that represents the table at that moment. Do not use Ecto schemas. Prefer this approach when you can. Your application's Ecto schemas will change over time, but your migration should not, therefore it's not a true snapshot of the data at the time.
-1. Write a small Ecto schema module inside the migration that only uses what you need. Then use that in your data migration. This is helpful if you prefer the Ecto API and decouples from your application's Ecto schemas as it evolves separately.
+
+2. Write a small Ecto schema module inside the migration that only uses what you need. Then use that in your data migration. This is helpful if you prefer the Ecto API and decouples from your application's Ecto schemas as it evolves separately.
 
 For throttling, we can simply add a `Process.sleep(@throttle)` for each page.
 
@@ -81,9 +82,9 @@ For resiliency, we need to ensure that we handle errors without losing our progr
 Finally, to manage these data migrations separately, we need to:
 
 1. Store data migrations separately from your schema migrations.
-1. Run the data migrations manually.
+2. Run the data migrations manually.
 
-To achieve this, be inspired by [Ecto's documentation on creating a Release module](`Ecto.Migrator`), and extend your release module to allow options to pass into `Ecto.Migrator` that specifies the version to migrate and the data migrations' file path, for example:
+If you have `mix` available in production, you can use `mix ecto.migrate --migrations-path "priv/repo/data_migrations"`. However, most applications use releases in production, so you need extend your release module (see [Ecto's documentation on creating a Release module](`Ecto.Migrator`)). The idea is to provide a `migrate_data` function that specifies the version to migrate and the data migrations' file path, for example:
 
 ```elixir
 defmodule MyApp.Release do
@@ -108,21 +109,15 @@ If the data can be queried with a condition that is removed after update then yo
 Here's how we can manage the backfill:
 
 1. Disable migration transactions.
-1. Use keyset pagination: Order the data, find rows greater than the last mutated row and limit by batch size.
-1. For each page, mutate the records.
-1. Check for failed updates and handle it appropriately.
-1. Use the last mutated record's ID as the starting point for the next page. This helps with resiliency and prevents looping on the same record over and over again.
-1. Arbitrarily sleep to throttle and prevent exhausting the database.
-1. Rinse and repeat until there are no more records
+2. Use keyset pagination: Order the data, find rows greater than the last mutated row and limit by batch size.
+3. For each page, mutate the records.
+4. Check for failed updates and handle it appropriately.
+5. Use the last mutated record's ID as the starting point for the next page. This helps with resiliency and prevents looping on the same record over and over again.
+6. Arbitrarily sleep to throttle and prevent exhausting the database.
+7. Rinse and repeat until there are no more records
 
 For example:
 
-```bash
-mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_posts
-```
-
-And modify the migration:
-
 ```elixir
 defmodule MyApp.Repo.DataMigrations.BackfillPosts do
   use Ecto.Migration
@@ -184,6 +179,12 @@ defmodule MyApp.Repo.DataMigrations.BackfillPosts do
 end
 ```
 
+To test it in development/test environments:
+
+```bash
+mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_posts
+```
+
 ## Batching Arbitrary Data
 
 If the data being updated does not indicate it's already been updated, then we need to take a snapshot of the current data and store it temporarily. For example, if all rows should increment a column's value by 10, how would you know if a record was already updated? You could load a list of IDs into the application during the migration, but what if the process crashes? Instead we're going to keep the data we need in the database.
@@ -193,23 +194,26 @@ To do this, it works well if we can pick a specific point in time where all reco
 Here's how we'll manage the backfill:
 
 1. Create a "temporary" table. In this example, we're creating a real table that we'll drop at the end of the data migration. In Postgres, there are [actual temporary tables](https://www.postgresql.org/docs/12/sql-createtable.html) that are discarded after the session is over; we're not using those because we need resiliency in case the data migration encounters an error. The error would cause the session to be over, and therefore the temporary table tracking progress would be lost. Real tables don't have this problem. Likewise, we don't want to store IDs in application memory during the migration for the same reason.
-1. Populate that temporary table with IDs of records that need to update. This query only requires a read of the current records, so there are no consequential locks occurring when populating, but be aware this could be a lengthy query. Populating this table can occur at creation or afterwards; in this example we'll populate it at table creation.
-1. Ensure there's an index on the temporary table so it's fast to delete IDs from it. I use an index instead of a primary key because it's easier to re-run the migration in case there's an error. There isn't a straight-forward way to `CREATE IF NOT EXIST` on a primary key; but you can do that easily with an index.
-1. Use keyset pagination to pull batches of IDs from the temporary table. Do this inside a database transaction and lock records for updates. Each batch should read and update within milliseconds, so this should have little impact on concurrent reads and writes.
-1. For each batch of records, determine the data changes that need to happen. This can happen for each record.
-1. [Upsert](https://wiki.postgresql.org/wiki/UPSERT) those changes to the real table. This insert will include the ID of the record that already exists and a list of attributes to change for that record. Since these insertions will conflict with existing records, we'll instruct Postgres to replace certain fields on conflicts.
-1. Delete those IDs from the temporary table since they're updated on the real table. Close the database transaction for that batch.
-1. Throttle so we don't overwhelm the database, and also give opportunity to other concurrent processes to work.
-1. Rinse and repeat until the temporary table is empty.
-1. Finally, drop the temporary table when empty.
 
-Let's see how this can work:
+2. Populate that temporary table with IDs of records that need to update. This query only requires a read of the current records, so there are no consequential locks occurring when populating, but be aware this could be a lengthy query. Populating this table can occur at creation or afterwards; in this example we'll populate it at table creation.
 
-```bash
-mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_weather
-```
+3. Ensure there's an index on the temporary table so it's fast to delete IDs from it. I use an index instead of a primary key because it's easier to re-run the migration in case there's an error. There isn't a straight-forward way to `CREATE IF NOT EXIST` on a primary key; but you can do that easily with an index.
+
+4. Use keyset pagination to pull batches of IDs from the temporary table. Do this inside a database transaction and lock records for updates. Each batch should read and update within milliseconds, so this should have little impact on concurrent reads and writes.
+
+5. For each batch of records, determine the data changes that need to happen. This can happen for each record.
+
+6. [Upsert](https://wiki.postgresql.org/wiki/UPSERT) those changes to the real table. This insert will include the ID of the record that already exists and a list of attributes to change for that record. Since these insertions will conflict with existing records, we'll instruct Postgres to replace certain fields on conflicts.
+
+7. Delete those IDs from the temporary table since they're updated on the real table. Close the database transaction for that batch.
+
+8. Throttle so we don't overwhelm the database, and also give opportunity to other concurrent processes to work.
+
+9. Rinse and repeat until the temporary table is empty.
 
-Modify the migration:
+10. Finally, drop the temporary table when empty.
+
+Let's see how this can work:
 
 ```elixir
 # Both of these modules are in the same migration file
@@ -355,6 +359,12 @@ defmodule MyApp.Repo.DataMigrations.BackfillWeather do
 end
 ```
 
+And to test in development/test environments:
+
+```bash
+mix ecto.gen.migration --migrations-path=priv/repo/data_migrations backfill_weather
+```
+
 ---
 
-This guide was originally published on [Fly.io Phoenix Files](https://fly.io/phoenix-files/backfilling-data/).
+This guide was originally published on [Fly.io Phoenix Files](https://fly.io/phoenix-files/backfilling-data/). See [Automatic and Manual Ecto Migrations by Wojtek Mach](https://dashbit.co/blog/automatic-and-manual-ecto-migrations) for more examples on running Ecto migrations.
diff --git a/guides/safe_migrations.md b/guides/safe_migrations.md
@@ -27,7 +27,7 @@ milliseconds which could be acceptable for you. However, once your table has
 100+ million records, the difference becomes seconds which is more likely to be 
 felt and cause timeouts. Therefore, err on the side of safety, but 
 **always benchmark for your own database**. Also consider the hardware the
-database is running; eg a Raspberry Pi 2B on a microSD will run much slower.
+database is running: for example, a Raspberry Pi 2B on a microSD will run much slower.
 
 ## Table of Contents
 
@@ -48,11 +48,6 @@ database is running; eg a Raspberry Pi 2B on a microSD will run much slower.
 - [Adding a PostgreSQL extension](#adding-a-postgresql-extension)
 - [Squashing migrations](#squashing-migrations)
 
-Read more about safe migration techniques:
-
-- [Migration locks in "Anatomy of a Migration"](migration_anatomy.html)
-- [How to backfill data and change data in bulk (aka: DML)](backfilling_data.html)
-
 ## Adding an index
 
 Creating an index will [block writes](https://www.postgresql.org/docs/8.2/sql-createindex.html) to the table in Postgres.
@@ -801,76 +796,6 @@ end
 >
 > Creating extensions typically requires superuser privileges. In managed database services (AWS RDS, Heroku), some extensions may not be available.
 
-## Squashing Migrations
-
-If you have a long list of migrations, sometimes it can take a while to migrate
-each of those files every time the project is reset or spun up by a new
-developer. Thankfully, Ecto comes with mix tasks to `dump` and `load` a database
-structure which will represent the state of the database up to a certain point
-in time, not including content.
-
-- `mix ecto.dump`
-- `mix ecto.load`
-
-Schema dumping and loading is only supported by external binaries `pg_dump` and
-`mysqldump`, which are used by the Postgres, MyXQL, and MySQL Ecto adapters (not
-supported in MSSQL adapter).
-
-For example:
-
-```
-20210101000000 - First Migration
-20210201000000 - Second Migration
-20210701000000 - Third Migration <-- we are here now. run `mix ecto.dump`
-```
-
-We can "squash" the migrations up to the current day which will effectively
-fast-forward migrations to that structure. The Ecto Migrator will detect that
-the database is already migrated to the third migration, and so it begins there
-and migrates forward.
-
-Let's add a new migration:
-
-```
-20210101000000 - First Migration
-20210201000000 - Second Migration
-20210701000000 - Third Migration <-- `structure.sql` represents up to here
-20210801000000 - New Migration <-- This is where migrations will begin
-```
-
-The new migration will still run, but the first-through-third migrations will
-not need to be run since the structure already represents the changes applied by
-those migrations. At this point, you can safely delete the first, second, and
-third migration files or keep them for historical auditing.
-
-Let's make this work:
-
-1. Run `mix ecto.dump` which will dump the current structure into
-   `priv/repo/structure.sql` by default. Check `mix help ecto.dump` for more
-   options.
-2. During project setup with an empty database, run `mix ecto.load` to load
-   `structure.sql`.
-3. Run `mix ecto.migrate` to run any additional migrations created after the
-   structure was dumped.
-
-To simplify these actions into one command, we can leverage mix aliases:
-
-```elixir
-# mix.exs
-
-defp aliases do
-  [
-    "ecto.reset": ["ecto.drop", "ecto.setup"],
-    "ecto.setup": ["ecto.load", "ecto.migrate"],
-    # ...
-  ]
-end
-```
-
-Now you can run `mix ecto.setup` and it will load the database structure and run
-remaining migrations. Or, run `mix ecto.reset` and it will drop and run setup.
-Of course, you can continue running `mix ecto.migrate` as you create them.
-
 ## Credits
 
 Created and written by David Bernheisel with recipes heavily inspired from Andrew Kane and his library [strong_migrations](https://github.com/ankane/strong_migrations).
@@ -879,8 +804,5 @@ Created and written by David Bernheisel with recipes heavily inspired from Andre
 - [Strong Migrations by Andrew Kane](https://github.com/ankane/strong_migrations)
 - [Adding a NOT NULL CONSTRAINT on PG Faster with Minimal Locking by Christophe Escobar](https://medium.com/doctolib/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c)
 - [Postgres Runtime Configuration](https://www.postgresql.org/docs/current/runtime-config-client.html)
-- [Automatic and Manual Ecto Migrations by Wojtek Mach](https://dashbit.co/blog/automatic-and-manual-ecto-migrations)
 
 Special thanks for sponsorship: Fly.io
-
-Special thanks for the reviewers.
diff --git a/mix.exs b/mix.exs
@@ -191,14 +191,16 @@ defmodule EctoSQL.MixProject do
       source_url: @source_url,
       extras: [
         "CHANGELOG.md",
-        "guides/safe_migrations.md",
         "guides/migration_anatomy.md",
+        "guides/safe_migrations.md",
+        "guides/squashing_migrations.md",
         "guides/backfilling_data.md"
       ],
       groups_for_extras: [
-        Guides: [
-          "guides/safe_migrations.md",
+        "Migration Guides": [
           "guides/migration_anatomy.md",
+          "guides/safe_migrations.md",
+          "guides/squashing_migrations.md",
           "guides/backfilling_data.md"
         ]
       ],