Skip to content

build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10)#38287

Open
feanil wants to merge 8 commits intomasterfrom
feanil/collect-test-timings
Open

build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10)#38287
feanil wants to merge 8 commits intomasterfrom
feanil/collect-test-timings

Conversation

@feanil
Copy link
Copy Markdown
Contributor

@feanil feanil commented Apr 4, 2026

Summary

Rebalances the unit test shard configuration to reduce the CI critical path from ~29 minutes down to ~23 minutes, measured across 3 consistent runs.

What changed

  • Replaced 16 uneven shards with 10 balanced shards targeting ~19–23 min each
  • New shard layout: 5 LMS shards, 2 shared-with-LMS shards, 1 shared-with-CMS shard, and 2 CMS shards
    • `cms-1`: small CMS apps (api, cms_user_tasks, course_creators, envs, lib, etc.) — ~5–6 min
    • `cms-2`: `contentstore/` only — ~19–20 min (split out to avoid OOM on runners when combined with other CMS apps)
  • Also includes test isolation fixes (also in fix: test isolation fixes for certificates and openedx-events test mixins #38347 for independent merging):
    • Moved `CourseFactory()` calls from `setUp` to `setUpClass` in three certificate test classes to avoid exhausting MongoDB connections across test methods
    • Updated to openedx-events 11.1.1 which adds `tearDownClass` to `OpenEdxEventsTestMixin`, preventing events from being left globally disabled between test classes

Timing data — 3 runs on the new config (all passed)

Shard Run 1 Run 2 Run 3 Avg
shared-with-cms-1 22m48s 22m33s 23m09s ~22m50s
shared-with-lms-1 21m57s 22m49s 23m01s ~22m36s
shared-with-lms-2 20m06s 20m54s 22m05s ~21m02s
lms-4 22m28s 21m07s 21m27s ~21m41s
lms-1 21m18s 19m23s 22m25s ~21m02s
lms-5 19m50s 21m11s 19m22s ~20m08s
cms-2 19m31s 18m13s 20m19s ~19m21s
lms-2 19m16s 19m19s 18m43s ~19m06s
lms-3 18m43s 18m48s 18m52s ~18m48s
cms-1 5m40s 5m27s 5m32s ~5m33s
Critical path ~23m ~23m ~23m ~23m

For comparison, the old config's critical path (visible in the #38347 run on unmodified master) was ~29m on lms-4.

@feanil feanil force-pushed the feanil/collect-test-timings branch 2 times, most recently from afef03c to c17633d Compare April 4, 2026 18:45
@feanil
Copy link
Copy Markdown
Contributor Author

feanil commented Apr 6, 2026

The fix needs to be made upstream: openedx/openedx-events#559 waiting for that to be merged and released before coming back to this PR.

@feanil feanil force-pushed the feanil/collect-test-timings branch 2 times, most recently from 676809c to a89e30c Compare April 10, 2026 21:11
@feanil feanil force-pushed the feanil/collect-test-timings branch from a89e30c to 19c5643 Compare April 11, 2026 02:20
feanil and others added 8 commits April 11, 2026 11:52
Run pytest with extra reporting enabled to generate files with per-test
durations. The file is uploaded as a CI artifact so timing data can be
downloaded and used to drive optimal shard rebalancing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Redistribute test paths across 9 shards (down from 16) using a greedy
bin-packing optimiser driven by real per-test timing data from
pytest-reportlog. Predicted critical path: ~18.7m (down from ~29m).

Key changes:
- Rename shard groups to reflect semantic meaning: lms-*, shared-with-lms-*,
  shared-with-cms-*, cms-* (openedx/common/xmodule paths explicitly separated
  from lms-only and cms-only paths)
- Split lms/djangoapps/discussion/ into its 4 subdirectories so the heavy
  rest_api/ shard (15.7m) can be distributed across bins independently
- Remove outdated comment referencing unit-tests-gh-hosted.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ubclasses

Three test classes in the certificates app were calling CourseFactory() in
setUp() despite extending SharedModuleStoreTestCase. Unlike ModuleStoreTestCase,
SharedModuleStoreTestCase shares a single modulestore across all tests in the
class and only closes MongoDB connections at tearDownClass. Calling
CourseFactory() in setUp() created a new MongoDB course (and opened connections)
for every test method without releasing them, causing connection accumulation
across the full test run.

Affected classes:
- CertificateFiltersTest (test_filters.py)
- CertificateInvalidationTest (test_models.py)
- CertificateAllowlistTest (test_models.py)

In each case the course is only read by test methods (test data such as users,
enrollments and certificates is written via Django ORM and rolled back between
tests), so sharing a single course across the class is correct.

See: https://github.com/openedx/openedx-platform/blob/master/xmodule/modulestore/tests/django_utils.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vents.testing

openedx_events/tests/utils.py was moved to openedx_events/testing.py in
openedx/openedx-events#559 so the test utilities are included in the
installed package (setup.py excludes the tests/ subpackage from the wheel).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When OpenEdxEventsTestMixin was listed after a TestCase subclass (e.g.
Foo(SharedModuleStoreTestCase, OpenEdxEventsTestMixin)), it landed after
unittest.case.TestCase in the MRO. Since unittest.case.TestCase.setUpClass
and tearDownClass do not call super(), the mixin's lifecycle methods never
ran. The workaround was to manually call cls.start_events_isolation() in
each class's setUpClass, but there was no corresponding tearDownClass to
restore event state, causing events disabled by one test class to leak into
subsequent classes in the same process.

Fix by placing OpenEdxEventsTestMixin first in the base class list so it
appears before unittest.case.TestCase in the MRO. This lets setUpClass and
tearDownClass run automatically through the cooperative super() chain,
removing the need for manual start_events_isolation() calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This mixin is already included via one of the other mixins on this test
class so including it again was messing with the MRO for the test
classes.
contentstore/ is large enough that the cms-1 runner was being killed
mid-run in CI (OOM or runner-level timeout). Splitting it into its own
shard keeps each job under the ~20-25 min target.

No changes needed to gha_unit_tests_collector.py — it already classifies
any shard whose first path starts with "cms/" as a CMS shard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@feanil feanil force-pushed the feanil/collect-test-timings branch from 19c5643 to 1422307 Compare April 11, 2026 15:53
@feanil feanil changed the title build: collect per-test timing data via --report-log build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) Apr 11, 2026
@feanil feanil marked this pull request as ready for review April 11, 2026 17:07
@feanil feanil requested review from a team, farhan, irtazaakram and salman2013 as code owners April 11, 2026 17:07
@feanil feanil requested a review from a team April 11, 2026 17:07
@feanil feanil changed the title build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) and use fewer workers (16 -> 10) Apr 11, 2026
@feanil feanil changed the title build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) and use fewer workers (16 -> 10) build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10) Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant