Skip to content

#50 implement DRY principle#51

Merged
Asifdotexe merged 8 commits into
mainfrom
50-code-optimisations
Jun 20, 2026
Merged

#50 implement DRY principle#51
Asifdotexe merged 8 commits into
mainfrom
50-code-optimisations

Conversation

@Asifdotexe

@Asifdotexe Asifdotexe commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added fossil update auto-detection: performs full data backfill if genesis is missing; otherwise runs incremental refresh.
    • Added manual GitHub Actions workflow trigger for fossil updates.
  • Refactor

    • Reorganized pipeline workflow to use artifact-based data distribution.
    • Simplified internal path resolution and helper utilities.
    • Removed unused repository cloning logic.
  • Tests

    • Data integrity tests now skip gracefully when no processed data is available.
  • Chores

    • Updated workflow action versions.
    • Added temporary directories to gitignore.

@Asifdotexe Asifdotexe self-assigned this Jun 16, 2026
@Asifdotexe Asifdotexe linked an issue Jun 16, 2026 that may be closed by this pull request
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@Asifdotexe, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes and 39 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de8bfe7a-154f-46ed-aebb-10c86d9780db

📥 Commits

Reviewing files that changed from the base of the PR and between ab7fee2 and c683e1f.

📒 Files selected for processing (5)
  • .github/workflows/theseus-engine.yml
  • .github/workflows/unit-tests.yml
  • .github/workflows/update-fossils.yml
  • scripts/_blame.py
  • scripts/workflow.py
📝 Walkthrough

Walkthrough

The PR removes the _path_guard sys.path injection pattern from all Python scripts, migrates intra-package imports to scripts.* namespaced paths, simplifies _blame.py and _utils.py logic, makes ensure_repo_ready public in analyse_repository.py, adds an auto_update_fossils dispatcher in add_fossils.py, reworks theseus-engine.yml to an artifact-based pipeline, adds a new update-fossils.yml workflow, and removes dead code from app.js.

Changes

app.js Dead Code Removal

Layer / File(s) Summary
renderMilestoneMarkers, updateInsights, and createSVGElement cleanup
app.js
Removes stray const tooltip = this; and adds proper repoInfo lookup in renderMilestoneMarkers; drops unused first local in updateInsights; deletes the createSVGElement helper method from TheseusVisualizer.

Python Pipeline Refactoring and Orchestration

Layer / File(s) Summary
_path_guard removal and scripts.* import migration
scripts/_path_guard.py, scripts/__init__.py, scripts/_blame.py, scripts/add_fossils.py, scripts/analyse_repository.py, scripts/cleanup_data.py, scripts/workflow.py, scripts/run_pipeline.py, scripts/_utils.py
Empties _path_guard.py and removes sys.path injection from __init__.py; drops all import _path_guard side-effect imports; switches all intra-package imports to scripts.* namespaced paths; updates schema doc references to point to _data_io.py.
_blame.py logic simplification and _utils.py remove_path rewrite
scripts/_blame.py, scripts/_utils.py
Rewrites _is_hash() to a compact len+strip check; simplifies fossil commit derivation to a direct string slice; removes blame_single_file_year_counts and blame_year_counts; changes BlameRunner.__init__ max_workers to int | None; replaces remove_path with a single shutil.rmtree(path, ignore_errors=True) call.
analyse_repository.py: ensure_repo_ready public, max_workers: None
scripts/analyse_repository.py
Removes local clone_repository; changes max_workers parameter type to int | None in three blame/guard helpers; sets max_workers = None unconditionally in analyze_single_snapshot; replaces private _ensure_repo_ready with a public ensure_repo_ready that clones-or-fetches and checks out the default branch; updates process_repository to call the new function.
auto_update_fossils dispatcher and workflow.py cleanup
scripts/add_fossils.py, scripts/workflow.py
Adds _check_genesis and auto_update_fossils to add_fossils.py: per-repo dispatcher that runs backfill_fossils when genesis is missing or update_survivor_fossils when it exists; main() calls auto_update_fossils by default. Removes discover-repos CLI subcommand from workflow.py.
CI workflow rework and new update-fossils workflow
.github/workflows/theseus-engine.yml, .github/workflows/update-fossils.yml, .github/workflows/unit-tests.yml, .gitignore
theseus-engine.yml: discover-repos uses jq over theseus.config.json; analyze job uploads per-repo artifacts instead of pushing to shared branch; create-pr job downloads all artifacts then always commits and force-pushes. New update-fossils.yml: manual workflow that runs add_fossils.py + cleanup_data.py and force-pushes to chore/fossil-update. unit-tests.yml unpins action versions from commit SHAs. Adds temp_fossil_repos/ to .gitignore.
test_data_integrity graceful skip
tests/test_data_integrity.py
Adds pytest import; replaces hard assertion failure with pytest.skip(...) when no JSON data files are found.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions
    participant discover as discover-repos job
    participant analyze as analyze job (matrix)
    participant artifacts as GitHub Artifacts
    participant create_pr as create-pr job
    participant branch as chore/monthly-data-update

    GHA->>discover: trigger theseus-engine workflow
    discover->>discover: jq theseus.config.json → repos matrix
    discover->>analyze: pass repos matrix
    loop per repo
        analyze->>analyze: poetry run python -m scripts.run_pipeline
        analyze->>analyze: write data/.status/<repo>.json
        analyze->>artifacts: upload data-<repo> artifact (1-day retention)
    end
    artifacts->>create_pr: download all data-* artifacts
    create_pr->>create_pr: workflow.py build-pr-body
    create_pr->>create_pr: workflow.py validate-graph-files
    create_pr->>branch: commit data/ and force-push
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Asifdotexe/Theseus#9: The createSVGElement helper removed in this PR was originally introduced in that PR, making them directly code-level related.
  • Asifdotexe/Theseus#43: Both PRs modify .github/workflows/theseus-engine.yml's CI pipeline structure, specifically how the analyze job passes data and how create-pr assembles it.
  • Asifdotexe/Theseus#46: This PR removes the _path_guard/sys.path bootstrapping scaffolding and the scripts/workflow.py discover-repos subcommand that PR #46 introduced.

Suggested labels

enhancement

🐇 Hops through the code with glee,
_path_guard no more — imports run free!
Dead code swept clean from app.js bright,
Artifacts carry data through the night.
auto_update_fossils picks the right way,
A tidy pipeline greets the new day! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "#50 implement DRY principle" is vague and generic, using non-descriptive language that does not convey the specific changes made in the changeset. Provide a more specific title that describes the actual changes, such as: "Remove sys.path manipulation and refactor imports to use package-relative paths" or "Simplify helper functions and restructure fossil update logic."
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 50-code-optimisations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread scripts/_repo.py Outdated
logger = logging.getLogger(__name__)


def clone_repository(repo_slug: str, clone_dir: str) -> None:

@Asifdotexe Asifdotexe Jun 16, 2026

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstrings

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/_utils.py`:
- Around line 199-205: The remove_path function uses Path(path) without
importing it from the pathlib module, causing a NameError at runtime. Add an
import statement at the top of scripts/_utils.py to import Path from pathlib,
ensuring it is available when the remove_path function executes.

In `@scripts/analyse_repository.py`:
- Around line 35-38: The bare internal imports (from _blame, from _data_io, from
_repo, and from _utils) are no longer safe after removing _path_guard and will
cause import failures when the module is loaded through scripts.run_pipeline.
Update all these import statements to use package-qualified paths by prefixing
each module name with "scripts." (for example, change "from _blame import" to
"from scripts._blame import", and similarly for _data_io, _repo, and _utils
modules to ensure consistent and safe imports).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: be6d1ec2-7eed-43cc-9aaf-db580843e1b4

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6681 and cdd499d.

📒 Files selected for processing (16)
  • app.js
  • data/raw/claude-code_data.json
  • data/raw/langchain_data.json
  • data/raw/numpy_data.json
  • data/raw/react_data.json
  • scripts/__init__.py
  • scripts/_blame.py
  • scripts/_path_guard.py
  • scripts/_repo.py
  • scripts/_utils.py
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/cleanup_data.py
  • scripts/run_pipeline.py
  • scripts/workflow.py
  • tests/test_data_integrity.py
💤 Files with no reviewable changes (5)
  • scripts/_path_guard.py
  • scripts/init.py
  • scripts/workflow.py
  • scripts/cleanup_data.py
  • app.js

Comment thread scripts/_utils.py Outdated
Comment thread scripts/analyse_repository.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/update-fossils.yml:
- Around line 50-53: The git rm -rf --cached . command on line 52 in the else
block stages the deletion of all tracked files in the repository when creating
the chore/fossil-update branch, which would result in a commit that deletes
application and workflow files. Replace this overly broad command with a more
targeted operation that only unstages the fossil data files that need to be
re-added (typically in the data/ directory), or remove this command entirely if
it is not necessary for the update workflow.
- Around line 16-21: The workflow uses mutable action version tags and exposes
the write token to intermediate steps. Pin actions/checkout and
actions/setup-python to their full commit SHAs instead of version tags. In the
checkout action, add persist-credentials: false to prevent the token from being
exposed to the pipx install, poetry install, and poetry run python
scripts/update_fossils.py steps. At the final push step (around line 68),
explicitly pass the token using env or with parameters instead of relying on the
persisted credential from checkout.

In `@scripts/update_fossils.py`:
- Around line 113-116: The generate_graph_data function call checks for errors
and logs them, but the script continues to exit successfully even when
graph_error is present. To prevent stale or incomplete processed graph data from
being committed, add a process exit with a non-zero exit code (such as
process.exit(1) or sys.exit(1) depending on the language context) immediately
after the logger.error call when graph_error is truthy, ensuring the script
fails when graph regeneration encounters errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82aea780-3925-40dc-bd51-32a4e21b3fb9

📥 Commits

Reviewing files that changed from the base of the PR and between cdd499d and b670436.

📒 Files selected for processing (3)
  • .github/workflows/update-fossils.yml
  • .gitignore
  • scripts/update_fossils.py
✅ Files skipped from review due to trivial changes (1)
  • .gitignore

Comment thread .github/workflows/update-fossils.yml Outdated
Comment thread .github/workflows/update-fossils.yml Outdated
Comment thread scripts/update_fossils.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/_blame.py (1)

202-206: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TypeError when max_workers is None due to %d format specifier.

self.max_workers can now be None, but the log statement uses %d which requires an integer. This will raise TypeError: %d format: a real number is required, not NoneType.

🐛 Proposed fix
         logger.info(
-            "  Blaming %d changed files (%d workers)...",
+            "  Blaming %d changed files (%s workers)...",
             len(files),
-            self.max_workers,
+            self.max_workers if self.max_workers is not None else "default",
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/_blame.py` around lines 202 - 206, The logger.info call uses the %d
format specifier for self.max_workers which can now be None, causing a
TypeError. Replace self.max_workers in the format string with a conditional
expression that provides a sensible default or string representation when it is
None, or change the %d format specifier to %s which can handle both integer and
None values appropriately.
🧹 Nitpick comments (1)
.github/workflows/theseus-engine.yml (1)

58-66: Pin artifact actions to commit SHAs for supply-chain security.

actions/upload-artifact@v4 (line 59) and actions/download-artifact@v4 (line 79) use mutable version tags. Pinning to full commit SHAs prevents potential supply-chain attacks if the action maintainer's account is compromised.

Latest versions available: actions/upload-artifact@v7.0.1 (043fb46d1a93c77aae656e7c1c64a875d1fc6a0a) and actions/download-artifact@v8.0.1 (3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c).

Also applies to: 78-83

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/theseus-engine.yml around lines 58 - 66, Replace the
mutable version tags for GitHub Actions with pinned commit SHAs to improve
supply-chain security. Update the uses field for actions/upload-artifact from
`@v4` to the full commit SHA 043fb46d1a93c77aae656e7c1c64a875d1fc6a0a, and update
actions/download-artifact from `@v4` to the full commit SHA
3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c. The format should be
actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a and
actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 49-56: The matrix.repo value is being directly interpolated into
shell commands in the poetry run command and the echo statement, which creates a
template injection vulnerability if the repo name contains shell metacharacters.
Replace the direct interpolation of ${{ matrix.repo }} with an environment
variable by adding an env section to these steps that assigns matrix.repo to an
environment variable (e.g., REPO_NAME), then reference the environment variable
(e.g., $REPO_NAME) instead of the direct interpolation in both the poetry run
command on line 49 and the echo command on line 56.

In @.github/workflows/unit-tests.yml:
- Line 15: Replace the version tag pinning in the GitHub Actions workflow with
full commit SHAs to prevent supply-chain attacks. For the actions/checkout@v4
reference on line 15 and the actions/setup-python reference on line 20, visit
their respective GitHub releases pages (github.com/actions/checkout/releases and
github.com/actions/setup-python/releases), find the full commit SHA for the
version tag being used, and update each action reference from the tag format
(e.g., `@v4`) to the complete commit SHA format (e.g., `@abc123def456`...).

---

Outside diff comments:
In `@scripts/_blame.py`:
- Around line 202-206: The logger.info call uses the %d format specifier for
self.max_workers which can now be None, causing a TypeError. Replace
self.max_workers in the format string with a conditional expression that
provides a sensible default or string representation when it is None, or change
the %d format specifier to %s which can handle both integer and None values
appropriately.

---

Nitpick comments:
In @.github/workflows/theseus-engine.yml:
- Around line 58-66: Replace the mutable version tags for GitHub Actions with
pinned commit SHAs to improve supply-chain security. Update the uses field for
actions/upload-artifact from `@v4` to the full commit SHA
043fb46d1a93c77aae656e7c1c64a875d1fc6a0a, and update actions/download-artifact
from `@v4` to the full commit SHA 3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c. The
format should be
actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a and
actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e15bf646-86bf-434b-8f2a-f227530982de

📥 Commits

Reviewing files that changed from the base of the PR and between b670436 and ab7fee2.

📒 Files selected for processing (10)
  • .github/actions/setup-python-poetry/action.yml
  • .github/workflows/theseus-engine.yml
  • .github/workflows/unit-tests.yml
  • .github/workflows/update-fossils.yml
  • scripts/_blame.py
  • scripts/_utils.py
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/cleanup_data.py
  • scripts/workflow.py
💤 Files with no reviewable changes (1)
  • .github/actions/setup-python-poetry/action.yml

Comment thread .github/workflows/theseus-engine.yml Outdated
Comment thread .github/workflows/unit-tests.yml Outdated
@Asifdotexe Asifdotexe merged commit b644f30 into main Jun 20, 2026
2 checks passed
@Asifdotexe Asifdotexe deleted the 50-code-optimisations branch June 20, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Code optimisations

1 participant