Skip to content

Attempt to make service fetching more efficient (using asyncio)#228

Merged
gilesknap merged 33 commits into
Implement-k8s-service-labelsfrom
more-efficient-fetching
Mar 27, 2026
Merged

Attempt to make service fetching more efficient (using asyncio)#228
gilesknap merged 33 commits into
Implement-k8s-service-labelsfrom
more-efficient-fetching

Conversation

@OCopping
Copy link
Copy Markdown
Contributor

No description provided.

@OCopping OCopping requested a review from gilesknap January 29, 2026 09:15
@OCopping OCopping changed the base branch from Implement-k8s-service-labels to update-copier-template January 29, 2026 09:17
@OCopping OCopping changed the base branch from update-copier-template to Implement-k8s-service-labels January 29, 2026 09:18
@OCopping OCopping force-pushed the more-efficient-fetching branch from fe01f5b to 8524908 Compare January 29, 2026 09:24
@gilesknap
Copy link
Copy Markdown
Member

Sorry I missed this. Remind me to take a look at it during the epics-containers sprint next week.

@OCopping OCopping force-pushed the more-efficient-fetching branch from eb0f146 to 4f48a10 Compare March 24, 2026 13:26
@gilesknap
Copy link
Copy Markdown
Member

@OCopping in testing I'm not seeing any improvement on the time to run ec ps.

Let me know when you are around and I'll drop by for a demo.

Copy link
Copy Markdown
Member

@gilesknap gilesknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Sorry to use Claude - but he is way better at reading other people's code than me! Cluade agrees with our analysis that the subprocess calls are not concurrent. Look out for point 3 too. I think these points are generally pretty good.

Overview

This PR attempts to speed up service fetching by running argocd app manifests calls concurrently via asyncio. It also improves the TUI monitor with a loading indicator, batch updates, and selective cell updates.

Issues

1. asyncio provides no actual concurrency here (Critical)

The _extract_app_manifests method is async but calls shell.run_command() — a synchronous blocking call. asyncio.TaskGroup only provides concurrency for await-based I/O. Since nothing is awaited in the hot path, all tasks run sequentially on the event loop, giving zero speedup. This likely explains the observation that ec ps shows no improvement.

To get actual concurrency, you'd need either:

  • await asyncio.to_thread(shell.run_command, ...) to run each shell call in a thread pool
  • Or skip asyncio entirely and use concurrent.futures.ThreadPoolExecutor directly

2. _get_services_df in monitor.py doesn't actually await (Bug)

In monitor.py, _get_services_df is now async but the body just calls self.commands._get_services_df(running_only) synchronously — no await. The return value is a plain DataFrame, not a coroutine. The async/await wrappers are decorative here.

3. Shared mutable state without proper protection (Bug)

ArgoCommands stores results on self.services_df and self.app_dicts as instance attributes, and _extract_app_manifests mutates self.services_df via .extend(). The self.async_lock is declared but never used in _extract_app_manifests. If this were truly concurrent, multiple tasks would race on self.services_df. Also, services_df is never reset between calls, so repeated invocations would accumulate duplicate rows.

4. _get_services_df event loop detection is fragile (Design)

try:
    asyncio.get_running_loop()
    with concurrent.futures.ThreadPoolExecutor() as pool:
        future = pool.submit(asyncio.run, self._get_service_data())
        future.result()
except RuntimeError:
    asyncio.run(self._get_service_data())

Spawning a new event loop inside a thread pool to work around an existing event loop is a code smell. This pattern can deadlock or cause subtle bugs with Textual's own event loop. Consider restructuring so the async boundary is cleaner.

5. self.services_df not reset between polls

Each call to _extract_app_manifests appends to self.services_df. On the second poll cycle, _get_service_data runs again but self.services_df still has data from the first cycle, leading to duplicate rows.

6. _check_service no longer uses the same code path as _ps

The base class _check_service calls _get_services_df, but ArgoCommands._check_service now calls _get_services() and reads self.app_dicts directly. This divergence means _check_service and _ps could give inconsistent results, and the base class override is easy to miss.

7. Minor: _get_services on base class is not abstract

commands.py adds _get_services as a plain method raising NotImplementedError (not @abstractmethod), while _get_services_df keeps @abstractmethod. This inconsistency means subclasses aren't forced to implement _get_services.

Positives

  • Loading indicator in the TUI is a good UX improvement — the app no longer hangs on startup.
  • batch_update() and selective cell updates (if str(current) != str(cell["contents"])) in populate_table are solid optimizations that reduce unnecessary redraws.
  • Caching Color.parse("white") as a module-level constant avoids repeated parsing.
  • Separating service list fetching (_get_services) from manifest extraction is a good structural direction.

Recommendation

The core async approach needs rework — the blocking shell.run_command calls need to be run via asyncio.to_thread() or a thread pool to achieve actual parallelism. I'd suggest addressing the concurrency issues before merging, since without them this adds complexity for no performance gain.

@gilesknap
Copy link
Copy Markdown
Member

ec ps is now super fast but this has introduced a few issues:

Updated PR 228 Review

Good progress — the core issue from the first review (blocking shell.run_command) is addressed by converting it to use asyncio.create_subprocess_shell. However, the async conversion is incomplete, introducing several
broken code paths.

Issues

  1. CLI callers never await async methods (Critical — broken at runtime)

cli.py is unchanged but now calls methods that are async:
backend.commands.delete(service_name) # line 85 — returns unawaited coroutine
backend.commands.deploy(...) # line 132
backend.commands.log_history(service_name) # line 225
backend.commands.restart(service_name) # line 282
backend.commands.start(...) # line 297
backend.commands.stop(...) # line 315
These will silently do nothing — the coroutine is created but never executed. Every user-facing command except ps and logs is broken.

  1. get_patches() calls async shell.run_command without await (Critical — broken at runtime)

get_patches() (line 39) is still a regular function but shell.run_command is now async. app_resp will be a coroutine object, not a string. YAML.load() will then fail or produce nonsense. Since get_patches is called
inside push_remove_key, that entire path is also broken.

  1. push_value and push_remove_key miss await on their first shell.run_command (Bug)

Both functions are async but their initial app_resp = shell.run_command(...) calls (lines ~87, ~107) are not awaited. Same issue — coroutine assigned to app_resp instead of the actual string result.

  1. do_retry wrapper breaks async functions (Critical)

do_retry wraps async functions (patch_value, push_value, push_remove_key) but calls them with cmd(*args, **kwargs) — this returns a coroutine without awaiting it. The _do_retry wrapper is sync, so it can never
properly execute the async function body. The retry logic is effectively dead, and the wrapped functions do nothing.

  1. self.services_df still not reset between polls (Bug from v1 — unfixed)

_get_service_data calls _extract_app_manifests which appends to self.services_df, but it's never cleared before a new poll cycle. Each poll accumulates duplicate rows.

  1. k8s_commands.py, helm.py, git.py all call shell.run_command without await (Critical)

These files have ~20+ calls to shell.run_command that are all non-awaited. The entire k8s backend and helm deployment are broken since shell.run_command is now unconditionally async.

  1. do_polling is @work(thread=True) + async def (Bug)

@work(thread=True) runs the function in a thread. Making it async def means it returns a coroutine from that thread, which Textual's worker won't automatically await. The polling loop likely never executes.

  1. asyncio.create_subprocess_shell passed shell=True (Minor)

create_subprocess_shell always runs through the shell — shell=True is not a valid parameter for it (it's a subprocess.Popen parameter). This may be silently ignored or could error depending on the Python version.

Positives (carried over from v1, still good)

  • Loading indicator, batch updates, and selective cell updates in the TUI remain solid improvements.
  • The async subprocess approach is the right direction for achieving concurrency.

Recommendation

The async conversion needs to be completed across the entire codebase — right now only argo_commands.py methods are partially converted while all other callers are broken. Consider:

  1. Keep shell.run_command synchronous and add a separate shell.run_command_async method. This way only the code that needs concurrency (the _extract_app_manifests TaskGroup) uses async, and everything else continues
    to work unchanged.
  2. Alternatively, commit to full async but then cli.py, k8s_commands.py, helm.py, git.py, and do_retry all need updating too.

Option 1 is much less invasive and easier to get right.

@gilesknap gilesknap merged commit 68964dc into Implement-k8s-service-labels Mar 27, 2026
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants