Attempt to make service fetching more efficient (using asyncio)#228
Conversation
fe01f5b to
8524908
Compare
|
Sorry I missed this. Remind me to take a look at it during the epics-containers sprint next week. |
6a283d7 to
65cf66c
Compare
eb0f146 to
4f48a10
Compare
… the table is more efficient Also added a loading indicator
|
@OCopping in testing I'm not seeing any improvement on the time to run Let me know when you are around and I'll drop by for a demo. |
There was a problem hiding this comment.
Review
Sorry to use Claude - but he is way better at reading other people's code than me! Cluade agrees with our analysis that the subprocess calls are not concurrent. Look out for point 3 too. I think these points are generally pretty good.
Overview
This PR attempts to speed up service fetching by running argocd app manifests calls concurrently via asyncio. It also improves the TUI monitor with a loading indicator, batch updates, and selective cell updates.
Issues
1. asyncio provides no actual concurrency here (Critical)
The _extract_app_manifests method is async but calls shell.run_command() — a synchronous blocking call. asyncio.TaskGroup only provides concurrency for await-based I/O. Since nothing is awaited in the hot path, all tasks run sequentially on the event loop, giving zero speedup. This likely explains the observation that ec ps shows no improvement.
To get actual concurrency, you'd need either:
await asyncio.to_thread(shell.run_command, ...)to run each shell call in a thread pool- Or skip asyncio entirely and use
concurrent.futures.ThreadPoolExecutordirectly
2. _get_services_df in monitor.py doesn't actually await (Bug)
In monitor.py, _get_services_df is now async but the body just calls self.commands._get_services_df(running_only) synchronously — no await. The return value is a plain DataFrame, not a coroutine. The async/await wrappers are decorative here.
3. Shared mutable state without proper protection (Bug)
ArgoCommands stores results on self.services_df and self.app_dicts as instance attributes, and _extract_app_manifests mutates self.services_df via .extend(). The self.async_lock is declared but never used in _extract_app_manifests. If this were truly concurrent, multiple tasks would race on self.services_df. Also, services_df is never reset between calls, so repeated invocations would accumulate duplicate rows.
4. _get_services_df event loop detection is fragile (Design)
try:
asyncio.get_running_loop()
with concurrent.futures.ThreadPoolExecutor() as pool:
future = pool.submit(asyncio.run, self._get_service_data())
future.result()
except RuntimeError:
asyncio.run(self._get_service_data())Spawning a new event loop inside a thread pool to work around an existing event loop is a code smell. This pattern can deadlock or cause subtle bugs with Textual's own event loop. Consider restructuring so the async boundary is cleaner.
5. self.services_df not reset between polls
Each call to _extract_app_manifests appends to self.services_df. On the second poll cycle, _get_service_data runs again but self.services_df still has data from the first cycle, leading to duplicate rows.
6. _check_service no longer uses the same code path as _ps
The base class _check_service calls _get_services_df, but ArgoCommands._check_service now calls _get_services() and reads self.app_dicts directly. This divergence means _check_service and _ps could give inconsistent results, and the base class override is easy to miss.
7. Minor: _get_services on base class is not abstract
commands.py adds _get_services as a plain method raising NotImplementedError (not @abstractmethod), while _get_services_df keeps @abstractmethod. This inconsistency means subclasses aren't forced to implement _get_services.
Positives
- Loading indicator in the TUI is a good UX improvement — the app no longer hangs on startup.
batch_update()and selective cell updates (if str(current) != str(cell["contents"])) inpopulate_tableare solid optimizations that reduce unnecessary redraws.- Caching
Color.parse("white")as a module-level constant avoids repeated parsing. - Separating service list fetching (
_get_services) from manifest extraction is a good structural direction.
Recommendation
The core async approach needs rework — the blocking shell.run_command calls need to be run via asyncio.to_thread() or a thread pool to achieve actual parallelism. I'd suggest addressing the concurrency issues before merging, since without them this adds complexity for no performance gain.
|
ec ps is now super fast but this has introduced a few issues: Updated PR 228 Review Good progress — the core issue from the first review (blocking shell.run_command) is addressed by converting it to use asyncio.create_subprocess_shell. However, the async conversion is incomplete, introducing several Issues
cli.py is unchanged but now calls methods that are async:
get_patches() (line 39) is still a regular function but shell.run_command is now async. app_resp will be a coroutine object, not a string. YAML.load() will then fail or produce nonsense. Since get_patches is called
Both functions are async but their initial app_resp = shell.run_command(...) calls (lines ~87, ~107) are not awaited. Same issue — coroutine assigned to app_resp instead of the actual string result.
do_retry wraps async functions (patch_value, push_value, push_remove_key) but calls them with cmd(*args, **kwargs) — this returns a coroutine without awaiting it. The _do_retry wrapper is sync, so it can never
_get_service_data calls _extract_app_manifests which appends to self.services_df, but it's never cleared before a new poll cycle. Each poll accumulates duplicate rows.
These files have ~20+ calls to shell.run_command that are all non-awaited. The entire k8s backend and helm deployment are broken since shell.run_command is now unconditionally async.
@work(thread=True) runs the function in a thread. Making it async def means it returns a coroutine from that thread, which Textual's worker won't automatically await. The polling loop likely never executes.
create_subprocess_shell always runs through the shell — shell=True is not a valid parameter for it (it's a subprocess.Popen parameter). This may be silently ignored or could error depending on the Python version. Positives (carried over from v1, still good)
Recommendation The async conversion needs to be completed across the entire codebase — right now only argo_commands.py methods are partially converted while all other callers are broken. Consider:
Option 1 is much less invasive and easier to get right. |
Missing await/async
also add useful comment
No description provided.