Skip to content

Comprehensive SDK audit: 46 findings across 8 PRs (reader crash safety, reconnect storm, error handling, asyncio lifecycle) #72

@mwolter805

Description

@mwolter805

Over the past week I've performed a comprehensive audit of meshcore_py at v2.3.6 (fbf84cb), motivated by stability issues I was hitting in the meshcore-ha Home Assistant integration. The audit found 46 unique issues across the SDK — 4 critical, 14 high, 18 medium, and 10 low/info.

I've implemented fixes for all 46 findings across 8 independent branches, each intended as a separate PR. All branches are based on upstream/main at fbf84cb. A combined integration branch merges all 8 together, and 136 tests pass with zero regressions (the full test suite completes in under 5 seconds with the test fixture fix) that has extensively tested on my live ha instance.

I'm filing this issue first to share the full plan and the forensics report before opening PRs, so you can review the scope and approach before being buried in diffs.

Forensics report

The full report is attached below. It covers:

  • Severity rubric and executive summary
  • 5 structural themes found across the codebase
  • All 46 findings organized by fix-group, with file/line references, code snippets, and recommended fixes
  • Extensively code/communication flow verified against companion_radio firmware at companion-v1.14.1

Full forensics report:
https://gist.github.com/mwolter805/ac228fb36471d722d096c82a0e980c21

PR plan — 8 branches in submission order

Each branch is scoped to a single theme and is independently reviewable/mergeable. They share some files (noted below), so later PRs will be rebased onto main after each merge.

# Branch Theme Key fixes Files touched Related issues
0 fix/test-timeout-waste Test infrastructure Mock dispatcher resolves events immediately; suite drops from ~8 min to <5s tests/unit/test_commands.py
1 fix/reader-parser-crash-safety Reader crash protection Umbrella try/except on handle_rx, NameError fix (LOGIN_FAILED), silent corruption guards (BATTERY, STATUS_RESPONSE), parser length validation reader.py, meshcore_parser.py meshcore_py#23, meshcore_py#19, meshcore-ha#154
2 fix/reconnect-path Reconnect storm TCP connect() Future→string, reconnect loop restructure, post-reconnect send_appstart() callback tcp_cx.py, connection_manager.py, meshcore.py meshcore-ha#204, meshcore-ha#170, meshcore-ha#176, meshcore-ha#36, meshcore-ha#45
3 fix/error-response-handling Error event handling Add EventType.ERROR to all command expected_events lists, fix fall-through crashes on error payloads events.py, commands/base.py, commands/device.py, commands/messaging.py meshcore_py#4 (extends earlier fix)
4 fix/transport-symmetry Transport parity Symmetric disconnect signaling on send failure, serial connect timeout, BLE callback re-registration, oversize-frame recovery tcp_cx.py, serial_cx.py, ble_cx.py meshcore_py#54 (related class)
5 fix/asyncio-lifecycle Asyncio hygiene Tracked background tasks (no more fire-and-forget), deferred Queue/Lock construction, get_running_loop() transports + events.py + commands/base.py meshcore_py#36 (related class)
6 fix/protocol-surface-gaps Missing protocol coverage CONTACT_DELETED/CONTACTS_FULL handlers, TUNING_PARAMS support, send_trace padding fix, orphaned command wrappers packets.py, reader.py, events.py, commands/device.py, commands/messaging.py, commands/contact.py
7 fix/standalone-bugs-and-cleanup Catch-all fixes Remove broken req_mma, bump DEFAULT_TIMEOUT 5→15s, TypeError guard, dead code removal, binary request pre-registration commands/binary.py, commands/base.py, commands/messaging.py, commands/contact.py, meshcore.py meshcore_py#52 (extends earlier fix)

Potential related issues this work addresses

meshcore_py

meshcore-ha (downstream)

Testing performed

  • All 136 unit tests pass on dev/combined (all 8 branches merged), zero regressions vs. upstream/main
  • Each branch individually passes its own tests plus all pre-existing tests
  • Test suite runs in <5 seconds (down from ~8 minutes) with the test fixture fix
  • Integration tested against meshcore-ha with real hardware (TCP companion)

Proposed submission cadence

I plan to submit PRs sequentially: test fix first (easiest to review, zero production code), then the remaining 7 in the order listed above. Each subsequent PR will be rebased onto the latest main after the previous one merges. I'm happy to adjust this cadence based on your preferences — if you'd rather review them all at once, I can push all branches and open all PRs simultaneously. I also have a dev/combined branch with everything merged together if you'd like to test the full set before merging individually.

Let me know how you'd like to proceed, and whether the scope/approach looks reasonable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions