Skip to content

tests: make test_networkevents portable on macOS#9140

Open
aryarathoree wants to merge 3 commits into
ElementsProject:masterfrom
aryarathoree:fix-macos-networkevents
Open

tests: make test_networkevents portable on macOS#9140
aryarathoree wants to merge 3 commits into
ElementsProject:masterfrom
aryarathoree:fix-macos-networkevents

Conversation

@aryarathoree
Copy link
Copy Markdown

@aryarathoree aryarathoree commented May 18, 2026

The test previously relied on Linux-specific socket error strings such as:

  • Connection refused
  • Connection timed out

On macOS (observed on Apple Silicon / Tahoe), failed connection attempts can instead surface as:

  • Bad file descriptor

This caused the test to fail despite the underlying behavior being correct.

Changes

  • Relaxed regex assertions to accept valid macOS socket error messages
  • Replaced fragile exact dictionary equality checks with:
    • explicit assertions for stable event fields
    • regex validation for platform-dependent error text

Result : The test now passes on both Linux and macOS without changing runtime behavior.

Reproduction

uv run pytest -vvvv tests/test_connection.py::test_networkevents

Before:

  • failed on macOS with Bad file descriptor

After:

  • passes successfully on macOS

Closes #9067

Checklist

  • The changelog has been updated in the relevant commit(s) according to the guidelines.
  • Tests have been added or modified to reflect the changes.
  • Documentation has been reviewed and updated as needed.
  • Related issues have been listed and linked, including any that this PR closes.
  • Important: All PRs must consider how to reverse any persistent changes for tools/lightning-downgrade

Notes

  • No changelog entry was added because this is a test portability fix.
  • No documentation updates were necessary.
  • No persistent/runtime behavior changes were introduced.

Copy link
Copy Markdown
Collaborator

@nGoline nGoline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce the failure on macOS Tahoe and agree the test needs to handle platform differences. However, I'd like to flag a concern before we merge.

EBADF ("Bad file descriptor", errno 9) is not a network-level failure. The two errors this test was designed to distinguish, ECONNREFUSED (remote port closed) and ETIMEDOUT (host unreachable), come from the kernel's TCP stack. EBADF instead means a system call was issued on a file descriptor that was already closed or invalid. That's an internal programming error, not a connection outcome.

Looking at the code path in ccan/ccan/io/io.c:

static int do_connect(int fd, struct io_plan_arg *arg) {
    ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &len);
    if (ret < 0)
        return -1;  // errno here is from getsockopt itself, not connect()

If getsockopt is failing with EBADF, it means the fd was invalidated between when the event loop polled it writable and when do_connect ran. That's likely a race or double-close in ccan/io's async connect path on macOS: a real bug, not expected OS behaviour.

By accepting EBADF in the test, we're being tolerate to an internal error that shouldn't happen. Two specific risks:

  1. The two failure modes are no longer distinguished on macOS. The test currently accepts EBADF for both "refused" (port 1, loopback) and "unreachable" (1.1.1.1:8081). If the actual errors were swapped or completely wrong, this test would still pass on macOS.
  2. The bug could affect real users on macOS, who might see "Connection establishment: Bad file descriptor" in RPC error messages, which is a confusing and unhelpful message.

Suggested path:

  • Confirm whether this is ccan/io's do_connect producing the EBADF (add a strerror(errno) log right after the getsockopt failure). If so, this is a ccan/io bug on macOS.
  • The fix belongs in ccan/io: preserve the SO_ERROR value before the fd is closed, or guard against the fd becoming invalid between poll and callback.
  • A workaround-only approach could use pytest.mark.skipif(sys.platform == 'darwin', reason="ccan/io EBADF bug on macOS, see #XXXX") to make the gap visible while the underlying issue is tracked.

I don't want to block the PR, but if we go with the regex approach now, can we at least open a follow-up issue to investigate the EBADF root cause? Can @rustyrussell offer some insight regarding ccan/io?

Comment thread tests/test_connection.py Outdated
Comment thread tests/test_connection.py
Comment thread tests/test_connection.py
Comment thread tests/test_connection.py
Comment on lines +4818 to +4820
with pytest.raises(
RpcError,
match=r"Connection establishment: (Connection refused|Bad file descriptor)."):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining the macOS behavior, e.g.:

Suggested change
with pytest.raises(
RpcError,
match=r"Connection establishment: (Connection refused|Bad file descriptor)."):
# macOS (Tahoe) returns EBADF instead of ECONNREFUSED/ETIMEDOUT for failed connections
with pytest.raises(
RpcError,
match=r"Connection establishment: (Connection refused|Bad file descriptor)."):

This helps future maintainers know this is intentional and not an oversight.

@aryarathoree
Copy link
Copy Markdown
Author

Thanks for the detailed analysis, that makes a lot of sense. I agree EBADF is very different from ECONNREFUSED or ETIMEDOUT, and the current macOS behavior does look suspicious.

My intention with the regex approach was mainly to make the test runnable on macOS again without losing the rest of the validation. The test was otherwise failing deterministically on Tahoe, so I tried to keep the assertions around the stable event fields intact and only relax the platform dependent error text.

I also agree that getting EBADF for both localhost:1 and 1.1.1.1:8081 suggests the actual connection outcome is being lost somewhere before it reaches the RPC layer. Your point about getsockopt(... SO_ERROR ...) returning after the fd may already be invalidated seems very plausible.

I still think there’s value in keeping a temporary portability fix so macOS CI/users retain coverage for this path, but I agree this should not be treated as expected long term behavior.

Opening a follow-up issue to investigate the EBADF root cause would definitely help, and I’m happy to file one and link it from this PR so the workaround stays clearly tracked as temporary.

@nGoline
Copy link
Copy Markdown
Collaborator

nGoline commented May 18, 2026

Thank you for the fixes! I guess that making the test pass was the main issue here, and it was completed.
Can you please clean-up the commits and add the line Changelog-None?

To clean-up you can fixup! all the fixes into a single commit, then edit the message to contain the changelog entry.
You can check coding-style-guidelines for detailed instructions.

Handle macOS-specific connection failures in
tests/test_connection.py::test_networkevents.

On macOS Tahoe, failed connect attempts can surface as
"Bad file descriptor" instead of the Linux-specific
errors previously expected by the test.

The assertions are updated to validate stable event
fields directly while allowing platform-dependent
connection error text.

Changelog-None
@aryarathoree aryarathoree force-pushed the fix-macos-networkevents branch from a5309d2 to f0626c9 Compare May 19, 2026 06:59
@aryarathoree
Copy link
Copy Markdown
Author

I have added Changelog-None to the commit message as suggested.
Also, I wanted to ask about joining the Core Lightning Slack. I tried the methods mentioned on the website, but I wasn’t able to get a working invite link. Is there an updated way to join?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pytest: *FAILED* in test_networkevents

2 participants