Skip to content

Turn the search index into a binary file#784

Open
jviotti wants to merge 2 commits intomainfrom
search-binary
Open

Turn the search index into a binary file#784
jviotti wants to merge 2 commits intomainfrom
search-binary

Conversation

@jviotti
Copy link
Member

@jviotti jviotti commented Mar 25, 2026

Signed-off-by: Juan Cruz Viotti jv@jviotti.com

@jviotti jviotti marked this pull request as ready for review March 25, 2026 20:42
Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/search/search.cc">

<violation number="1" location="src/search/search.cc:70">
P1: String lengths are truncated to `std::uint16_t` without validation. If any field exceeds 65535 bytes, the length is silently truncated, corrupting the binary index and causing incorrect reads. Consider adding validation or using larger length fields.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@augmentcode
Copy link

augmentcode bot commented Mar 25, 2026

🤖 Augment PR Summary

Summary: Converts the Explorer search index payload from JSONL text into a compact binary format.
Changes:

  • Introduce packed binary headers (SearchIndexHeader, SearchRecordHeader) in sourcemeta/one/search.h
  • Rewrite make_search() to emit: header + offset table + variable-length records (path/title/description)
  • Update search() to parse the binary payload and run case-insensitive matching per field
  • Switch generated search index MIME type to application/octet-stream
  • Update CLI integration test to validate schema paths extracted from the binary payload
  • Split/expand unit tests into build-format tests and query/robustness tests for malformed payloads
Technical Notes: Result ordering is preserved by pre-sorting entries (metadata completeness, then health, then path) and search results are still capped at 10. Considerations: The binary layout uses an offset table to enable O(1) record lookup and bounded parsing during streaming.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

payload.size()}};
sourcemeta::one::metapack_write_text(
action.destination, payload_view, "application/jsonl",
action.destination, payload_view, "application/octet-stream",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metapack_write_text() appends a trailing \n (and copies into a std::string), so using it for this binary search index will mutate the payload bytes (and even an “empty” index becomes a 1-byte payload). For a binary format this can be surprising and may break any consumer expecting the payload to end exactly at the last record.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

std::memcpy(offset_table + entry_index * sizeof(std::uint32_t),
&record_offset, sizeof(std::uint32_t));

// Write record header
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The record field lengths are truncated to std::uint16_t without checking entry.*.size() <= UINT16_MAX, which will silently corrupt/truncate large paths/titles/descriptions. Because the writer still advances by the full std::string::size(), this can also make the on-disk format internally inconsistent for long fields.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

if (payload_size == 0) {
if (payload == nullptr || payload_size < sizeof(SearchIndexHeader)) {
return result;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload is raw bytes from a file, but it’s being parsed via reinterpret_cast<...*>(payload) and dereferenced (header, offset_table, record_header), which relies on type-punning/object-lifetime assumptions and can be undefined behavior under optimization. Consider decoding these fixed-size structs/integers via std::memcpy into local trivially-copyable values instead.

Severity: medium

Other Locations
  • src/search/search.cc:125
  • src/search/search.cc:136

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Index (community)

Details
Benchmark suite Current: 7f0b3f1 Previous: 00a0767 Ratio
Add one schema (0 existing) 21 ms 19 ms 1.11
Add one schema (100 existing) 27 ms 24 ms 1.13
Add one schema (1000 existing) 86 ms 78 ms 1.10
Add one schema (10000 existing) 723 ms 819 ms 0.88
Update one schema (1 existing) 21 ms 17 ms 1.24
Update one schema (101 existing) 27 ms 24 ms 1.13
Update one schema (1001 existing) 93 ms 77 ms 1.21
Update one schema (10001 existing) 748 ms 673 ms 1.11
Cached rebuild (1 existing) 11 ms 10 ms 1.10
Cached rebuild (101 existing) 13 ms 12 ms 1.08
Cached rebuild (1001 existing) 38 ms 33 ms 1.15
Cached rebuild (10001 existing) 299 ms 285 ms 1.05
Index 100 schemas 121 ms 109 ms 1.11
Index 1000 schemas 1119 ms 888 ms 1.26
Index 10000 schemas 14037 ms 13507 ms 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Index (enterprise)

Details
Benchmark suite Current: 7f0b3f1 Previous: 00a0767 Ratio
Add one schema (0 existing) 24 ms 22 ms 1.09
Add one schema (100 existing) 27 ms 28 ms 0.96
Add one schema (1000 existing) 84 ms 84 ms 1
Add one schema (10000 existing) 765 ms 805 ms 0.95
Update one schema (1 existing) 21 ms 20 ms 1.05
Update one schema (101 existing) 27 ms 27 ms 1
Update one schema (1001 existing) 86 ms 87 ms 0.99
Update one schema (10001 existing) 720 ms 705 ms 1.02
Cached rebuild (1 existing) 19 ms 11 ms 1.73
Cached rebuild (101 existing) 14 ms 14 ms 1
Cached rebuild (1001 existing) 38 ms 39 ms 0.97
Cached rebuild (10001 existing) 317 ms 300 ms 1.06
Index 100 schemas 124 ms 124 ms 1
Index 1000 schemas 992 ms 1065 ms 0.93
Index 10000 schemas 13976 ms 14304 ms 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant