Skip to content

Self-Describing Binary Log Format (v3)#13231

Draft
masaori335 wants to merge 2 commits into
apache:masterfrom
masaori335:binary-format-v3
Draft

Self-Describing Binary Log Format (v3)#13231
masaori335 wants to merge 2 commits into
apache:masterfrom
masaori335:binary-format-v3

Conversation

@masaori335
Copy link
Copy Markdown
Contributor

Motivation

In version 2, a segment header carries the field symbols (fmt_fieldlist,
e.g. "chi cqu pssc") and a printf-style template (fmt_printf) but
not the field types. To decode an entry a reader had to already know the
type of each symbol, because the value encodings are only self-delimiting once
the type is known (IP is variable length, for example). That coupled every
out-of-tree parser to the exact ATS build that wrote the log.

Version 3 adds one thing: a per-segment field-type schema that lists the
wire type of every field, in field order. Decoding then needs only the symbols
(as keys) and the schema (for types).


Depends on #13223

Seven boolean/counter fields were declared dINT, but their marshal
functions write a single int, and proxy_protocol_version (ppv) was
declared dINT while it actually marshals a string. The dINT type
wrongly excludes these fields from log filters and aggregates, and
the ppv mislabeling misrepresents variable-length string bytes as
two fixed ints to any type-driven consumer. Retype the single-int
fields as sINT and ppv as STRING so the declared type matches what
each marshal function emits.
Publish each field's type in a per-segment schema so a generic reader can
decode a .blog from the file alone, without an embedded ATS symbol-to-type
table that must track the writer in lockstep. The per-field code is
LogField::Type serialized directly (now an enum class : uint8_t with INVALID=0
reserved and sINT..IP = 1..4 as the frozen wire codes); a static_assert pins
the values. This relies on each field's declared type matching its marshalled
framing, which the parent commit ("Fix mismatched sINT/dINT log field types")
establishes.

Readers (LogBufferIterator, logcat, logstats, the ASCII output paths) accept
both v2 and v3 segments, sizing the header read to the on-disk version, so a v3
build keeps decoding logs written by an older one. Integer values stay in host
byte order, as in v2 (no endianness change). The public TSLogType enum is given
the same values as LogField::Type so TSLogFieldRegister can static_cast between
them; static_asserts in InkAPI.cc (the only TU that sees both) pin the
alignment so a future reorder fails to compile.

The writer version is per-LogObject: logging.yaml "binary_log_version: 2"
pins a binary log to the pre-v3 layout (no schema, shorter header) so a
not-yet-upgraded downstream parser keeps working during a migration; the
default is v3.

Decoding untrusted .blog input is bounded: LogBufferIterator validates
data_offset and each entry against the segment, and the JSON decoder validates
the schema offset alignment and cross-checks field_count against the symbol
list.
@masaori335 masaori335 added this to the 11.0.0 milestone Jun 3, 2026
@masaori335 masaori335 self-assigned this Jun 3, 2026
Copilot AI review requested due to automatic review settings June 3, 2026 06:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces binary log format v3 for Apache Traffic Server logging, making .blog segments self-describing by embedding a per-segment field-type schema. This decouples binary log decoding from the exact ATS build that produced the log and adds tooling/tests/docs around the new format.

Changes:

  • Add a v3 per-segment field-type schema to the binary log segment header and make segment header reads version-sized (v2/v3 compatibility).
  • Extend traffic_logcat with a schema-driven JSON output mode (-j/--json) and harden iteration/decoding against malformed segments.
  • Add unit + gold tests and documentation for the v3 on-disk format and configuration (binary_log_version).

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/gold_tests/logging/gold/binary_log_v3_json.gold Gold output for traffic_logcat -j JSON decoding of v3 logs.
tests/gold_tests/logging/gold/binary_log_v3_ascii.gold Gold output for ASCII decoding of v2/v3 logs.
tests/gold_tests/logging/binary_log_v3.test.py End-to-end gold test exercising v2/v3 binary logs and v3 JSON output.
src/traffic_logstats/logstats.cc Read segment headers using on-disk header size; add corruption guards.
src/traffic_logcat/unit-tests/test_LogEntryJson.cc Unit tests for schema-driven v3 JSON reference decoder.
src/traffic_logcat/LogEntryJson.h Public interface/contract for v3 entry-to-JSON reference decoder.
src/traffic_logcat/LogEntryJson.cc Implementation of schema-only v3 JSON decoder with bounds checks.
src/traffic_logcat/logcat.cc Add -j/--json option and version-sized header reads; emit JSON lines.
src/traffic_logcat/CMakeLists.txt Build LogEntryJson into traffic_logcat and add Catch2 unit test target.
src/proxy/logging/YamlLogConfig.cc Add binary_log_version logging.yaml key parsing and propagation to LogObject.
src/proxy/logging/unit-tests/test_LogBuffer.cc Unit tests covering v3 schema/type alignment and version-sized header sizing/iteration.
src/proxy/logging/LogObject.cc Extend LogObject ctor to store per-object binary_log_version.
src/proxy/logging/LogFormat.cc Update type enum usage for aggregation checks.
src/proxy/logging/LogFilter.cc Update type enum usage and improve unknown-type error reporting.
src/proxy/logging/LogFile.cc Accept v2/v3 segment versions for ASCII conversion paths.
src/proxy/logging/LogField.cc Convert type to scoped enum and update assertions/display formatting.
src/proxy/logging/LogBuffer.cc Write v3 type schema, set per-object segment version, and harden iterator.
src/proxy/logging/LogAccess.cc Fix pointer advancement bug in unmarshal_http_version.
src/proxy/logging/Log.cc Align field declarations to new LogField::Type and fix several field types (incl. ppv).
src/proxy/logging/CMakeLists.txt Add Catch2 unit test target for LogBuffer v3.
src/api/InkAPI.cc Pin TSLogType-to-LogField::Type relationship and validate plugin-provided types.
include/ts/apidefs.h.in Update TSLogType enum values/comments to mirror v3 wire-type codes.
include/proxy/logging/LogObject.h Add binary_log_version ctor arg/default and accessor/storage.
include/proxy/logging/LogFormat.h Expose field_list() for writer-side schema emission in v3.
include/proxy/logging/LogField.h Redefine LogField::Type as a stable, append-only wire-code enum class.
include/proxy/logging/LogBuffer.h Bump LOG_SEGMENT_VERSION to 3; add schema struct and helper sizing/version utilities.
doc/developer-guide/logging-architecture/index.en.rst Add v3 format page to logging architecture docs index.
doc/developer-guide/logging-architecture/binary-log-v3-format.en.rst New specification document for v3 on-disk format and decoding rules.
doc/appendices/command-line/traffic_logstats.en.rst Document v2/v3 support and reference v3 format spec.
doc/appendices/command-line/traffic_logcat.en.rst Document v2/v3 support and new -j/--json option behavior.
doc/admin-guide/files/logging.yaml.en.rst Document binary_log_version key and provide v2/v3 configuration examples.

Comment thread tests/gold_tests/logging/binary_log_v3.test.py
Comment thread src/traffic_logcat/logcat.cc
Comment thread src/proxy/logging/YamlLogConfig.cc
Comment on lines +91 to +99
char *
LogBufferHeader::fmt_fieldtypes()
{
char *addr = nullptr;
if (fmt_fieldtypes_offset) {
addr = reinterpret_cast<char *>(this) + fmt_fieldtypes_offset;
}
return addr;
}
Comment on lines +178 to +188
for (const char *p = s; p < nul; ++p) {
// Minimal JSON escaping for structural characters.
if (*p == '"' || *p == '\\') {
if (!put_ch('\\')) {
return -1;
}
}
if (!put_ch(*p)) {
return -1;
}
}
Comment thread include/ts/apidefs.h.in
Comment on lines 1652 to 1657
enum TSLogType {
TS_LOG_TYPE_INT,
TS_LOG_TYPE_INT = 1, ///< LogField::Type::sINT
// DINT is omitted from the public API for now, until we decide whether we keep the type
TS_LOG_TYPE_STRING = 2,
TS_LOG_TYPE_ADDR = 3,
TS_LOG_TYPE_STRING = 3, ///< LogField::Type::STRING
TS_LOG_TYPE_ADDR = 4, ///< LogField::Type::IP
};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renumbering is fine because TSLogType is going to be released by 11.0.0, it's not published yet.

@masaori335 masaori335 marked this pull request as draft June 3, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants