Skip to content

[c++] avrogencpp: emit deterministic include guard#3778

Open
travisdowns wants to merge 1 commit into
apache:mainfrom
travisdowns:stable-avrogen-include-guard
Open

[c++] avrogencpp: emit deterministic include guard#3778
travisdowns wants to merge 1 commit into
apache:mainfrom
travisdowns:stable-avrogen-include-guard

Conversation

@travisdowns
Copy link
Copy Markdown

What

CodeGen::guard() in lang/c++/impl/avrogencpp.cc suffixes the generated header's #ifndef guard with the output of an std::mt19937 seeded from ::time(nullptr):

string CodeGen::guard() {
    string h = headerFile_;
    makeCanonical(h, true);
    return h + "_" + std::to_string(random_()) + "_H";
}

So two invocations on the same schema produce headers whose guards differ:

#ifndef FOO_AVROGEN_H_3350718792_H
#ifndef FOO_AVROGEN_H_2362587291_H

This PR drops the random suffix.

Why

  1. Generated output is non-deterministic. Re-running avrogencpp -i schema.avsc -o foo.h on the same input produces a different file every time, which is surprising for a codegen and makes side-by-side diff / git review difficult.

  2. It breaks content-addressed build systems. Bazel's remote cache and the Nix store both key on input-content digests. With a randomised include guard, every consumer of the generated header sees a different input digest on every build — so on byte-identical schemas the entire downstream .cc.o.a → binary chain has to recompile and re-link, even though the schema hasn't changed. This was the trigger for filing the PR: a hermetic two-output-base Bazel build comparison flagged manifest_file.avrogen.h as a root non-hermetic action and traced ~hundreds of downstream cascade rebuilds back to it.

Why this is safe

headerFile_ is already guaranteed-unique per output (it's the path of the file we're about to write). makeCanonical(h, true) turns that path into a valid C identifier, which on its own is a fine guard name. The RNG suffix only added entropy, not uniqueness — there's no scenario where two avrogencpp runs producing headers at the same output path are supposed to coexist with conflicting guards.

After the change, the guard is <canonicalised-path>_H, deterministic across runs.

Test

Build avrogencpp, run it twice against the same schema (echo $RANDOM > /tmp/x.avsc is irrelevant — same schema, same path), check the outputs are byte-identical. Before the change they differ; after, they match.

CodeGen::guard() in avrogencpp.cc was suffixing the generated header's
include guard with the output of std::mt19937 seeded from ::time(nullptr).
That produced a different guard on every avrogen invocation, e.g.:

  #ifndef FOO_AVROGEN_H_3350718792_H
  #ifndef FOO_AVROGEN_H_2362587291_H

Two consequences:

1. Generated headers were non-deterministic. Repeated runs on the same
   schema produced different bytes, which is surprising for a codegen and
   makes side-by-side diff / git review difficult.

2. Build systems that key their cache on input-content digests
   (e.g. Bazel's remote cache, the Nix store) saw every consumer of the
   generated header miss the cache on every build, even when the schema
   was byte-identical. In a hermetic two-output-base Bazel build of a
   downstream project, this surfaced as a chain of cascade rebuilds
   that started at manifest_file.avrogen.h and propagated through every
   .cc that included it.

headerFile_ is already guaranteed-unique per output (it's the path of
the file we're about to write). makeCanonical(h, true) turns it into a
valid C identifier, which is already a fine guard name on its own; the
random suffix doesn't add uniqueness, only entropy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ Pull Requests for C++ binding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant