Skip to content

feat: Gc benchmarking#421

Draft
stanbrub wants to merge 22 commits intodeephaven:mainfrom
stanbrub:gc-benchmarking
Draft

feat: Gc benchmarking#421
stanbrub wants to merge 22 commits intodeephaven:mainfrom
stanbrub:gc-benchmarking

Conversation

@stanbrub
Copy link
Copy Markdown
Collaborator

  • Added "training tests" that are representative benchmarks for comparing JDK versions, GC types, Python versions, etc. They are meant to provide as much coverage as possible with the fewest tests
  • Added a LocalParquetGenerator to generate very large parquet files into the DHC data directory. The typical standard benchmarks generate data through DHC, which is great for small to mid sized data sets.
  • Tests added: AggBy, Filter, Join, UpdateBy, Formula,

@stanbrub stanbrub self-assigned this Mar 26, 2026
Copy link
Copy Markdown

@cpwright cpwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the benchmarks are going to do work. I have some concerns that we might do too much work related to the timestamp calculation and traversing a UnionSourceManager (merge) where not necessary.

merge([
read('/data/timed.parquet').view(formulas=[${loadColumns}])${headRows}
] * ${scaleFactor}).update_view([
'timestamp=timestamp.plusMillis((long)(ii / ${rows}) * ${rows})'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't use the timestamp from the file? I have a few worries about doing rowset calculation as part of the benchmark (to come up with ii).

For the actual test benchmarks, without a select we would also just prefer more/bigger parquet files to avoid the overhead of going through the merge data structures. We might even be able to get away with symlinks to have the data just repeate itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "train" benchmarks, since we don't use Scale Factors, that section of code will not be hit. This is only used when we are doing merges to simulate larger data sets. So for the nightly runs, this will happen BEFORE the "select" into memory, which is not included in the measurement. But for the "train" benchmarks, we only read timestamps directly from the parquet file(s), and that only if they are used in the benchmark (like for rollingtime).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants