Conversation
stanbrub
commented
Mar 26, 2026
- Added "training tests" that are representative benchmarks for comparing JDK versions, GC types, Python versions, etc. They are meant to provide as much coverage as possible with the fewest tests
- Added a LocalParquetGenerator to generate very large parquet files into the DHC data directory. The typical standard benchmarks generate data through DHC, which is great for small to mid sized data sets.
- Tests added: AggBy, Filter, Join, UpdateBy, Formula,
cpwright
left a comment
There was a problem hiding this comment.
All the benchmarks are going to do work. I have some concerns that we might do too much work related to the timestamp calculation and traversing a UnionSourceManager (merge) where not necessary.
| merge([ | ||
| read('/data/timed.parquet').view(formulas=[${loadColumns}])${headRows} | ||
| ] * ${scaleFactor}).update_view([ | ||
| 'timestamp=timestamp.plusMillis((long)(ii / ${rows}) * ${rows})' |
There was a problem hiding this comment.
Is there a reason we can't use the timestamp from the file? I have a few worries about doing rowset calculation as part of the benchmark (to come up with ii).
For the actual test benchmarks, without a select we would also just prefer more/bigger parquet files to avoid the overhead of going through the merge data structures. We might even be able to get away with symlinks to have the data just repeate itself.
There was a problem hiding this comment.
For the "train" benchmarks, since we don't use Scale Factors, that section of code will not be hit. This is only used when we are doing merges to simulate larger data sets. So for the nightly runs, this will happen BEFORE the "select" into memory, which is not included in the measurement. But for the "train" benchmarks, we only read timestamps directly from the parquet file(s), and that only if they are used in the benchmark (like for rollingtime).