Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,33 @@ on:
branches: [main]

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: "3.11"
enable-cache: true

- name: Install dev dependencies
run: uv sync --group dev

- name: Ruff check
run: uv run ruff check src/ tests/

- name: Ruff format check
run: uv run ruff format --check src/ tests/

unit-tests:
runs-on: ubuntu-latest
needs: lint
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13"]
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]

steps:
- uses: actions/checkout@v4
Expand All @@ -21,6 +42,7 @@ jobs:
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ matrix.python-version }}
enable-cache: true

- name: Install dependencies
run: uv sync --group dev
Expand Down Expand Up @@ -66,6 +88,7 @@ jobs:
uses: astral-sh/setup-uv@v5
with:
python-version: "3.11"
enable-cache: true

- name: Install dependencies (${{ matrix.engine }})
run: uv sync --group dev ${{ matrix.extras_flags }}
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,6 @@ __lakebench_cli_cache__/
# Optional: Docs builds
site/
docs/_build/

# Personal scratch / scratchpads (workspace-specific drivers, demo captures)
scratch/
18 changes: 18 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
hooks:
- id: ruff
args: [--fix]
- id: ruff-format

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-toml
- id: check-merge-conflict
- id: check-added-large-files
args: [--maxkb=500]
320 changes: 320 additions & 0 deletions docs/architecture.md

Large diffs are not rendered by default.

253 changes: 253 additions & 0 deletions docs/cli-quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# LakeBench CLI — Quick Start

A 5-minute tour of the `lakebench` CLI. Get from zero to a measured benchmark
run on your laptop without touching any Python.

---

## 1. Install

```bash
# pip — pick the engines you want; DuckDB has the smallest footprint
pip install 'lakebench[duckdb,tpch_datagen]'
```

Verify:

```bash
lakebench --version
lakebench --help
```

> **Using `uv` instead of `pip`?** Every command below works with the same
> arguments — just prefix with `uv run`, e.g. `uv run lakebench --version`.
> To set up the dev environment from a clone:
> `uv sync --group dev --extra duckdb --extra tpch_datagen`
> Install `uv` with `curl -LsSf https://astral.sh/uv/install.sh | sh`.

---

## 2. Generate some data (optional)

```bash
lakebench datagen \
--benchmark tpch \
--scale-factor 1 \
--output /tmp/tpch_sf1
```

That writes the 8 TPC-H tables as parquet under `/tmp/tpch_sf1/`. Use scale
factor `0.1` if you want it to finish in seconds.

---

## 3. Run a benchmark — zero config

You can run with no profile at all:

```bash
lakebench run \
--engine duckdb \
--benchmark tpch --scenario sf1 --scale-factor 1 \
--input-uri /tmp/tpch_sf1
```

`--engine` builds an ad-hoc profile inline. Local engines (`duckdb`, `polars`,
`daft`, `sail`) get a working-directory URI under `$TMPDIR/lakebench-scratch`
unless you override with `-E schema_or_working_directory_uri=...`.

Drop `--engine` and the CLI will **auto-create `~/.lakebench.json`** the first
time, picking the first installed local engine (priority: duckdb → polars →
daft → spark → sail). You'll see one warning line:

```
WARNING lakebench: No profile config found — created starter at /home/you/.lakebench.json
(re-run with --engine to override).
```

After that, future runs use the saved default with no flags needed.

---

## 4. Create a named profile (for repeated runs)

For more than one engine or non-default settings, create
`./lakebench.json` in the repo root (project-level):

```json
{
"defaults": { "profile": "local-duckdb" },
"profiles": {
"local-duckdb": {
"engine": "duckdb",
"engine_options": {
"schema_or_working_directory_uri": "/tmp/lakebench-duckdb"
}
}
}
}
```

Inspect what the CLI actually sees:

```bash
lakebench profiles list
lakebench profiles show local-duckdb
```

---

## 5. Run with the profile

```bash
lakebench run \
--benchmark tpch \
--scenario sf1 \
--scale-factor 1 \
--input-uri /tmp/tpch_sf1
```

Because `defaults.profile` is set, you didn't need `--profile`. Add
`--print-config` (or `--dry-run`) first if you want to see the merged config
without actually launching an engine:

```bash
lakebench run --benchmark tpch --scenario sf1 \
--scale-factor 1 --input-uri /tmp/tpch_sf1 --print-config
```

---

## 6. Inspect results

```bash
lakebench results latest # most recent run
lakebench results list --benchmark tpch # filter
lakebench results show <run_id_prefix> # 6-char prefix is enough
lakebench results stats --benchmark tpch # n / mean / p50 / p95
```

Runs land in `./results/` by default — change with `--results-dir DIR` or
`LAKEBENCH_RESULTS_DIR`.

---

## 6a. Discover datasets already in your lakehouse

Pointing LakeBench at a Fabric workspace or Databricks catalog for the first
time? Ask it what's there:

```bash
lakebench discover --profile my-fabric
```

Example output:

```
catalog schema benchmark confidence matched/expected
spark_catalog tpcds_sf1000 tpcds | eltbench 100% 24/24
spark_catalog tpch_sf1000 tpch 100% 8/8
spark_catalog clickbench clickbench 100% 1/1
```

Now you know which schema to pass as `--input-uri` / `schema_name` in a
subsequent `lakebench run`. Also works with `--engine duckdb` against a local
scratch dir. `--min-confidence 0.8` hides partial matches; `--format json`
emits machine-readable output for scripting.

### Benchmark against an existing database

Once `discover` tells you what's in the lakehouse, run queries against it
without re-loading. Use `--mode query`, `--database <schema>`, and (for
multi-catalog engines) `--catalog <name>`:

```bash
# Fabric / Synapse / HDInsight via Livy
lakebench run --profile my-fabric \
--benchmark tpcds --scenario sf1000 --scale-factor 1000 \
--database tpcds_sf1000 --mode query

# Databricks (Unity Catalog or hive_metastore)
lakebench run --profile my-databricks \
--benchmark tpch --scenario sf100 --scale-factor 100 \
--catalog hive_metastore --database tpch_sf100 --mode query
```

`--database` (alias: `--schema`) overlays onto `engine_options.schema_name`,
and `--catalog` onto `engine_options.catalog_name`. Queries are auto-qualified
with the resolved catalog/schema, so no SQL edits are required.

---

## 7. Check your environment

Before debugging a flaky run, ask the CLI to self-check:

```bash
lakebench doctor
lakebench doctor --profile local-duckdb
```

Catches missing extras, broken profile, datagen tools not on PATH, unwritable
results dir, and missing/unauthenticated `az` CLI when any profile uses
`auth: az` (Fabric / Databricks / Synapse / HDInsight).

---

## 8. Tweak engine settings without editing the profile

Two override flags, last-one-wins, deep-merged into the profile:

```bash
# -E: any key under engine_options (JSON-aware, dotted nesting)
lakebench run --benchmark tpch --scenario sf1 \
--scale-factor 1 --input-uri /tmp/tpch_sf1 \
-E "compute_stats_all_cols=true"

# --conf: shortcut for engine_options.session_conf.<key>
lakebench run --benchmark tpch --scenario sf1 ... \
--conf spark.sql.shuffle.partitions=200
```

Both also have file forms: `--engine-options-file foo.json`,
`--conf-file foo.properties`.

---

## 9. Tab completion (optional)

```bash
# bash
eval "$(lakebench --shell-init bash)"
# zsh
eval "$(lakebench --shell-init zsh)"
# fish
lakebench --shell-init fish | source
```

Requires `argcomplete` (`pip install argcomplete`); otherwise this is a no-op.

---

## Common recipes

| Task | Command |
|---|---|
| List supported run modes for a benchmark | `lakebench list-modes tpch` |
| Compare two runs side-by-side | `lakebench results compare <a> <b>` |
| Tag a run | `lakebench results tag <run_id> baseline production` |
| Add a note | `lakebench results notes <run_id> "warm cache, after vacuum"` |
| Export to CSV / Markdown | `lakebench results export --format md --output report.md` |
| Purge old runs | `lakebench results purge --older-than 30d` |
| Get full traceback on error | add `--debug` |
| Continue past engine crash, exit 2 instead of 3 | add `--continue-on-error` |

---

## Where to next

- **`docs/cli-reference.md`** — every flag, every subcommand, all defaults.
- **`docs/install-fabric.md`** — Fabric-specific install + first run.
- **`docs/install-databricks.md`** — Databricks-specific install + first run.
- **`README.md`** — Python-API usage, custom benchmarks/engines.
- **`lakebench doctor`** — first stop when something doesn't work.
Loading