diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 8a0c961..8a71eb1 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -1,237 +1,53 @@
-# Dataframely - Coding Agent Instructions
+# Dataframely
 
-## Project Overview
+## Package Management
 
-Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in
-polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It
-supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes.
+This repository uses the Pixi package manager. When editing `pixi.toml`, run `pixi lock` afterwards.
 
-## Tech Stack
+When running any commands (like `pytest`), prepend them with `pixi run`.
 
-### Core Technologies
+## Code Style
 
-- **Python**: Primary language for the public API
-- **Rust**: Backend for polars plugin and custom regex operations
-- **Polars**: Only supported data frame library
-- **pyo3 & maturin**: Rust-Python bindings and build system
-- **pixi**: Primary environment and task manager (NOT pip/conda directly)
+### Documentation
 
-### Build System
+- Document all public functions/methods and classes using docstrings
+  - For functions & methods, use Google Docstrings and include `Args` (if there are any arguments) and `Returns` (if
+    there is a return type).
+  - Do not include type hints in the docstrings
+  - Do not mention default values in the docstrings
+- Do not write docstrings for private functions/methods unless the function is highly complex
 
-- **maturin**: Builds the Rust extension module `dataframely._native`
-- **Cargo**: Rust dependency management
-- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components
+### License Headers
 
-## Environment Setup
+Do not manually adjust or add license headers. A pre-commit hook will take care of this.
 
-**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically
-required for Rust-only operations.
+## Testing
 
-### Initial Setup
+- Never use classes for pytest, but only free functions
+- Do not put `__init__.py` files into test directories
+- Tests should not have docstrings unless they are very complicated or very specific, i.e. warrant a description beyond
+  the test's name
+- All tests should follow the arrange-act-assert pattern. The respective logical blocks should be distinguished via
+  code comments as follows:
 
-Unless already performed via external setup steps:
+  ```python
+  def test_method() -> None:
+      # Arrange
+      ...
 
-```bash
-# Install Rust toolchain
-rustup show
+      # Act
+      ...
 
-# Install pixi environment and dependencies
-pixi install
+      # Assert
+      ...
+  ```
 
-# Build and install the package locally (REQUIRED after Rust changes)
-pixi run postinstall
-```
+- If two or more tests are structurally equivalent, they should be merged into a single test and parametrized with
+  `@pytest.mark.parametrize`
+- If at least two tests share the same logic in the "arrange" step, the respective logic should be extracted into a
+  fixture
 
-### After Rust Code Changes
+## Reviewing
 
-**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension.
-
-## Development Workflow
-
-### Running Tests
-
-```bash
-# Run all tests (excludes S3 tests by default)
-pixi run test
-
-# Run tests with S3 backend (requires moto server)
-pixi run test -m s3
-
-# Run specific test file or directory
-pixi run test tests/schema/
-
-# Run with coverage
-pixi run test-coverage
-
-# Run benchmarks
-pixi run test-bench
-```
-
-### Code Quality
-
-**NEVER** run linters/formatters directly. Use pre-commit:
-
-```bash
-# Run all pre-commit hooks
-pixi run pre-commit run
-```
-
-Pre-commit handles:
-
-- **Python**: ruff (lint & format), mypy (type checking), docformatter
-- **Rust**: cargo fmt, cargo clippy
-- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace
-
-### Building Documentation
-
-```bash
-# Build documentation
-pixi run -e docs postinstall
-pixi run docs
-
-# Open in browser (macOS)
-open docs/_build/html/index.html
-```
-
-## Project Structure
-
-```
-dataframely/              # Python package
-  schema.py              # Core Schema class for DataFrame validation
-  collection/            # Collection class for validating multiple interconnected DataFrames
-  columns/               # Column type definitions (String, Integer, Float, etc.)
-  testing/               # Testing utilities (factories, masks, storage mocks)
-  _storage/              # Storage backends (Parquet, Delta Lake)
-  _rule.py               # Rule decorator for validation rules
-  _plugin.py             # Polars plugin registration
-  _native.pyi            # Type stubs for Rust extension
-
-src/                     # Rust source code
-  lib.rs                 # PyO3 module definition
-  polars_plugin/         # Custom polars plugin for validation
-  regex/                 # Custom regex operations
-
-tests/                   # Unit tests (mirrors dataframely/ structure)
-  benches/               # Benchmark tests
-  conftest.py            # Shared pytest fixtures (including s3_server)
-
-docs/                    # Sphinx documentation
-  guides/                # User guides and examples
-  api/                   # Auto-generated API reference
-```
-
-## Pixi Environments
-
-Multiple environments for different purposes:
-
-- **default**: Base Python + core dependencies
-- **dev**: Includes jupyter for notebooks
-- **test**: Testing dependencies (pytest, moto, boto3, etc.)
-- **docs**: Documentation building (sphinx, myst-parser, etc.)
-- **lint**: Linting and formatting tools
-- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy)
-- **py310-py314**: Python version-specific environments
-
-Use `-e <env>` to run commands in specific environments:
-
-```bash
-pixi run -e test test
-pixi run -e docs docs
-```
-
-## API Design Principles
-
-### Critical Guidelines
-
-1. **NO BREAKING CHANGES**: Public API must remain backward compatible
-2. **100% Test Coverage**: All new code requires tests
-3. **Documentation Required**: All public features need docstrings + API docs
-4. **Cautious API Extension**: Avoid adding to public API unless necessary
-
-### Public API
-
-Public exports are in `dataframely/__init__.py`. Main components:
-
-- **Schema classes**: `Schema` for DataFrame validation
-- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation
-- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc.
-- **Decorators**: `@rule()`, `@filter()`
-- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation`
-
-## Common Pitfalls & Solutions
-
-### S3 Testing
-
-The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file).
-
-**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends.
-
-## Testing Strategy
-
-- Tests are organized by module, mirroring the `dataframely/` structure
-- Use `dy.Schema.sample()` for generating test data
-- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution
-- S3 tests use moto server fixture from `conftest.py`
-- Benchmark tests in `tests/benches/` use pytest-benchmark
-
-## Validation Pattern
-
-Typical usage pattern:
-
-```python
-class MySchema(dy.Schema):
-    col = dy.String(nullable=False)
-
-    @dy.rule()
-    def my_rule(cls) -> pl.Expr:
-        return pl.col("col").str.len_chars() > 0
-
-# Validate and cast
-validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True)
-```
-
-## Key Configuration Files
-
-- `pixi.toml`: Environment and task definitions
-- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest)
-- `Cargo.toml`: Rust dependencies and build settings
-- `.pre-commit-config.yaml`: All code quality checks
-- `rust-toolchain.toml`: Rust nightly version specification
-
-## When Making Changes
-
-1. **Python code**: Run `pixi run pre-commit run` before committing
-2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
-3. **Tests**: Ensure `pixi run test` passes. If changes might affect storage backends, use `pixi run test -m s3`.
-4. **Documentation**: Update docstrings
-5. **API changes**: Ensure backward compatibility or document migration path
-
-### Pull request titles (required)
-
-Pull request titles must follow the Conventional Commits format: `<type>[!]: <Subject>`
-
-Allowed `type` values:
-
-- `feat`: A new feature
-- `fix`: A bug fix
-- `docs`: Documentation only changes
-- `style`: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
-- `refactor`: A code change that neither fixes a bug nor adds a feature
-- `perf`: A code change that improves performance
-- `test`: Adding missing tests or correcting existing tests
-- `build`: Changes that affect the build system or external dependencies
-- `ci`: Changes to our CI configuration files and scripts
-- `chore`: Other changes that don't modify src or test files
-- `revert`: Reverts a previous commit
-
-Additional rules:
-
-- Use `!` only for **breaking changes**
-- `Subject` must start with an **uppercase** letter and must **not** end with `.` or a trailing space
-
-## Performance Considerations
-
-- Validation uses native polars expressions for performance
-- Custom Rust plugin for advanced validation logic
-- Lazy evaluation supported via `LazyFrame` for large datasets
-- Avoid materializing data unnecessarily in validation rules
+When reviewing code changes, make sure that the `SKILL.md` is up-to-date and in line with the public API of this
+package.
diff --git a/SKILL.md b/SKILL.md
new file mode 100644
index 0000000..d9eb1c0
--- /dev/null
+++ b/SKILL.md
@@ -0,0 +1,238 @@
+---
+name: dataframely
+description: Best practices for polars data processing with dataframely. Covers definitions of Schema and Collection, usage of
+  .validate() and .filter(), type hints, and testing. Use when writing or modifying code involving dataframely or
+  polars data frames.
+license: BSD-3-Clause
+user-invocable: false
+---
+
+# Overview
+
+`dataframely` provides two types:
+
+- `dy.Schema` documents and enforces the structure of a single data frame
+- `dy.Collection` documents and enforces the relationships between multiple related data frames that each have their
+  own `dy.Schema`
+
+## `dy.Schema`
+
+A subclass of `dy.Schema` describes the structure of a single dataframe.
+
+```python
+class MyHouseSchema(dy.Schema):
+    """A schema for a dataframe describing houses."""
+
+    street = dy.String(primary_key=True)
+    number = dy.UInt16(primary_key=True)
+    #: Description on the number of rooms.
+    rooms = dy.UInt8()
+    #: Description on the area of the house.
+    area = dy.UInt16()
+```
+
+The schema can be used in type hints via `dy.DataFrame[MyHouseSchema]` and `dy.LazyFrame[MyHouseSchema]` to express
+schema adherence statically. It can also be used to validate the structure and contents of a data frame at runtime
+using validation and filtering.
+
+`dy.DataFrame[...]` and `dy.LazyFrame[...]` are typically referred to as "typed data frames". They are typing-only
+wrappers around `pl.DataFrame` and `pl.LazyFrame`, respectively, and only express intent. They are never initialized at
+runtime.
+
+### Defining Constraints
+
+Persist all implicit assumptions on the data as constraints in the schema. Use docstrings purely to answer the "what"
+about the column contents.
+
+- Use the most specific type possible for each column (e.g. `dy.Enum` instead of `dy.String` when applicable).
+- Use pre-defined arguments (e.g. `nullable`, `min`, `regex`) for column-level constraints if possible.
+- Use the `check` argument for non-standard column-level constraints that cannot be expressed using pre-defined
+  arguments. Prefer the defining the check as a dictionary with keys describing the type of check:
+
+  ```python
+  class MySchema(dy.Schema):
+      col = dy.UInt8(check={"divisible_by_two": lambda col: (col % 2) == 0})
+  ```
+
+- Use rules (i.e. methods decorated with `@dy.rule`) for cross-column constraints. Use expressive names for the rules
+  and use `cls` to refer to the schema:
+
+  ```python
+  class MySchema(dy.Schema):
+      col1 = dy.UInt8()
+      col2 = dy.UInt8()
+
+      @dy.rule()
+      def col1_greater_col2(cls) -> pl.Expr:
+          return cls.col1.col > cls.col2.col
+  ```
+
+- Use group rules (i.e. methods decorated with `@dy.rule(group_by=...)`) for cross-row constraints beyond primary key
+  checks.
+
+### Referencing Columns
+
+When referencing columns of the schema anywhere in the code, always reference column as attribute of the schema class:
+
+- Use `Schema.column.col` instead of `pl.col("column")` to obtain a `pl.Expr` referencing the column.
+- Use `Schema.column.name` to reference the column name as a string.
+
+This allows for easier refactorings and enables lookups on column definitions and constraints via LSP.
+
+## `dy.Collection`
+
+A subclass of `dy.Collection` describes a set of related data frames, each described by a `dy.Schema`. Data frames in a
+collection should share at least a subset of their primary key.
+
+```python
+class MyStreetSchema(dy.Schema):
+    """A schema for a dataframe describing streets."""
+
+    # Shared primary key component with MyHouseSchema
+    street = dy.String(primary_key=True)
+    city = dy.String()
+
+
+class MyCollection(dy.Collection):
+    """A collection of related dataframes."""
+
+    houses: dy.LazyFrame[MyHouseSchema]
+    streets: dy.LazyFrame[MyStreetSchema]
+```
+
+The collection can be used in a standalone manner (much like a dataclass). It can also be used to validate the
+structure and contents of its members and their relationships at runtime using validation and filtering.
+
+### Defining Constraints
+
+Persist all implicit assumptions about the relationships between the collections' data frames as constraints in the
+collection.
+
+- Use filters (i.e. methods decorated with `@dy.filter`) to enforce assumptions about the relationships (e.g. 1:1, 1:N)
+  between the collections' data frames. Leverage `dy.functional` for writing filter logic.
+
+  ```python
+  class MyCollection(dy.Collection):
+      houses: dy.LazyFrame[MyHouseSchema]
+      streets: dy.LazyFrame[MyStreetSchema]
+
+      @dy.filter()
+      def all_houses_on_known_streets(cls) -> pl.LazyFrame:
+          return dy.functional.require_relationship_one_to_at_least_one(
+              cls.streets, cls.houses, on="street"
+          )
+  ```
+
+# Usage Conventions
+
+## Clear Interfaces
+
+Structure data processing code with clear interfaces documented using `dataframely` type hints:
+
+```python
+def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
+    # Internal data frames do not require schemas
+    df: pl.LazyFrame = ...
+    return MyPreprocessedSchema.validate(df, cast=True)
+```
+
+- Use schemas for all input and output data frames in a function. Omit type hints if the function is a private helper
+  (prefixed with `_`) unless the schema critically improves readability or testability.
+- Omit schemas for short-lived temporary data frames. Never define schemas for function-local data frames.
+
+## Validation and Filtering
+
+Both `.validate` and `.filter` enforce the schema at runtime. Pass `cast=True` for safe type-casting.
+
+- **`Schema.validate`** — raises on failure. Use when failures are unexpected (e.g. transforming already-validated
+  data).
+- **`Schema.filter`** — returns valid rows plus a `FailureInfo` describing filtered-out rows. Use when failures are
+  possible and should be handled gracefully. Failures should either be kept around or logged for introspection. The
+  `FailureInfo` object provides several utility methods to obtain information about the failures:
+  - `len(failure)` provides the total number of failures
+  - `failure.counts()` provides the number of violations by rule
+  - `failure.invalid()` provides the data frame of invalid rows
+  - `failure.details()` provides the data frame of invalid rows with additional columns providing information on which
+    rules were violated
+
+When performing validation or filtering, prefer using `pipe` to clarify the flow of data:
+
+```python
+result = df.pipe(MySchema.validate)
+out, failures = df.pipe(MySchema.filter)
+```
+
+### Pure Casting
+
+Use `Schema.cast` as an escape-hatch when it is already known that the data frame conforms to the schema and the
+runtime cost of the validation should not be incurred. Generally, prefer using `Schema.validate` or `Schema.filter`.
+
+## Testing
+
+Unless otherwise specified by the user or the project context, add unit tests for all (non-private) methods performing
+data transformations.
+
+- Do not test properties already guaranteed by the schema (e.g. data types, nullability, value constraints).
+
+### Test structure
+
+Write tests with the following structure:
+
+1. "Arrange": Define synthetic input data and expected output
+2. "Act": Execute the transformation
+3. "Assert": Compare expected and actual output using `assert_frame_equal` from `polars.testing`
+
+```python
+from polars.testing import assert_frame_equal
+
+
+def test_grouped_sum():
+    df = pl.DataFrame({
+        "col1": [1, 2, 3],
+        "col2": ["a", "a", "b"],
+    }).pipe(MyInputSchema.validate, cast=True)
+
+    expected = pl.DataFrame({
+        "col1": ["a", "b"],
+        "col2": [3, 3],
+    })
+
+    result = my_code(df)
+
+    assert_frame_equal(expected, result)
+```
+
+### Generating Synthetic Test Data
+
+Use `dataframely`'s synthetic data generation for creating inputs to functions requiring typed data frames in their
+input. Generate synthetic data for schemas as follows:
+
+- Use `MySchema.sample(num_rows=...)` to generate fully random data when exact contents don't matter.
+- Use `MySchema.sample(overrides=...)` to generate random data with specific columns pinned to certain values for
+  testing specific functionality. Prefer using dicts of lists for overrides unless specifically prompted otherwise.
+  - When using dicts of lists: for providing overrides that are constant across all rows, provide scalar values instead
+    of lists of equal values.
+- Always use `MySchema.create_empty()` instead of sampling with empty overrides when an empty data frame is needed.
+
+Synthetic data for collections should be generated as follows:
+
+- Use `MyCollection.sample(num_rows=...)` to generate fully random data when exact contents don't matter.
+- Use `MyCollection.sample(overrides=...)` to generate random data where certain values of the collection members
+  matter. Use lists of dicts for providing overrides as "objects" spanning the collection members.
+  - Values for shared primary keys must be provided at the root of the dictionaries
+  - Values for individual collection members must be provided in nested dictionaries under the keys corresponding to
+    the collection member names.
+- Always use `MyCollection.create_empty()` instead of sampling with empty overrides when an empty collection is needed.
+
+## I/O Conventions
+
+When writing typed data frames to disk, prefer using `MySchema.write_...` instead of using `write_...` directly on the
+data frame. This ensures that schema metadata is persisted alongside the data and can be leveraged when reading the
+data back in.
+
+When reading typed data frames from disk, prefer using `MySchema.read_...` instead of using `pl.read_...` directly from
+
+# Getting more information
+
+`dataframely` provides clear function signatures, type hints and docstrings for the full public API. For more
+information, inspect the source code in the site packages. If available, always use the LSP tool to find documentation.
diff --git a/docs/conf.py b/docs/conf.py
index 73dc611..6e49c8b 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -22,7 +22,6 @@
 
 _mod = importlib.import_module("dataframely")
 
-
 project = "dataframely"
 copyright = f"{datetime.date.today().year}, QuantCo, Inc"
 author = "QuantCo, Inc."
diff --git a/docs/guides/coding-agents.md b/docs/guides/coding-agents.md
new file mode 100644
index 0000000..21eb4ec
--- /dev/null
+++ b/docs/guides/coding-agents.md
@@ -0,0 +1,75 @@
+# Using `dataframely` with coding agents
+
+Coding agents like [Claude Code](https://code.claude.com/), [Codex](https://openai.com/codex/) and
+[GitHub Copilot](https://github.com/features/copilot) are particularly powerful when two criteria are met:
+
+1. The agent has access to the full context required to solve the problem, i.e. does not have to guess.
+2. The results of the agent's work can be easily verified.
+
+When writing data processing logic, `dataframely` helps to fulfill these criteria.
+
+To help your coding agent write idiomatic `dataframely` code, we provide a `dataframely`
+[skill](https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/SKILL.md) following the
+[`agentskills.io` spec](https://agentskills.io/specification). You can install it by placing it where your agent can
+find it. For example, if you are using Claude Code:
+
+```bash
+mkdir -p .claude/skills/dataframely/
+curl -o .claude/skills/dataframely/SKILL.md https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/SKILL.md
+```
+
+or if you are using [skills.sh](https://skills.sh/) to manage your skills:
+
+```bash
+npx skills add Quantco/dataframely
+```
+
+Refer to the documentation of your coding agent for instructions on how to add custom skills.
+
+## Tell the agent about your data with `dataframely` schemas
+
+`dataframely` schemas provide a clear format for documenting dataframe structure and contents, which helps coding
+agents understand your code base. We recommend structuring your data processing code using clear interfaces that are
+documented using `dataframely` type hints. This streamlines your coding agent's ability to find the right schema at the
+right time.
+
+For example:
+
+```python
+def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
+    ...
+```
+
+gives a coding agent much more information than the schema-less alternative:
+
+```python
+def load_data(raw: pl.LazyFrame) -> pl.DataFrame:
+    ...
+```
+
+This convention also makes your code more readable and maintainable for human developers.
+
+If there is additional domain information that is not natively expressed through the structure of the schema, we
+recommend documenting this as docstrings on the definition of the schema columns. One common example would be the
+semantic meanings of enum values referring to conventions in the data:
+
+```python
+class HospitalStaySchema(dy.Schema):
+    # Reason for admission to the hospital
+    # N = Emergency
+    # V = Transfer from another hospital
+    # ...
+    admission_reason = dy.Enum(["N", "V", ...])
+```
+
+## Verifying results
+
+`dataframely` supports you and your coding agent in writing unit tests for individual pieces of logic. One significant
+bottleneck is the generation of appropriate test data. Check out
+[our documentation on synthetic data generation](./features/data-generation.md) to see how `dataframely` can help you
+generate realistic test data that meets the constraints of your schema. We recommend requiring your coding agent to
+write tests using this functionality to verify its work.
+
+<!-- prettier-ignore -->
+> [!NOTE]
+> The official skill already tells your coding agent how to best write unit tests with dataframely.
diff --git a/docs/guides/index.md b/docs/guides/index.md
index d0e20eb..538b63e 100644
--- a/docs/guides/index.md
+++ b/docs/guides/index.md
@@ -7,6 +7,7 @@
 quickstart
 examples/index
 features/index
+coding-agents
 development
 migration/index
 faq