Feature Type
Problem Description
pd.col() currently only supports selecting a single column, even though its
behavior allows expressions to be chained in ways that naturally suggest
multi-column operations should also work.
The function is defined as:
def col(col_name: Hashable) -> Expression:
Internally it constructs an Expression whose evaluation function does:
def func(df: DataFrame) -> Series:
if col_name not in df.columns:
raise ValueError(...)
return df[col_name]
This always returns a Series, meaning pd.col() is fundamentally
single-column oriented.
Inconsistent behavior with lists
Passing a list currently does not raise a clear error:
pd.col(["price", "discount"])
The expression is created successfully, but when evaluated it produces:
df[["price", "discount"]] # returns a DataFrame
This creates inconsistent behavior when chaining operations such as:
df.assign(total=pd.col(["price", "discount"]).sum())
because the expression now operates on a DataFrame instead of a Series.
The behavior becomes unclear and differs from user expectations.
Missing regex selection
There is currently no way to express pattern-based column selection inside
pd.col():
pd.col("^price_", regex=True)
This fails with:
TypeError: col() got an unexpected keyword argument 'regex'
Users must instead rely on eager operations such as:
df.filter(regex="^price_")
which breaks the composable expression style that pd.col() aims to provide.
Summary of limitations
Currently pd.col() does not support:
- Multi-column selection
- Regex-based column selection
- dtype-based column selection
- Any expression whose base resolves to a DataFrame
This limits the usefulness of the expression API for many real-world workflows.
Feature Description
Extend pd.col() so it can reference multiple columns and column groups
while still returning an Expression.
Proposed signature
def col(
col_name: Hashable | list[Hashable] | None = None,
*,
regex: str | None = None,
dtype: str | type | None = None,
) -> Expression:
Behavior
- Single column (current behavior)
Resolves to:
- Multi-column list
pd.col(["price", "discount"])
Resolves to:
df[["price", "discount"]] # DataFrame
which enables operations like:
df.assign(
total=pd.col(["price", "discount"]).sum(axis=1)
)
- Regex selection
Resolves to:
df.filter(regex="^price_")
Example usage:
df.assign(
total_price=pd.col(regex="^price_").sum(axis=1)
)
- dtype selection
Resolves to:
df.select_dtypes(include="float64")
Why this works with the existing Expression system
The Expression evaluation pipeline already supports this:
result = expr._eval_expression(df)
If the base expression returns a DataFrame, chained operations like
.sum(axis=1) naturally return a Series, which is compatible with
assign, loc, and other pandas APIs.
Example:
df.assign(
total=pd.col(["a", "b"]).sum(axis=1)
)
The final expression resolves to a Series and integrates seamlessly
with the existing execution flow.
Alternative Solutions
Without this feature, users must fall back to less expressive patterns.
Lambda workaround
The most common workaround is using a lambda:
df.assign(
total=lambda df: df[["price", "discount"]].sum(axis=1)
)
While functional, this approach has drawbacks:
- Lambdas are not reusable
- They cannot be easily inspected or composed
- They break the uniform expression style introduced by
pd.col()
Manual column arithmetic
Users can sometimes express operations using separate expressions:
df.assign(
total=pd.col("price") + pd.col("discount")
)
However, this only works for simple arithmetic and does not scale to
aggregations across many columns.
Precomputing outside assign
Another workaround is eager computation:
price_cols = df.filter(regex="^price_").sum(axis=1)
df.assign(total_price=price_cols)
This approach breaks the lazy evaluation pattern of assign,
since the computation occurs before the DataFrame pipeline is executed.
Summary
All existing alternatives either:
- abandon the expression API
- require verbose lambdas
- perform eager evaluation outside the pipeline
A native multi-column pd.col() would provide a cleaner and more
consistent solution.
Additional Context
Many dataframe libraries already support multi-column column expressions.
| Library |
Multi-column selection |
Regex selection |
dtype selection |
| Polars |
pl.col(["a","b"]) |
pl.col("^price_") |
pl.col(pl.Float64) |
| DuckDB |
COLUMNS('^price_') |
Yes |
Yes |
| Spark |
col("a") + col("b") |
Partial |
Limited |
pandas (pd.col) |
Not supported |
Not supported |
Not supported |
Because pd.col() was introduced to improve composability and readability
in pandas expressions, extending it to support multi-column references would
make it significantly more useful in real-world data workflows.
Typical real-world workflow
df.assign(
subtotal=pd.col("price") * pd.col("qty"),
tax=pd.col("subtotal") * 0.1,
total=pd.col(["subtotal", "tax"]).sum(axis=1)
)
The final step cannot currently be expressed with pd.col(),
forcing users to revert to lambdas.
Test coverage gap
Current tests for pd.col() focus on:
- arithmetic operators
- logical operators
- accessor chaining (
.str, .dt)
- conditional expressions
There are no tests covering:
pd.col(["a", "b"])
pd.col(regex="^pattern")
pd.col(dtype="float64")
Adding support for these would likely require corresponding
test cases to ensure correct behavior when expressions
resolve to a DataFrame instead of a Series.
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
pd.col()currently only supports selecting a single column, even though itsbehavior allows expressions to be chained in ways that naturally suggest
multi-column operations should also work.
The function is defined as:
Internally it constructs an
Expressionwhose evaluation function does:This always returns a Series, meaning
pd.col()is fundamentallysingle-column oriented.
Inconsistent behavior with lists
Passing a list currently does not raise a clear error:
The expression is created successfully, but when evaluated it produces:
This creates inconsistent behavior when chaining operations such as:
because the expression now operates on a DataFrame instead of a Series.
The behavior becomes unclear and differs from user expectations.
Missing regex selection
There is currently no way to express pattern-based column selection inside
pd.col():This fails with:
Users must instead rely on eager operations such as:
which breaks the composable expression style that
pd.col()aims to provide.Summary of limitations
Currently
pd.col()does not support:This limits the usefulness of the expression API for many real-world workflows.
Feature Description
Extend
pd.col()so it can reference multiple columns and column groupswhile still returning an
Expression.Proposed signature
Behavior
Resolves to:
Resolves to:
which enables operations like:
Resolves to:
Example usage:
Resolves to:
Why this works with the existing Expression system
The
Expressionevaluation pipeline already supports this:If the base expression returns a DataFrame, chained operations like
.sum(axis=1)naturally return a Series, which is compatible withassign,loc, and other pandas APIs.Example:
The final expression resolves to a Series and integrates seamlessly
with the existing execution flow.
Alternative Solutions
Without this feature, users must fall back to less expressive patterns.
Lambda workaround
The most common workaround is using a lambda:
While functional, this approach has drawbacks:
pd.col()Manual column arithmetic
Users can sometimes express operations using separate expressions:
However, this only works for simple arithmetic and does not scale to
aggregations across many columns.
Precomputing outside
assignAnother workaround is eager computation:
This approach breaks the lazy evaluation pattern of
assign,since the computation occurs before the DataFrame pipeline is executed.
Summary
All existing alternatives either:
A native multi-column
pd.col()would provide a cleaner and moreconsistent solution.
Additional Context
Many dataframe libraries already support multi-column column expressions.
pl.col(["a","b"])pl.col("^price_")pl.col(pl.Float64)COLUMNS('^price_')col("a") + col("b")pd.col)Because
pd.col()was introduced to improve composability and readabilityin pandas expressions, extending it to support multi-column references would
make it significantly more useful in real-world data workflows.
Typical real-world workflow
The final step cannot currently be expressed with
pd.col(),forcing users to revert to lambdas.
Test coverage gap
Current tests for
pd.col()focus on:.str,.dt)There are no tests covering:
Adding support for these would likely require corresponding
test cases to ensure correct behavior when expressions
resolve to a DataFrame instead of a Series.