Skip to content

ENH: Proposal for pd.col() for multi-column and regex column selection #64627

@lukhi-laksh

Description

@lukhi-laksh

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

pd.col() currently only supports selecting a single column, even though its
behavior allows expressions to be chained in ways that naturally suggest
multi-column operations should also work.

The function is defined as:

def col(col_name: Hashable) -> Expression:

Internally it constructs an Expression whose evaluation function does:

def func(df: DataFrame) -> Series:
    if col_name not in df.columns:
        raise ValueError(...)
    return df[col_name]

This always returns a Series, meaning pd.col() is fundamentally
single-column oriented.

Inconsistent behavior with lists

Passing a list currently does not raise a clear error:

pd.col(["price", "discount"])

The expression is created successfully, but when evaluated it produces:

df[["price", "discount"]]   # returns a DataFrame

This creates inconsistent behavior when chaining operations such as:

df.assign(total=pd.col(["price", "discount"]).sum())

because the expression now operates on a DataFrame instead of a Series.
The behavior becomes unclear and differs from user expectations.

Missing regex selection

There is currently no way to express pattern-based column selection inside
pd.col():

pd.col("^price_", regex=True)

This fails with:

TypeError: col() got an unexpected keyword argument 'regex'

Users must instead rely on eager operations such as:

df.filter(regex="^price_")

which breaks the composable expression style that pd.col() aims to provide.

Summary of limitations

Currently pd.col() does not support:

  • Multi-column selection
  • Regex-based column selection
  • dtype-based column selection
  • Any expression whose base resolves to a DataFrame

This limits the usefulness of the expression API for many real-world workflows.

Feature Description

Extend pd.col() so it can reference multiple columns and column groups
while still returning an Expression.

Proposed signature

def col(
    col_name: Hashable | list[Hashable] | None = None,
    *,
    regex: str | None = None,
    dtype: str | type | None = None,
) -> Expression:

Behavior

  1. Single column (current behavior)
pd.col("price")

Resolves to:

df["price"]   # Series
  1. Multi-column list
pd.col(["price", "discount"])

Resolves to:

df[["price", "discount"]]   # DataFrame

which enables operations like:

df.assign(
    total=pd.col(["price", "discount"]).sum(axis=1)
)
  1. Regex selection
pd.col(regex="^price_")

Resolves to:

df.filter(regex="^price_")

Example usage:

df.assign(
    total_price=pd.col(regex="^price_").sum(axis=1)
)
  1. dtype selection
pd.col(dtype="float64")

Resolves to:

df.select_dtypes(include="float64")

Why this works with the existing Expression system

The Expression evaluation pipeline already supports this:

result = expr._eval_expression(df)

If the base expression returns a DataFrame, chained operations like
.sum(axis=1) naturally return a Series, which is compatible with
assign, loc, and other pandas APIs.

Example:

df.assign(
    total=pd.col(["a", "b"]).sum(axis=1)
)

The final expression resolves to a Series and integrates seamlessly
with the existing execution flow.

Alternative Solutions

Without this feature, users must fall back to less expressive patterns.

Lambda workaround

The most common workaround is using a lambda:

df.assign(
    total=lambda df: df[["price", "discount"]].sum(axis=1)
)

While functional, this approach has drawbacks:

  • Lambdas are not reusable
  • They cannot be easily inspected or composed
  • They break the uniform expression style introduced by pd.col()

Manual column arithmetic

Users can sometimes express operations using separate expressions:

df.assign(
    total=pd.col("price") + pd.col("discount")
)

However, this only works for simple arithmetic and does not scale to
aggregations across many columns.

Precomputing outside assign

Another workaround is eager computation:

price_cols = df.filter(regex="^price_").sum(axis=1)

df.assign(total_price=price_cols)

This approach breaks the lazy evaluation pattern of assign,
since the computation occurs before the DataFrame pipeline is executed.

Summary

All existing alternatives either:

  • abandon the expression API
  • require verbose lambdas
  • perform eager evaluation outside the pipeline

A native multi-column pd.col() would provide a cleaner and more
consistent solution.

Additional Context

Many dataframe libraries already support multi-column column expressions.

Library Multi-column selection Regex selection dtype selection
Polars pl.col(["a","b"]) pl.col("^price_") pl.col(pl.Float64)
DuckDB COLUMNS('^price_') Yes Yes
Spark col("a") + col("b") Partial Limited
pandas (pd.col) Not supported Not supported Not supported

Because pd.col() was introduced to improve composability and readability
in pandas expressions, extending it to support multi-column references would
make it significantly more useful in real-world data workflows.

Typical real-world workflow

df.assign(
    subtotal=pd.col("price") * pd.col("qty"),
    tax=pd.col("subtotal") * 0.1,
    total=pd.col(["subtotal", "tax"]).sum(axis=1)
)

The final step cannot currently be expressed with pd.col(),
forcing users to revert to lambdas.

Test coverage gap

Current tests for pd.col() focus on:

  • arithmetic operators
  • logical operators
  • accessor chaining (.str, .dt)
  • conditional expressions

There are no tests covering:

pd.col(["a", "b"])
pd.col(regex="^pattern")
pd.col(dtype="float64")

Adding support for these would likely require corresponding
test cases to ensure correct behavior when expressions
resolve to a DataFrame instead of a Series.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further actionexpressionspd.eval, query, pd.col

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions