ENH: Proposal for pd.col() for multi-column and regex column selection

### Feature Type

- [x] Adding new functionality to pandas

- [ ] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

`pd.col()` currently only supports selecting a **single column**, even though its
behavior allows expressions to be chained in ways that naturally suggest
multi-column operations should also work.

The function is defined as:

```python
def col(col_name: Hashable) -> Expression:
```

Internally it constructs an `Expression` whose evaluation function does:

```python
def func(df: DataFrame) -> Series:
    if col_name not in df.columns:
        raise ValueError(...)
    return df[col_name]
```

This always returns a **Series**, meaning `pd.col()` is fundamentally
single-column oriented.

### Inconsistent behavior with lists

Passing a list currently does not raise a clear error:

```python
pd.col(["price", "discount"])
```

The expression is created successfully, but when evaluated it produces:

```python
df[["price", "discount"]]   # returns a DataFrame
```

This creates inconsistent behavior when chaining operations such as:

```python
df.assign(total=pd.col(["price", "discount"]).sum())
```

because the expression now operates on a **DataFrame instead of a Series**.
The behavior becomes unclear and differs from user expectations.

### Missing regex selection

There is currently no way to express pattern-based column selection inside
`pd.col()`:

```python
pd.col("^price_", regex=True)
```

This fails with:

```python
TypeError: col() got an unexpected keyword argument 'regex'
```

Users must instead rely on eager operations such as:

```python
df.filter(regex="^price_")
```

which breaks the composable expression style that `pd.col()` aims to provide.

### Summary of limitations

Currently `pd.col()` does **not support**:

- Multi-column selection
- Regex-based column selection
- dtype-based column selection
- Any expression whose base resolves to a DataFrame

This limits the usefulness of the expression API for many real-world workflows.


### Feature Description

Extend `pd.col()` so it can reference **multiple columns and column groups**
while still returning an `Expression`.

### Proposed signature

```python
def col(
    col_name: Hashable | list[Hashable] | None = None,
    *,
    regex: str | None = None,
    dtype: str | type | None = None,
) -> Expression:
```

### Behavior

1. **Single column (current behavior)**

```python
pd.col("price")
```

Resolves to:

```python
df["price"]   # Series
```

2. **Multi-column list**

```python
pd.col(["price", "discount"])
```

Resolves to:

```python
df[["price", "discount"]]   # DataFrame
```

which enables operations like:

```python
df.assign(
    total=pd.col(["price", "discount"]).sum(axis=1)
)
```

3. **Regex selection**

```python
pd.col(regex="^price_")
```

Resolves to:

```python
df.filter(regex="^price_")
```

Example usage:

```python
df.assign(
    total_price=pd.col(regex="^price_").sum(axis=1)
)
```

4. **dtype selection**

```python
pd.col(dtype="float64")
```

Resolves to:

```python
df.select_dtypes(include="float64")
```

### Why this works with the existing Expression system

The `Expression` evaluation pipeline already supports this:

```python
result = expr._eval_expression(df)
```

If the base expression returns a **DataFrame**, chained operations like
`.sum(axis=1)` naturally return a **Series**, which is compatible with
`assign`, `loc`, and other pandas APIs.

Example:

```python
df.assign(
    total=pd.col(["a", "b"]).sum(axis=1)
)
```

The final expression resolves to a Series and integrates seamlessly
with the existing execution flow.

### Alternative Solutions


Without this feature, users must fall back to less expressive patterns.

### Lambda workaround

The most common workaround is using a lambda:

```python
df.assign(
    total=lambda df: df[["price", "discount"]].sum(axis=1)
)
```

While functional, this approach has drawbacks:

- Lambdas are not reusable
- They cannot be easily inspected or composed
- They break the uniform expression style introduced by `pd.col()`


### Manual column arithmetic

Users can sometimes express operations using separate expressions:

```python
df.assign(
    total=pd.col("price") + pd.col("discount")
)
```

However, this only works for simple arithmetic and does not scale to
aggregations across many columns.


### Precomputing outside `assign`

Another workaround is eager computation:

```python
price_cols = df.filter(regex="^price_").sum(axis=1)

df.assign(total_price=price_cols)
```

This approach breaks the **lazy evaluation pattern** of `assign`,
since the computation occurs before the DataFrame pipeline is executed.


### Summary

All existing alternatives either:

- abandon the expression API
- require verbose lambdas
- perform eager evaluation outside the pipeline

A native multi-column `pd.col()` would provide a cleaner and more
consistent solution.

### Additional Context


Many dataframe libraries already support multi-column column expressions.

| Library | Multi-column selection | Regex selection | dtype selection |
|-------|-----------------------|---------------|---------------|
| Polars | `pl.col(["a","b"])` | `pl.col("^price_")` | `pl.col(pl.Float64)` |
| DuckDB | `COLUMNS('^price_')` | Yes | Yes |
| Spark | `col("a") + col("b")` | Partial | Limited |
| pandas (`pd.col`) | Not supported | Not supported | Not supported |

Because `pd.col()` was introduced to improve composability and readability
in pandas expressions, extending it to support multi-column references would
make it significantly more useful in real-world data workflows.

### Typical real-world workflow

```python
df.assign(
    subtotal=pd.col("price") * pd.col("qty"),
    tax=pd.col("subtotal") * 0.1,
    total=pd.col(["subtotal", "tax"]).sum(axis=1)
)
```

The final step cannot currently be expressed with `pd.col()`,
forcing users to revert to lambdas.

### Test coverage gap

Current tests for `pd.col()` focus on:

- arithmetic operators
- logical operators
- accessor chaining (`.str`, `.dt`)
- conditional expressions

There are **no tests covering**:

```python
pd.col(["a", "b"])
pd.col(regex="^pattern")
pd.col(dtype="float64")
```

Adding support for these would likely require corresponding
test cases to ensure correct behavior when expressions
resolve to a DataFrame instead of a Series.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Proposal for pd.col() for multi-column and regex column selection #64627

Feature Type

Problem Description

Inconsistent behavior with lists

Missing regex selection

Summary of limitations

Feature Description

Proposed signature

Behavior

Why this works with the existing Expression system

Alternative Solutions

Lambda workaround

Manual column arithmetic

Precomputing outside `assign`

Summary

Additional Context

Typical real-world workflow

Test coverage gap

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Library	Multi-column selection	Regex selection	dtype selection
Polars	`pl.col(["a","b"])`	`pl.col("^price_")`	`pl.col(pl.Float64)`
DuckDB	`COLUMNS('^price_')`	Yes	Yes
Spark	`col("a") + col("b")`	Partial	Limited
pandas (`pd.col`)	Not supported	Not supported	Not supported

Uh oh!

ENH: Proposal for pd.col() for multi-column and regex column selection #64627

Description

Feature Type

Problem Description

Inconsistent behavior with lists

Missing regex selection

Summary of limitations

Feature Description

Proposed signature

Behavior

Why this works with the existing Expression system

Alternative Solutions

Lambda workaround

Manual column arithmetic

Precomputing outside assign

Summary

Additional Context

Typical real-world workflow

Test coverage gap

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Precomputing outside `assign`