Skip to content

RFC-0025 Derived column#61

Open
ScrapCodes wants to merge 3 commits into
prestodb:mainfrom
ScrapCodes:derived-column-spec
Open

RFC-0025 Derived column#61
ScrapCodes wants to merge 3 commits into
prestodb:mainfrom
ScrapCodes:derived-column-spec

Conversation

@ScrapCodes
Copy link
Copy Markdown
Contributor

@ScrapCodes ScrapCodes commented Apr 24, 2026

What is a derived column?

A column created by applying a SQL expression or a UDF to an existing column in a table.

Why do we need that, since we can always apply a UDF to a column during project, filter or join?

Indeed, a derived column consumes O(N) storage, where N is the number of rows in the table. We still need them because, the performance benefits outweigh the disadvantage of extra storage it consumes. Let us understand with the following use case example:

A compute engine like Presto can easily push down a filter predicate e.g. SELECT col1, col2, FROM table T1 WHERE col1='constant_value' , this allows for pruning the number of rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an example SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'. While optimizers can easily push down the filter predicate, however, it can not be used in filtering using the lower and upper bound metrics, for example Iceberg manifest statics and Parquet row group statistics. As a result, we end up scanning a large number of rows.

So, to support push down of certain predicates (with UDFs in them) and reduce the amount of data scanned, derived column bring massive performance improvements. Derived columns have already been proven in RDBMS system e.g. DB2 [1], and now we intend to bring them to Presto.

@prestodb-ci prestodb-ci added the from:IBM PRs from IBM label Apr 24, 2026
@prestodb-ci prestodb-ci requested review from a team, BryanCutler and infvg and removed request for a team April 24, 2026 07:35
@ScrapCodes ScrapCodes marked this pull request as draft April 24, 2026 07:35
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch 4 times, most recently from 474cf06 to 221e8c0 Compare April 24, 2026 11:37
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 221e8c0 to 8690c4e Compare May 4, 2026 09:02
@ScrapCodes ScrapCodes changed the title [WIP] RFC-0025 Derived column RFC-0025 Derived column May 4, 2026
@ScrapCodes ScrapCodes marked this pull request as ready for review May 4, 2026 09:58
@prestodb-ci prestodb-ci requested review from a team, infvg and wanglinsong and removed request for a team May 4, 2026 09:58
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch 3 times, most recently from 3bdd1ef to 2448a60 Compare May 4, 2026 12:23
Comment thread RFC-0025-derived-column-support.md Outdated
@jja725 jja725 self-requested a review May 5, 2026 18:25
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 2448a60 to 8c6c4a4 Compare May 6, 2026 16:39
Copy link
Copy Markdown

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that how write work would be the main concern here with compatibility with all the engine

Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md Outdated
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 8c6c4a4 to 03935be Compare May 7, 2026 16:11
@ScrapCodes
Copy link
Copy Markdown
Contributor Author

@tdcmeehan has volunteered to be a co-author ! Yay!

Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md
Comment thread RFC-0025-derived-column-support.md Outdated
Comment thread RFC-0025-derived-column-support.md
@ScrapCodes ScrapCodes requested a review from tdcmeehan May 19, 2026 06:47
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 82feae1 to 334810b Compare May 19, 2026 10:35
of extra storage it consumes. Let us understand with the following use case example:

A compute engine like Presto can easily push down a filter predicate e.g. `SELECT col1, col2, FROM table T1 WHERE col1='constant_value'` , this allows for pruning the number of
rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Its not necessarily "UDF" but any SQL expression involving the columns of the table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct my understanding on this please, Do you mean a simple expression for example: c1 > 100 will be affected, I guess no. So it is always expressions involving UDFs (in presto's case a IF statement is also a UDF)?

result when the derived column feature flags are enabled.
3. We can provide a command to sync derived columns when they do go out of sync. (Similar to Materialized views REFRESH).

### What is allowed in derived column expressions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not a big deal but are IS NULL and NOT IS NULL allowed for derived columns ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

2. A sub query expressions.
3. IN Query and Exists query.

### Table properties
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table properties are Iceberg specific. Do you have a design for Hive as well ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have mentioned, the current design focuses only on Iceberg, and all other connectors are mentioned in Future Work.

It will be interesting work to support Hive, and may be that will be next phase of the RFC.

Alter Table:

```sql
ALTER TABLE [ IF EXISTS ] name ADD COLUMN [ IF NOT EXISTS ] column_name data_type [GENERATED ALWAYS] AS ( <expression> ) [VIRTUAL | PERSISTENT]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between VIRTUAL and PERSISTENT ? Please can you elaborate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comments !


```sql
ALTER TABLE [ IF EXISTS ] name ADD COLUMN [ IF NOT EXISTS ] column_name data_type [GENERATED ALWAYS] AS ( <expression> ) [VIRTUAL | PERSISTENT]
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the derived column expressions show up in DESCRIBE TABLE/SHOW COLUMNS sql ? If yes, then what will be the syntax of the expressions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update this soon.

Comment thread RFC-0025-derived-column-support.md Outdated
"derived-columns.spec.expression.json" = '{
"expressionSpecList" : [ {
"derivedColumnType" : "PERSISTENT",
"derivedColumnExpression" : "if(lower(c2)=''c'',''g'',''e'')",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you handle the case if there is a type mismatch between what the user specifies as the type of the column and the expression return type ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I am understanding correctly, Expression return type is interpreted so this situation should not occur.

Comment thread RFC-0025-derived-column-support.md Outdated
"derived-columns.spec.expression.json" = '{
"expressionSpecList" : [ {
"derivedColumnType" : "PERSISTENT",
"derivedColumnExpression" : "if(lower(c2)=''c'',''g'',''e'')",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very contrived expression. Can we use something simple like just "lower(column)" instead ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a purpose in mind for this contrived expression, i.e. wanted to show the working of sub-expression matching. There is a usage guide section, where I have covered what users in general will see as an example.


### Table properties

Following table properties are added.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are in metadata.json file ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will include examples.

{
"udfSpecList" : [ {
"derivedColumnType" : "PERSISTENT",
"derivedColumnExpression" : "SQL expression",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more info about the SQL dialect of this expression ? Seems like you want atleast Presto and Spark to understand it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, deriving a common subset of expressions that are interpretable by both Spark and Presto is hard and likely outside of the scope of this RFC. I think the most straightforward thing is to treat them like views, which defer on cross-platform interpretability and force any consumer of the view SQL to understand Presto's dialect. Cross platform expressions can be considered an orthogonal yet important task.


## Alternatives considered.

### 1. Use table properties to configure derived column information (instead of alter column syntax).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought this is your actual solution. This writeup gives the impression its just an alternative considered. Please can you elaborate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my previous solution, where a user would manually enter the Table properties. Now that is no longer true, we support CREATE TABLE and ALTER TABLE syntax to achieve this. Internally table properties are used - but they are not end user editable.

Comment thread RFC-0025-derived-column-support.md Outdated
CREATE TABLE iceberg.perf_test.test2 (
"c1" bigint,
"c2" varchar,
"c2_derived" varchar,
Copy link
Copy Markdown
Contributor Author

@ScrapCodes ScrapCodes May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will soon have an update to showQueryRewrite with

CREATE TABLE iceberg.perf_test.test2 (                                                           
    "c1" bigint,                                                                                  
    "c2" varchar,                                                                                 
    "c2_derived" varchar AS if(lower(c2) = 'c', 'g', 'e') PERSISTENT,                                                                         

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PRs from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants