RFC-0025 Derived column by ScrapCodes · Pull Request #61 · prestodb/rfcs

ScrapCodes · 2026-04-24T07:35:00Z

What is a derived column?

A column created by applying a SQL expression or a UDF to an existing column in a table.

Why do we need that, since we can always apply a UDF to a column during project, filter or join?

Indeed, a derived column consumes O(N) storage, where N is the number of rows in the table. We still need them because, the performance benefits outweigh the disadvantage of extra storage it consumes. Let us understand with the following use case example:

A compute engine like Presto can easily push down a filter predicate e.g. SELECT col1, col2, FROM table T1 WHERE col1='constant_value' , this allows for pruning the number of rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an example SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'. While optimizers can easily push down the filter predicate, however, it can not be used in filtering using the lower and upper bound metrics, for example Iceberg manifest statics and Parquet row group statistics. As a result, we end up scanning a large number of rows.

So, to support push down of certain predicates (with UDFs in them) and reduce the amount of data scanned, derived column bring massive performance improvements. Derived columns have already been proven in RDBMS system e.g. DB2 [1], and now we intend to bring them to Presto.

jja725

Agree that how write work would be the main concern here with compatibility with all the engine

ScrapCodes · 2026-05-07T16:17:55Z

@tdcmeehan has volunteered to be a co-author ! Yay!

aditi-pandit · 2026-05-19T18:58:44Z

+of extra storage it consumes. Let us understand with the following use case example:
+
+A compute engine like Presto can easily push down a filter predicate e.g. `SELECT col1, col2, FROM table T1 WHERE col1='constant_value'` , this allows for pruning the number of
+rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an


Nit : Its not necessarily "UDF" but any SQL expression involving the columns of the table.

Correct my understanding on this please, Do you mean a simple expression for example: c1 > 100 will be affected, I guess no. So it is always expressions involving UDFs (in presto's case a IF statement is also a UDF)?

aditi-pandit · 2026-05-19T19:04:44Z

+   result when the derived column feature flags are enabled.
+3. We can provide a command to sync derived columns when they do go out of sync. (Similar to Materialized views REFRESH).
+
+### What is allowed in derived column expressions


Maybe not a big deal but are IS NULL and NOT IS NULL allowed for derived columns ?

aditi-pandit · 2026-05-19T19:05:43Z

+2. A sub query expressions.
+3. IN Query and Exists query.
+
+### Table properties


Table properties are Iceberg specific. Do you have a design for Hive as well ?

Good point, I have mentioned, the current design focuses only on Iceberg, and all other connectors are mentioned in Future Work.

It will be interesting work to support Hive, and may be that will be next phase of the RFC.

aditi-pandit · 2026-05-19T19:06:26Z

+Alter Table:
+
+```sql 
+ALTER TABLE [ IF EXISTS ] name ADD COLUMN [ IF NOT EXISTS ] column_name data_type [GENERATED ALWAYS]  AS   ( <expression> ) [VIRTUAL | PERSISTENT]


What is the difference between VIRTUAL and PERSISTENT ? Please can you elaborate.

I will add comments !

aditi-pandit · 2026-05-19T19:07:55Z

+
+```sql 
+ALTER TABLE [ IF EXISTS ] name ADD COLUMN [ IF NOT EXISTS ] column_name data_type [GENERATED ALWAYS]  AS   ( <expression> ) [VIRTUAL | PERSISTENT]
+``` 


Do the derived column expressions show up in DESCRIBE TABLE/SHOW COLUMNS sql ? If yes, then what will be the syntax of the expressions.

Will update this soon.

aditi-pandit · 2026-05-19T19:11:25Z

+    "derived-columns.spec.expression.json" = '{                                                   
+   "expressionSpecList" : [ {                                                                     
+     "derivedColumnType" : "PERSISTENT",                                                          
+     "derivedColumnExpression" : "if(lower(c2)=''c'',''g'',''e'')",                               


How do you handle the case if there is a type mismatch between what the user specifies as the type of the column and the expression return type ?

if I am understanding correctly, Expression return type is interpreted so this situation should not occur.

aditi-pandit · 2026-05-19T19:12:57Z

+    "derived-columns.spec.expression.json" = '{                                                   
+   "expressionSpecList" : [ {                                                                     
+     "derivedColumnType" : "PERSISTENT",                                                          
+     "derivedColumnExpression" : "if(lower(c2)=''c'',''g'',''e'')",                               


This seems like a very contrived expression. Can we use something simple like just "lower(column)" instead ?

I had a purpose in mind for this contrived expression, i.e. wanted to show the working of sub-expression matching. There is a usage guide section, where I have covered what users in general will see as an example.

aditi-pandit · 2026-05-19T19:18:44Z

+
+### Table properties
+
+Following table properties are added.


These are in metadata.json file ?

Yes, will include examples.

aditi-pandit · 2026-05-19T19:19:25Z

+   {
+            "udfSpecList" : [ {
+               "derivedColumnType" : "PERSISTENT",
+               "derivedColumnExpression" : "SQL expression",


Can you give more info about the SQL dialect of this expression ? Seems like you want atleast Presto and Spark to understand it.

To be clear, deriving a common subset of expressions that are interpretable by both Spark and Presto is hard and likely outside of the scope of this RFC. I think the most straightforward thing is to treat them like views, which defer on cross-platform interpretability and force any consumer of the view SQL to understand Presto's dialect. Cross platform expressions can be considered an orthogonal yet important task.

aditi-pandit · 2026-05-19T19:20:19Z

+
+## Alternatives considered.
+
+### 1. Use table properties to configure derived column information (instead of alter column syntax).


Thought this is your actual solution. This writeup gives the impression its just an alternative considered. Please can you elaborate.

This was my previous solution, where a user would manually enter the Table properties. Now that is no longer true, we support CREATE TABLE and ALTER TABLE syntax to achieve this. Internally table properties are used - but they are not end user editable.

ScrapCodes · 2026-05-20T13:16:59Z

+ CREATE TABLE iceberg.perf_test.test2 (                                                           
+    "c1" bigint,                                                                                  
+    "c2" varchar,                                                                                 
+    "c2_derived" varchar,                                                                         


Will soon have an update to showQueryRewrite with

CREATE TABLE iceberg.perf_test.test2 ( "c1" bigint, "c2" varchar, "c2_derived" varchar AS if(lower(c2) = 'c', 'g', 'e') PERSISTENT,

prestodb-ci added the from:IBM PRs from IBM label Apr 24, 2026

prestodb-ci requested review from a team, BryanCutler and infvg and removed request for a team April 24, 2026 07:35

ScrapCodes marked this pull request as draft April 24, 2026 07:35

ScrapCodes removed request for BryanCutler and infvg April 24, 2026 07:35

ScrapCodes force-pushed the derived-column-spec branch 4 times, most recently from 474cf06 to 221e8c0 Compare April 24, 2026 11:37

ScrapCodes force-pushed the derived-column-spec branch from 221e8c0 to 8690c4e Compare May 4, 2026 09:02

ScrapCodes changed the title ~~[WIP] RFC-0025 Derived column~~ RFC-0025 Derived column May 4, 2026

ScrapCodes marked this pull request as ready for review May 4, 2026 09:58

prestodb-ci requested review from a team, infvg and wanglinsong and removed request for a team May 4, 2026 09:58

ScrapCodes force-pushed the derived-column-spec branch 3 times, most recently from 3bdd1ef to 2448a60 Compare May 4, 2026 12:23

aditi-pandit reviewed May 4, 2026

View reviewed changes

Comment thread RFC-0025-derived-column-support.md Outdated

jja725 self-requested a review May 5, 2026 18:25

ScrapCodes force-pushed the derived-column-spec branch from 2448a60 to 8c6c4a4 Compare May 6, 2026 16:39

jja725 reviewed May 7, 2026

View reviewed changes

Comment thread RFC-0025-derived-column-support.md Outdated

Comment thread RFC-0025-derived-column-support.md Outdated

ScrapCodes force-pushed the derived-column-spec branch from 8c6c4a4 to 03935be Compare May 7, 2026 16:11

tdcmeehan reviewed May 7, 2026

View reviewed changes

Comment thread RFC-0025-derived-column-support.md Outdated

Comment thread RFC-0025-derived-column-support.md Outdated

Comment thread RFC-0025-derived-column-support.md Outdated

Comment thread RFC-0025-derived-column-support.md Outdated

tdcmeehan reviewed May 8, 2026

View reviewed changes

Comment thread RFC-0025-derived-column-support.md Outdated

Comment thread RFC-0025-derived-column-support.md

Comment thread RFC-0025-derived-column-support.md Outdated

RFC-0025 Derived column

6629e4e

ScrapCodes force-pushed the derived-column-spec branch from 03935be to 6629e4e Compare May 18, 2026 10:54

ScrapCodes requested review from aditi-pandit, jja725 and tdcmeehan May 18, 2026 17:52

tdcmeehan reviewed May 18, 2026

View reviewed changes

Comment thread RFC-0025-derived-column-support.md

ScrapCodes requested a review from tdcmeehan May 19, 2026 06:47

Added POC details and usage guide.

334810b

ScrapCodes force-pushed the derived-column-spec branch from 82feae1 to 334810b Compare May 19, 2026 10:35

aditi-pandit reviewed May 19, 2026

View reviewed changes

ScrapCodes commented May 20, 2026

View reviewed changes

Fixed show query rewrite to not show internal table properties.

e9533da


		## Alternatives considered.

		### 1. Use table properties to configure derived column information (instead of alter column syntax).

Conversation

ScrapCodes commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jja725 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ScrapCodes commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ScrapCodes commented Apr 24, 2026 •

edited

Loading

ScrapCodes May 20, 2026 •

edited

Loading