RFC-0025 Derived column#61
Conversation
474cf06 to
221e8c0
Compare
221e8c0 to
8690c4e
Compare
3bdd1ef to
2448a60
Compare
2448a60 to
8c6c4a4
Compare
jja725
left a comment
There was a problem hiding this comment.
Agree that how write work would be the main concern here with compatibility with all the engine
8c6c4a4 to
03935be
Compare
|
@tdcmeehan has volunteered to be a co-author ! Yay! |
03935be to
6629e4e
Compare
82feae1 to
334810b
Compare
| of extra storage it consumes. Let us understand with the following use case example: | ||
|
|
||
| A compute engine like Presto can easily push down a filter predicate e.g. `SELECT col1, col2, FROM table T1 WHERE col1='constant_value'` , this allows for pruning the number of | ||
| rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an |
There was a problem hiding this comment.
Nit : Its not necessarily "UDF" but any SQL expression involving the columns of the table.
There was a problem hiding this comment.
Correct my understanding on this please, Do you mean a simple expression for example: c1 > 100 will be affected, I guess no. So it is always expressions involving UDFs (in presto's case a IF statement is also a UDF)?
| result when the derived column feature flags are enabled. | ||
| 3. We can provide a command to sync derived columns when they do go out of sync. (Similar to Materialized views REFRESH). | ||
|
|
||
| ### What is allowed in derived column expressions |
There was a problem hiding this comment.
Maybe not a big deal but are IS NULL and NOT IS NULL allowed for derived columns ?
| 2. A sub query expressions. | ||
| 3. IN Query and Exists query. | ||
|
|
||
| ### Table properties |
There was a problem hiding this comment.
Table properties are Iceberg specific. Do you have a design for Hive as well ?
There was a problem hiding this comment.
Good point, I have mentioned, the current design focuses only on Iceberg, and all other connectors are mentioned in Future Work.
It will be interesting work to support Hive, and may be that will be next phase of the RFC.
| Alter Table: | ||
|
|
||
| ```sql | ||
| ALTER TABLE [ IF EXISTS ] name ADD COLUMN [ IF NOT EXISTS ] column_name data_type [GENERATED ALWAYS] AS ( <expression> ) [VIRTUAL | PERSISTENT] |
There was a problem hiding this comment.
What is the difference between VIRTUAL and PERSISTENT ? Please can you elaborate.
There was a problem hiding this comment.
I will add comments !
|
|
||
| ```sql | ||
| ALTER TABLE [ IF EXISTS ] name ADD COLUMN [ IF NOT EXISTS ] column_name data_type [GENERATED ALWAYS] AS ( <expression> ) [VIRTUAL | PERSISTENT] | ||
| ``` |
There was a problem hiding this comment.
Do the derived column expressions show up in DESCRIBE TABLE/SHOW COLUMNS sql ? If yes, then what will be the syntax of the expressions.
There was a problem hiding this comment.
Will update this soon.
| "derived-columns.spec.expression.json" = '{ | ||
| "expressionSpecList" : [ { | ||
| "derivedColumnType" : "PERSISTENT", | ||
| "derivedColumnExpression" : "if(lower(c2)=''c'',''g'',''e'')", |
There was a problem hiding this comment.
How do you handle the case if there is a type mismatch between what the user specifies as the type of the column and the expression return type ?
There was a problem hiding this comment.
if I am understanding correctly, Expression return type is interpreted so this situation should not occur.
| "derived-columns.spec.expression.json" = '{ | ||
| "expressionSpecList" : [ { | ||
| "derivedColumnType" : "PERSISTENT", | ||
| "derivedColumnExpression" : "if(lower(c2)=''c'',''g'',''e'')", |
There was a problem hiding this comment.
This seems like a very contrived expression. Can we use something simple like just "lower(column)" instead ?
There was a problem hiding this comment.
I had a purpose in mind for this contrived expression, i.e. wanted to show the working of sub-expression matching. There is a usage guide section, where I have covered what users in general will see as an example.
|
|
||
| ### Table properties | ||
|
|
||
| Following table properties are added. |
There was a problem hiding this comment.
These are in metadata.json file ?
There was a problem hiding this comment.
Yes, will include examples.
| { | ||
| "udfSpecList" : [ { | ||
| "derivedColumnType" : "PERSISTENT", | ||
| "derivedColumnExpression" : "SQL expression", |
There was a problem hiding this comment.
Can you give more info about the SQL dialect of this expression ? Seems like you want atleast Presto and Spark to understand it.
There was a problem hiding this comment.
To be clear, deriving a common subset of expressions that are interpretable by both Spark and Presto is hard and likely outside of the scope of this RFC. I think the most straightforward thing is to treat them like views, which defer on cross-platform interpretability and force any consumer of the view SQL to understand Presto's dialect. Cross platform expressions can be considered an orthogonal yet important task.
|
|
||
| ## Alternatives considered. | ||
|
|
||
| ### 1. Use table properties to configure derived column information (instead of alter column syntax). |
There was a problem hiding this comment.
Thought this is your actual solution. This writeup gives the impression its just an alternative considered. Please can you elaborate.
There was a problem hiding this comment.
This was my previous solution, where a user would manually enter the Table properties. Now that is no longer true, we support CREATE TABLE and ALTER TABLE syntax to achieve this. Internally table properties are used - but they are not end user editable.
| CREATE TABLE iceberg.perf_test.test2 ( | ||
| "c1" bigint, | ||
| "c2" varchar, | ||
| "c2_derived" varchar, |
There was a problem hiding this comment.
Will soon have an update to showQueryRewrite with
CREATE TABLE iceberg.perf_test.test2 (
"c1" bigint,
"c2" varchar,
"c2_derived" varchar AS if(lower(c2) = 'c', 'g', 'e') PERSISTENT,
What is a derived column?
A column created by applying a SQL expression or a UDF to an existing column in a table.
Why do we need that, since we can always apply a UDF to a column during project, filter or join?
Indeed, a derived column consumes O(N) storage, where N is the number of rows in the table. We still need them because, the performance benefits outweigh the disadvantage of extra storage it consumes. Let us understand with the following use case example:
A compute engine like Presto can easily push down a filter predicate e.g. SELECT col1, col2, FROM table T1 WHERE col1='constant_value' , this allows for pruning the number of rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an example SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'. While optimizers can easily push down the filter predicate, however, it can not be used in filtering using the lower and upper bound metrics, for example Iceberg manifest statics and Parquet row group statistics. As a result, we end up scanning a large number of rows.
So, to support push down of certain predicates (with UDFs in them) and reduce the amount of data scanned, derived column bring massive performance improvements. Derived columns have already been proven in RDBMS system e.g. DB2 [1], and now we intend to bring them to Presto.