Summary
The expectedSparkSql pinned by CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere is not valid Spark SQL — Spark, DuckDB, and SQLite all refuse to analyze it. The unit test passes only because verifyPPLToSparkSQL checks the printed SQL string and never executes it. This makes the assertion misleading: the Spark in the helper name implies an executable contract that doesn't exist.
Where
- File:
ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLScalarSubqueryTest.java
- Test:
testCorrelatedScalarSubqueryInWhere (around line 119)
- Helper:
CalcitePPLAbstractTest.verifyPPLToSparkSQL (line 185) — only string-equals on sqlNode.toSqlString(OpenSearchSparkSqlDialect.DEFAULT); never analyzes/runs against Spark.
The PPL under test
source=EMP
| where SAL > [
source=SALGRADE | where SAL = HISAL | stats AVG(SAL)
]
The expected Spark SQL pinned by the test
SELECT *
FROM `scott`.`EMP`
WHERE `SAL` > (((SELECT AVG(`EMP`.`SAL`) AS `AVG(SAL)`
FROM `scott`.`SALGRADE`
WHERE `EMP`.`SAL` = `HISAL`)))
Why it is invalid
The inner SELECT has FROM scott.SALGRADE, but SALGRADE does not have a SAL column. Both SAL references inside the subquery silently bind to the outer EMP.SAL (a correlated reference). The aggregate AVG(EMP.SAL) is therefore an aggregate over an outer column, which the SQL standard treats as an aggregate of the outer query block, not the inner one. That cannot be placed inside the outer WHERE clause.
The Calcite logical plan agrees that both SAL references are outer references:
LogicalFilter(condition=[>($7, $SCALAR_QUERY({
LogicalAggregate(group=[{}], AVG(SAL)=[AVG($0)])
LogicalProject($f0=[$cor0.SAL]) <-- AVG over outer column
LogicalFilter(condition=[=($cor0.SAL, $0)]) <-- outer = SALGRADE.HISAL
...SALGRADE...
})], variablesSet=[[$cor0]])
...EMP...
Reproduction with real engines
Spark 4.1 (PySpark, the dialect the test claims to emit)
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.CORRELATED_REFERENCE]
Unsupported subquery expression: Expressions referencing the outer query
are not supported outside of WHERE/HAVING clauses:
"avg(SAL) AS `AVG(SAL)`". SQLSTATE: 0A000
Spark plan fragment before analysis fails:
Filter (SAL > scalar-subquery [avg(SAL) && SAL])
Aggregate [avg(outer(SAL)) AS AVG(SAL)]
Filter (outer(SAL) = HISAL)
Relation scott.salgrade[GRADE,LOSAL,HISAL]
A valid SQL query equivalent to the physical plan
The optimized physical plan from opensearch _explain already de-correlates the query into a flat join (nested-loop because of the > predicate). The following SQL is accepted by Spark and matches that physical plan:
SELECT e.*
FROM scott.EMP AS e
JOIN (
SELECT s.SAL AS sal_key,
AVG(s.SAL) AS avg_sal
FROM (SELECT DISTINCT SAL FROM scott.EMP) AS s
CROSS JOIN scott.SALGRADE AS g
WHERE g.HISAL IS NOT NULL
GROUP BY s.SAL
) AS sub
ON e.SAL <=> sub.sal_key -- IS NOT DISTINCT FROM (null-safe equi-key, matches plan)
AND e.SAL > sub.avg_sal -- nested-loop range predicate, matches plan
Summary
The
expectedSparkSqlpinned byCalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhereis not valid Spark SQL — Spark, DuckDB, and SQLite all refuse to analyze it. The unit test passes only becauseverifyPPLToSparkSQLchecks the printed SQL string and never executes it. This makes the assertion misleading: theSparkin the helper name implies an executable contract that doesn't exist.Where
ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLScalarSubqueryTest.javatestCorrelatedScalarSubqueryInWhere(around line 119)CalcitePPLAbstractTest.verifyPPLToSparkSQL(line 185) — only string-equals onsqlNode.toSqlString(OpenSearchSparkSqlDialect.DEFAULT); never analyzes/runs against Spark.The PPL under test
The expected Spark SQL pinned by the test
Why it is invalid
The inner
SELECThasFROM scott.SALGRADE, butSALGRADEdoes not have aSALcolumn. BothSALreferences inside the subquery silently bind to the outerEMP.SAL(a correlated reference). The aggregateAVG(EMP.SAL)is therefore an aggregate over an outer column, which the SQL standard treats as an aggregate of the outer query block, not the inner one. That cannot be placed inside the outerWHEREclause.The Calcite logical plan agrees that both
SALreferences are outer references:Reproduction with real engines
Spark 4.1 (PySpark, the dialect the test claims to emit)
Spark plan fragment before analysis fails:
A valid SQL query equivalent to the physical plan
The optimized physical plan from opensearch
_explainalready de-correlates the query into a flat join (nested-loop because of the>predicate). The following SQL is accepted by Spark and matches that physical plan: