Skip to content

verifyPPLToSparkSQL pins invalid Spark SQL in CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere #5470

@penghuo

Description

@penghuo

Summary

The expectedSparkSql pinned by CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere is not valid Spark SQL — Spark, DuckDB, and SQLite all refuse to analyze it. The unit test passes only because verifyPPLToSparkSQL checks the printed SQL string and never executes it. This makes the assertion misleading: the Spark in the helper name implies an executable contract that doesn't exist.

Where

  • File: ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLScalarSubqueryTest.java
  • Test: testCorrelatedScalarSubqueryInWhere (around line 119)
  • Helper: CalcitePPLAbstractTest.verifyPPLToSparkSQL (line 185) — only string-equals on sqlNode.toSqlString(OpenSearchSparkSqlDialect.DEFAULT); never analyzes/runs against Spark.

The PPL under test

source=EMP
| where SAL > [
    source=SALGRADE | where SAL = HISAL | stats AVG(SAL)
  ]

The expected Spark SQL pinned by the test

SELECT *
FROM `scott`.`EMP`
WHERE `SAL` > (((SELECT AVG(`EMP`.`SAL`) AS `AVG(SAL)`
                 FROM `scott`.`SALGRADE`
                 WHERE `EMP`.`SAL` = `HISAL`)))

Why it is invalid

The inner SELECT has FROM scott.SALGRADE, but SALGRADE does not have a SAL column. Both SAL references inside the subquery silently bind to the outer EMP.SAL (a correlated reference). The aggregate AVG(EMP.SAL) is therefore an aggregate over an outer column, which the SQL standard treats as an aggregate of the outer query block, not the inner one. That cannot be placed inside the outer WHERE clause.

The Calcite logical plan agrees that both SAL references are outer references:

LogicalFilter(condition=[>($7, $SCALAR_QUERY({
  LogicalAggregate(group=[{}], AVG(SAL)=[AVG($0)])
    LogicalProject($f0=[$cor0.SAL])              <-- AVG over outer column
      LogicalFilter(condition=[=($cor0.SAL, $0)]) <-- outer = SALGRADE.HISAL
        ...SALGRADE...
})], variablesSet=[[$cor0]])
  ...EMP...

Reproduction with real engines

Spark 4.1 (PySpark, the dialect the test claims to emit)

[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.CORRELATED_REFERENCE]
Unsupported subquery expression: Expressions referencing the outer query
are not supported outside of WHERE/HAVING clauses:
"avg(SAL) AS `AVG(SAL)`". SQLSTATE: 0A000

Spark plan fragment before analysis fails:

Filter (SAL > scalar-subquery [avg(SAL) && SAL])
  Aggregate [avg(outer(SAL)) AS AVG(SAL)]
    Filter (outer(SAL) = HISAL)
      Relation scott.salgrade[GRADE,LOSAL,HISAL]

A valid SQL query equivalent to the physical plan

The optimized physical plan from opensearch _explain already de-correlates the query into a flat join (nested-loop because of the > predicate). The following SQL is accepted by Spark and matches that physical plan:

SELECT e.*
FROM   scott.EMP AS e
JOIN (
  SELECT s.SAL              AS sal_key,
         AVG(s.SAL)         AS avg_sal
  FROM   (SELECT DISTINCT SAL FROM scott.EMP) AS s
  CROSS JOIN scott.SALGRADE AS g
  WHERE  g.HISAL IS NOT NULL
  GROUP  BY s.SAL
) AS sub
  ON e.SAL <=> sub.sal_key   -- IS NOT DISTINCT FROM (null-safe equi-key, matches plan)
 AND e.SAL  > sub.avg_sal    -- nested-loop range predicate, matches plan

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing language

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions