verifyPPLToSparkSQL pins invalid Spark SQL in CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere

### Summary

The `expectedSparkSql` pinned by `CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere` is **not valid Spark SQL** — Spark, DuckDB, and SQLite all refuse to analyze it. The unit test passes only because `verifyPPLToSparkSQL` checks the printed SQL string and never executes it. This makes the assertion misleading: the `Spark` in the helper name implies an executable contract that doesn't exist.

### Where

- File: `ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLScalarSubqueryTest.java`
- Test: `testCorrelatedScalarSubqueryInWhere` (around line 119)
- Helper: `CalcitePPLAbstractTest.verifyPPLToSparkSQL` (line 185) — only string-equals on `sqlNode.toSqlString(OpenSearchSparkSqlDialect.DEFAULT)`; never analyzes/runs against Spark.

### The PPL under test

```ppl
source=EMP
| where SAL > [
    source=SALGRADE | where SAL = HISAL | stats AVG(SAL)
  ]
```

### The expected Spark SQL pinned by the test

```sql
SELECT *
FROM `scott`.`EMP`
WHERE `SAL` > (((SELECT AVG(`EMP`.`SAL`) AS `AVG(SAL)`
                 FROM `scott`.`SALGRADE`
                 WHERE `EMP`.`SAL` = `HISAL`)))
```

### Why it is invalid

The inner `SELECT` has `FROM scott.SALGRADE`, but `SALGRADE` does not have a `SAL` column. Both `SAL` references inside the subquery silently bind to the **outer** `EMP.SAL` (a correlated reference). The aggregate `AVG(EMP.SAL)` is therefore an aggregate over an outer column, which the SQL standard treats as an aggregate of the **outer** query block, not the inner one. That cannot be placed inside the outer `WHERE` clause.

The Calcite logical plan agrees that both `SAL` references are outer references:

```
LogicalFilter(condition=[>($7, $SCALAR_QUERY({
  LogicalAggregate(group=[{}], AVG(SAL)=[AVG($0)])
    LogicalProject($f0=[$cor0.SAL])              <-- AVG over outer column
      LogicalFilter(condition=[=($cor0.SAL, $0)]) <-- outer = SALGRADE.HISAL
        ...SALGRADE...
})], variablesSet=[[$cor0]])
  ...EMP...
```

### Reproduction with real engines

#### Spark 4.1 (PySpark, the dialect the test claims to emit)

```
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.CORRELATED_REFERENCE]
Unsupported subquery expression: Expressions referencing the outer query
are not supported outside of WHERE/HAVING clauses:
"avg(SAL) AS `AVG(SAL)`". SQLSTATE: 0A000
```

Spark plan fragment before analysis fails:

```
Filter (SAL > scalar-subquery [avg(SAL) && SAL])
  Aggregate [avg(outer(SAL)) AS AVG(SAL)]
    Filter (outer(SAL) = HISAL)
      Relation scott.salgrade[GRADE,LOSAL,HISAL]
```

### A valid SQL query equivalent to the physical plan

The optimized physical plan from opensearch `_explain` already de-correlates the query into a flat join (nested-loop because of the `>` predicate). The following SQL is **accepted by Spark** and matches that physical plan:

```sql
SELECT e.*
FROM   scott.EMP AS e
JOIN (
  SELECT s.SAL              AS sal_key,
         AVG(s.SAL)         AS avg_sal
  FROM   (SELECT DISTINCT SAL FROM scott.EMP) AS s
  CROSS JOIN scott.SALGRADE AS g
  WHERE  g.HISAL IS NOT NULL
  GROUP  BY s.SAL
) AS sub
  ON e.SAL <=> sub.sal_key   -- IS NOT DISTINCT FROM (null-safe equi-key, matches plan)
 AND e.SAL  > sub.avg_sal    -- nested-loop range predicate, matches plan
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verifyPPLToSparkSQL pins invalid Spark SQL in CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere #5470

Summary

Where

The PPL under test

The expected Spark SQL pinned by the test

Why it is invalid

Reproduction with real engines

Spark 4.1 (PySpark, the dialect the test claims to emit)

A valid SQL query equivalent to the physical plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

verifyPPLToSparkSQL pins invalid Spark SQL in CalcitePPLScalarSubqueryTest.testCorrelatedScalarSubqueryInWhere #5470

Description

Summary

Where

The PPL under test

The expected Spark SQL pinned by the test

Why it is invalid

Reproduction with real engines

Spark 4.1 (PySpark, the dialect the test claims to emit)

A valid SQL query equivalent to the physical plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions