Skip to content

fix joins for missing key columns#187

Open
Anamika1608 wants to merge 7 commits intoDataHaskell:mainfrom
Anamika1608:fix/missing-join-columns
Open

fix joins for missing key columns#187
Anamika1608 wants to merge 7 commits intoDataHaskell:mainfrom
Anamika1608:fix/missing-join-columns

Conversation

@Anamika1608
Copy link
Contributor

closes #118

summary

this fixes the join behavior when a requested join key does not exist. instead of silently hashing an empty key set and producing cartesian style output, joins now fail fast with the existing columnnotfoundexception.

what changed

  • added a shared validation helper in join.hs that checks requested join keys against dataframe column indices before hashes are computed
  • routed innerjoin, leftjoin, rightjoin, fullouterjoin, and buildhashcolumn through that validation path
  • kept the public join signatures unchanged
  • preserved the correct rightjoin callpoint by routing it through an internal leftjoin helper with an explicit callpoint
  • added regression tests for missing join keys across inner, left, right, and full outer joins
  • updated subset.hs to use randomgen and split so the test suite builds under the current nix toolchain

behavior

before this change, a missing join column could fall through to hashing zero key columns, which made every row hash the same and could expand into cartesian style output. after this change, the join fails immediately with the missing column name and the existing available column context.

verification

  • source /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh
  • nix --extra-experimental-features 'nix-command flakes' develop --no-write-lock-file -c cabal test test:tests --test-show-details=direct --test-options='-p Operations.Join'\n- nix --extra-experimental-features 'nix-command flakes' develop --no-write-lock-file -c cabal test test:tests --test-show-details=direct\n\n## results\n- full tests suite passed\n- 445 hunit cases passed\n- 0 failures\n- 0 errors\n- all quickcheck properties passed

@Anamika1608
Copy link
Contributor Author

@daikonradish @mchav, can you review this pr?

@daikonradish
Copy link
Contributor

We've left some comments, but in general your code looks good. Some of it is very stylish! Thank you for your effort.

@Anamika1608
Copy link
Contributor Author

@mchav @daikonradish fixed all the issues. check once?

import Type.Reflection
import Prelude hiding (filter, take)

#if MIN_VERSION_random(1,3,0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this prevent compilation? What ghc and build tool are you using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no for the repo’s normal build. this module enables CPP, and MIN_VERSION_random(...) is provided by cabal preprocessing. i tested it with ghc 9.10.3 and cabal-install 3.16.1.0 in the nix dev shell, and the repo ci also builds with cabal. if someone compiles the file with bare ghc outside cabal, then that macro would not be defined.

@Anamika1608
Copy link
Contributor Author

@mchav check now?

(callingFunctionName context)
errorString
show (ColumnNotFoundException columnName callPoint availableColumns) = columnNotFound columnName callPoint availableColumns
show (ColumnsNotFoundException columnNames callPoint availableColumns) = columnsNotFound columnNames callPoint availableColumns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we should consolidate the two error paths since the two exceptions are practically the same. ColumnNotFound should be a special case of columns not found with a single element lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] joins with incorrect names should error out, not OUTER JOIN.

3 participants