Skip to content

Text::CSV CPAN module support#435

Merged
fglock merged 28 commits intomasterfrom
feature/text-csv-support
Apr 4, 2026
Merged

Text::CSV CPAN module support#435
fglock merged 28 commits intomasterfrom
feature/text-csv-support

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 4, 2026

Summary

Comprehensive fixes to support the CPAN Text::CSV 2.06 test suite (which delegates to Text::CSV_PP pure Perl). This brings the pass rate from ~4/40 test programs to 30/40, with ~30,700 subtests running.

Fixes by phase

  • Phase 1: %_ strict vars exemption + use lib prepend-with-dedup ordering
  • Phase 2: @INC ordering (-I > PERL5LIB > ~/.perlonjava/lib > jar) + blib support in ExtUtils::MakeMaker
  • Phase 3a: last/next/redo inside do {} while inside a true loop — bytecode compiler now skips do-while pseudo-loops to find the innermost true loop
  • Phase 3b: bytes::length, bytes::chr, bytes::ord, bytes::substr as callable Perl subroutines
  • Phase 3c: Bare glob (*FH) method dispatch — auto-bless to IO::File
  • Phase 3 extras: Bytecode HINT_BYTES parity (FC/LC/UC/LCFIRST/UCFIRST_BYTES opcodes) + raw-bytes DATA section via Latin-1 extraction
  • Phase 4: Logical operator (&&/||///) VOID context propagation to RHS + PerlIO::get_layers NPE fix
  • Phase 4b: local %hash glob slot restoration via new GlobalRuntimeHash
  • Phase 5: Readline BYTE_STRING propagation for handles without encoding layers
  • Phase 5b: untie retains last FETCH value + UTF-16/32 encoding layer support
  • Phase 5c: UTF-8 encode wide characters on binary handles + utf8::decode for non-octets
  • Phase 5d: use bytes regex matching + Latin-1 source encoding detection
  • Phase 5e: "Wide character in print" warning + utf8::upgrade preserves content
  • Phase 5f: print reads internal ORS/OFS copy — fixes for $\ (@list) aliasing bug where print incorrectly used the aliased $\ value instead of the internal output record separator

Key new files

  • OutputRecordSeparator.java / OutputFieldSeparator.java — special variable classes for $\ and $, with internal values that print reads, immune to aliasing
  • GlobalRuntimeHash.javalocal %hash save/restore for global symbol table hashes
  • FileUtils.java — Latin-1 source encoding detection utilities

Test results

Test Before After
Programs passing ~4/40 30/40
t/45_eol.t - 1176/1182
t/46_eol_si.t - 562/562
t/47_comment.t - 71/71
t/85_util.t - 330/330

Remaining failures (10 test files)

  • t/70_rt.t — needs encoding-aware lexer for raw \xab/\xbb bytes in source
  • t/75_hashref.t — needs Scalar::Util::readonly() implementation
  • t/51_utf8.t — UTF-8 flag tracking issues (35 failures)
  • t/20_file.t, t/21_lexicalio.t, t/22_scalario.t — 1 failure each (EOL content comparison)
  • t/45_eol.t — 6 remaining failures (eol(undef) + quoted field reset interaction)
  • t/50_utf8.t — 1 failure (use bytes + regex)
  • t/76_magic.t — 1 failure (TieScalar edge case)

See dev/modules/text_csv_fix_plan.md for detailed design and progress tracking.

Test plan

  • make passes (all unit tests)
  • Text::CSV t/45_eol.t: 1176/1182 pass
  • Text::CSV t/46_eol_si.t: 562/562 pass (all)
  • 30/40 Text::CSV test programs pass

Generated with Devin

@fglock fglock force-pushed the feature/text-csv-support branch 2 times, most recently from 6ed310d to 412bbcd Compare April 4, 2026 14:47
fglock and others added 25 commits April 4, 2026 18:24
- Add %_ to strict vars exemption lists (EmitVariable, BytecodeCompiler,
  Variable) - %_ is a valid Perl global hash like $_ and @_
- Fix Lib.java to unshift (prepend) directories instead of push (append),
  matching Perl lib.pm semantics. This allows use lib qw(./lib) in
  Makefile.PL to override bundled modules.
- Add Text::CSV fix plan documenting remaining issues

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Reorder @inc so user-installed modules override bundled ones:
  -I args > PERL5LIB > ~/.perlonjava/lib > jar:PERL5LIB
  This mirrors Perl 5 site_perl > core pattern.
- Add blib/lib population to MakeMaker-generated Makefiles so
  make test can find modules via PERL5LIB=./blib/lib

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The bytecode compiler used loopStack.peek() for unlabeled last/next/redo,
which returned do-while pseudo-loops (isTrueLoop=false). This caused
errors when last was used inside a do-while nested in a real while loop.

Fix: iterate loopStack to find the first isTrueLoop=true entry, matching
the JVM backend findInnermostTrueLoopLabels behavior.

Impact: unblocks Text::CSV_PP core parser. Tests go from ~4 to 19/40.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- BytesPragma.java: Register bytes::length, bytes::chr, bytes::ord,
  bytes::substr as callable subroutines, delegating to existing
  StringOperators/ScalarOperators byte-aware methods.
  Text::CSV_PP calls bytes::length() directly at lines 1989/1995.

- RuntimeCode.java: Add GLOB type handling in method dispatch.
  Bare typeglobs (*FH, *DATA) used as method invocants now auto-bless
  to IO::File, matching the existing GLOBREFERENCE behavior.
  This fixes *FH->print(), *DATA->getline(), etc.

Text::CSV tests: 24/40 pass (up from 19), 31019 subtests ran.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
24/40 test programs pass, 31019 subtests ran, 118 actual failures.
Documented remaining issues: binary source reading (t/70_rt.t),
Scalar::Util::readonly (t/75_hashref.t), TieScalar (t/76_magic.t),
utf-32 encoding (t/85_util.t), and UTF-8 handling edge cases.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Bytecode compiler changes:
- Add isBytesEnabled() helper to BytecodeCompiler
- Check HINT_BYTES for length/chr/ord/fc/lc/uc/lcfirst/ucfirst
  and emit *_BYTES opcodes when 'use bytes' is active
- Add FC_BYTES, LC_BYTES, UC_BYTES, LCFIRST_BYTES, UCFIRST_BYTES
  opcodes with handler and disassembly support

DATA section changes:
- Store raw file bytes (after BOM removal) in CompilerOptions
- Extract DATA section content from raw bytes instead of
  UTF-8-decoded tokens, preserving non-UTF-8 bytes (e.g. Latin-1)
- Fall back to token-based extraction for eval/string contexts

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Pass VOID context through to RHS of &&/and, ||/or, // operators in
  both JVM backend (EmitLogicalOperator) and bytecode compiler
  (CompileBinaryOperator). Previously VOID was converted to SCALAR,
  causing side-effect-only expressions to leave values on the stack.
  Fixes t/80_diag.t tests 113-114.

- Add null check in PerlIO::get_layers for non-GLOB arguments,
  throwing "Not a GLOB reference" instead of NPE.
  Fixes t/90_csv.t test 104.

Text::CSV results: 27/40 programs pass (was 16/40).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Previously, `local %hash` only saved the hash contents internally
(via RuntimeHash.dynamicSaveState), but did not save the globalHashes
map entry. When `*glob = \%other` replaced the map entry via glob slot
assignment, the scope-exit restore put the saved contents into the
orphaned original hash, not the one in the global map.

This adds GlobalRuntimeHash (following the GlobalRuntimeScalar pattern)
which saves and restores the actual globalHashes map entry, including
glob alias handling.

Applied in both the JVM backend (EmitOperatorLocal.java) and the
bytecode interpreter (BytecodeInterpreter.java LOCAL_HASH handler).

Fixes Text::CSV t/91_csv_cb.t test 45 (%_ restoration after
`local %_` + `*_ = $hashref`).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…yers

In Perl, reading from file handles without encoding layers (e.g., :raw,
:bytes, or default mode) produces byte strings with the UTF-8 flag off.
PerlOnJava's readline methods (readUntilCharacter, readUntilString,
readParagraphMode, readFixedLength) were always creating STRING-typed
results, which made utf8::is_utf8() return true for all readline output.

This caused Text::CSV_PP's binary character detection to fail: CSV_PP
checks utf8::is_utf8($data) to decide whether to skip binary validation,
so bytes like \x08 (backspace) were silently accepted instead of
raising error 2037.

Changes:
- LayeredIOHandle: add hasEncodingLayer() to detect :utf8/:encoding(...)
- RuntimeIO: add isByteMode() to check if handle produces byte data
- Readline: all four read methods now check isByteMode() and set
  BYTE_STRING type on results when no encoding layers are active

Impact on Text::CSV tests:
- t/20_file.t, t/21_lexicalio.t: 104/109 -> 108/109 (+4 each)
- t/22_scalario.t: 131/136 -> 135/136 (+4)
- t/47_comment.t: 56/71 -> 71/71 (+15, all pass)
- t/51_utf8.t: 128/207 -> 132/167 (+4)
- t/85_util.t: 318/1448 -> 330/330 (all pass)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
30/40 test programs now pass (up from 27/40).
Phase 5 (readline BYTE_STRING propagation) fixed 27 subtest
failures across 6 test files. Notable improvements:
- t/47_comment.t: 71/71 (was 56/71)
- t/85_util.t: 330/330 (was 318/1448)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- TieScalar: cache last FETCHd value; untie restores it (not pre-tie
  value), matching Perl 5 behavior. Fixes t/76_magic.t (44/44).

- LayeredIOHandle: add decoded character buffer to prevent character loss
  when encoding layer decodes more characters than requested. Previously,
  reading 4 bytes of UTF-16BE produced 2 chars but only 1 was consumed;
  the other was silently discarded. Now excess chars are buffered for the
  next read. Also clear buffer on binmode/seek/close.

- Encode: add UTF-32, UTF-32BE, UTF-32LE charset aliases.

- perl_test_runner.pl: handle CPAN module paths with absolute directories
  so require ./t/util.pl works correctly.

Text::CSV t/85_util.t: 330/1448 -> 1350/1448

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… for non-octets

- All IO write() methods (CustomFileChannel, StandardIO, PipeOutputChannel,
  CustomOutputStreamHandle): detect characters > 255 and auto-encode to
  UTF-8, matching Perl 5 'Wide character in print' behavior. Previously,
  wide chars were truncated to their low byte (e.g., U+FEFF -> 0xFF).

- Utf8.java decode(): return false without modification when string
  contains characters > 0xFF, since they cannot be valid UTF-8 octets.
  Previously, getBytes(ISO_8859_1) silently replaced them with '?',
  corrupting Text::CSV sep/quote chars and causing sanity check failures.

Text::CSV t/85_util.t: 1350 -> 1356/1448
Text::CSV t/51_utf8.t: 167/207 (crashed) -> 198/207 (all run)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two fixes that significantly improve Text::CSV test pass rates:

1. use bytes regex matching: Under use bytes pragma, regex character
   classes like [\x7f-\xa0] now match against UTF-8 byte representation
   of strings rather than Unicode characters. This fixes Text::CSV_PP
   quote_binary detection for multi-byte characters (e.g., euro sign).
   Added toBytesString() to StringOperators, with support in both JVM
   and interpreter backends.

2. Latin-1 source encoding detection: Source files containing non-ASCII
   bytes that are not valid UTF-8 are now detected and read as ISO-8859-1
   instead of UTF-8. This matches Perl 5 behavior where source files
   without use utf8 are treated as Latin-1. Files are marked with
   isByteStringSource so the string parser does not re-encode characters.

Test improvements:
- t/50_utf8.t: 92/93 -> 93/93 (use bytes regex fix)
- t/20_file.t: 108/109 -> 109/109 (Latin-1 fix)
- t/21_lexicalio.t: 108/109 -> 109/109 (Latin-1 fix)
- t/22_scalario.t: 135/136 -> 136/136 (Latin-1 fix)
- t/70_rt.t: 1/20469 -> 20466/20469 (Latin-1 fix, +20465 tests!)
- Overall: 32255 total -> 52723 total tests, 9 -> 5 failing programs

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add Wide character in print warning to RuntimeIO.write() when writing
  characters > 0xFF to a filehandle without a UTF-8 encoding layer. The
  warning is on by default (matching Perl 5) and suppressible with
  no warnings utf8. It goes through WarnDie.warn() so it is catchable
  by $SIG{__WARN__} handlers.

- Fix utf8::upgrade() to simply flip the type from BYTE_STRING to STRING
  without decoding the bytes as UTF-8. In Perl 5, utf8::upgrade() only
  changes the internal storage flag; character codepoints remain identical.
  Previously, bytes like 0xE2,0x82,0xAC (UTF-8 for euro sign) were
  incorrectly decoded back to U+20AC, reversing a prior utf8::encode().

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
In Perl, `print` uses an internal copy of $\ (PL_ors_sv) that is only
updated by direct assignment to $\. Aliasing via `for $\ (@list)` changes
the Perl-visible variable but not the internal value print uses.

PerlOnJava was reading $\ directly from the global variable map, so
`for $\ ($rs) { print $fh $str }` would incorrectly append the aliased
iterator value instead of the original $\ value.

Fix: Create OutputRecordSeparator and OutputFieldSeparator classes that
maintain a static internal value updated only by set(). print reads these
internal values instead of the map entries. GlobalRuntimeScalar handles
save/restore of internal values during local/for scoping.

This fixes 12 failures in Text::CSV t/45_eol.t and all 12 failures in
t/46_eol_si.t.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
withCapturedVars() created a copy of InterpretedCode for closures but
didn't copy gotoLabelPcs or usesLocalization. This caused goto LABEL
to fail in interpreter-fallback subroutines that have closure variables
(like Text::CSV_PP's ____parse, because the label map was silently
dropped when binding captured variables.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
EOF
)
- RuntimeTransliterate: both /r return path and in-place modification
  path now preserve BYTE_STRING type from the input scalar
- RuntimeSubstrLvalue: substr() on a BYTE_STRING parent now returns
  BYTE_STRING instead of hardcoded STRING type

These fixes ensure that byte-oriented string operations maintain
their binary semantics, fixing Text::CSV t/51_utf8.t tests 122,
134, 144 where multi-byte separators were garbled.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ions

- chomp/chop: preserve BYTE_STRING after removing separator
- Regex captures ($1, $2, $&, etc.): track lastMatchWasByteString flag
  and propagate to ScalarSpecialVariable results and list-context returns
- split: all result elements inherit BYTE_STRING from input string
- s///: preserve BYTE_STRING for both normal and /r substitution
- lc/uc/lcfirst/ucfirst/fc/quotemeta: preserve type from input
- reverse/repeat (x): preserve BYTE_STRING from input
- utf8::is_utf8: resolve ScalarSpecialVariable proxy before checking type
- RegexState: save/restore lastMatchWasByteString across scope boundaries

These fixes ensure binary-mode string operations maintain their byte
semantics throughout the parsing pipeline. Fixes Text::CSV t/51_utf8.t
(all 207 tests now pass, was 4 failures) and reduces t/85_util.t
from 92 to 24 failures (remaining are UTF-16/32 encoding layer issues).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Perl Encode::decode silently drops incomplete trailing code units
for fixed-width encodings (UTF-16, UTF-32). Java String(byte[],
Charset) replaces them with U+FFFD replacement characters instead.

This caused Text::CSV t/85_util.t to fail 24 tests when reading
BOM-prefixed UTF-16LE/UTF-32LE files with CR line endings: binary
readline consumed the entire file, CSV_PP header() padded the header
with a null byte for alignment, and the extra U+FFFD in the decoded
string was parsed as a second data row.

Fix: trim input bytes to a multiple of the code unit size (2 for
UTF-16, 4 for UTF-32) before decoding. Applied to decode(),
encoding_decode(), and from_to().

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- RuntimeRegex.java: When s/// result contains wide chars (codepoint > 255),
  upgrade from BYTE_STRING to STRING instead of preserving byte type.
  Fixes re/subst.t regression where e.g. s/a/\x{100}/g on a byte string
  incorrectly kept BYTE_STRING type.

- LayeredIOHandle.java: For non-encoding layers like :crlf, read
  conservatively (bytesToRead = charactersNeeded) to avoid over-consuming
  from the delegate, which made tell() inaccurate. Encoding layers
  (UTF-16/32) still read extra bytes to handle multi-byte characters.
  Fixes io/crlf.t regression.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
All 5 reported regressions for PR #424 investigated:
- re/subst.t: fixed (s/// wide char BYTE_STRING upgrade)
- io/crlf.t: fixed (:crlf read over-consumption)
- re/pat_advanced.t: not a regression (matches master)
- comp/parser_run.t: not a regression (matches master)
- op/anonsub.t: not a regression (pre-existing env issue)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Replace ICU4J's UnicodeSet.toPattern(false) with custom
unicodeSetToJavaPattern() that:

1. Uses \x{XXXX} notation for supplementary characters (U+10000+)
   to avoid Java misinterpreting UTF-16 surrogate pairs in char
   class ranges
2. Escapes # and whitespace characters so patterns work correctly
   when recompiled with Pattern.COMMENTS flag (Java's /x mode

Root cause: When an empty regex // reuses the last successful
pattern with different flags (e.g., adding /x), the pattern is
recompiled with Pattern.COMMENTS. Java's COMMENTS mode treats #
as a comment delimiter even inside character classes, breaking
ranges like [!-#] in the expanded \p{IsPunct} pattern.

This fixes the re/pat_advanced.t crash that killed the test at
test ~1521, preventing 157 remaining tests from running. Now all
1678 tests complete (1316 pass, matching master's test count).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
EOF
)
…dvanced.t

- B.pm: wrap require Sub::Util in eval in _introspect() so that
  Sub::Util loading failures (due to @inc reordering) fall back to
  __ANON__ defaults instead of dying (fixes op/anonsub.t test 9)
- IdentifierParser: format non-ASCII bytes as \xNN (uppercase, no braces)
  inside ${...} contexts to match Perl diagnostic format
  (fixes comp/parser_run.t test 66)
- re/pat_advanced.t: no longer crashes - the unicodeSetToJavaPattern()
  fix from previous commit properly handles supplementary characters

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the feature/text-csv-support branch from 30ce422 to 5256681 Compare April 4, 2026 16:25
fglock and others added 2 commits April 4, 2026 18:41
…ions

Replace the no-op stub with a working implementation that:
- Uses B::Hooks::EndOfScope to register cleanup at end of compilation
- Uses Sub::Util::subname (XS) to detect imported vs local functions
- Removes imported functions from the stash while preserving methods
- Supports -cleanee, -also, -except parameters

This fixes DateTime test t/48rt-115983.t which verifies that
Try::Tiny's catch/try don't leak as callable methods on DateTime
objects. Previously the no-op stub left them in the namespace.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…e::autoclean

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Functions installed from companion packages (e.g. DateTime::PP into
DateTime) via glob assignment are now recognized as intentional methods,
not imports. The heuristic: if the origin package is a sub-package of
the cleanee (DateTime::PP starts with DateTime::), keep it.

This fixes DateTime::_ymd2rd (from DateTime::PP) being incorrectly
cleaned, which caused 'Can't locate object method _ymd2rd' errors.
Try::Tiny imports (try, catch) are still correctly cleaned.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock merged commit 26ae675 into master Apr 4, 2026
2 checks passed
@fglock fglock deleted the feature/text-csv-support branch April 4, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant