diff --git a/dev/modules/text_csv_fix_plan.md b/dev/modules/text_csv_fix_plan.md new file mode 100644 index 000000000..d44b7e144 --- /dev/null +++ b/dev/modules/text_csv_fix_plan.md @@ -0,0 +1,336 @@ +# Text::CSV Fix Plan + +## Problem + +`./jcpan -j 4 -t Text::CSV` fails. Multiple root causes identified across four phases. + +## Architecture + +PerlOnJava ships a **bundled Text::CSV** (`src/main/perl/lib/Text/CSV.pm`, 557 lines) that wraps Apache Commons CSV (Java) via `TextCsv.java`. It provides basic CSV functionality but is missing ~40+ methods from the CPAN version. + +The CPAN **Text::CSV 2.06** is a thin wrapper that delegates to `Text::CSV_PP` (pure Perl, 3,480 lines of code). It provides full compatibility with Text::CSV_XS including all accessors, error handling, callbacks, types, etc. + +When a user installs Text::CSV via `jcpan`, the CPAN version (+ CSV_PP) should override the bundled version. The bundled version remains as a zero-install fallback for users who don't need the full CPAN feature set. + +## Current Test Results (after Phase 9) + +**39/40 test programs pass.** ~52,360 subtests ran, only **4** actually failed (all in t/70_rt.t). + +Passing (39/40): `00_pod` (skip), `01_is_pp`, `10_base`, `12_acc`, `15_flags`, `16_import`, `20_file`, `21_lexicalio`, `22_scalario`, `30_types`, `40_misc`, `41_null`, `45_eol`, `46_eol_si`, `47_comment`, `50_utf8`, `51_utf8`, `55_combi`, `60_samples`, `65_allow`, `66_formula`, `67_emptrow`, `68_header`, `71_pp`, `71_strict`, `75_hashref`, `76_magic`, `77_getall`, `78_fragment`, `79_callbacks`, `80_diag`, `81_subclass`, `85_util`, `90_csv`, `91_csv_cb`, `92_stream`, `csv_method`, `fields_containing_0`, `rt99774`. + +## Fix Phases + +### Phase 1: Strict vars + use lib (DONE) + +**Files changed:** +- `EmitVariable.java` — Added `%_` to `isBuiltinSpecialContainerVar` +- `BytecodeCompiler.java` — Same +- `Variable.java` — Added `%_` to parse-time strict vars exemptions +- `Lib.java` — Changed `push` to `unshift` with dedup, matching Perl's `lib.pm` semantics + +### Phase 2: @INC ordering + blib support (DONE) + +- `GlobalContext.java` — Reordered @INC: `-I` args > PERL5LIB env > `~/.perlonjava/lib` > `jar:PERL5LIB` +- `ExtUtils/MakeMaker.pm` — Added `pure_all` target to copy .pm files to `blib/lib/` + +### Phase 3a: `last` inside `do {} while` inside a true loop (DONE) + +The `____parse` subroutine (766 lines) is too large for the JVM backend and falls back to the bytecode interpreter. The bytecode compiler's `compileLastNextRedo()` had a bug: for unlabeled `last`/`next`/`redo`, it used `loopStack.peek()` which returns the innermost loop entry — including do-while pseudo-loops (`isTrueLoop=false`). It then threw "Can't last outside a loop block" because do-while is not a true loop. + +**Root cause:** `loopStack.peek()` instead of searching for the innermost true loop. + +**Fix:** Changed the unlabeled case to iterate `loopStack` from top to bottom and return the first entry with `isTrueLoop=true`, matching the JVM backend's `findInnermostTrueLoopLabels()` behavior. + +**File:** `BytecodeCompiler.java`, `compileLastNextRedo()` (~line 5789) + +**Impact:** Highest-impact fix — unblocked the core CSV parsing engine that nearly every test depends on. Went from ~4 passing tests to 19. + +### Phase 3b: Implement `bytes::length` and other `bytes::` functions + +**Status:** TODO — highest priority remaining fix + +**Problem:** `bytes::length($value)` is an explicit subroutine call to `bytes::length`, not the `length` builtin under `use bytes`. PerlOnJava's `bytes.pm` is a stub placeholder with no function definitions. The Java-side `BytesPragma.java` only handles `import`/`unimport` (hint flags), not callable functions. + +**What exists:** +- `BytesPragma.java` — Sets/clears `HINT_BYTES` for `use bytes`/`no bytes` (working) +- `EmitOperator.java` — Compiler checks `HINT_BYTES` to emit byte-aware `length`/`chr`/`ord`/`substr` (working) +- `StringOperators.lengthBytes()` — Java implementation of byte-length (working) + +**What's missing:** `bytes::length`, `bytes::chr`, `bytes::ord`, `bytes::substr` as callable Perl subroutines. + +**Fix:** Register `bytes::length` etc. as Java methods in `BytesPragma.java`, following the pattern used by `Utf8.java` for `utf8::encode`, `utf8::decode`, etc. + +**Files:** `BytesPragma.java` + +**Impact:** Unblocks t/12_acc.t (245 tests), t/55_combi.t (25119), t/70_rt.t (20469), t/71_pp.t (104), t/85_util.t (1448) — all crash on `bytes::length`. + +### Phase 3c: Fix bare glob (`*FH`/`*DATA`) method dispatch + +**Status:** TODO — second highest priority + +**Problem:** When a bare typeglob like `*FH` is used as a method invocant (`$io->print($str)` where `$io` is `*FH`), PerlOnJava's method dispatch in `RuntimeCode.call()` doesn't handle the GLOB type. It falls through to the string path, stringifies the glob to `"*main::FH"`, and tries to find a class `*main::FH`. + +**Root cause:** `RuntimeCode.call()` has handling for `GLOBREFERENCE` (auto-blesses to `IO::File`) but no handling for plain `GLOB` type. + +**Fix:** Add an `else if (runtimeScalar.type == RuntimeScalarType.GLOB)` branch that auto-blesses to `IO::File`, matching the `GLOBREFERENCE` behavior. + +**File:** `RuntimeCode.java`, `call()` method (~line 1546) + +**Impact:** Unblocks t/20_file.t (109 tests), t/79_callbacks.t (~86 of 111 failures from `*DATA`), t/90_csv.t (~124 of 127), t/71_strict.t (~15 of 17). + +### Phase 3d: UTF-8 handling improvements (LOWER PRIORITY) + +Multiple interrelated UTF-8 issues affect ~55 test failures across t/47_comment.t, t/50_utf8.t, t/51_utf8.t: + +| Issue | Root Cause | File | Impact | +|-------|-----------|------|--------| +| Readline returns STRING type | `Readline.java` always creates STRING, losing BYTE_STRING info from raw handles | Readline.java | t/51_utf8.t #93-94 | +| `utf8::is_utf8` too permissive | Returns true for all non-BYTE_STRING types (INTEGER, DOUBLE, etc.) | Utf8.java | t/51_utf8.t #94 | +| No "Wide character in print" warning | `IOOperator.print()` never checks for chars > 0xFF | IOOperator.java | t/51_utf8.t #7, #13 | +| `use bytes` doesn't affect regex | `HINT_BYTES` not checked for regex matching | EmitOperator.java | t/50_utf8.t #71 | +| `utf8::upgrade` decodes instead of just flagging | Incorrectly decodes UTF-8 bytes into characters | Utf8.java | t/51_utf8.t bytes_up tests | +| Multi-byte UTF-8 comment_str matching | Byte vs character length confusion in comment detection | CSV_PP issue | t/47_comment.t #46-60 | + +**Strategy:** These are complex and risky to change broadly. Defer unless the simpler fixes (3b, 3c) don't get us to an acceptable pass rate. + +### Phase 3e: Other edge cases (LOWEST PRIORITY) + +| Test | Failures | Likely Cause | +|------|----------|--------------| +| t/45_eol.t | 18/1182 | EOL handling edge cases (1.5% fail rate) | +| t/46_eol_si.t | 12/562 | Same EOL issues (2.1% fail rate) | +| t/20_file.t | 5/109 | Binary char detection (`\x08` not flagged as binary) | +| t/21_lexicalio.t | 5/109 | Same binary char issue | +| t/22_scalario.t | 5/136 | Same binary char issue | +| t/91_csv_cb.t | 1/82 | `local %h` + `*g = \%h` glob slot restoration | + +### Phase 3f: Infrastructure issues (NOT Text::CSV specific) + +These failures are caused by broader PerlOnJava limitations, not Text::CSV bugs: + +| Test | Failures | Root Cause | +|------|----------|-----------| +| t/70_rt.t | 20468/20469 | Source file contains raw `\xab`/`\xbb` bytes in CODE section (regex patterns). Even with Latin-1 source reading, the test crashes with "Can't use an undefined value as an ARRAY reference" early on. | +| t/75_hashref.t | 44/102 | `Scalar::Util::readonly()` always returns false. Test binds read-only refs (`\1, \2`), CSV_PP can't detect readonly, tries to assign, crashes. | +| t/76_magic.t | 35/44 | `TieScalar` ClassCastException in bytecode interpreter. Tied variables not properly dereferenced when used as string operands. 1 actual failure + 34 not run. | +| t/85_util.t | 1130/1448 | Crash at test 330: `open` with `:encoding(utf-32be)` not supported. 12 earlier failures from BOM detection/Unicode decode. | + +### Phase 4: Logical operator VOID context + PerlIO NPE (DONE) + +**Status:** DONE — committed as `976f7a168` + +**Problem 1:** The RHS of `&&`/`and`, `||`/`or`, and `//` operators was compiled in SCALAR context even when the overall expression was in VOID context. This caused side-effect-only expressions to leave spurious values on the JVM stack and waste bytecode registers. + +**Fix:** Changed both the JVM backend (`EmitLogicalOperator.java`) and the bytecode compiler (`CompileBinaryOperator.java`) to pass VOID context through to the RHS instead of converting it to SCALAR. + +**Problem 2:** `PerlIO::get_layers()` threw a NullPointerException when called with a non-GLOB argument. + +**Fix:** Added null check in `PerlIO.java` to throw "Not a GLOB reference" instead of NPE. + +**Files:** `EmitLogicalOperator.java`, `CompileBinaryOperator.java`, `PerlIO.java` + +**Impact:** Fixed t/80_diag.t (316/316 pass, was failing at tests 113-114) and t/90_csv.t (127/127 pass, was crashing at test 104). Combined with accumulated Phase 3 fixes: 27/40 programs pass (up from 24/40). + +## Progress Tracking + +### Current Status: Phase 9 complete — 39/40 programs pass, 52356/52360 subtests pass (99.99%) + +### Completed +- [x] Phase 1: strict vars + use lib (2026-04-03) + - Files: EmitVariable.java, BytecodeCompiler.java, Variable.java, Lib.java +- [x] Phase 2: @INC ordering + blib support (2026-04-03) + - Files: GlobalContext.java, ExtUtils/MakeMaker.pm +- [x] Phase 3a: `last` in do-while inside true loop (2026-04-03) + - File: BytecodeCompiler.java + - Result: 19/40 tests pass (up from ~4) +- [x] Phase 3b: `bytes::length` and other bytes:: functions (2026-04-03) + - File: BytesPragma.java + - Added: bytes::length, bytes::chr, bytes::ord, bytes::substr +- [x] Phase 3c: Bare glob method dispatch (2026-04-03) + - File: RuntimeCode.java + - Added: GLOB type handling in method dispatch (auto-bless to IO::File) + - Result: 24/40 tests pass, 31019 subtests ran +- [x] Phase 3 extras: bytecode HINT_BYTES parity + raw-bytes DATA section (2026-04-03) + - Files: CompileOperator.java, Opcodes.java, ScalarUnaryOpcodeHandler.java, Disassemble.java, CompilerOptions.java, FileUtils.java, DataSection.java + - Added: FC_BYTES/LC_BYTES/UC_BYTES/LCFIRST_BYTES/UCFIRST_BYTES opcodes for bytecode interpreter + - Fixed: DATA section preserves raw bytes via Latin-1 extraction from rawCodeBytes +- [ ] Phase 3 extras: Latin-1 source reading + StringParser UTF-8 decoding (REVERTED) + - Attempted: change default source encoding from UTF-8 to Latin-1 in FileUtils.java + re-decode in StringParser.java + - **Problem**: Source enters the compiler via multiple paths (FileUtils for files, `StandardCharsets.UTF_8` in JUnit tests, command-line for `-e`). The StringParser transformations need to know whether the source string has "byte-preserving" (Latin-1) or "already decoded" (UTF-8) semantics. Fixing one path broke the other. + - **Reverted**: Changes to FileUtils.java and StringParser.java were rolled back. See "Encoding-Aware Lexer" design below for the proper solution. +- [x] Phase 4: Logical operator VOID context + PerlIO NPE (2026-04-03) + - Files: EmitLogicalOperator.java, CompileBinaryOperator.java, PerlIO.java + - Fixed: VOID context passed through to RHS of &&/and, ||/or, // + - Fixed: PerlIO::get_layers null check for non-GLOB references + - Result: 27/40 tests pass (up from 24/40), 114 subtest failures (down from 118) +- [x] Phase 4b: `local %hash` glob slot restoration (2026-04-03) + - Files: GlobalRuntimeHash.java (new), EmitOperatorLocal.java, BytecodeInterpreter.java + - Fixed: `local %hash` now saves/restores the globalHashes map entry, not just hash contents + - Result: t/91_csv_cb.t 82/82 pass (was 81/82) +- [x] Phase 5: readline BYTE_STRING propagation (2026-04-03) + - Files: LayeredIOHandle.java, RuntimeIO.java, Readline.java + - Root cause: readline always returned STRING type, causing utf8::is_utf8() to return true + for all readline output. This broke CSV_PP's binary character detection (checks utf8 flag + to skip binary validation) and multi-byte UTF-8 comment string handling. + - Added: LayeredIOHandle.hasEncodingLayer(), RuntimeIO.isByteMode() + - Fixed: All four Readline methods check isByteMode() and return BYTE_STRING when appropriate + - Impact: Fixed 27 subtest failures across 6 test files: + - t/20_file.t: 104/109 -> 108/109 (+4) + - t/21_lexicalio.t: 104/109 -> 108/109 (+4) + - t/22_scalario.t: 131/136 -> 135/136 (+4) + - t/47_comment.t: 56/71 -> 71/71 (+15, all pass) + - t/51_utf8.t: 128/207 -> 132/167 (+4) + - t/85_util.t: 318/1448 -> 330/330 (all pass) + - Result: 30/40 programs pass (up from 27/40) +- [x] Phase 5b: `$\` / `$,` aliasing fix (2026-04-03) — committed as `a73f378e2` + - Created: OutputRecordSeparator.java, OutputFieldSeparator.java + - Modified: IOOperator.java (static getters), GlobalContext.java (special types), GlobalRuntimeScalar.java (save/restore) + - Root cause: `print` read `$\`/`$,` directly from global map; `for $\ ($rs) { print }` leaked aliased value + - Impact: t/45_eol.t: 18→6 failures; t/46_eol_si.t: 12→0 failures +- [x] Phase 6: `goto LABEL` in interpreter-fallback closures (2026-04-03) + - File: InterpretedCode.java, `withCapturedVars()` method + - Root cause: `withCapturedVars()` created a copy but dropped `gotoLabelPcs` and `usesLocalization` + - Fix: Copy `gotoLabelPcs` and `usesLocalization` to the new InterpretedCode in `withCapturedVars()` + - Impact: t/45_eol.t: 6→0 (all 1182 pass); t/20_file.t: 108→109; t/21_lexicalio.t: 108→109; t/22_scalario.t: 135→136 + - Result: 34/40 programs pass (up from 30/40) +- [x] Phase 7: BYTE_STRING preservation + Encode::decode orphan byte fix (2026-04-04) + - **BYTE_STRING preservation across string operations** (commit 886c6394e): + - RuntimeTransliterate.java: tr///r and in-place tr/// preserve BYTE_STRING type + - RuntimeSubstrLvalue.java: substr lvalue inherits BYTE_STRING from parent + - StringOperators.java: chomp, chop, lc, uc, lcfirst, ucfirst, reverse preserve BYTE_STRING + - RuntimeRegex.java: added lastMatchWasByteString flag propagated through regex match/substitution + - ScalarSpecialVariable.java: $1, $&, $`, $' inherit BYTE_STRING from last match + - RegexState.java: lastMatchWasByteString saved/restored with regex state + - Utf8.java: isUtf8() resolves ScalarSpecialVariable proxy types before checking + - Operator.java: repeat (x) and split preserve BYTE_STRING type + - **Encode::decode orphan byte fix** (commit b91457959): + - Encode.java: Added trimOrphanBytes() to drop incomplete trailing code units for UTF-16/32 + - Root cause: Java's String(byte[], Charset) replaces orphan bytes with U+FFFD; Perl drops them + - Applied to decode(), encoding_decode(), and from_to() + - Impact: + - t/51_utf8.t: 132/167 → 207/207 (all pass, +75) + - t/85_util.t: 1424/1448 → 1448/1448 (all pass, +24) + - t/75_hashref.t: 58/58+44 skipped → 102/102 (all pass, previously skipped tests now run) + - t/76_magic.t: 43/44 → 44/44 (all pass) + - t/70_rt.t: 1/20469 → 20465/20469 (massive improvement, +20464) + - Result: 39/40 programs pass (up from 34/40) + +- [x] Phase 8: Regression fixes for PR #424 (2026-04-04) + - **re/subst.t fix** (RuntimeRegex.java): + - When s/// replacement introduces wide characters (codepoint > 255), the result is now + correctly upgraded from BYTE_STRING to STRING instead of preserving byte type + - Added `containsWideChars()` helper to detect characters > 255 in substitution results + - Root cause: Phase 7's BYTE_STRING preservation unconditionally kept BYTE_STRING type on + substitution results, even when replacement introduced wide characters (e.g. `s/a/\x{100}/g`) + - **io/crlf.t fix** (LayeredIOHandle.java): + - For non-encoding layers like `:crlf`, `doRead()` now reads conservatively + (`bytesToRead = charactersNeeded`) to avoid over-consuming from the delegate + - Encoding layers (UTF-16/32) still use the wider read (`charactersNeeded * 4`) + - Root cause: Phase 5's encoding layer read logic used `charactersNeeded * 4` for ALL layers, + causing `:crlf` layer to over-read, making `tell()` inaccurate + - **Regression investigation results:** + - re/pat_advanced.t: NOT a regression — matches master exactly at 1316/1678 passing + - comp/parser_run.t: NOT a regression — same 18 failures on both master and branch + - op/anonsub.t: NOT a regression — pre-existing List::Util 1.70 vs 1.63 version mismatch + - Commit: `07b856abc` + +- [x] Phase 9: Regression fixes + namespace::autoclean + Unicode property fix (2026-04-04) + - **op/anonsub.t test 9 fix** (B.pm): + - Wrapped `require Sub::Util` in eval in B::CV::_introspect() so that loading failures + (caused by @INC reordering putting CPAN Sub::Util before bundled) fall back to __ANON__ + defaults instead of dying + - **comp/parser_run.t test 66 fix** (IdentifierParser.java): + - Non-ASCII bytes (0x80-0xFF) inside `${...}` contexts now formatted as `\xNN` (uppercase, + no braces) matching Perl's diagnostic format + - **re/pat_advanced.t Unicode fix** (UnicodeResolver.java): + - `unicodeSetToJavaPattern()` uses `\x{XXXX}` notation for supplementary characters (U+10000+) + to avoid Java's Pattern.compile() misinterpreting UTF-16 surrogate pairs + - Escape `#` and whitespace in character class patterns for Pattern.COMMENTS compatibility + - Confirmed: branch matches master at 1316/1678 (no regression) + - **namespace::autoclean implementation** (namespace/autoclean.pm): + - Replaced no-op stub with working implementation using B::Hooks::EndOfScope + Sub::Util + - Uses Sub::Util::subname (XS via XSLoader) to distinguish imported vs local functions + - Removes imported functions from stash at end of scope while preserving methods + - Supports -cleanee, -also, -except parameters + - Fixed DateTime test t/48rt-115983.t: Try::Tiny's try/catch no longer leak as callable + methods on DateTime objects + - Commits: `52566815a` (regression fixes), `29638fcec` (namespace::autoclean) + +### Remaining Failures (1 test file, 4 subtests) + +| Test | ok/total | Failures | Details | +|------|----------|----------|---------| +| t/70_rt.t | 20465/20469 | 4 | See below | + +#### t/70_rt.t failure details + +| Test # | Description | Likely Cause | +|--------|-------------|--------------| +| 72 | IO::Handle triggered a warning | Missing warning when printing to invalid IO::Handle | +| 84 | fields () | Incorrect field parsing with unusual quote/sep values (non-ASCII separator `\xab`/`\xbb` from `chr()`) | +| 86 | fields () | Same as above | +| 444 | first string correct in Perl | String content mismatch — likely a raw-bytes vs Unicode edge case | + +### Next Steps + +The Text::CSV module is effectively complete for practical use (**99.99% pass rate**). The 4 remaining failures are minor edge cases: + +1. **Investigate t/70_rt.t #72** — IO::Handle warning on invalid filehandle. Low priority; may require implementing Perl's warning for printing to a closed/invalid handle. + +2. **Investigate t/70_rt.t #84/#86** — Non-ASCII separator/quote handling. These test `chr(0xab)`/`chr(0xbb)` as separator/quote characters. May be a byte vs character encoding edge case. + +3. **Investigate t/70_rt.t #444** — String content comparison failure. Need to check what the expected vs actual strings are. + +4. **Consider merging** — With 39/40 test files passing and 52356/52360 subtests passing, this branch is ready for review/merge. The remaining 4 failures are edge cases that can be addressed in follow-up work. + +--- + +## Encoding-Aware Lexer Design + +### Problem + +Perl reads source files as raw bytes. The `use utf8` pragma tells the parser to decode string literals (and identifiers, regex patterns, etc.) as UTF-8. This encoding switch happens mid-file and is lexically scoped — `no utf8` reverts to byte semantics. `use encoding 'latin1'` and other encoding pragmas add further complexity. + +PerlOnJava currently reads the entire source file as a Java String up front using a fixed encoding (UTF-8 by default). This creates a fundamental mismatch: + +1. **Without `use utf8`**: Source bytes `\xC3\xA9` should be two separate byte-values (195, 169). But UTF-8 decoding collapses them into one character é (U+00E9). +2. **With `use utf8`**: Source bytes `\xC3\xA9` should become one character é (U+00E9). This happens to work when reading as UTF-8, but only by accident. +3. **Mixed contexts**: A file with `use utf8` in one block and byte semantics elsewhere needs both behaviors. + +An attempted fix (Latin-1 source reading + StringParser re-decode) was reverted because source code enters the compiler via multiple paths (file reading, `-e` arguments, `eval` strings, JUnit tests) and each path has different encoding semantics. Patching StringParser for one path broke others. + +### Proposed Solution: Encoding Feedback from Parser to Lexer + +Instead of fixing encoding in StringParser after the fact, make the Lexer encoding-aware with feedback from the Parser: + +``` + Source bytes ──► Lexer (encoding-aware) ──► Tokens ──► Parser + ▲ │ + └── "use utf8" / "no utf8" ─────────┘ +``` + +#### Key Design Points + +1. **Normalize source to Latin-1 at the boundary**: All source entry points (file, `-e`, `eval`, tests) should convert to a canonical byte-preserving representation before reaching the Lexer. For files, read as Latin-1. For `-e` (already UTF-8 decoded), re-encode to UTF-8 bytes then store as Latin-1 chars. This ensures the Lexer always works with byte-valued characters. + +2. **Lexer tracks encoding state**: The Lexer holds a current encoding flag (initially `bytes`, switched to `utf8` when the Parser encounters `use utf8`). This affects how it tokenizes: + - In **bytes** mode: each Latin-1 char is one token character (preserving raw byte values) + - In **utf8** mode: consecutive Latin-1 chars forming a valid UTF-8 sequence are combined into one Unicode character + +3. **Parser signals encoding changes**: When the Parser processes `use utf8`, `no utf8`, or `use encoding '...'`, it calls back to the Lexer to change the encoding mode. This takes effect for subsequent tokens. + +4. **Lexically scoped**: The encoding state is part of the scope stack, matching Perl's `use utf8` / `no utf8` scoping. + +#### Impact on Existing Code + +- **StringParser.java**: The `use utf8` / `no utf8` post-processing branches become unnecessary — the Lexer already delivers correctly-decoded tokens. +- **FileUtils.java**: Simplified to always read as Latin-1. +- **PerlScriptExecutionTest.java**: Must normalize `-e`-style source to Latin-1 chars. +- **Lexer.java**: Needs encoding state and multi-byte char combining logic. +- **Parser.java**: Needs to signal encoding changes to Lexer. + +#### Risks and Alternatives + +- **Risk**: The Lexer currently operates on a pre-built Java String. Making it byte-aware may require significant refactoring. +- **Alternative (simpler)**: Instead of modifying the Lexer, add a `sourceIsLatinEncoded` flag to `CompilerOptions` and branch on it in StringParser. This would require all entry points to set the flag correctly but avoids Lexer changes. The `-e` path would re-encode its argument to pseudo-Latin-1 and set the flag. +- **Alternative (pragmatic)**: Leave the source reading as UTF-8 but fix the specific tests that need raw bytes (t/70_rt.t) by adding a binary mode flag or pre-processing step for files containing non-UTF-8 bytes. diff --git a/dev/tools/perl_test_runner.pl b/dev/tools/perl_test_runner.pl index ef6893953..2670d6c2d 100755 --- a/dev/tools/perl_test_runner.pl +++ b/dev/tools/perl_test_runner.pl @@ -302,6 +302,11 @@ sub run_single_test { elsif ($test_file =~ m{^t/} && !-f 't/TestLib.pm') { $local_test_dir = 't'; } + # For CPAN module tests with absolute paths (e.g., /path/to/Module-1.23/t/test.t) + # chdir to the module root so require "./t/util.pl" works + elsif ($test_file =~ m{^(/.*)/t/[^/]+\.t$}) { + $local_test_dir = $1; + } chdir($local_test_dir) if $local_test_dir && -d $local_test_dir; diff --git a/src/main/java/org/perlonjava/app/cli/CompilerOptions.java b/src/main/java/org/perlonjava/app/cli/CompilerOptions.java index 34a37d458..9b8151189 100644 --- a/src/main/java/org/perlonjava/app/cli/CompilerOptions.java +++ b/src/main/java/org/perlonjava/app/cli/CompilerOptions.java @@ -49,6 +49,7 @@ public class CompilerOptions implements Cloneable { public boolean processAndPrint = false; // For -p public boolean inPlaceEdit = false; // New field for in-place editing public String code = null; + public byte[] rawCodeBytes = null; // Raw file bytes (after BOM removal) for DATA section public boolean codeHasEncoding = false; public String fileName = null; public String inPlaceExtension = null; // For -i diff --git a/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java b/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java index 1a214957e..486cfbfa4 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java +++ b/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java @@ -236,7 +236,8 @@ private static boolean isBuiltinSpecialContainerVar(String sigil, String name) { || name.equals("ENV") || name.equals("INC") || name.equals("+") - || name.equals("-"); + || name.equals("-") + || name.equals("_"); } if ("@".equals(sigil)) { return name.equals("ARGV") @@ -400,6 +401,10 @@ boolean isStrictRefsEnabled() { * @return true if access should be blocked under strict vars */ + boolean isBytesEnabled() { + return getEffectiveSymbolTable().isStrictOptionEnabled(Strict.HINT_BYTES); + } + boolean isIntegerEnabled() { return getEffectiveSymbolTable().isStrictOptionEnabled(Strict.HINT_INTEGER); } @@ -5786,9 +5791,13 @@ void handleLoopControlOperator(OperatorNode node, String op) { // Find the target loop LoopInfo targetLoop = null; if (labelStr == null) { - // Unlabeled: find innermost loop - if (!loopStack.isEmpty()) { - targetLoop = loopStack.peek(); + // Unlabeled: find innermost true loop (skip do-while/bare blocks) + for (int i = loopStack.size() - 1; i >= 0; i--) { + LoopInfo loop = loopStack.get(i); + if (loop.isTrueLoop) { + targetLoop = loop; + break; + } } } else { // Labeled: search for matching label diff --git a/src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java b/src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java index 96ab44f33..e8fe18844 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java +++ b/src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java @@ -2479,8 +2479,7 @@ private static int executeScopeOps(int opcode, int[] bytecode, int pc, int nameIdx = bytecode[pc++]; String fullName = code.stringPool[nameIdx]; - RuntimeHash hash = GlobalVariable.getGlobalHash(fullName); - DynamicVariableManager.pushLocalVariable(hash); + GlobalRuntimeHash.makeLocal(fullName); registers[rd] = GlobalVariable.getGlobalHash(fullName); return pc; } diff --git a/src/main/java/org/perlonjava/backend/bytecode/CompileBinaryOperator.java b/src/main/java/org/perlonjava/backend/bytecode/CompileBinaryOperator.java index fddf3e553..dc20172c6 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/CompileBinaryOperator.java +++ b/src/main/java/org/perlonjava/backend/bytecode/CompileBinaryOperator.java @@ -449,7 +449,7 @@ else if (node.right instanceof BinaryOperatorNode rightCall) { bytecodeCompiler.emitReg(rd); bytecodeCompiler.emitInt(0); - int rightCtx = bytecodeCompiler.currentCallContext == RuntimeContextType.VOID ? RuntimeContextType.SCALAR : bytecodeCompiler.currentCallContext; + int rightCtx = bytecodeCompiler.currentCallContext; bytecodeCompiler.compileNode(node.right, rd, rightCtx); int rs2 = bytecodeCompiler.lastResultReg; if (rs2 >= 0) { @@ -475,7 +475,7 @@ else if (node.right instanceof BinaryOperatorNode rightCall) { bytecodeCompiler.emitReg(rd); bytecodeCompiler.emitInt(0); - int rightCtx = bytecodeCompiler.currentCallContext == RuntimeContextType.VOID ? RuntimeContextType.SCALAR : bytecodeCompiler.currentCallContext; + int rightCtx = bytecodeCompiler.currentCallContext; bytecodeCompiler.compileNode(node.right, rd, rightCtx); int rs2 = bytecodeCompiler.lastResultReg; if (rs2 >= 0) { @@ -506,7 +506,7 @@ else if (node.right instanceof BinaryOperatorNode rightCall) { bytecodeCompiler.emitReg(definedReg); bytecodeCompiler.emitInt(0); - int rightCtx = bytecodeCompiler.currentCallContext == RuntimeContextType.VOID ? RuntimeContextType.SCALAR : bytecodeCompiler.currentCallContext; + int rightCtx = bytecodeCompiler.currentCallContext; bytecodeCompiler.compileNode(node.right, rd, rightCtx); int rs2 = bytecodeCompiler.lastResultReg; if (rs2 >= 0) { diff --git a/src/main/java/org/perlonjava/backend/bytecode/CompileOperator.java b/src/main/java/org/perlonjava/backend/bytecode/CompileOperator.java index bed6647a6..1159a9f76 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/CompileOperator.java +++ b/src/main/java/org/perlonjava/backend/bytecode/CompileOperator.java @@ -242,6 +242,14 @@ private static void visitMatchRegex(BytecodeCompiler bc, OperatorNode node) { } else { stringReg = loadDefaultUnderscore(bc); } + // When 'use bytes' is in effect, convert string to UTF-8 byte representation + if (bc.isBytesEnabled()) { + int bytesReg = bc.allocateRegister(); + bc.emit(Opcodes.TO_BYTES_STRING); + bc.emitReg(bytesReg); + bc.emitReg(stringReg); + stringReg = bytesReg; + } int rd = bc.allocateOutputRegister(); bc.emit(Opcodes.MATCH_REGEX); bc.emitReg(rd); @@ -666,20 +674,20 @@ public static void visitOperator(BytecodeCompiler bytecodeCompiler, OperatorNode case "exp" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.EXP); case "abs" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.ABS); case "integerBitwiseNot" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.INTEGER_BITWISE_NOT); - case "ord" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.ORD); + case "ord" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.ORD_BYTES : Opcodes.ORD); case "ordBytes" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.ORD_BYTES); case "oct" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.OCT); case "hex" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.HEX); case "srand" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.SRAND); - case "chr" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.CHR); + case "chr" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.CHR_BYTES : Opcodes.CHR); case "chrBytes" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.CHR_BYTES); case "lengthBytes" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.LENGTH_BYTES); case "quotemeta" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.QUOTEMETA); - case "fc" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.FC); - case "lc" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.LC); - case "lcfirst" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.LCFIRST); - case "uc" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.UC); - case "ucfirst" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.UCFIRST); + case "fc" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.FC_BYTES : Opcodes.FC); + case "lc" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.LC_BYTES : Opcodes.LC); + case "lcfirst" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.LCFIRST_BYTES : Opcodes.LCFIRST); + case "uc" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.UC_BYTES : Opcodes.UC); + case "ucfirst" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, bytecodeCompiler.isBytesEnabled() ? Opcodes.UCFIRST_BYTES : Opcodes.UCFIRST); case "tell" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.TELL); case "rmdir" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.RMDIR); case "closedir" -> visitSimpleUnaryWithDefault(bytecodeCompiler, node, Opcodes.CLOSEDIR); @@ -1274,7 +1282,7 @@ private static void visitLength(BytecodeCompiler bc, OperatorNode node) { if (node.operand instanceof ListNode list) { if (list.elements.isEmpty()) bc.throwCompilerException("length requires an argument"); list.elements.get(0).accept(bc); } else node.operand.accept(bc); int stringReg = bc.lastResultReg; - int rd = bc.allocateOutputRegister(); bc.emit(Opcodes.LENGTH_OP); bc.emitReg(rd); bc.emitReg(stringReg); + int rd = bc.allocateOutputRegister(); bc.emit(bc.isBytesEnabled() ? Opcodes.LENGTH_BYTES : Opcodes.LENGTH_OP); bc.emitReg(rd); bc.emitReg(stringReg); bc.lastResultReg = rd; } diff --git a/src/main/java/org/perlonjava/backend/bytecode/Disassemble.java b/src/main/java/org/perlonjava/backend/bytecode/Disassemble.java index 84a4cd02f..ddc627060 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/Disassemble.java +++ b/src/main/java/org/perlonjava/backend/bytecode/Disassemble.java @@ -1499,10 +1499,16 @@ public static String disassemble(InterpretedCode interpretedCode) { case Opcodes.LENGTH_BYTES: case Opcodes.QUOTEMETA: case Opcodes.FC: + case Opcodes.FC_BYTES: case Opcodes.LC: + case Opcodes.LC_BYTES: case Opcodes.LCFIRST: + case Opcodes.LCFIRST_BYTES: case Opcodes.UC: + case Opcodes.UC_BYTES: case Opcodes.UCFIRST: + case Opcodes.UCFIRST_BYTES: + case Opcodes.TO_BYTES_STRING: case Opcodes.SLEEP: case Opcodes.TELL: case Opcodes.RMDIR: diff --git a/src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java b/src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java index c1ac6d156..dd8eb0a26 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java +++ b/src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java @@ -317,6 +317,9 @@ public InterpretedCode withCapturedVars(RuntimeBase[] capturedVars) { copy.attributes = this.attributes; copy.subName = this.subName; copy.packageName = this.packageName; + // Preserve compiler-set fields that are not passed through the constructor + copy.gotoLabelPcs = this.gotoLabelPcs; + copy.usesLocalization = this.usesLocalization; return copy; } diff --git a/src/main/java/org/perlonjava/backend/bytecode/Opcodes.java b/src/main/java/org/perlonjava/backend/bytecode/Opcodes.java index 4ba2e0b99..fef4e077e 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/Opcodes.java +++ b/src/main/java/org/perlonjava/backend/bytecode/Opcodes.java @@ -2157,6 +2157,38 @@ public class Opcodes { */ public static final short DEFINED_CODE = 454; + /** + * Fold case (bytes mode): rd = StringOperators.fcBytes(rs) + */ + public static final short FC_BYTES = 455; + + /** + * Lowercase (bytes mode): rd = StringOperators.lcBytes(rs) + */ + public static final short LC_BYTES = 456; + + /** + * Uppercase (bytes mode): rd = StringOperators.ucBytes(rs) + */ + public static final short UC_BYTES = 457; + + /** + * Lowercase first (bytes mode): rd = StringOperators.lcfirstBytes(rs) + */ + public static final short LCFIRST_BYTES = 458; + + /** + * Uppercase first (bytes mode): rd = StringOperators.ucfirstBytes(rs) + */ + public static final short UCFIRST_BYTES = 459; + + /** + * Convert string to UTF-8 byte representation: rd = StringOperators.toBytesString(rs) + * Used when 'use bytes' is in effect before regex matching. + * Format: TO_BYTES_STRING rd rs + */ + public static final short TO_BYTES_STRING = 460; + private Opcodes() { } // Utility class - no instantiation } diff --git a/src/main/java/org/perlonjava/backend/bytecode/ScalarUnaryOpcodeHandler.java b/src/main/java/org/perlonjava/backend/bytecode/ScalarUnaryOpcodeHandler.java index 4ff0a9446..8aefeaac1 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/ScalarUnaryOpcodeHandler.java +++ b/src/main/java/org/perlonjava/backend/bytecode/ScalarUnaryOpcodeHandler.java @@ -42,10 +42,16 @@ public static int execute(int opcode, int[] bytecode, int pc, case Opcodes.LENGTH_BYTES -> StringOperators.lengthBytes((RuntimeScalar) registers[rs]); case Opcodes.QUOTEMETA -> StringOperators.quotemeta((RuntimeScalar) registers[rs]); case Opcodes.FC -> StringOperators.fc((RuntimeScalar) registers[rs]); + case Opcodes.FC_BYTES -> StringOperators.fcBytes((RuntimeScalar) registers[rs]); case Opcodes.LC -> StringOperators.lc((RuntimeScalar) registers[rs]); + case Opcodes.LC_BYTES -> StringOperators.lcBytes((RuntimeScalar) registers[rs]); case Opcodes.LCFIRST -> StringOperators.lcfirst((RuntimeScalar) registers[rs]); + case Opcodes.LCFIRST_BYTES -> StringOperators.lcfirstBytes((RuntimeScalar) registers[rs]); case Opcodes.UC -> StringOperators.uc((RuntimeScalar) registers[rs]); + case Opcodes.UC_BYTES -> StringOperators.ucBytes((RuntimeScalar) registers[rs]); case Opcodes.UCFIRST -> StringOperators.ucfirst((RuntimeScalar) registers[rs]); + case Opcodes.UCFIRST_BYTES -> StringOperators.ucfirstBytes((RuntimeScalar) registers[rs]); + case Opcodes.TO_BYTES_STRING -> StringOperators.toBytesString((RuntimeScalar) registers[rs]); case Opcodes.SLEEP -> Time.sleep((RuntimeScalar) registers[rs]); case Opcodes.TELL -> IOOperator.tell((RuntimeScalar) registers[rs]); case Opcodes.RMDIR -> Directory.rmdir((RuntimeScalar) registers[rs]); @@ -96,10 +102,22 @@ public static int disassemble(int opcode, int[] bytecode, int pc, case Opcodes.QUOTEMETA -> sb.append("QUOTEMETA r").append(rd).append(" = quotemeta(r").append(rs).append(")\n"); case Opcodes.FC -> sb.append("FC r").append(rd).append(" = fc(r").append(rs).append(")\n"); + case Opcodes.FC_BYTES -> + sb.append("FC_BYTES r").append(rd).append(" = fcBytes(r").append(rs).append(")\n"); case Opcodes.LC -> sb.append("LC r").append(rd).append(" = lc(r").append(rs).append(")\n"); + case Opcodes.LC_BYTES -> + sb.append("LC_BYTES r").append(rd).append(" = lcBytes(r").append(rs).append(")\n"); case Opcodes.LCFIRST -> sb.append("LCFIRST r").append(rd).append(" = lcfirst(r").append(rs).append(")\n"); + case Opcodes.LCFIRST_BYTES -> + sb.append("LCFIRST_BYTES r").append(rd).append(" = lcfirstBytes(r").append(rs).append(")\n"); case Opcodes.UC -> sb.append("UC r").append(rd).append(" = uc(r").append(rs).append(")\n"); + case Opcodes.UC_BYTES -> + sb.append("UC_BYTES r").append(rd).append(" = ucBytes(r").append(rs).append(")\n"); case Opcodes.UCFIRST -> sb.append("UCFIRST r").append(rd).append(" = ucfirst(r").append(rs).append(")\n"); + case Opcodes.UCFIRST_BYTES -> + sb.append("UCFIRST_BYTES r").append(rd).append(" = ucfirstBytes(r").append(rs).append(")\n"); + case Opcodes.TO_BYTES_STRING -> + sb.append("TO_BYTES_STRING r").append(rd).append(" = toBytesString(r").append(rs).append(")\n"); case Opcodes.SLEEP -> sb.append("SLEEP r").append(rd).append(" = sleep(r").append(rs).append(")\n"); case Opcodes.TELL -> sb.append("TELL r").append(rd).append(" = tell(r").append(rs).append(")\n"); case Opcodes.RMDIR -> sb.append("RMDIR r").append(rd).append(" = rmdir(r").append(rs).append(")\n"); diff --git a/src/main/java/org/perlonjava/backend/jvm/EmitLogicalOperator.java b/src/main/java/org/perlonjava/backend/jvm/EmitLogicalOperator.java index a2c824427..9ea901ab8 100644 --- a/src/main/java/org/perlonjava/backend/jvm/EmitLogicalOperator.java +++ b/src/main/java/org/perlonjava/backend/jvm/EmitLogicalOperator.java @@ -328,7 +328,7 @@ private static void emitLogicalOperatorSimple(EmitterVisitor emitterVisitor, Bin Label endLabel = new Label(); if (emitterVisitor.ctx.contextType == RuntimeContextType.VOID) { - evalTrace("EmitLogicalOperatorSimple VOID op=" + node.operator + " emit LHS in SCALAR; RHS in SCALAR"); + evalTrace("EmitLogicalOperatorSimple VOID op=" + node.operator + " emit LHS in SCALAR; RHS in VOID"); OperatorNode voidDeclaration = FindDeclarationVisitor.findOperator(node.right, "my"); String voidSavedOperator = null; @@ -348,8 +348,7 @@ private static void emitLogicalOperatorSimple(EmitterVisitor emitterVisitor, Bin mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeBase", getBoolean, "()Z", false); mv.visitJumpInsn(compareOpcode, endLabel); - node.right.accept(emitterVisitor.with(RuntimeContextType.SCALAR)); - mv.visitInsn(Opcodes.POP); + node.right.accept(emitterVisitor.with(RuntimeContextType.VOID)); mv.visitLabel(endLabel); } finally { diff --git a/src/main/java/org/perlonjava/backend/jvm/EmitOperatorLocal.java b/src/main/java/org/perlonjava/backend/jvm/EmitOperatorLocal.java index 956eeb0f5..c1ad19e38 100644 --- a/src/main/java/org/perlonjava/backend/jvm/EmitOperatorLocal.java +++ b/src/main/java/org/perlonjava/backend/jvm/EmitOperatorLocal.java @@ -62,6 +62,40 @@ static void handleLocal(EmitterVisitor emitterVisitor, OperatorNode node) { } } + // Handle local %hash for global/our hashes. + // Uses GlobalRuntimeHash.makeLocal() to save/restore the globalHashes map entry, + // not just the hash contents. This is needed because `*glob = \%hash` replaces + // the map entry, and a simple save/restore of contents would lose the reference. + if (node.operand instanceof OperatorNode opNode && opNode.operator.equals("%")) { + if (opNode.operand instanceof IdentifierNode idNode) { + String varName = opNode.operator + idNode.name; + int varIndex = emitterVisitor.ctx.symbolTable.getVariableIndex(varName); + boolean isOurVariable = false; + if (varIndex != -1) { + var symbolEntry = emitterVisitor.ctx.symbolTable.getSymbolEntry(varName); + isOurVariable = symbolEntry != null && "our".equals(symbolEntry.decl()); + } + if (varIndex == -1 || isOurVariable) { + String fullName = NameNormalizer.normalizeVariableName(idNode.name, emitterVisitor.ctx.symbolTable.getCurrentPackage()); + mv.visitLdcInsn(fullName); + mv.visitMethodInsn(Opcodes.INVOKESTATIC, + "org/perlonjava/runtime/runtimetypes/GlobalRuntimeHash", + "makeLocal", + "(Ljava/lang/String;)Lorg/perlonjava/runtime/runtimetypes/RuntimeHash;", + false); + if (isDeclaredReference && emitterVisitor.ctx.contextType != RuntimeContextType.VOID) { + mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, + "org/perlonjava/runtime/runtimetypes/RuntimeBase", + "createReference", + "()Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", + false); + } + EmitOperator.handleVoidContext(emitterVisitor); + return; + } + } + } + // emit the lvalue int lvalueContext = LValueVisitor.getContext(node.operand); diff --git a/src/main/java/org/perlonjava/backend/jvm/EmitRegex.java b/src/main/java/org/perlonjava/backend/jvm/EmitRegex.java index 94427d53f..92c9de8d3 100644 --- a/src/main/java/org/perlonjava/backend/jvm/EmitRegex.java +++ b/src/main/java/org/perlonjava/backend/jvm/EmitRegex.java @@ -3,6 +3,7 @@ import org.objectweb.asm.Opcodes; import org.perlonjava.frontend.analysis.EmitterVisitor; import org.perlonjava.frontend.astnode.*; +import org.perlonjava.runtime.perlmodule.Strict; import org.perlonjava.runtime.runtimetypes.PerlCompilerException; import org.perlonjava.runtime.runtimetypes.RuntimeContextType; @@ -312,8 +313,21 @@ static void handleMatchRegex(EmitterVisitor emitterVisitor, OperatorNode node) { /** * Helper method to emit bytecode for regex matching operations. * Handles different context types (SCALAR, VOID) appropriately. + * When 'use bytes' is in effect, converts the input string to its + * UTF-8 byte representation before matching. */ private static void emitMatchRegex(EmitterVisitor emitterVisitor) { + // When 'use bytes' is in effect, convert the input string to byte representation + // so that regex character classes like [\x7f-\xa0] match against UTF-8 bytes + if (emitterVisitor.ctx.symbolTable != null && + emitterVisitor.ctx.symbolTable.isStrictOptionEnabled(Strict.HINT_BYTES)) { + // Stack: regex, string (top) -> regex, bytesString (top) + emitterVisitor.ctx.mv.visitMethodInsn(Opcodes.INVOKESTATIC, + "org/perlonjava/runtime/operators/StringOperators", "toBytesString", + "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", + false); + } + emitterVisitor.pushCallContext(); // Invoke the regex matching operation emitterVisitor.ctx.mv.visitMethodInsn(Opcodes.INVOKESTATIC, diff --git a/src/main/java/org/perlonjava/backend/jvm/EmitVariable.java b/src/main/java/org/perlonjava/backend/jvm/EmitVariable.java index 66c0bb9e9..68c213636 100644 --- a/src/main/java/org/perlonjava/backend/jvm/EmitVariable.java +++ b/src/main/java/org/perlonjava/backend/jvm/EmitVariable.java @@ -102,7 +102,8 @@ private static boolean isBuiltinSpecialContainerVar(String sigil, String name) { || name.equals("ENV") || name.equals("INC") || name.equals("+") - || name.equals("-"); + || name.equals("-") + || name.equals("_"); } if ("@".equals(sigil)) { return name.equals("ARGV") diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index 18712e4de..1b6139207 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "df663708c"; + public static final String gitCommitId = "b037509d0"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). diff --git a/src/main/java/org/perlonjava/frontend/parser/DataSection.java b/src/main/java/org/perlonjava/frontend/parser/DataSection.java index b2a05a637..8bc9ec809 100644 --- a/src/main/java/org/perlonjava/frontend/parser/DataSection.java +++ b/src/main/java/org/perlonjava/frontend/parser/DataSection.java @@ -9,6 +9,7 @@ import org.perlonjava.runtime.runtimetypes.RuntimeIO; import org.perlonjava.runtime.runtimetypes.RuntimeScalar; +import java.nio.charset.StandardCharsets; import java.util.HashSet; import java.util.List; import java.util.Set; @@ -96,6 +97,68 @@ private static boolean isEndMarker(LexerToken token) { return false; } + /** + * Extracts DATA section content from raw file bytes. + * In Perl 5, <DATA> reads raw bytes from the file. This method searches for + * the __DATA__ or __END__ marker in the raw bytes and returns the content + * after it as a Latin-1 string (each byte = one character), preserving + * non-UTF-8 bytes that would be corrupted by UTF-8 decoding. + * + * @param rawBytes the raw file bytes (after BOM removal) + * @param markerText the marker to search for ("__DATA__" or "__END__") + * @return the DATA content as a Latin-1 string, or null if marker not found + */ + private static String extractDataFromRawBytes(byte[] rawBytes, String markerText) { + byte[] marker = markerText.getBytes(StandardCharsets.US_ASCII); + int markerLen = marker.length; + + // Search for the marker at the start of a line in raw bytes + for (int i = 0; i <= rawBytes.length - markerLen; i++) { + // Check that we're at the start of a line (position 0 or after \n) + if (i > 0 && rawBytes[i - 1] != '\n') { + continue; + } + + // Check if the marker matches at this position + boolean match = true; + for (int j = 0; j < markerLen; j++) { + if (rawBytes[i + j] != marker[j]) { + match = false; + break; + } + } + if (!match) continue; + + // Verify the marker is followed by whitespace/newline/EOF (not part of a longer identifier) + int afterMarker = i + markerLen; + if (afterMarker < rawBytes.length) { + byte next = rawBytes[afterMarker]; + if (next != '\n' && next != '\r' && next != ' ' && next != '\t') { + continue; // Part of a longer identifier + } + } + + // Skip past the marker and any trailing whitespace + newline + int dataStart = afterMarker; + // Skip spaces/tabs + while (dataStart < rawBytes.length && (rawBytes[dataStart] == ' ' || rawBytes[dataStart] == '\t')) { + dataStart++; + } + // Skip the newline (\n or \r\n) + if (dataStart < rawBytes.length && rawBytes[dataStart] == '\r') { + dataStart++; + } + if (dataStart < rawBytes.length && rawBytes[dataStart] == '\n') { + dataStart++; + } + + // Return remaining bytes as Latin-1 string (each byte = one character) + return new String(rawBytes, dataStart, rawBytes.length - dataStart, StandardCharsets.ISO_8859_1); + } + + return null; // Marker not found + } + static int parseDataSection(Parser parser, int tokenIndex, List tokens, LexerToken token) { String handleName = parser.ctx.symbolTable.getCurrentPackage() + "::DATA"; @@ -133,21 +196,36 @@ static int parseDataSection(Parser parser, int tokenIndex, List toke } if (populateData) { - // Capture all remaining content until end marker - StringBuilder dataContent = new StringBuilder(); - while (tokenIndex < tokens.size()) { - LexerToken currentToken = tokens.get(tokenIndex); - - // Stop if we hit an end marker - if (isEndMarker(currentToken)) { - break; + // Try to extract DATA content from raw file bytes first. + // This preserves non-UTF-8 bytes (e.g., Latin-1) that would be corrupted + // by the UTF-8 decoding that happens when reading source files. + // In Perl 5, reads raw bytes from the file. + byte[] rawBytes = parser.ctx.compilerOptions.rawCodeBytes; + String rawContent = null; + if (rawBytes != null) { + rawContent = extractDataFromRawBytes(rawBytes, token.text); + } + + if (rawContent != null) { + createDataHandle(parser, rawContent); + } else { + // Fallback: concatenate remaining tokens (for eval/string-based code + // where raw bytes are not available) + StringBuilder dataContent = new StringBuilder(); + while (tokenIndex < tokens.size()) { + LexerToken currentToken = tokens.get(tokenIndex); + + // Stop if we hit an end marker + if (isEndMarker(currentToken)) { + break; + } + + dataContent.append(currentToken.text); + tokenIndex++; } - dataContent.append(currentToken.text); - tokenIndex++; + createDataHandle(parser, dataContent.toString()); } - - createDataHandle(parser, dataContent.toString()); } } // Return tokens.size() to indicate we've consumed everything diff --git a/src/main/java/org/perlonjava/frontend/parser/IdentifierParser.java b/src/main/java/org/perlonjava/frontend/parser/IdentifierParser.java index d6e1c6fae..d1d51507e 100644 --- a/src/main/java/org/perlonjava/frontend/parser/IdentifierParser.java +++ b/src/main/java/org/perlonjava/frontend/parser/IdentifierParser.java @@ -286,8 +286,12 @@ public static String parseComplexIdentifierInner(Parser parser, boolean insideBr hex = "\\x{" + Integer.toHexString(cp) + "}"; } } else if (cp <= 255) { - // Perl tends to report non-ASCII bytes as \x{..} in these contexts - hex = "\\x{" + Integer.toHexString(cp) + "}"; + if (insideBraces) { + // Inside ${...}, Perl formats non-ASCII bytes as \xNN (uppercase, no braces) + hex = String.format("\\x%02X", cp); + } else { + hex = "\\x{" + Integer.toHexString(cp) + "}"; + } } else { hex = "\\x{" + Integer.toHexString(cp) + "}"; } diff --git a/src/main/java/org/perlonjava/frontend/parser/Variable.java b/src/main/java/org/perlonjava/frontend/parser/Variable.java index 67a3753dd..971b0c0f1 100644 --- a/src/main/java/org/perlonjava/frontend/parser/Variable.java +++ b/src/main/java/org/perlonjava/frontend/parser/Variable.java @@ -317,7 +317,8 @@ private static void checkStrictVarsAtParseTime(Parser parser, String sigil, Stri // Built-in special container vars (%ENV, %SIG, @ARGV, @INC, etc.) if (sigil.equals("%") && (varName.equals("SIG") || varName.equals("ENV") - || varName.equals("INC") || varName.equals("+") || varName.equals("-"))) return; + || varName.equals("INC") || varName.equals("+") || varName.equals("-") + || varName.equals("_"))) return; if (sigil.equals("@") && (varName.equals("ARGV") || varName.equals("INC") || varName.equals("_") || varName.equals("F"))) return; diff --git a/src/main/java/org/perlonjava/runtime/io/CustomFileChannel.java b/src/main/java/org/perlonjava/runtime/io/CustomFileChannel.java index 495e6982e..f8603f529 100644 --- a/src/main/java/org/perlonjava/runtime/io/CustomFileChannel.java +++ b/src/main/java/org/perlonjava/runtime/io/CustomFileChannel.java @@ -12,6 +12,7 @@ import java.nio.channels.FileLock; import java.nio.channels.OverlappingFileLockException; import java.nio.charset.Charset; +import java.nio.charset.StandardCharsets; import java.nio.file.Path; import java.nio.file.StandardOpenOption; import java.util.Set; @@ -193,9 +194,24 @@ public RuntimeScalar write(String string) { if (appendMode) { fileChannel.position(fileChannel.size()); } - byte[] data = new byte[string.length()]; + // Check if string contains wide characters (codepoint > 255) + // Perl 5 auto-upgrades to UTF-8 for wide chars on binary handles + boolean hasWideChars = false; for (int i = 0; i < string.length(); i++) { - data[i] = (byte) string.charAt(i); + if (string.charAt(i) > 255) { + hasWideChars = true; + break; + } + } + byte[] data; + if (hasWideChars) { + // Encode as UTF-8, matching Perl 5 "Wide character in print" behavior + data = string.getBytes(StandardCharsets.UTF_8); + } else { + data = new byte[string.length()]; + for (int i = 0; i < string.length(); i++) { + data[i] = (byte) string.charAt(i); + } } ByteBuffer byteBuffer = ByteBuffer.wrap(data); fileChannel.write(byteBuffer); diff --git a/src/main/java/org/perlonjava/runtime/io/CustomOutputStreamHandle.java b/src/main/java/org/perlonjava/runtime/io/CustomOutputStreamHandle.java index 45c2f65ad..a4bc69d23 100644 --- a/src/main/java/org/perlonjava/runtime/io/CustomOutputStreamHandle.java +++ b/src/main/java/org/perlonjava/runtime/io/CustomOutputStreamHandle.java @@ -82,8 +82,18 @@ public CustomOutputStreamHandle(OutputStream outputStream) { */ @Override public RuntimeScalar write(String string) { - // Convert string to bytes, treating each character as a byte value - var data = string.getBytes(StandardCharsets.ISO_8859_1); + // Check for wide characters (codepoint > 255) + // Perl 5 auto-upgrades to UTF-8 for wide chars + boolean hasWideChars = false; + for (int i = 0; i < string.length(); i++) { + if (string.charAt(i) > 255) { + hasWideChars = true; + break; + } + } + var data = hasWideChars + ? string.getBytes(java.nio.charset.StandardCharsets.UTF_8) + : string.getBytes(StandardCharsets.ISO_8859_1); try { outputStream.write(data); bytesWritten += data.length; diff --git a/src/main/java/org/perlonjava/runtime/io/LayeredIOHandle.java b/src/main/java/org/perlonjava/runtime/io/LayeredIOHandle.java index 3454704f4..108378923 100644 --- a/src/main/java/org/perlonjava/runtime/io/LayeredIOHandle.java +++ b/src/main/java/org/perlonjava/runtime/io/LayeredIOHandle.java @@ -65,6 +65,14 @@ public class LayeredIOHandle implements IOHandle { */ private Function outputPipeline = Function.identity(); + /** + * Buffer for decoded characters that were produced by the encoding layer + * but not yet consumed by doRead(). This prevents character loss when + * the encoding layer decodes more characters than the caller requested + * (e.g., reading 4 bytes of UTF-16BE gives 2 characters when only 1 was needed). + */ + private StringBuilder decodedCharBuffer = new StringBuilder(); + /** * Constructs a new layered IO handle wrapping the given delegate. * @@ -144,11 +152,30 @@ public RuntimeScalar doRead(int maxBytes, Charset charset) { // For encoding layers, use precise character-based reading StringBuilder result = new StringBuilder(); int charactersNeeded = maxBytes; - int safetyLimit = maxBytes * 4; // Prevent infinite loops + boolean hasEncoding = hasEncodingLayer(); + + // First, drain any previously buffered decoded characters + if (decodedCharBuffer.length() > 0) { + int charsFromBuffer = Math.min(decodedCharBuffer.length(), charactersNeeded); + result.append(decodedCharBuffer, 0, charsFromBuffer); + decodedCharBuffer.delete(0, charsFromBuffer); + charactersNeeded -= charsFromBuffer; + } + + // Safety limit must be generous for multi-byte encodings (e.g., UTF-32 = 4 bytes/char) + int safetyLimit = Math.max(maxBytes * 8, 64); // Prevent infinite loops while (charactersNeeded > 0 && safetyLimit > 0) { - // Read only what we need, don't over-consume - int bytesToRead = Math.min(128, charactersNeeded); + // For encoding layers (UTF-16, UTF-32), read extra bytes to ensure we decode + // at least enough characters. For non-encoding layers (e.g., :crlf), read + // conservatively to avoid over-consuming from the delegate (which would make + // tell() inaccurate since it reports the delegate's position). + int bytesToRead; + if (hasEncoding) { + bytesToRead = Math.min(128, Math.max(4, charactersNeeded * 4)); + } else { + bytesToRead = Math.min(128, charactersNeeded); + } RuntimeScalar chunk = delegate.doRead(bytesToRead, charset); String chunkStr = chunk.toString(); @@ -167,9 +194,9 @@ public RuntimeScalar doRead(int maxBytes, Charset charset) { result.append(processed, 0, charsToTake); charactersNeeded -= charsToTake; - // If we have extra characters, let the layer buffer them + // Buffer any excess decoded characters for the next doRead() call if (processed.length() > charsToTake) { - // This should be handled by the layer's internal buffering + decodedCharBuffer.append(processed, charsToTake, processed.length()); break; } } @@ -209,6 +236,9 @@ public RuntimeScalar binmode(String modeStr) { inputPipeline = Function.identity(); outputPipeline = Function.identity(); + // Clear decoded character buffer (layer change invalidates buffered data) + decodedCharBuffer.setLength(0); + // Reset and clear existing layers for (IOLayer layer : activeLayers) { layer.reset(); @@ -413,6 +443,7 @@ public RuntimeScalar close() { for (IOLayer layer : activeLayers) { layer.reset(); } + decodedCharBuffer.setLength(0); return delegate.close(); } @@ -439,6 +470,10 @@ public RuntimeScalar fileno() { */ @Override public RuntimeScalar eof() { + // If there are buffered decoded characters, we're not at EOF + if (decodedCharBuffer.length() > 0) { + return new RuntimeScalar(0); + } return delegate.eof(); } @@ -475,6 +510,7 @@ public RuntimeScalar seek(long pos, int whence) { for (IOLayer layer : activeLayers) { layer.reset(); } + decodedCharBuffer.setLength(0); return delegate.seek(pos, whence); } @@ -507,6 +543,24 @@ public RuntimeScalar flock(int operation) { return delegate.flock(operation); } + /** + * Checks if this handle has any encoding layers (e.g., :utf8, :encoding(UTF-8)). + * + *

Encoding layers decode bytes into characters, which means reads should + * produce character strings (UTF-8 flag set in Perl terms). Without encoding + * layers, reads produce byte strings.

+ * + * @return true if any active layer is an EncodingLayer + */ + public boolean hasEncodingLayer() { + for (IOLayer layer : activeLayers) { + if (layer instanceof EncodingLayer) { + return true; + } + } + return false; + } + public String getCurrentLayers() { // Return the currently applied layers as a string StringBuilder layers = new StringBuilder(); diff --git a/src/main/java/org/perlonjava/runtime/io/PipeOutputChannel.java b/src/main/java/org/perlonjava/runtime/io/PipeOutputChannel.java index 4071df22e..f6976261b 100644 --- a/src/main/java/org/perlonjava/runtime/io/PipeOutputChannel.java +++ b/src/main/java/org/perlonjava/runtime/io/PipeOutputChannel.java @@ -226,10 +226,23 @@ public RuntimeScalar write(String string) { } try { - // String contains raw bytes (each char is a byte value 0-255) - byte[] bytes = new byte[string.length()]; + // Check for wide characters (codepoint > 255) + // Perl 5 auto-upgrades to UTF-8 for wide chars + boolean hasWideChars = false; for (int i = 0; i < string.length(); i++) { - bytes[i] = (byte) string.charAt(i); + if (string.charAt(i) > 255) { + hasWideChars = true; + break; + } + } + byte[] bytes; + if (hasWideChars) { + bytes = string.getBytes(java.nio.charset.StandardCharsets.UTF_8); + } else { + bytes = new byte[string.length()]; + for (int i = 0; i < string.length(); i++) { + bytes[i] = (byte) string.charAt(i); + } } // Write raw bytes to process diff --git a/src/main/java/org/perlonjava/runtime/io/StandardIO.java b/src/main/java/org/perlonjava/runtime/io/StandardIO.java index f7e8ac32c..ab2b061d7 100644 --- a/src/main/java/org/perlonjava/runtime/io/StandardIO.java +++ b/src/main/java/org/perlonjava/runtime/io/StandardIO.java @@ -54,8 +54,18 @@ public RuntimeScalar write(String string) { } try { synchronized (writeLock) { - // Write directly - let BufferedOutputStream handle buffering - byte[] data = string.getBytes(StandardCharsets.ISO_8859_1); + // Check for wide characters (codepoint > 255) + // Perl 5 auto-upgrades to UTF-8 for wide chars + boolean hasWideChars = false; + for (int i = 0; i < string.length(); i++) { + if (string.charAt(i) > 255) { + hasWideChars = true; + break; + } + } + byte[] data = hasWideChars + ? string.getBytes(StandardCharsets.UTF_8) + : string.getBytes(StandardCharsets.ISO_8859_1); bufferedOutputStream.write(data); } return RuntimeScalarCache.scalarTrue; diff --git a/src/main/java/org/perlonjava/runtime/operators/IOOperator.java b/src/main/java/org/perlonjava/runtime/operators/IOOperator.java index b85fcf7ef..97dc2d33d 100644 --- a/src/main/java/org/perlonjava/runtime/operators/IOOperator.java +++ b/src/main/java/org/perlonjava/runtime/operators/IOOperator.java @@ -715,8 +715,8 @@ public static RuntimeScalar print(RuntimeList runtimeList, RuntimeScalar fileHan } StringBuilder sb = new StringBuilder(); - String separator = getGlobalVariable("main::,").toString(); // fetch $, - String newline = getGlobalVariable("main::\\").toString(); // fetch $\ + String separator = OutputFieldSeparator.getInternalOFS(); // fetch $, (internal copy, not affected by aliasing) + String newline = OutputRecordSeparator.getInternalORS(); // fetch $\ (internal copy, not affected by aliasing) boolean first = true; // Iterate through elements and append them with the separator diff --git a/src/main/java/org/perlonjava/runtime/operators/Operator.java b/src/main/java/org/perlonjava/runtime/operators/Operator.java index ddd199ed7..db433400b 100644 --- a/src/main/java/org/perlonjava/runtime/operators/Operator.java +++ b/src/main/java/org/perlonjava/runtime/operators/Operator.java @@ -231,6 +231,15 @@ public static RuntimeList split(RuntimeScalar quotedRegex, RuntimeList args, int } } + // Preserve BYTE_STRING type: if input was BYTE_STRING, all split results should be too + if (string.type == RuntimeScalarType.BYTE_STRING) { + for (RuntimeBase element : splitElements) { + if (element instanceof RuntimeScalar rs && rs.type == RuntimeScalarType.STRING) { + rs.type = RuntimeScalarType.BYTE_STRING; + } + } + } + if (ctx == SCALAR) { int size = result.elements.size(); return getScalarInt(size).getList(); @@ -468,15 +477,26 @@ public static RuntimeBase reverse(int ctx, RuntimeBase... args) { if (ctx == SCALAR) { StringBuilder sb = new StringBuilder(); + boolean isByteString = false; if (args.length == 0) { // In scalar context, reverse($_) if no arguments are provided. - sb.append(GlobalVariable.getGlobalVariable("main::_")); + RuntimeScalar defaultVar = GlobalVariable.getGlobalVariable("main::_"); + sb.append(defaultVar); + isByteString = (defaultVar.type == RuntimeScalarType.BYTE_STRING); } else { + isByteString = true; for (RuntimeBase arg : args) { sb.append(arg.toString()); + if (arg instanceof RuntimeScalar rs && rs.type != RuntimeScalarType.BYTE_STRING) { + isByteString = false; + } } } - return new RuntimeScalar(sb.reverse().toString()); + RuntimeScalar result = new RuntimeScalar(sb.reverse().toString()); + if (isByteString) { + result.type = RuntimeScalarType.BYTE_STRING; + } + return result; } // List context - avoid unnecessary copying to preserve element references @@ -636,7 +656,11 @@ public static RuntimeBase repeat(RuntimeBase value, RuntimeScalar timesScalar, i // Convert to scalar (gets count for arrays, etc.) scalarValue = value.scalar(); } - return new RuntimeScalar(scalarValue.toString().repeat(Math.max(0, times))); + RuntimeScalar rv = new RuntimeScalar(scalarValue.toString().repeat(Math.max(0, times))); + if (scalarValue.type == RuntimeScalarType.BYTE_STRING) { + rv.type = RuntimeScalarType.BYTE_STRING; + } + return rv; } else { RuntimeList result = new RuntimeList(); List outElements = result.elements; diff --git a/src/main/java/org/perlonjava/runtime/operators/Readline.java b/src/main/java/org/perlonjava/runtime/operators/Readline.java index 634a22067..1c5187eab 100644 --- a/src/main/java/org/perlonjava/runtime/operators/Readline.java +++ b/src/main/java/org/perlonjava/runtime/operators/Readline.java @@ -127,6 +127,7 @@ public static RuntimeScalar readline(RuntimeIO runtimeIO) { } private static RuntimeScalar readParagraphMode(RuntimeIO runtimeIO) { + boolean isByteMode = runtimeIO.isByteMode(); StringBuilder paragraph = new StringBuilder(); boolean inParagraph = false; boolean lastWasNewline = false; @@ -169,10 +170,15 @@ private static RuntimeScalar readParagraphMode(RuntimeIO runtimeIO) { } } - return new RuntimeScalar(paragraph.toString()); + RuntimeScalar result = new RuntimeScalar(paragraph.toString()); + if (isByteMode) { + result.type = RuntimeScalarType.BYTE_STRING; + } + return result; } private static RuntimeScalar readFixedLength(RuntimeIO runtimeIO, int length) { + boolean isByteMode = runtimeIO.isByteMode(); StringBuilder result = new StringBuilder(); for (int i = 0; i < length; i++) { @@ -191,10 +197,15 @@ private static RuntimeScalar readFixedLength(RuntimeIO runtimeIO, int length) { // Don't increment line numbers for fixed-length reads // (this matches Perl behavior for record-length mode) - return new RuntimeScalar(result.toString()); + RuntimeScalar rslt = new RuntimeScalar(result.toString()); + if (isByteMode) { + rslt.type = RuntimeScalarType.BYTE_STRING; + } + return rslt; } private static RuntimeScalar readUntilCharacter(RuntimeIO runtimeIO, char separator) { + boolean isByteMode = runtimeIO.isByteMode(); StringBuilder line = new StringBuilder(); String readChar; @@ -217,10 +228,15 @@ private static RuntimeScalar readUntilCharacter(RuntimeIO runtimeIO, char separa return scalarUndef; } - return new RuntimeScalar(line.toString()); + RuntimeScalar result = new RuntimeScalar(line.toString()); + if (isByteMode) { + result.type = RuntimeScalarType.BYTE_STRING; + } + return result; } private static RuntimeScalar readUntilString(RuntimeIO runtimeIO, String separator) { + boolean isByteMode = runtimeIO.isByteMode(); StringBuilder line = new StringBuilder(); StringBuilder buffer = new StringBuilder(); @@ -256,7 +272,11 @@ private static RuntimeScalar readUntilString(RuntimeIO runtimeIO, String separat return scalarUndef; } - return new RuntimeScalar(line.toString()); + RuntimeScalar result = new RuntimeScalar(line.toString()); + if (isByteMode) { + result.type = RuntimeScalarType.BYTE_STRING; + } + return result; } /** diff --git a/src/main/java/org/perlonjava/runtime/operators/RuntimeTransliterate.java b/src/main/java/org/perlonjava/runtime/operators/RuntimeTransliterate.java index 6f172d6eb..148234146 100644 --- a/src/main/java/org/perlonjava/runtime/operators/RuntimeTransliterate.java +++ b/src/main/java/org/perlonjava/runtime/operators/RuntimeTransliterate.java @@ -3,6 +3,7 @@ import org.perlonjava.runtime.regex.UnicodeResolver; import org.perlonjava.runtime.runtimetypes.PerlCompilerException; import org.perlonjava.runtime.runtimetypes.RuntimeScalar; +import org.perlonjava.runtime.runtimetypes.RuntimeScalarType; import java.util.*; @@ -165,7 +166,12 @@ public RuntimeScalar transliterate(RuntimeScalar originalString, int ctx) { // Handle the /r modifier - return the transliterated string without modifying original if (returnOriginal) { - return new RuntimeScalar(resultString); + RuntimeScalar rv = new RuntimeScalar(resultString); + // Preserve BYTE_STRING type from input + if (originalString.type == RuntimeScalarType.BYTE_STRING) { + rv.type = RuntimeScalarType.BYTE_STRING; + } + return rv; } // Determine if we need to call set() which will trigger read-only error if applicable @@ -176,7 +182,12 @@ public RuntimeScalar transliterate(RuntimeScalar originalString, int ctx) { boolean needsSet = !input.equals(resultString) || (input.isEmpty() && hasReplacement); if (needsSet) { + // Preserve BYTE_STRING type: tr/// on a byte string should produce a byte string + boolean wasByteString = originalString.type == RuntimeScalarType.BYTE_STRING; originalString.set(resultString); + if (wasByteString) { + originalString.type = RuntimeScalarType.BYTE_STRING; + } } // Return the count of matched characters diff --git a/src/main/java/org/perlonjava/runtime/operators/StringOperators.java b/src/main/java/org/perlonjava/runtime/operators/StringOperators.java index 2b7d17bbd..8c2816cbb 100644 --- a/src/main/java/org/perlonjava/runtime/operators/StringOperators.java +++ b/src/main/java/org/perlonjava/runtime/operators/StringOperators.java @@ -56,6 +56,47 @@ public static RuntimeScalar lengthBytes(RuntimeScalar runtimeScalar) { } } + /** + * Converts a string to its UTF-8 byte representation. + * Each byte becomes a separate character in the range 0x00-0xFF. + * This is used when 'use bytes' pragma is in effect for regex matching. + * + * @param runtimeScalar the {@link RuntimeScalar} to convert + * @return a {@link RuntimeScalar} containing the byte-level string + */ + public static RuntimeScalar toBytesString(RuntimeScalar runtimeScalar) { + String str = runtimeScalar.toString(); + // Check if all characters are already in 0-255 range (ASCII/Latin-1) + boolean needsConversion = false; + for (int i = 0; i < str.length(); i++) { + if (str.charAt(i) > 0xFF) { + needsConversion = true; + break; + } + } + if (!needsConversion) { + return runtimeScalar; + } + // Convert to UTF-8 bytes, then create a string where each byte is a character + byte[] bytes = str.getBytes(StandardCharsets.UTF_8); + StringBuilder sb = new StringBuilder(bytes.length); + for (byte b : bytes) { + sb.append((char) (b & 0xFF)); + } + return new RuntimeScalar(sb.toString()); + } + + /** + * Helper to create a string result that preserves BYTE_STRING type from the source. + */ + private static RuntimeScalar makeStringResult(String value, RuntimeScalar source) { + RuntimeScalar result = new RuntimeScalar(value); + if (source.type == RuntimeScalarType.BYTE_STRING) { + result.type = RuntimeScalarType.BYTE_STRING; + } + return result; + } + /** * Escapes all non-alphanumeric characters in the string representation of the given {@link RuntimeScalar}. * @@ -75,7 +116,7 @@ public static RuntimeScalar quotemeta(RuntimeScalar runtimeScalar) { quoted.append("\\").append(c); } } - return new RuntimeScalar(quoted.toString()); + return makeStringResult(quoted.toString(), runtimeScalar); } /** @@ -92,7 +133,7 @@ public static RuntimeScalar fc(RuntimeScalar runtimeScalar) { // NFKC would decompose these to their ASCII equivalents, which is wrong. str = CaseMap.fold().apply(str); - return new RuntimeScalar(str); + return makeStringResult(str, runtimeScalar); } /** @@ -119,7 +160,7 @@ public static RuntimeScalar fcBytes(RuntimeScalar runtimeScalar) { public static RuntimeScalar lc(RuntimeScalar runtimeScalar) { // Convert the string to lowercase using ICU4J for proper Unicode handling String str = UCharacter.toLowerCase(runtimeScalar.toString()); - return new RuntimeScalar(str); + return makeStringResult(str, runtimeScalar); } /** @@ -142,7 +183,7 @@ public static RuntimeScalar lcfirst(RuntimeScalar runtimeScalar) { String str = runtimeScalar.toString(); // Check if the string is empty if (str.isEmpty()) { - return new RuntimeScalar(str); + return makeStringResult(str, runtimeScalar); } // Get the first code point and convert it to lowercase using ICU4J int firstCodePoint = str.codePointAt(0); @@ -150,7 +191,7 @@ public static RuntimeScalar lcfirst(RuntimeScalar runtimeScalar) { String firstChar = str.substring(0, charCount); String rest = str.substring(charCount); String lowerFirst = UCharacter.toLowerCase(firstChar); - return new RuntimeScalar(lowerFirst + rest); + return makeStringResult(lowerFirst + rest, runtimeScalar); } /** @@ -163,7 +204,7 @@ public static RuntimeScalar lcfirst(RuntimeScalar runtimeScalar) { public static RuntimeScalar uc(RuntimeScalar runtimeScalar) { // Convert the string to uppercase using ICU4J for proper Unicode handling String str = UCharacter.toUpperCase(runtimeScalar.toString()); - return new RuntimeScalar(str); + return makeStringResult(str, runtimeScalar); } /** @@ -194,7 +235,7 @@ public static RuntimeScalar ucfirst(RuntimeScalar runtimeScalar) { titleFirst = String.valueOf(Character.toChars(titleCodePoint)); } } - return new RuntimeScalar(titleFirst + rest); + return makeStringResult(titleFirst + rest, runtimeScalar); } /** @@ -421,7 +462,11 @@ public static RuntimeScalar chompScalar(RuntimeScalar runtimeScalar) { // Always update the original scalar if we modified the string if (!str.equals(originalStr)) { + boolean wasByteString = runtimeScalar.type == RuntimeScalarType.BYTE_STRING; runtimeScalar.set(str); + if (wasByteString) { + runtimeScalar.type = RuntimeScalarType.BYTE_STRING; + } } return getScalarInt(charsRemoved); @@ -440,7 +485,11 @@ public static RuntimeScalar chopScalar(RuntimeScalar runtimeScalar) { String lastChar = str.substring(str.length() - lastCharSize); String remainingStr = str.substring(0, str.length() - lastCharSize); + boolean wasByteString = runtimeScalar.type == RuntimeScalarType.BYTE_STRING; runtimeScalar.set(remainingStr); + if (wasByteString) { + runtimeScalar.type = RuntimeScalarType.BYTE_STRING; + } return new RuntimeScalar(lastChar); } diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/BytesPragma.java b/src/main/java/org/perlonjava/runtime/perlmodule/BytesPragma.java index 84cd4003e..db37eadf3 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/BytesPragma.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/BytesPragma.java @@ -1,14 +1,17 @@ package org.perlonjava.runtime.perlmodule; import org.perlonjava.frontend.semantic.ScopedSymbolTable; -import org.perlonjava.runtime.runtimetypes.RuntimeArray; -import org.perlonjava.runtime.runtimetypes.RuntimeList; +import org.perlonjava.runtime.operators.ScalarOperators; +import org.perlonjava.runtime.operators.StringOperators; +import org.perlonjava.runtime.runtimetypes.*; import static org.perlonjava.frontend.parser.SpecialBlockParser.getCurrentScope; /** * The BytesPragma class provides functionalities similar to the Perl bytes module. * When enabled, it forces string operations to work with bytes rather than characters. + * Also provides bytes::length(), bytes::chr(), bytes::ord(), bytes::substr() as + * callable subroutines (used by modules like Text::CSV_PP). */ public class BytesPragma extends PerlModuleBase { @@ -28,9 +31,16 @@ public static void initialize() { try { bytes.registerMethod("import", "useBytes", ";$"); bytes.registerMethod("unimport", "noBytes", ";$"); + // Register bytes:: utility functions (callable as bytes::length($x) etc.) + bytes.registerMethod("length", "bytesLength", "$"); + bytes.registerMethod("chr", "bytesChr", "$"); + bytes.registerMethod("ord", "bytesOrd", "$"); + bytes.registerMethod("substr", "bytesSubstr", null); } catch (NoSuchMethodException e) { System.err.println("Warning: Missing Bytes method: " + e.getMessage()); } + // Set $bytes::VERSION + GlobalVariable.getGlobalVariable("bytes::VERSION").set(new RuntimeScalar("1.08")); } /** @@ -64,4 +74,64 @@ public static RuntimeList noBytes(RuntimeArray args, int ctx) { } return new RuntimeList(); } + + /** + * Implements bytes::length($string). + * Returns the number of bytes in the UTF-8 encoding of the string. + */ + public static RuntimeList bytesLength(RuntimeArray args, int ctx) { + RuntimeScalar scalar = args.size() > 0 ? args.get(0) : new RuntimeScalar(); + return StringOperators.lengthBytes(scalar).getList(); + } + + /** + * Implements bytes::chr($codepoint). + * Returns a byte character for the given code point (mod 256). + */ + public static RuntimeList bytesChr(RuntimeArray args, int ctx) { + RuntimeScalar scalar = args.size() > 0 ? args.get(0) : new RuntimeScalar(); + return StringOperators.chrBytes(scalar).getList(); + } + + /** + * Implements bytes::ord($string). + * Returns the byte value of the first byte in the string. + */ + public static RuntimeList bytesOrd(RuntimeArray args, int ctx) { + RuntimeScalar scalar = args.size() > 0 ? args.get(0) : new RuntimeScalar(); + return ScalarOperators.ordBytes(scalar).getList(); + } + + /** + * Implements bytes::substr($string, $offset, $length, $replacement). + * Operates on the UTF-8 byte representation of the string. + */ + public static RuntimeList bytesSubstr(RuntimeArray args, int ctx) { + if (args.size() < 2) { + throw new IllegalStateException("Usage: bytes::substr(STRING, OFFSET [, LENGTH [, REPLACEMENT]])"); + } + // Delegate to the standard substr but operating on bytes + // Convert to byte string first, then do substr + RuntimeScalar str = args.get(0); + RuntimeScalar offset = args.get(1); + RuntimeScalar length = args.size() > 2 ? args.get(2) : new RuntimeScalar(); + RuntimeScalar replacement = args.size() > 3 ? args.get(3) : null; + + // Get the UTF-8 bytes of the string + byte[] bytes = str.toString().getBytes(java.nio.charset.StandardCharsets.UTF_8); + int off = offset.getInt(); + int len = length.getDefinedBoolean() ? length.getInt() : bytes.length - off; + + // Handle negative offset + if (off < 0) off = bytes.length + off; + if (off < 0) off = 0; + if (off > bytes.length) off = bytes.length; + if (len < 0) len = bytes.length - off + len; + if (len < 0) len = 0; + if (off + len > bytes.length) len = bytes.length - off; + + byte[] result = new byte[len]; + System.arraycopy(bytes, off, result, 0, len); + return new RuntimeScalar(result).getList(); + } } diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java b/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java index b6e4aa801..bc1700bbf 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java @@ -7,6 +7,7 @@ import java.nio.charset.IllegalCharsetNameException; import java.nio.charset.StandardCharsets; import java.nio.charset.UnsupportedCharsetException; +import java.util.Arrays; import java.util.HashMap; import java.util.Map; @@ -89,6 +90,32 @@ public class Encode extends PerlModuleBase { Charset defaultCharset = Charset.defaultCharset(); CHARSET_ALIASES.put("locale", defaultCharset); CHARSET_ALIASES.put("locale_fs", defaultCharset); + + // UTF-32 aliases + try { + Charset utf32 = Charset.forName("UTF-32"); + CHARSET_ALIASES.put("utf32", utf32); + CHARSET_ALIASES.put("UTF32", utf32); + CHARSET_ALIASES.put("utf-32", utf32); + CHARSET_ALIASES.put("UTF-32", utf32); + } catch (Exception ignored) { + } + try { + Charset utf32be = Charset.forName("UTF-32BE"); + CHARSET_ALIASES.put("utf32be", utf32be); + CHARSET_ALIASES.put("UTF32BE", utf32be); + CHARSET_ALIASES.put("utf-32be", utf32be); + CHARSET_ALIASES.put("UTF-32BE", utf32be); + } catch (Exception ignored) { + } + try { + Charset utf32le = Charset.forName("UTF-32LE"); + CHARSET_ALIASES.put("utf32le", utf32le); + CHARSET_ALIASES.put("UTF32LE", utf32le); + CHARSET_ALIASES.put("utf-32le", utf32le); + CHARSET_ALIASES.put("UTF-32LE", utf32le); + } catch (Exception ignored) { + } } public Encode() { @@ -270,6 +297,9 @@ public static RuntimeList decode(RuntimeArray args, int ctx) { Charset charset = getCharset(encodingName); // Convert the string to bytes assuming it contains raw octets byte[] bytes = octets.getBytes(StandardCharsets.ISO_8859_1); + // Trim orphan trailing bytes for fixed-width encodings + // (Perl's Encode silently drops incomplete trailing code units) + bytes = trimOrphanBytes(bytes, charset); String decoded = new String(bytes, charset); return new RuntimeScalar(decoded).getList(); @@ -412,6 +442,8 @@ public static RuntimeList encoding_decode(RuntimeArray args, int ctx) { try { Charset charset = getCharset(charsetName); byte[] bytes = octets.getBytes(StandardCharsets.ISO_8859_1); + // Trim orphan trailing bytes for fixed-width encodings + bytes = trimOrphanBytes(bytes, charset); String decoded = new String(bytes, charset); return new RuntimeScalar(decoded).getList(); } catch (Exception e) { @@ -456,6 +488,8 @@ public static RuntimeList from_to(RuntimeArray args, int ctx) { byte[] bytes = octets.getBytes(StandardCharsets.ISO_8859_1); // Decode from source encoding + // Trim orphan trailing bytes for fixed-width encodings + bytes = trimOrphanBytes(bytes, fromCharset); String decoded = new String(bytes, fromCharset); // Encode to target encoding @@ -492,6 +526,29 @@ public static RuntimeList _utf8_off(RuntimeArray args, int ctx) { return scalarUndef.getList(); } + /** + * Trims orphan trailing bytes for fixed-width encodings. + * Perl's Encode silently drops incomplete trailing code units + * (e.g., an odd byte at the end of UTF-16 input). + * Java's String(byte[], Charset) replaces them with U+FFFD instead. + */ + private static byte[] trimOrphanBytes(byte[] bytes, Charset charset) { + String name = charset.name().toLowerCase(); + int codeUnitSize = 0; + if (name.contains("utf-16") || name.contains("utf16") || name.contains("ucs-2") || name.contains("ucs2")) { + codeUnitSize = 2; + } else if (name.contains("utf-32") || name.contains("utf32")) { + codeUnitSize = 4; + } + if (codeUnitSize > 1) { + int remainder = bytes.length % codeUnitSize; + if (remainder != 0) { + bytes = Arrays.copyOf(bytes, bytes.length - remainder); + } + } + return bytes; + } + /** * Helper method to get a Charset from an encoding name. * Handles common aliases and Perl-style encoding names. diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Lib.java b/src/main/java/org/perlonjava/runtime/perlmodule/Lib.java index f383365aa..6acafaba9 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Lib.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Lib.java @@ -51,11 +51,13 @@ public static void initialize() { public static RuntimeList useLib(RuntimeArray args, int ctx) { RuntimeArray INC = GlobalVariable.getGlobalArray("main::INC"); initOrigInc(INC); - for (int i = 1; i < args.size(); i++) { + // Process in reverse order and unshift, matching Perl's lib.pm behavior: + // directories are prepended to @INC so they take precedence over existing paths + for (int i = args.size() - 1; i >= 1; i--) { String dir = args.get(i).toString(); - if (!contains(INC, dir)) { - RuntimeArray.push(INC, new RuntimeScalar(dir)); - } + // Remove any existing occurrence first (dedup), then prepend + INC.elements.removeIf(path -> path.toString().equals(dir)); + RuntimeArray.unshift(INC, new RuntimeScalar(dir)); } return new RuntimeList(); } diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/PerlIO.java b/src/main/java/org/perlonjava/runtime/perlmodule/PerlIO.java index 4fa767fee..0a23a1151 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/PerlIO.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/PerlIO.java @@ -47,6 +47,9 @@ public static RuntimeList find(RuntimeArray args, int ctx) { // Optional arguments like 'output', 'details' are accepted but currently ignored public static RuntimeList get_layers(RuntimeArray args, int ctx) { RuntimeIO fh = args.get(0).getRuntimeIO(); + if (fh == null) { + throw new PerlCompilerException("Not a GLOB reference"); + } if (fh instanceof TieHandle) { throw new PerlCompilerException("can't get_layers on tied handle"); } diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java b/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java index 5c7e88513..1f96ff5a1 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java @@ -114,55 +114,17 @@ public static RuntimeList upgrade(RuntimeArray args, int ctx) { // Don't modify read-only scalars (e.g., string literals) if (!(scalar instanceof RuntimeScalarReadOnly)) { if (scalar.type == BYTE_STRING) { - // BYTE_STRING: interpret bytes as Latin-1, then decode as UTF-8 if valid. + // BYTE_STRING → STRING: just flip the type flag without changing content. // - // IMPORTANT CORNER CASE (regression-prone): - // In a perfect world, BYTE_STRING values would only ever contain characters in - // the 0x00..0xFF range (representing raw octets). However, some parts of the - // interpreter/compiler may currently construct a BYTE_STRING that already - // contains Unicode code points > 0xFF (e.g. from "\x{100}" yielding U+0100). + // In Perl 5, utf8::upgrade() only changes the internal storage format + // (from byte to UTF-8 encoded), but the character codepoints remain + // identical. For example, bytes 0xE2, 0x82, 0xAC become characters + // U+00E2, U+0082, U+00AC (NOT decoded as UTF-8 to U+20AC). // - // If we blindly treat such a value as bytes and cast each char to (byte), Java - // will truncate U+0100 (256) to 0x00 and we corrupt the string to "\0". - // This breaks re/regexp.t cases that do: - // $subject = "\x{100}"; utf8::upgrade($subject); - // and then expect the subject to still contain U+0100. - // - // Therefore: - // - If the current BYTE_STRING already contains chars > 0xFF, treat it as - // already-upgraded Unicode content and simply flip the type to STRING. - // (No re-decoding step; content must not change.) - boolean hasNonByteChars = false; - for (int i = 0; i < string.length(); i++) { - if (string.charAt(i) > 0xFF) { - hasNonByteChars = true; - break; - } - } - if (hasNonByteChars) { - scalar.set(string); - scalar.type = STRING; - return new RuntimeScalar(utf8Bytes.length).getList(); - } - - // Extract raw byte values (0x00-0xFF) directly from char codes. - // Do NOT use getBytes(ISO_8859_1) on values that may contain characters > 0xFF, - // as Java will replace unmappable characters with '?'. - byte[] bytes = new byte[string.length()]; - for (int i = 0; i < string.length(); i++) { - bytes[i] = (byte) string.charAt(i); - } - CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder() - .onMalformedInput(CodingErrorAction.REPORT) - .onUnmappableCharacter(CodingErrorAction.REPORT); - try { - CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes)); - scalar.set(decoded.toString()); - } catch (CharacterCodingException e) { - // Not valid UTF-8: keep Latin-1 codepoint semantics. - // Each byte value becomes a character with that code point. - scalar.set(string); - } + // NOTE: Some parts of the interpreter/compiler may construct a BYTE_STRING + // that already contains Unicode code points > 0xFF (e.g. "\x{100}"). + // This is fine — we just flip the type and preserve the content as-is. + scalar.set(string); scalar.type = STRING; } else if (scalar.type != STRING) { // Other types (INTEGER, DOUBLE, UNDEF, etc.): convert to string and mark as STRING. @@ -269,8 +231,22 @@ public static RuntimeList decode(RuntimeArray args, int ctx) { } RuntimeScalar scalar = args.get(0); String string = scalar.toString(); + + // utf8::decode expects octet data (0-255). If the string contains + // characters > 0xFF, it cannot be valid octet data — return false + // without modifying the string. + for (int i = 0; i < string.length(); i++) { + if (string.charAt(i) > 0xFF) { + return new RuntimeScalar(false).getList(); + } + } + try { - byte[] bytes = string.getBytes(StandardCharsets.ISO_8859_1); + // Safe: all chars are <= 0xFF, so no data loss with manual byte extraction + byte[] bytes = new byte[string.length()]; + for (int i = 0; i < string.length(); i++) { + bytes[i] = (byte) string.charAt(i); + } // Use a strict UTF-8 decoder that throws on invalid sequences // instead of silently replacing with U+FFFD. This matches Perl 5 // behavior where utf8::decode returns FALSE for invalid UTF-8. @@ -342,6 +318,10 @@ public static RuntimeList isUtf8(RuntimeArray args, int ctx) { * @return true if the scalar is a UTF-8 string (not BYTE_STRING), false otherwise. */ public static boolean isUtf8(RuntimeScalar scalar) { + // Resolve proxy types (ScalarSpecialVariable for $1, $&, etc.) + if (scalar instanceof ScalarSpecialVariable sv) { + scalar = sv.getValueAsScalar(); + } return scalar.type != BYTE_STRING; } diff --git a/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java b/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java index 4c5574c57..a24cfee73 100644 --- a/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java +++ b/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java @@ -65,6 +65,9 @@ protected boolean removeEldestEntry(Map.Entry eldest) { // Capture groups from the last successful match that had captures. // In Perl 5, $1/$2/etc persist across non-capturing matches. public static String[] lastCaptureGroups = null; + // Track whether the last successful match was on a BYTE_STRING input, + // so that captures ($1, $2, $&, etc.) preserve BYTE_STRING type. + public static boolean lastMatchWasByteString = false; // Compiled regex pattern (for byte strings - ASCII-only \w, \d) public Pattern pattern; // Compiled regex pattern for Unicode strings (Unicode \w, \d) @@ -646,6 +649,7 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc } found = true; + lastMatchWasByteString = (string.type == RuntimeScalarType.BYTE_STRING); int captureCount = matcher.groupCount(); // Always initialize $1, $2, @+, @-, $`, $&, $' for every successful match @@ -691,7 +695,7 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc if (regex.regexFlags.isGlobalMatch() && captureCount < 1 && ctx == RuntimeContextType.LIST) { // Global match and no captures, in list context return the matched string String matchedStr = regex.hasBackslashK ? lastMatchedString : matcher.group(0); - matchedGroups.add(new RuntimeScalar(matchedStr)); + matchedGroups.add(makeMatchResultScalar(matchedStr)); } else { // save captures in return list if needed if (ctx == RuntimeContextType.LIST) { @@ -704,13 +708,13 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc // because Java creates separate groups for each alternative // but Perl reuses group numbers across alternatives if (matchedStr != null) { - matchedGroups.add(new RuntimeScalar(matchedStr)); + matchedGroups.add(makeMatchResultScalar(matchedStr)); } } else { // Include undef for groups that didn't participate in the match // This is important for patterns like m{^(.*/)?(.*)}s where // the optional group returns undef when it doesn't match - matchedGroups.add(new RuntimeScalar(matchedStr)); + matchedGroups.add(makeMatchResultScalar(matchedStr)); } } } @@ -990,6 +994,7 @@ public static RuntimeBase replaceRegex(RuntimeScalar quotedRegex, RuntimeScalar try { while (matcher.find()) { found++; + lastMatchWasByteString = (string.type == RuntimeScalarType.BYTE_STRING); // Initialize $1, $2, @+, @- only when we have a match globalMatcher = matcher; @@ -1074,6 +1079,7 @@ public static RuntimeBase replaceRegex(RuntimeScalar quotedRegex, RuntimeScalar if (found > 0) { String finalResult = resultBuffer.toString(); + boolean wasByteString = (string.type == RuntimeScalarType.BYTE_STRING); // Store as last successful pattern for empty pattern reuse lastMatchUsedPFlag = regex.hasPreservesMatch; @@ -1081,10 +1087,17 @@ public static RuntimeBase replaceRegex(RuntimeScalar quotedRegex, RuntimeScalar if (regex.regexFlags.isNonDestructive()) { // /r modifier: return the modified string - return new RuntimeScalar(finalResult); + RuntimeScalar rv = new RuntimeScalar(finalResult); + if (wasByteString && !containsWideChars(finalResult)) { + rv.type = RuntimeScalarType.BYTE_STRING; + } + return rv; } else { // Save the modified string back to the original scalar string.set(finalResult); + if (wasByteString && !containsWideChars(finalResult)) { + string.type = RuntimeScalarType.BYTE_STRING; + } // Return the number of substitutions made return RuntimeScalarCache.getScalarInt(found); } @@ -1181,6 +1194,21 @@ public static String lastCaptureString() { return lastCaptureGroups[lastCaptureGroups.length - 1]; } + /** + * Creates a RuntimeScalar from a regex match result string, preserving + * BYTE_STRING type if the matched input was a byte string. + */ + public static RuntimeScalar makeMatchResultScalar(String value) { + if (value == null) { + return RuntimeScalarCache.scalarUndef; + } + RuntimeScalar scalar = new RuntimeScalar(value); + if (lastMatchWasByteString) { + scalar.type = RuntimeScalarType.BYTE_STRING; + } + return scalar; + } + public static RuntimeScalar matcherStart(int group) { if (group == 0) { return lastMatchStart >= 0 ? getScalarInt(lastMatchStart) : scalarUndef; @@ -1600,6 +1628,20 @@ public RuntimeScalar getLastCodeBlockResult() { return null; } + /** + * Check if a string contains any characters with codepoints > 255. + * Used to determine if a substitution result should be upgraded from + * BYTE_STRING to STRING (e.g., when the replacement introduced wide characters). + */ + private static boolean containsWideChars(String s) { + for (int i = 0; i < s.length(); i++) { + if (s.charAt(i) > 255) { + return true; + } + } + return false; + } + /** * Get the group number of the internal perlK named capture group. * This group is inserted by the preprocessor at the \K position. diff --git a/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java b/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java index 4f45f9389..27c55c839 100644 --- a/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java +++ b/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java @@ -191,7 +191,7 @@ private static String parseUserDefinedProperty(String definition, Set re resultSet.retainAll(intersectionSet); } - return resultSet.toPattern(false); + return unicodeSetToJavaPattern(resultSet); } /** @@ -437,7 +437,7 @@ private static String translateUnicodeProperty(String property, boolean negated, } } - String pattern = unicodeSet.toPattern(false); + String pattern = unicodeSetToJavaPattern(unicodeSet); return wrapCharClass(pattern, negated); } catch (IllegalArgumentException e) { @@ -454,7 +454,7 @@ private static String translateUnicodeProperty(String property, boolean negated, private static String getXIDStartPattern(boolean negated) { UnicodeSet xidStartSet = new UnicodeSet(); xidStartSet.applyPropertyAlias("XID_Start", "True"); - String pattern = xidStartSet.toPattern(false); + String pattern = unicodeSetToJavaPattern(xidStartSet); return wrapCharClass(pattern, negated); } @@ -462,7 +462,7 @@ private static String getXIDStartPattern(boolean negated) { private static String getXIDContinuePattern(boolean negated) { UnicodeSet xidContSet = new UnicodeSet(); xidContSet.applyPropertyAlias("XID_Continue", "True"); - String pattern = xidContSet.toPattern(false); + String pattern = unicodeSetToJavaPattern(xidContSet); return wrapCharClass(pattern, negated); } @@ -470,7 +470,7 @@ private static String getXIDContinuePattern(boolean negated) { private static String getXPosixSpacePattern(boolean negated) { UnicodeSet spaceSet = new UnicodeSet(); spaceSet.applyPropertyAlias("White_Space", "True"); - String pattern = spaceSet.toPattern(false); + String pattern = unicodeSetToJavaPattern(spaceSet); return wrapCharClass(pattern, negated); } @@ -479,7 +479,7 @@ private static String getPerlIDStartPattern(boolean negated) { UnicodeSet perlIDStartSet = new UnicodeSet(); perlIDStartSet.applyPropertyAlias("XID_Start", "True"); perlIDStartSet.add('_'); // Add underscore - String pattern = perlIDStartSet.toPattern(false); + String pattern = unicodeSetToJavaPattern(perlIDStartSet); return wrapCharClass(pattern, negated); } @@ -512,6 +512,52 @@ private static String wrapCharClass(String pattern, boolean negated) { return negated ? "[^" + pattern + "]" : "[" + pattern + "]"; } + /** + * Converts a UnicodeSet to a Java regex character class pattern. + * Uses \x{XXXX} notation for supplementary characters (U+10000+) to avoid + * issues with Java's Pattern.compile() misinterpreting UTF-16 surrogate pairs + * in character class ranges generated by ICU4J's toPattern(). + */ + static String unicodeSetToJavaPattern(UnicodeSet set) { + StringBuilder sb = new StringBuilder(); + for (int i = 0; i < set.getRangeCount(); i++) { + int start = set.getRangeStart(i); + int end = set.getRangeEnd(i); + appendJavaPatternChar(sb, start); + if (start != end) { + sb.append('-'); + appendJavaPatternChar(sb, end); + } + } + return sb.toString(); + } + + private static void appendJavaPatternChar(StringBuilder sb, int codePoint) { + if (codePoint >= 0x10000) { + // Use \x{XXXX} for supplementary characters to avoid surrogate pair issues + sb.append(String.format("\\x{%X}", codePoint)); + } else { + // Escape special regex metacharacters inside character classes + // Also escape # and whitespace so the pattern works with Pattern.COMMENTS flag + switch (codePoint) { + case '[': case ']': case '\\': case '^': case '-': case '&': + case '{': case '}': case '#': + sb.append('\\'); + sb.append((char) codePoint); + break; + default: + if (codePoint < 0x20 || codePoint == 0x7F || + Character.isWhitespace(codePoint)) { + // Control characters and whitespace - use hex escape + sb.append(String.format("\\x{%X}", codePoint)); + } else { + sb.append((char) codePoint); + } + break; + } + } + } + // Helper method to check if a property is a block property private static boolean isBlockProperty(String property) { // List of known block properties (can be expanded as needed) diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/FileUtils.java b/src/main/java/org/perlonjava/runtime/runtimetypes/FileUtils.java index 9a5bfb39f..4552dacf4 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/FileUtils.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/FileUtils.java @@ -60,6 +60,15 @@ private static String detectEncodingAndDecode(byte[] bytes, CompilerOptions pars offset = 0; } + // Store raw bytes (after BOM removal) for DATA section extraction. + // In Perl 5, reads raw bytes from the file. We preserve the original + // bytes so the DATA section can provide raw bytes instead of UTF-8-decoded content. + if (offset > 0) { + parsedArgs.rawCodeBytes = java.util.Arrays.copyOfRange(bytes, offset, bytes.length); + } else { + parsedArgs.rawCodeBytes = bytes; + } + // For UTF-16 encodings, use a decoder that can handle malformed input // This is needed to preserve invalid surrogate sequences that Perl allows if (charset == StandardCharsets.UTF_16LE || charset == StandardCharsets.UTF_16BE) { @@ -78,6 +87,14 @@ private static String detectEncodingAndDecode(byte[] bytes, CompilerOptions pars } } + // When source is detected as ISO-8859-1, mark it as byte string source. + // ISO-8859-1 is a 1:1 mapping from bytes to characters (0x00-0xFF), + // so the Java string already represents the raw byte values correctly. + // The string parser should not re-encode these characters to UTF-8. + if (charset == StandardCharsets.ISO_8859_1) { + parsedArgs.isByteStringSource = true; + } + // For UTF-8 and other charsets, use standard decoding return new String(bytes, offset, bytes.length - offset, charset); } @@ -119,7 +136,43 @@ private static Charset detectCharsetWithoutBOM(byte[] bytes) { } } + // Check if file contains non-ASCII bytes that aren't valid UTF-8. + // Perl 5 without 'use utf8' treats source as Latin-1 (ISO-8859-1). + // We use UTF-8 for valid UTF-8 files (most modern files), but fall back + // to ISO-8859-1 for files with invalid UTF-8 sequences (legacy Latin-1 files). + if (hasNonAscii(bytes) && !isValidUtf8(bytes)) { + return StandardCharsets.ISO_8859_1; + } + // Default to UTF-8 return StandardCharsets.UTF_8; } + + /** + * Checks if the byte array contains any non-ASCII bytes (> 0x7F). + */ + private static boolean hasNonAscii(byte[] bytes) { + for (byte b : bytes) { + if ((b & 0x80) != 0) { + return true; + } + } + return false; + } + + /** + * Validates that the byte array is valid UTF-8. + * Uses Java's CharsetDecoder with strict error handling. + */ + private static boolean isValidUtf8(byte[] bytes) { + CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder() + .onMalformedInput(CodingErrorAction.REPORT) + .onUnmappableCharacter(CodingErrorAction.REPORT); + try { + decoder.decode(ByteBuffer.wrap(bytes)); + return true; + } catch (CharacterCodingException e) { + return false; + } + } } diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java index a2cf71123..8f9d857b3 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalContext.java @@ -77,11 +77,18 @@ public static void initializeGlobals(CompilerOptions compilerOptions) { GlobalVariable.getGlobalVariable("main::a"); // initialize $a to "undef" GlobalVariable.getGlobalVariable("main::b"); // initialize $b to "undef" GlobalVariable.globalVariables.put("main::!", new ErrnoVariable()); // initialize $! with dualvar support - GlobalVariable.getGlobalVariable("main::,").set(""); // initialize $, to "" + // Initialize $, (output field separator) with special variable class + if (!GlobalVariable.globalVariables.containsKey("main::,")) { + var ofs = new OutputFieldSeparator(); + ofs.set(""); + GlobalVariable.globalVariables.put("main::,", ofs); + } GlobalVariable.globalVariables.put("main::|", new OutputAutoFlushVariable()); // Only set $\ if it hasn't been set yet - prevents overwriting during re-entrant calls if (!GlobalVariable.globalVariables.containsKey("main::\\")) { - GlobalVariable.getGlobalVariable("main::\\").set(compilerOptions.outputRecordSeparator); // initialize $\ + var ors = new OutputRecordSeparator(); + ors.set(compilerOptions.outputRecordSeparator); // initialize $\ + GlobalVariable.globalVariables.put("main::\\", ors); } GlobalVariable.getGlobalVariable("main::$").set(ProcessHandle.current().pid()); // initialize `$$` to process id GlobalVariable.getGlobalVariable("main::?"); @@ -184,17 +191,17 @@ public static void initializeGlobals(CompilerOptions compilerOptions) { System.getenv().forEach((k, v) -> env.put(k, new RuntimeScalar(v))); /* Initialize @INC. - @INC Search order is: - - "-I" argument - - JAR_PERLLIB, the jar directory: src/main/perl/lib - - PERL5LIB env - - ~/.perlonjava/lib (user installed modules) + @INC Search order mirrors Perl 5's site_perl > core pattern: + - "-I" argument (highest priority, user override) + - PERL5LIB env (user environment override) + - ~/.perlonjava/lib (user-installed CPAN modules, like site_perl) + - JAR_PERLLIB (bundled modules, like core lib — lowest priority) + This allows CPAN-installed modules to override bundled ones. See also: https://stackoverflow.com/questions/2526804/how-is-perls-inc-constructed */ List inc = GlobalVariable.getGlobalArray("main::INC").elements; inc.addAll(compilerOptions.inc.elements); // add from `-I` - inc.add(new RuntimeScalar(JAR_PERLLIB)); // internal src/main/perl/lib String[] directories = env.getOrDefault("PERL5LIB", new RuntimeScalar("")).toString().split(":"); for (String directory : directories) { if (!directory.isEmpty()) { @@ -210,6 +217,7 @@ public static void initializeGlobals(CompilerOptions compilerOptions) { inc.add(new RuntimeScalar(userLib)); } } + inc.add(new RuntimeScalar(JAR_PERLLIB)); // internal src/main/perl/lib (lowest priority) // Initialize %INC GlobalVariable.getGlobalHash("main::INC"); diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeHash.java b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeHash.java new file mode 100644 index 000000000..40df14826 --- /dev/null +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeHash.java @@ -0,0 +1,78 @@ +package org.perlonjava.runtime.runtimetypes; + +import java.util.Stack; + +/** + * A DynamicState implementation for global hashes that saves/restores the + * globalHashes map entry when localized. This handles the case where + * {@code local %hash} is followed by {@code *hash = \%other} — the glob + * slot assignment replaces the map entry, so a simple save-and-restore of + * the hash contents (as RuntimeHash.dynamicSaveState does) is insufficient. + * + *

Follows the same pattern as {@link GlobalRuntimeScalar} for scalars. + */ +public class GlobalRuntimeHash implements DynamicState { + private static final Stack localizedStack = new Stack<>(); + private final String fullName; + + public GlobalRuntimeHash(String fullName) { + this.fullName = fullName; + } + + /** + * Called from the JVM-emitted code for {@code local %hash} when the hash + * is a global (not lexical) variable. Registers a DynamicState marker on + * the local-variable stack so that scope exit restores the original hash. + * + * @param fullName the fully-qualified hash name (e.g. "main::_") + * @return the current RuntimeHash (callers may ignore this in VOID context) + */ + public static RuntimeHash makeLocal(String fullName) { + var localMarker = new GlobalRuntimeHash(fullName); + DynamicVariableManager.pushLocalVariable(localMarker); + return GlobalVariable.getGlobalHash(fullName); + } + + @Override + public void dynamicSaveState() { + // Save the current hash reference from the global map + RuntimeHash original = GlobalVariable.globalHashes.get(fullName); + localizedStack.push(new SavedGlobalHashState(fullName, original)); + + // Install a fresh empty hash in the global map + RuntimeHash newLocal = new RuntimeHash(); + GlobalVariable.globalHashes.put(fullName, newLocal); + + // Update glob aliases so they all point to the new local hash + java.util.List aliasGroup = GlobalVariable.getGlobAliasGroup(fullName); + for (String alias : aliasGroup) { + if (!alias.equals(fullName)) { + GlobalVariable.globalHashes.put(alias, newLocal); + } + } + } + + @Override + public void dynamicRestoreState() { + if (!localizedStack.isEmpty()) { + SavedGlobalHashState saved = localizedStack.peek(); + if (saved.fullName.equals(this.fullName)) { + localizedStack.pop(); + + // Restore the original hash reference in the global map + GlobalVariable.globalHashes.put(saved.fullName, saved.originalHash); + + // Restore glob aliases + java.util.List aliasGroup = GlobalVariable.getGlobAliasGroup(saved.fullName); + for (String alias : aliasGroup) { + if (!alias.equals(saved.fullName)) { + GlobalVariable.globalHashes.put(alias, saved.originalHash); + } + } + } + } + } + + private record SavedGlobalHashState(String fullName, RuntimeHash originalHash) { + } +} diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeScalar.java b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeScalar.java index 1948455f5..4ff982deb 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeScalar.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/GlobalRuntimeScalar.java @@ -43,14 +43,26 @@ public static RuntimeScalar makeLocal(String fullName) { @Override public void dynamicSaveState() { - // Create a new RuntimeScalar for the localized value - GlobalRuntimeScalar newLocal = new GlobalRuntimeScalar(fullName); - // Save the current global reference var originalVariable = GlobalVariable.globalVariables.get(fullName); localizedStack.push(new SavedGlobalState(fullName, originalVariable)); + // Create a new variable for the localized scope. + // For output separator variables, create the matching special type so that + // set() in the localized scope correctly updates the internal value that print reads. + // Also save the internal separator value for restoration. + RuntimeScalar newLocal; + if (originalVariable instanceof OutputRecordSeparator) { + OutputRecordSeparator.saveInternalORS(); + newLocal = new OutputRecordSeparator(); + } else if (originalVariable instanceof OutputFieldSeparator) { + OutputFieldSeparator.saveInternalOFS(); + newLocal = new OutputFieldSeparator(); + } else { + newLocal = new GlobalRuntimeScalar(fullName); + } + // Replace this variable in the global symbol table with the new one GlobalVariable.globalVariables.put(fullName, newLocal); @@ -72,6 +84,13 @@ public void dynamicRestoreState() { if (saved.fullName.equals(this.fullName)) { localizedStack.pop(); + // Restore the internal separator values if this was an output separator variable + if (saved.originalVariable instanceof OutputRecordSeparator) { + OutputRecordSeparator.restoreInternalORS(); + } else if (saved.originalVariable instanceof OutputFieldSeparator) { + OutputFieldSeparator.restoreInternalOFS(); + } + // Restore the original variable in the global symbol table GlobalVariable.globalVariables.put(saved.fullName, saved.originalVariable); diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/OutputFieldSeparator.java b/src/main/java/org/perlonjava/runtime/runtimetypes/OutputFieldSeparator.java new file mode 100644 index 000000000..747982fee --- /dev/null +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/OutputFieldSeparator.java @@ -0,0 +1,99 @@ +package org.perlonjava.runtime.runtimetypes; + +import java.util.Stack; + +/** + * Special variable for $, (output field separator). + * + *

Like $\ (OutputRecordSeparator), $, has special semantics in Perl: + * print uses an internal copy that is only updated by direct assignment + * to $,. Aliasing via "for $, (@list)" does NOT affect the separator + * print uses between arguments. + * + *

This class maintains a static {@code internalOFS} that print reads, + * separate from the variable's value in the global symbol table. + */ +public class OutputFieldSeparator extends RuntimeScalar { + + /** + * The internal OFS value that print reads. + * Only updated by OutputFieldSeparator.set() calls. + */ + private static String internalOFS = ""; + + /** + * Stack for save/restore during local $, and for $, (list). + */ + private static final Stack ofsStack = new Stack<>(); + + public OutputFieldSeparator() { + super(); + } + + /** + * Returns the internal OFS value for use by print. + */ + public static String getInternalOFS() { + return internalOFS; + } + + /** + * Save the current internalOFS onto the stack. + * Called from GlobalRuntimeScalar.dynamicSaveState() when localizing $,. + */ + public static void saveInternalOFS() { + ofsStack.push(internalOFS); + } + + /** + * Restore internalOFS from the stack. + * Called from GlobalRuntimeScalar.dynamicRestoreState() when restoring $,. + */ + public static void restoreInternalOFS() { + if (!ofsStack.isEmpty()) { + internalOFS = ofsStack.pop(); + } + } + + @Override + public RuntimeScalar set(RuntimeScalar value) { + super.set(value); + internalOFS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(String value) { + super.set(value); + internalOFS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(int value) { + super.set(value); + internalOFS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(long value) { + super.set(value); + internalOFS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(boolean value) { + super.set(value); + internalOFS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(Object value) { + super.set(value); + internalOFS = this.toString(); + return this; + } +} diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/OutputRecordSeparator.java b/src/main/java/org/perlonjava/runtime/runtimetypes/OutputRecordSeparator.java new file mode 100644 index 000000000..f35fe8991 --- /dev/null +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/OutputRecordSeparator.java @@ -0,0 +1,104 @@ +package org.perlonjava.runtime.runtimetypes; + +import java.util.Stack; + +/** + * Special variable for $\ (output record separator). + * + *

In Perl, the output record separator ($\) has special semantics: + * when print reads $\, it uses an internal copy (PL_ors_sv in C Perl) + * that is only updated by direct assignment to $\. This means that + * aliasing $\ via "for $\ (@list)" does NOT affect what print appends, + * because the alias changes the Perl-visible variable but not the + * internal ORS value. + * + *

This class maintains a static {@code internalORS} that print reads, + * separate from the variable's value in the global symbol table. + * Only {@code set()} on an OutputRecordSeparator instance updates + * {@code internalORS}; aliasing replaces the map entry with a plain + * RuntimeScalar whose set() does not touch internalORS. + */ +public class OutputRecordSeparator extends RuntimeScalar { + + /** + * The internal ORS value that print reads. + * Only updated by OutputRecordSeparator.set() calls. + */ + private static String internalORS = ""; + + /** + * Stack for save/restore during local $\ and for $\ (list). + */ + private static final Stack orsStack = new Stack<>(); + + public OutputRecordSeparator() { + super(); + } + + /** + * Returns the internal ORS value for use by print. + */ + public static String getInternalORS() { + return internalORS; + } + + /** + * Save the current internalORS onto the stack. + * Called from GlobalRuntimeScalar.dynamicSaveState() when localizing $\. + */ + public static void saveInternalORS() { + orsStack.push(internalORS); + } + + /** + * Restore internalORS from the stack. + * Called from GlobalRuntimeScalar.dynamicRestoreState() when restoring $\. + */ + public static void restoreInternalORS() { + if (!orsStack.isEmpty()) { + internalORS = orsStack.pop(); + } + } + + @Override + public RuntimeScalar set(RuntimeScalar value) { + super.set(value); + internalORS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(String value) { + super.set(value); + internalORS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(int value) { + super.set(value); + internalORS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(long value) { + super.set(value); + internalORS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(boolean value) { + super.set(value); + internalORS = this.toString(); + return this; + } + + @Override + public RuntimeScalar set(Object value) { + super.set(value); + internalORS = this.toString(); + return this; + } +} diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RegexState.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RegexState.java index 35806b55d..e4e9a4455 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RegexState.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RegexState.java @@ -25,6 +25,7 @@ public class RegexState implements DynamicState { private final RuntimeRegex lastSuccessfulPattern; private final boolean lastMatchUsedPFlag; private final String[] lastCaptureGroups; + private final boolean lastMatchWasByteString; public RegexState() { this.globalMatcher = RuntimeRegex.globalMatcher; @@ -39,6 +40,7 @@ public RegexState() { this.lastSuccessfulPattern = RuntimeRegex.lastSuccessfulPattern; this.lastMatchUsedPFlag = RuntimeRegex.lastMatchUsedPFlag; this.lastCaptureGroups = RuntimeRegex.lastCaptureGroups; + this.lastMatchWasByteString = RuntimeRegex.lastMatchWasByteString; } public static void save() { @@ -67,5 +69,6 @@ public void dynamicRestoreState() { RuntimeRegex.lastSuccessfulPattern = this.lastSuccessfulPattern; RuntimeRegex.lastMatchUsedPFlag = this.lastMatchUsedPFlag; RuntimeRegex.lastCaptureGroups = this.lastCaptureGroups; + RuntimeRegex.lastMatchWasByteString = this.lastMatchWasByteString; } } diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java index 7b935a3c2..eaf52624d 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java @@ -1544,6 +1544,11 @@ public static RuntimeList call(RuntimeScalar runtimeScalar, } else { perlClassName = NameNormalizer.getBlessStr(blessId); } + } else if (runtimeScalar.type == RuntimeScalarType.GLOB) { + // Bare typeglob used as method invocant (e.g., *FH->print(...)) + // Auto-bless to IO::File, same as GLOBREFERENCE + perlClassName = "IO::File"; + ModuleOperators.require(new RuntimeScalar("IO/File.pm")); } else if (!runtimeScalar.getDefinedBoolean()) { throw new PerlCompilerException("Can't call method \"" + methodName + "\" on an undefined value"); } else { diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeIO.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeIO.java index ca0ffabba..31fb3c625 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeIO.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeIO.java @@ -247,6 +247,24 @@ public RuntimeIO(DirectoryIO directoryIO) { this.directoryIO = directoryIO; } + /** + * Checks if this handle is in byte mode (no encoding layers). + * + *

In Perl, reads from handles without encoding layers (e.g., :raw, :bytes, + * or default mode) produce byte strings (UTF-8 flag off). Reads from handles + * with encoding layers (e.g., :utf8, :encoding(UTF-8)) produce character + * strings (UTF-8 flag on).

+ * + * @return true if the handle produces byte data (no encoding layers active) + */ + public boolean isByteMode() { + if (ioHandle instanceof LayeredIOHandle layered) { + return !layered.hasEncodingLayer(); + } + // Non-layered handles (CustomFileChannel, etc.) are always byte mode + return true; + } + public static void registerChildProcess(Process p) { if (p != null) childProcesses.put(p.pid(), p); } @@ -1259,7 +1277,7 @@ public RuntimeScalar write(String data) { // When no encoding layer is active, check for wide characters (> 0xFF). // Perl 5 warns and outputs UTF-8 encoding of the entire string in this case. - if (!(ioHandle instanceof LayeredIOHandle)) { + if (isByteMode()) { boolean hasWide = false; for (int i = 0; i < data.length(); i++) { if (data.charAt(i) > 0xFF) { diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java index de1efc8af..99ee27ec4 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java @@ -37,7 +37,9 @@ public RuntimeSubstrLvalue(RuntimeScalar parent, String str, int offset, int len this.length = length; this.outOfBounds = false; - this.type = RuntimeScalarType.STRING; + // Preserve BYTE_STRING type from parent so substr() on byte strings stays byte + this.type = (parent.type == RuntimeScalarType.BYTE_STRING) + ? RuntimeScalarType.BYTE_STRING : RuntimeScalarType.STRING; this.value = str; } diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/ScalarSpecialVariable.java b/src/main/java/org/perlonjava/runtime/runtimetypes/ScalarSpecialVariable.java index 6e58b087a..88c450bf4 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/ScalarSpecialVariable.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/ScalarSpecialVariable.java @@ -143,34 +143,34 @@ public RuntimeScalar getValueAsScalar() { RuntimeScalar result = switch (variableId) { case CAPTURE -> { String capture = RuntimeRegex.captureString(position); - yield capture != null ? new RuntimeScalar(capture) : scalarUndef; + yield capture != null ? makeRegexResultScalar(capture) : scalarUndef; } case MATCH -> { String match = RuntimeRegex.matchString(); - yield match != null ? new RuntimeScalar(match) : scalarUndef; + yield match != null ? makeRegexResultScalar(match) : scalarUndef; } case PREMATCH -> { String prematch = RuntimeRegex.preMatchString(); - yield prematch != null ? new RuntimeScalar(prematch) : scalarUndef; + yield prematch != null ? makeRegexResultScalar(prematch) : scalarUndef; } case POSTMATCH -> { String postmatch = RuntimeRegex.postMatchString(); - yield postmatch != null ? new RuntimeScalar(postmatch) : scalarUndef; + yield postmatch != null ? makeRegexResultScalar(postmatch) : scalarUndef; } case P_PREMATCH -> { if (!RuntimeRegex.lastMatchUsedPFlag) yield scalarUndef; String prematch = RuntimeRegex.preMatchString(); - yield prematch != null ? new RuntimeScalar(prematch) : scalarUndef; + yield prematch != null ? makeRegexResultScalar(prematch) : scalarUndef; } case P_MATCH -> { if (!RuntimeRegex.lastMatchUsedPFlag) yield scalarUndef; String match = RuntimeRegex.matchString(); - yield match != null ? new RuntimeScalar(match) : scalarUndef; + yield match != null ? makeRegexResultScalar(match) : scalarUndef; } case P_POSTMATCH -> { if (!RuntimeRegex.lastMatchUsedPFlag) yield scalarUndef; String postmatch = RuntimeRegex.postMatchString(); - yield postmatch != null ? new RuntimeScalar(postmatch) : scalarUndef; + yield postmatch != null ? makeRegexResultScalar(postmatch) : scalarUndef; } case LAST_FH -> { if (RuntimeIO.lastAccesseddHandle == null) { @@ -454,6 +454,18 @@ public void dynamicRestoreState() { super.dynamicRestoreState(); } + /** + * Creates a RuntimeScalar from a regex match result string, preserving + * BYTE_STRING type if the matched input was a byte string. + */ + private static RuntimeScalar makeRegexResultScalar(String value) { + RuntimeScalar scalar = new RuntimeScalar(value); + if (RuntimeRegex.lastMatchWasByteString) { + scalar.type = RuntimeScalarType.BYTE_STRING; + } + return scalar; + } + /** * Enum to represent the id of the special variable. * diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/TieScalar.java b/src/main/java/org/perlonjava/runtime/runtimetypes/TieScalar.java index 4a46bdb5e..ab3a1d297 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/TieScalar.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/TieScalar.java @@ -57,6 +57,8 @@ public static RuntimeScalar tiedUntie(RuntimeScalar runtimeScalar) { /** * Fetches the value from a tied scalar (delegates to FETCH). + * Caches the result so that after untie, the variable retains + * the last FETCH'd value (matching Perl 5 behavior). */ public RuntimeScalar tiedFetch() { RuntimeScalar result = tieCall("FETCH"); @@ -76,4 +78,4 @@ public RuntimeScalar tiedStore(RuntimeScalar v) { public RuntimeScalar getPreviousValue() { return previousValue; } -} \ No newline at end of file +} diff --git a/src/main/perl/lib/B.pm b/src/main/perl/lib/B.pm index a35efc7ef..3f44d4f98 100644 --- a/src/main/perl/lib/B.pm +++ b/src/main/perl/lib/B.pm @@ -127,7 +127,8 @@ package B::CV { $self->{_pkg_name} = 'main'; $self->{_is_anon} = 1; if ($self->{ref} && ref($self->{ref}) eq 'CODE') { - require Sub::Util; + eval { require Sub::Util }; + return if $@; # Sub::Util not available, use defaults my $fqn = Sub::Util::subname($self->{ref}); if (defined $fqn && $fqn ne '__ANON__') { # Split "Package::Name::subname" into package and name diff --git a/src/main/perl/lib/ExtUtils/MakeMaker.pm b/src/main/perl/lib/ExtUtils/MakeMaker.pm index 4427a556b..9fe1394a5 100644 --- a/src/main/perl/lib/ExtUtils/MakeMaker.pm +++ b/src/main/perl/lib/ExtUtils/MakeMaker.pm @@ -427,7 +427,9 @@ sub _create_install_makefile { # Build install commands for module/data/share files my @install_cmds; + my @blib_cmds; # Also copy to blib/lib for test compatibility my %dirs_seen; + my %blib_dirs_seen; for my $src (sort keys %$pm) { my $dest = $pm->{$src}; my $dir = dirname($dest); @@ -435,6 +437,26 @@ sub _create_install_makefile { push @install_cmds, _shell_mkdir($dir); } push @install_cmds, _shell_cp($src, $dest); + + # Build blib/lib copy command: derive relative path from source + # Source is like "lib/Text/CSV.pm" -> blib dest is "blib/lib/Text/CSV.pm" + my $blib_rel; + if ($src =~ m{^lib/(.*)$}) { + $blib_rel = $1; + } elsif ($src =~ m{^blib/lib/(.*)$}) { + $blib_rel = $1; + } else { + # Flat layout: compute from dest path relative to INSTALL_BASE + ($blib_rel = $dest) =~ s{^\Q$INSTALL_BASE\E/?}{}; + } + if ($blib_rel) { + my $blib_dest = "blib/lib/$blib_rel"; + my $blib_dir = dirname($blib_dest); + unless ($blib_dirs_seen{$blib_dir}++) { + push @blib_cmds, _shell_mkdir($blib_dir); + } + push @blib_cmds, _shell_cp($src, $blib_dest); + } } # Build install commands for scripts @@ -452,6 +474,7 @@ sub _create_install_makefile { } my $install_cmds_str = join("\n", @install_cmds) || "\t\@true"; + my $blib_cmds_str = join("\n", @blib_cmds) || "\t\@true"; my $script_cmds_str = join("\n", @script_cmds) || "\t\@true"; my $file_count = scalar(keys %$pm) + scalar(keys %$scripts); @@ -501,13 +524,17 @@ INSTALLSITELIB = $installsitelib NOECHO = \@ RM_RF = rm -rf -all: pm_to_blib pl_files config +all: pm_to_blib pure_all pl_files config \t\@echo "PerlOnJava: $name v$version installed ($file_count files)" # Copy module and data files to installation directory pm_to_blib: $install_cmds_str +# Copy to blib/lib for test compatibility (make test uses PERL5LIB=./blib/lib) +pure_all: +$blib_cmds_str + # Process PL_FILES pl_files: $pl_cmds_str @@ -534,7 +561,7 @@ realclean: clean distclean: clean \t\$(RM_RF) $makefile ${makefile}.old -.PHONY: all pm_to_blib pl_files config test install clean realclean distclean install_scripts +.PHONY: all pm_to_blib pure_all pl_files config test install clean realclean distclean install_scripts MAKEFILE # Call MY::postamble if it exists (File::ShareDir::Install uses this) diff --git a/src/main/perl/lib/namespace/autoclean.pm b/src/main/perl/lib/namespace/autoclean.pm index 5826aacaa..72435220e 100644 --- a/src/main/perl/lib/namespace/autoclean.pm +++ b/src/main/perl/lib/namespace/autoclean.pm @@ -5,32 +5,154 @@ use warnings; our $VERSION = '0.31'; -# namespace::autoclean stub for PerlOnJava +# namespace::autoclean for PerlOnJava # -# This is a no-op stub that provides the interface but skips cleanup. -# -# Problem: The real namespace::autoclean uses subname() to detect whether a -# function was defined in the current package or imported. Functions where -# subname() returns a different package are cleaned. However, this breaks -# modules like DateTime::TimeZone that import Try::Tiny's try/catch and use -# them internally. -# -# Solution: Skip all cleanup. The cleanup is just namespace hygiene - it -# prevents imported functions from being callable as methods. Since PerlOnJava -# is typically used in controlled environments where this isn't a concern, -# skipping cleanup is safe and enables modules like DateTime to work. +# Removes imported functions from a package's namespace at end of scope, +# keeping locally-defined methods. Uses Sub::Util::subname (via XSLoader) +# to determine whether a function was imported or defined locally. + +use B::Hooks::EndOfScope 'on_scope_end'; +use List::Util 'first'; + +# Load the XS Sub::Util implementation directly to avoid CPAN version conflicts +BEGIN { + require XSLoader; + XSLoader::load('Sub::Util', '1.63'); +} sub import { - # Accept all arguments but do nothing - # Real signature: ($class, %args) where %args can include -cleanee, -also, -except - return; + my ($class, %args) = @_; + + my $subcast = sub { + my $i = shift; + return $i if ref $i eq 'CODE'; + return sub { $_ =~ $i } if ref $i eq 'Regexp'; + return sub { $_ eq $i }; + }; + + my $runtest = sub { + my ($code, $method_name) = @_; + local $_ = $method_name; + return $code->(); + }; + + my $cleanee = exists $args{-cleanee} ? $args{-cleanee} : scalar caller; + + my @also = map $subcast->($_), ( + exists $args{-also} + ? (ref $args{-also} eq 'ARRAY' ? @{ $args{-also} } : $args{-also}) + : () + ); + + my @except = map $subcast->($_), ( + exists $args{-except} + ? (ref $args{-except} eq 'ARRAY' ? @{ $args{-except} } : $args{-except}) + : () + ); + + on_scope_end { + my $subs = _get_functions($cleanee); + my $method_check = _method_check($cleanee); + + my @clean = grep { + my $method = $_; + ! first { $runtest->($_, $method) } @except + and ( + !$method_check->($method) + or first { $runtest->($_, $method) } @also + ) + } keys %$subs; + + # Remove cleaned functions from the stash + if (@clean) { + no strict 'refs'; + for my $func (@clean) { + # Save non-CODE slots (scalars, arrays, hashes, etc.) + my $glob = *{"${cleanee}::${func}"}; + my @saved; + for my $slot (qw(SCALAR ARRAY HASH IO FORMAT)) { + my $ref = *{$glob}{$slot}; + push @saved, [$slot, $ref] if defined $ref; + } + + # Delete the glob entirely + delete ${"${cleanee}::"}{$func}; + + # Restore non-CODE slots + for my $pair (@saved) { + my ($slot, $ref) = @$pair; + # Recreate the glob with just the non-CODE slots + if ($slot eq 'SCALAR' && defined $$ref) { + *{"${cleanee}::${func}"} = $ref; + } elsif ($slot eq 'ARRAY' && @$ref) { + *{"${cleanee}::${func}"} = $ref; + } elsif ($slot eq 'HASH' && %$ref) { + *{"${cleanee}::${func}"} = $ref; + } + } + } + } + }; +} + +# Get all functions in a package +sub _get_functions { + my $package = shift; + my %subs; + no strict 'refs'; + for my $name (keys %{"${package}::"}) { + next if $name =~ /^[A-Z]+$/; # Skip special names like BEGIN, END, etc. + my $glob = ${"${package}::"}{$name}; + # Check if the glob has a CODE slot + if (defined &{"${package}::${name}"}) { + $subs{$name} = \&{"${package}::${name}"}; + } + } + return \%subs; } -# Provide the subname function in case anything checks for it -sub subname { - my ($coderef) = @_; - # Return a reasonable default - the B module integration isn't always available - return ref($coderef) eq 'CODE' ? '__ANON__' : undef; +# Check if a function is a "method" (defined locally vs imported) +sub _method_check { + my $package = shift; + + # For Moose/Moo classes, use the metaclass if available + if (defined &Class::MOP::class_of) { + my $meta = Class::MOP::class_of($package); + if ($meta) { + my %methods = map +($_ => 1), $meta->get_method_list; + $methods{meta} = 1 + if $meta->isa('Moose::Meta::Role') + && eval { Moose->VERSION } < 0.90; + return sub { $_[0] =~ /^\(/ || $methods{$_[0]} }; + } + } + + # For plain classes: use subname to detect origin + my $does = $package->can('does') ? 'does' + : $package->can('DOES') ? 'DOES' + : undef; + + return sub { + return 1 if $_[0] =~ /^\(/; # Overloaded operators + + my $coderef = do { no strict 'refs'; \&{"${package}::$_[0]"} }; + my $fullname = Sub::Util::subname($coderef); + return 1 unless defined $fullname; # Can't determine origin, keep it + + my ($code_stash) = $fullname =~ /\A(.*)::/s; + return 1 unless defined $code_stash; + + return 1 if $code_stash eq $package; # Defined locally + return 1 if $code_stash eq 'constant'; # Constant subs + # Companion/helper packages (e.g. DateTime::PP for DateTime) install + # functions via glob assignment — these are intentional methods, not imports. + # In PerlOnJava, method calls are resolved at runtime through the stash, + # so we must not remove them. + return 1 if index($code_stash, "${package}::") == 0; # Companion package + return 1 if $does && eval { $package->$does($code_stash) }; # Role methods + + return 0; # Imported - clean it + }; } 1; @@ -39,81 +161,40 @@ __END__ =head1 NAME -namespace::autoclean - PerlOnJava stub (no cleanup performed) +namespace::autoclean - Keep imports out of your namespace =head1 SYNOPSIS - package MyClass; + package Foo; use namespace::autoclean; use Some::Exporter qw(imported_function); - sub method { imported_function('args') } + sub bar { imported_function('stuff') } - # In real namespace::autoclean, imported_function would be removed - # In this stub, it remains available (both as function and method) + # later: + Foo->bar; # works + Foo->imported_function; # fails - cleaned after compilation =head1 DESCRIPTION -This is a stub implementation of namespace::autoclean for PerlOnJava. It -provides the interface but performs no actual cleanup. - -=head2 Why a stub? - -The real namespace::autoclean removes imported functions from a package's -namespace to keep it clean. It uses C or the B module -to detect which functions were imported vs defined locally. - -This breaks modules like DateTime::TimeZone that: - -=over 4 - -=item 1. Import functions from Try::Tiny (try, catch) - -=item 2. Use namespace::autoclean - -=item 3. Call those functions internally - -=back - -The imported try/catch get cleaned, causing "Undefined subroutine" errors. - -=head2 Why is skipping cleanup safe? - -The cleanup is purely cosmetic - it prevents imported functions from being -callable as methods on objects. In most use cases: - -=over 4 - -=item * Methods are called by name, not discovered dynamically - -=item * Imported functions aren't accidentally called as methods - -=item * The slight namespace pollution is harmless - -=back +When you import a function into a Perl package, it will naturally also be +available as a method. The C pragma will remove all +imported symbols at the end of the current package's compile cycle. Functions +called in the package itself will still be bound by their name, but they won't +show up as methods on your class or instances. =head1 PARAMETERS -The following parameters are accepted but ignored: - -=over 4 - -=item -cleanee => $package - -=item -also => \@subs or qr/pattern/ - -=item -except => \@subs or qr/pattern/ - -=back +=head2 -cleanee => $package -=head1 SEE ALSO +Specify which package to clean (defaults to caller). -L - The module this is based on +=head2 -also => ITEM | REGEX | SUB | ARRAYREF -L - A module that benefits from this stub +Additional functions to clean. -=head1 COPYRIGHT +=head2 -except => ITEM | REGEX | SUB | ARRAYREF -This is a PerlOnJava compatibility stub. +Functions to exclude from cleaning. =cut