Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions .agents/docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Architecture

## Project Overview

jntajis-python is a Python library for transliterating and encoding/decoding characters across three Japanese character set standards: JIS X 0208, JIS X 0213, and Unicode. It also supports transliteration via the MJ (Moji Joho) character table and shrink conversion maps.

## Directory Layout

```
jntajis-python/
setup.py # setuptools + Cython extension build
setup.cfg # Package metadata, dependencies, dev extras
Makefile # Data pipeline: download -> parse -> codegen
src/jntajis/
__init__.py # Public Python API surface (enums + re-exports)
_jntajis.pyx # Cython implementation (core logic)
_jntajis.h # Generated C header (lookup tables)
_jntajis.pyi # Type stubs for the Cython extension
_jntajis.c # Cython-generated C source (not committed normally)
gen.py # Code generator: Excel/JSON -> _jntajis.h
py.typed # PEP 561 marker
tests/
test_encoder.py # Tests for encoding/decoding and IncrementalEncoder
test_mj_translit.py # Tests for MJ shrink candidate transliteration
xlsx_parser/
__init__.py # Re-exports read_xlsx
parser.py # Streaming OpenXML XLSX reader
xmlutils.py # SAX-style XML parser framework (expat-based)
docs/
source/
api.rst # Sphinx API documentation
conf.py # Sphinx configuration
_static/images/ # SVG diagrams
.github/workflows/
main.yml # CI entry point (PR + push + tag triggers)
tests.yml # Lint (black, flake8, mypy) + test job
wheels.yml # cibuildwheel multi-platform wheel builds
```

## High-Level Architecture

The system has three distinct phases: **data pipeline** (build-time), **native extension** (compile-time), and **runtime API** (user-facing).

### 1. Data Pipeline (build-time, `Makefile` + `gen.py`)

External data sources are downloaded and processed into a single generated C header file:

```
[JNTA Excel] ---+
[MJ Excel] ---+--> gen.py (Jinja2 template) --> _jntajis.h (C lookup tables)
[MJ Shrink JSON]+
```

- **JNTA Excel** (`jissyukutaimap1_0_0.xlsx`): NTA shrink conversion map. Downloaded from NTA.
- **MJ Excel** (`mji.00601.xlsx`): MJ character table. Downloaded from CITPC/IPA.
- **MJ Shrink JSON** (`MJShrinkMap.1.2.0.json`): MJ shrink conversion map. Downloaded from CITPC/IPA.

`gen.py` uses a custom `xlsx_parser` to read the Excel files, processes the data into optimized lookup structures, and renders `_jntajis.h` via a Jinja2 template. The generated header contains:

- `tx_mappings[]`: 2*94*94 entries, one per JIS X 0213 codepoint (men-ku-ten)
- `urange_to_jis_mappings[]`: Sorted ranges for Unicode-to-JIS binary search
- `sm_uni_to_jis_mapping()`: State machine for multi-codepoint Unicode-to-JIS mapping
- `urange_to_mj_mappings[]`: Sorted ranges for Unicode-to-MJ-mapping-set binary search
- `mj_shrink_mappings[]`: MJ shrink mapping unicode sets indexed by MJ code

### 2. Native Extension (compile-time, Cython)

`_jntajis.pyx` is a Cython file compiled into a C extension module. It:

- Includes `_jntajis.h` via `cdef extern` to access the generated lookup tables
- Uses CPython internal APIs (`_PyUnicodeWriter`, `_PyBytesWriter`, `PyUnicode_READ`, etc.) directly for high-performance string construction
- Compiles with safety checks disabled (`boundscheck=False`, `wraparound=False`, `cdivision=True`)

The build process is: `_jntajis.pyx` + `_jntajis.h` --> Cython --> `_jntajis.c` --> C compiler --> `_jntajis.so`.

### 3. Runtime API

The public API is exposed via `__init__.py` which re-exports from the Cython extension:

| Symbol | Type | Description |
|--------|------|-------------|
| `jnta_encode()` | function | Unicode -> JIS byte sequence |
| `jnta_decode()` | function | JIS byte sequence -> Unicode |
| `jnta_shrink_translit()` | function | JNTA shrink transliteration (Unicode -> Unicode) |
| `mj_shrink_candidates()` | function | MJ-based shrink transliteration candidates |
| `IncrementalEncoder` | class | Stateful encoder (codec-compatible) |
| `TransliterationError` | exception | Raised on transliteration failure |
| `ConversionMode` | enum | Encoding mode selection |
| `MJShrinkScheme` | enum | Individual MJ shrink scheme identifiers |
| `MJShrinkSchemeCombo` | flag enum | Combinable MJ shrink scheme selectors |

## Key Data Structures

### JIS Code Representation

JIS codepoints are packed into a `uint16_t` as: `(men - 1) * 94 * 94 + (ku - 1) * 94 + (ten - 1)`, where men is 1 or 2 (JIS X 0213 plane), ku is 1-94 (row), ten is 1-94 (column).

### ShrinkingTransliterationMapping

Each JIS X 0213 position has an entry:
- `jis`: packed men-ku-ten code
- `us[2]`: primary Unicode codepoint(s)
- `sus[2]`: secondary (similar glyph) Unicode codepoint(s)
- `class_`: JIS character class (level 1-4, non-kanji, reserved)
- `tx_jis[4]`/`tx_us[4]`: transliterated form (JIS and Unicode)

### Unicode-to-JIS Reverse Lookup

Uses sorted range tables (`URangeToJISMapping`) with binary search. Multi-codepoint sequences (e.g. base + combining mark) use a state machine (`sm_uni_to_jis_mapping()`).

### MJ Mapping Structures

- `MJMapping`: Maps an MJ code to Unicode codepoints + IVS (Ideographic Variation Sequence) pairs
- `MJMappingSet`: A set of MJ mappings for a single Unicode codepoint
- `URangeToMJMappings`: Sorted range table for Unicode-to-MJ binary search
- `MJShrinkMappingUnicodeSet`: Per-MJ-code shrink targets, one array per scheme (4 schemes)

## Component Interactions

```
User code
|
v
__init__.py (Python enums + re-exports)
|
v
_jntajis.pyx (Cython: encoding, decoding, transliteration logic)
|
v
_jntajis.h (Generated C: static lookup tables + state machine)
```

## xlsx_parser Sub-package

A lightweight, streaming, read-only XLSX parser. It avoids heavyweight dependencies like openpyxl by:

1. Opening XLSX as a zip file (`zipfile.ZipFile`)
2. Parsing `xl/sharedStrings.xml` for the shared string table
3. Parsing `xl/worksheets/sheetN.xml` incrementally via SAX-style handlers

The XML parsing framework in `xmlutils.py` provides:
- A hierarchical `Handlers`/`HandlersBase` abstract pattern where each nesting level of XML is handled by a different handler class
- `HandlerShim` wraps handlers to dynamically switch the active handler as XML nesting changes
- `read_xml_incremental()` enables pull-style iteration over worksheet rows

## CI/CD

- **Trigger** (`main.yml`): On PR open, push to main, or version tag push (`v*`)
- **Lint & Test** (`tests.yml`): black + flake8 + mypy on Python 3.11
- **Wheels** (`wheels.yml`): cibuildwheel across Ubuntu, Windows, macOS (11/12/13), excluding PyPy. Only runs on tag push.

## Documentation

Sphinx with `sphinx_rtd_theme`, hosted on Read the Docs. API docs are manually authored in `api.rst` (not autodoc).
238 changes: 238 additions & 0 deletions .agents/docs/IMPLEMENTATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
# Implementation Details

## Code Generation (`gen.py`)

### Entry Point

`gen.py` provides a CLI via `click`:

```
python -m jntajis.gen -- <dest> <src_jnta> <src_mj> <src_mj_shrink>
```

### Input Parsing

Three source data files are read:

1. **`read_jnta_excel_file()`** parses the NTA shrink map Excel:
- Validates header rows match expected Japanese column names
- For each row: parses men-ku-ten code, Unicode codepoint(s), JIS character class, transliteration target (single or multi-char)
- Fills gaps between consecutive JIS codes with `RESERVED` entries
- Extracts secondary Unicode mappings from memo fields via regex

2. **`read_mj_excel_file()`** parses the MJ character table Excel:
- Extracts MJ code, corresponding UCS, implemented UCS, IVS pairs (Moji_Joho collection + SVS)
- Builds `UIVSPair` tuples (Unicode codepoint + variation selector number)
- Tracks max variant count across all entries

3. **`read_mj_shrink_file()`** parses the MJ shrink map JSON:
- Reads target Unicode codepoints for each of the 4 shrink schemes
- Groups by source MJ code

### Data Structure Construction

1. **`build_reverse_mappings()`**: Builds Unicode-to-JIS reverse lookup:
- Sorts all mappings by primary Unicode codepoint
- Groups contiguous codepoints into ranges (`URangeToJISMapping`), splitting at gaps >= `gap_thr` (default 256)
- Separately collects multi-codepoint sequences into `Outer` groups for the state machine

2. **`build_digested_shrink_mappings()`**: Linearizes MJ shrink mappings:
- Creates a dense array indexed by MJ code
- Fills gaps with empty tuples
- Tracks per-scheme maximum array lengths

3. **`build_chunked_mj_mappings()`**: Builds Unicode-to-MJ reverse lookup:
- Groups all MJ mappings by Unicode codepoint
- Chunks contiguous ranges, splitting at gaps >= 64
- Returns `URangeToMJMappings` list + max mapping set size

### Template Rendering

Uses Jinja2 to render the C header from `code_template`. The template generates:

- `JISCharacterClass` enum
- `ShrinkingTransliterationMapping` struct and the `tx_mappings[]` array (2 * 94 * 94 entries)
- Per-range `uint16_t` arrays for Unicode-to-JIS lookup
- `URangeToJISMapping` array for binary search
- `sm_uni_to_jis_mapping()` function: a C switch-based state machine for multi-codepoint Unicode sequences
- MJ-related structs and arrays (`MJMapping`, `MJMappingSet`, `URangeToMJMappings`, `MJShrinkMappingUnicodeSet`)

## Cython Extension (`_jntajis.pyx`)

### Compiler Directives

```cython
# cython: language_level=3, cdivision=True, boundscheck=False, wraparound=False, embedsignature=True
```

All safety checks are disabled for performance. `embedsignature=True` embeds Python signatures in docstrings.

### Core Internal Types

- **`JNTAJISIncrementalEncoder`**: Struct holding encoder state:
- `encoding`: Python string (ref-counted) for error reporting
- `replacement`: Fallback JIS code (0xFFFF = no replacement)
- `put_jis`: Function pointer selecting the output strategy
- `la[32]`/`lal`: Lookahead buffer for multi-codepoint sequences
- `shift_state`/`state`: State machine state

- **`JNTAJISIncrementalEncoderContext`**: Per-call context wrapping the encoder + `_PyBytesWriter` for output construction

- **`JNTAJISShrinkingTransliteratorContext`**: Per-call context for `jnta_shrink_translit`, using `_PyUnicodeWriter` for output

- **`MJShrinkCandidates`**: Manages cartesian product enumeration for `mj_shrink_candidates`

### Encoding Flow (`jnta_encode` / `IncrementalEncoder.encode`)

1. Initialize `_PyBytesWriter` with estimated size (2 * input length)
2. For each Unicode codepoint in input:
a. Feed to `sm_uni_to_jis_mapping()` state machine
b. If state machine returns a JIS code (state == -1): call `put_jis` function pointer
c. If state machine is still consuming (state > 0): buffer in lookahead
d. If state machine returns to state 0 with buffered chars: flush lookahead via reverse table lookup
3. On flush: flush remaining lookahead, emit shift-out if in SISO mode
4. Finalize bytes writer

### Output Strategies (`put_jis` function pointers)

| Function | ConversionMode | Behavior |
|----------|---------------|----------|
| `jis_put_siso` | SISO | Emits SI/SO escape bytes for plane switching + 2-byte JIS |
| `jis_put_men_1` | MEN1 | Only allows plane 1; rejects plane 2 characters |
| `jis_put_jisx0208` | JISX0208 | Only allows level 1/2 kanji and JIS X 0208 non-kanji |
| `jis_put_jisx0208_translit` | JISX0208_TRANSLIT | Like JISX0208, but falls back to `tx_jis[]`/`tx_us[]` transliteration for non-0208 chars |

### Decoding Flow (`jnta_decode`)

1. Initialize `_PyUnicodeWriter`
2. Parse byte pairs as JIS row+column codes
3. Handle SI (0x0E) / SO (0x0F) shift bytes in SISO mode
4. Look up `tx_mappings[jis]` to get Unicode codepoint(s)
5. Write 1 or 2 Unicode codepoints per JIS code

### JNTA Shrink Transliteration (`jnta_shrink_translit`)

1. Initialize `_PyUnicodeWriter`
2. For each Unicode codepoint: use `sm_uni_to_jis_mapping()` to find JIS code
3. If the JIS code maps to a level 3/4 or non-kanji-extended character with a transliteration entry: output the transliterated form (`tx_us[]`)
4. Otherwise: output the original Unicode codepoint(s) from `us[]`
5. If no mapping found: use replacement string or passthrough

### MJ Shrink Candidates (`mj_shrink_candidates`)

This is the most complex function. It:

1. Allocates per-character candidate arrays (`UIVSPair[20]` per position)
2. For each input character (possibly with trailing IVS):
a. Look up `urange_to_mj_mappings` to find candidate `MJMapping` entries
b. If IVS present: filter to exact IVS match
c. If no IVS: collect all non-IVS variants
d. For each matching MJ code, look up `mj_shrink_mappings` and collect target Unicode codepoints per selected scheme (combo bitmask)
e. Also include the original Unicode variants from the MJ mapping itself
f. If no candidates: keep the original character
3. Enumerate the cartesian product of per-character candidates (up to `limit`) using carry-based iteration
4. Build result strings using `_PyUnicodeWriter`

### Binary Search Pattern

Both `lookup_rev_table()` and `lookup_mj_mapping_table()` use the same pattern:
- Binary search over sorted range arrays
- Each range has `start`, `end`, and a pointer to a dense sub-array
- Index into sub-array as `array[u - start]`

### Unicode String Internals Access

The extension directly uses CPython internal APIs for zero-copy string access:
- `PyUnicode_KIND()`: Get the internal storage width (1/2/4 byte)
- `PyUnicode_DATA()`: Get raw buffer pointer
- `PyUnicode_READ()`: Read a codepoint at an index
- `_PyUnicodeWriter` / `_PyBytesWriter`: Internal buffer builders that handle memory allocation and string compaction

This makes the code CPython-specific and incompatible with other Python implementations.

## xlsx_parser Implementation

### xmlutils.py - XML Framework

The framework builds a hierarchical SAX handler system:

- **`Handlers`** (ABC): Defines `start_element()`, `end_element()`, `cdata()` -- each returns `Optional[Handlers]` to signal handler switching
- **`HandlersBase`**: Concrete base with `outer` (parent handler), `parser` ref, `path` tuple for error reporting, and `next()` for creating child handlers
- **`HandlerShim`**: Adapts the handler-switching protocol to expat's flat callback interface; stores the current handler and swaps it when a method returns non-None
- **`wrap_start_element_handler`**: Decorator that splits `namespace\nlocal_name` and converts attlist to `OrderedDict`
- **`read_xml_incremental()`**: Drives expat parsing in 4KB chunks, yielding events from a `pull_events` callback between chunks

### parser.py - XLSX Parser

Layered handler hierarchy for each XML document:

**Shared strings** (`xl/sharedStrings.xml`):
- Level 0 (`SharedStringsReader_0`): Expects `<sst>`
- Level 1 (`SharedStringsReader_1`): Iterates `<si>` elements
- Level 2 (`SharedStringsReader_2`): Extracts text from `<t>` within `<si>`

**Worksheet** (`xl/worksheets/sheetN.xml`):
- Level 0 (`WorksheetReader_0`): Expects `<worksheet>`
- Level 1 (`WorksheetReader_1`): Handles `<dimension>` and `<sheetData>`
- Level 2 (`WorksheetReader_2`): Iterates `<row>` elements
- Level 3 (`WorksheetReader_3`): Iterates `<c>` (cell) elements within a row
- Level 4 (`WorksheetReader_4`): Extracts `<v>` (value) or `<f>` (formula) content

**`StreamingWorksheetReader`**: Resolves shared string references (`t="s"`) and pads sparse rows into dense arrays based on cell references (e.g. "A1", "C3").

**`ReadonlyWorkbook`/`ReadonlyWorksheet`**: Top-level API wrapping zipfile access with lazy shared string loading and incremental row iteration.

## Python API Layer (`__init__.py`)

### Enums

- **`ConversionMode`** (`IntEnum`): SISO=0, MEN1=1, JISX0208=2, JISX0208_TRANSLIT=3
- **`MJShrinkScheme`** (`IntEnum`): Four MJ shrink scheme identifiers (0-3)
- **`MJShrinkSchemeCombo`** (`IntFlag`): Bitmask flags (1, 2, 4, 8) for combining MJ shrink schemes

The Cython extension symbols are imported with a `try/except ImportError` guard so the package can be imported even when the native extension is not built (e.g. for documentation generation).

## Build System

### setup.py / setup.cfg

- Uses `setuptools-scm` for version management (from git tags matching `v*`)
- Declares a single Cython extension: `jntajis._jntajis` from `src/jntajis/_jntajis.pyx`
- Requires Cython >= 0.29 at build time
- No runtime dependencies

### Makefile

Defines the data pipeline with proper dependency tracking:

```
_jntajis.h <-- gen.py + jissyukutaimap1_0_0.xlsx + mji.00601.xlsx + MJShrinkMap.1.2.0.json
jissyukutaimap1_0_0.xlsx <-- syukutaimap1_0_0.zip (curl from NTA)
mji.00601.xlsx <-- mji.00601-xlsx.zip (curl from CITPC)
MJShrinkMap.1.2.0.json <-- MJShrinkMapVer.1.2.0.zip (curl from CITPC)
```

### CI/CD

- Lint + test runs on every PR and push to main
- Wheel builds only on tag push (`v*`)
- Wheels built via `cibuildwheel` on: Ubuntu 20.04, Windows 2019, macOS 11/12/13
- PyPy wheels are skipped (`CIBW_SKIP: pp*`)

## Testing

Two test modules using pytest:

- **`test_encoder.py`**: Tests `jnta_encode()` and `IncrementalEncoder` across all `ConversionMode` values. Covers:
- Unmapped character encoding errors
- Single and multi-codepoint sequences (e.g. katakana with combining marks)
- Transliteration fallback (JISX0208_TRANSLIT mode)
- Incremental encoding with flush behavior
- SISO mode with plane switching
- Supplementary plane characters

- **`test_mj_translit.py`**: Tests `mj_shrink_candidates()` with various:
- Characters with/without IVS
- Different shrink scheme combinations
- Characters with multiple shrink candidates
- Supplementary plane characters (e.g. U+2AC2A)
2 changes: 0 additions & 2 deletions .black.ini

This file was deleted.

2 changes: 0 additions & 2 deletions .flake8

This file was deleted.

Loading