Skip to content
View Incheonkirin's full-sized avatar
:octocat:
:octocat:

Block or report Incheonkirin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Incheonkirin/README.md

Mingi Jeong

ML/LLM Engineer — retrieval training internals, LLM serving correctness, and search & RAG systems

Previously 5.5 years on the search team at 42Maru — Korean hybrid retrieval (BM25 + dense, learning-to-rank, hard negatives), MRC (machine reading comprehension), RAG, and open-source LLM fine-tuning.

LinkedIn Email


Currently I use Korean/CJK and insurance-domain retrieval as a stress test for boundary bugs in production AI systems: mixed-precision and distributed embedding losses, hard-negative mining, byte-level tokenizers, tool-call parsers, and analyzers.

🔧 Upstream contributions

Found by dogfooding my own Korean RAG + evaluation stack, search_system — a Korean insurance-clause retrieval testbed (nori BM25 + BGE-M3 hybrid, a real-query failure catalog, analyzer benchmarks). Most of these share one shape:

Data that is valid on one side of a representation boundary silently breaks the other — NFD Hangul vs. the analyzer, stop strings vs. byte-fragment tokens, a literal </tool_call> vs. the tool-call parser, bf16 logits vs. a float32 loss. Korean hits these boundaries constantly; English-only test suites never do.

Lately the same probes audit the training signal itself: false-negative masking and distributed positive alignment in contrastive losses (#3817).

Retrieval training and embedding losses

  • sentence-transformers #3800 — bf16/fp16 training crash across six learning-to-rank losses. (merged)
  • sentence-transformers #3817 — on multi-GPU gather_across_devices, gathered positives in GISTEmbedLoss/CachedGISTEmbedLoss were masked as false negatives, so the cross-entropy target collapsed to -inf and the training signal silently vanished on rank > 0. Surfaced with a Korean polarity probe; it also covered a regression the earlier in-batch-negative fix (#3453) had left in the GIST losses. (merged)
  • sentence-transformers #3816 — avoid materializing the full non-FAISS hard-negative mining similarity matrix. (merged)
  • sentence-transformers #3812 — MPS support for cached-loss RandContext. (merged)
  • sentence-transformers #3821 — hard-negative mining's relative-margin threshold was sign-dependent and inverted on negative positive-scores; made it sign-independent (#3819). (open)

LLM serving and model internals

Search analyzers and query normalization

  • apache/lucene #16242 — new HangulCompositionCharFilter for analysis-nori: NFD-form Hangul was silently unanalyzable as Korean (#16241). (open)
  • elastic/elasticsearch #151094 — nori's default XPN stop tag silently deletes meaning-bearing Korean prefixes, so 비급여 (non-covered) analyzes to 급여 (covered); fix is a maintainer-invited docs warning (#151157). (open)
  • elastic/elasticsearch #151008 — wildcard queries: re-escape operator characters produced by the normalizer. (open)
  • explosion/spaCy #13974 — Korean tokenizer collapsed whitespace runs, breaking doc.text round-trips and offsets. (open)

Vector search and evaluation infrastructure

  • facebookresearch/faiss #5272 — diagnosed that musllinux wheels were dropped during the move to official PyPI wheels (*-musllinux_* remained in the cibuildwheel skip list) and outlined the restore path; upstream shipped the fix in faiss-cpu 1.14.3 via #5299. (resolved upstream)

Also in MLflow: OpenTelemetry retriever-span reassembly (mlflow #23818) and restoring dataset expectation/tag logging in genai.evaluate(scorers=[]) (mlflow #23957), plus ragas #2759 and BentoML #5632 / #5633.


📊 Public datasets — NIA AI Hub (42Maru)

Drove these government-published Korean NLP datasets end-to-end at 42Maru — proposal, post-award schema/annotation and difficulty design, then AI modeling and validation. ~2.3M labeled QA pairs plus a ~300M-token corpus across five public datasets, all downloadable on AI Hub.

  • 뉴스 기사 기계독해 (news-article MRC) — 2021, 42Maru lead. Four answer regimes including inference and unanswerable, with evidence spans for the hard cases. data
  • 국가기록물 초거대 AI 말뭉치 (national-archives LLM corpus) — 2023, 42Maru lead. Instruction-tuning data for an LLM (Llama2), with four personas (formal/casual × written/spoken) and length-controlled answers. data
  • 금융·법률 문서 기계독해 (finance/legal MRC) — 2022, data design + modeling. Text+table multimodal QA with cell-coordinate table answers and multiple-choice. data
  • 숫자연산 기계독해 (numeric-reasoning MRC) — 2022, data design + modeling. Numeric reasoning — arithmetic, ratio, date, and multi-fact comparison. data
  • 표 정보 질의응답 (table-information QA) — 2022, 42Maru participant. Complex-table QA with unanswerable cases and complexity tiers, informed by English table-QA benchmarks. data

🏢 Enterprise NLP/QA at 42Maru (press)

Closed-source enterprise deployments I worked on as a technical planner — experiment design, evaluation, and search-quality / data improvements with the research and engineering teams.

  • AI ship-sales design-support system — Daewoo Shipbuilding (DSME): semantic QA over ~100K historical records for shipowners' pre-contract technical inquiries. press
  • AML / trade-based transaction detection — Hana Bank: OCR-NLP over cross-border remittance invoices. press

🧭 Repo map

Private prototypes stay private until they produce either a reproducible upstream bug, a clean public artifact, or a result worth explaining without the scaffolding.


🧰 Stack

Python PyTorch vLLM Elasticsearch / Lucene SFT / DPO / LoRA

Pinned Loading

  1. Incheonkirin.github.io Incheonkirin.github.io Public

    Personal site — portfolio and notes.

    TypeScript