HzaCode · HzaCode · Jun 14, 2026
diff --git a/.gitignore b/.gitignore
@@ -49,3 +49,6 @@ docs/.doctrees/
 .DS_Store
 Thumbs.db
 
+# Claude Code (personal/local config)
+.claude/settings.local.json
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -58,6 +58,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
   in the `README.md` Roadmap section and the `flake8 onecite tests`
   validation check.
 
+### Removed
+- `onecite process` no longer accepts `--google-scholar`, and
+  `process_references()` no longer accepts the `use_google_scholar`
+  parameter. Google Scholar was never consulted from the authoritative
+  `process` path, so the flag and parameter were no-ops there. Google
+  Scholar remains available as an opt-in, best-effort fallback on
+  `onecite suggest --google-scholar` /
+  `suggest_references(use_google_scholar=True)`.
+
 ### Fixed
 - Corrected the benchmark Nature DQN DOI fixture from
   `10.1038/nature14539` to `10.1038/nature14236`, and added regression
@@ -87,6 +96,20 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
 - Clarified that `onecite benchmark --json` is the deterministic offline
   health check, while `onecite process ...` may contact upstream APIs
   unless fixtures or mocks are explicitly configured.
+- DOI-backed BibTeX input now keeps the canonical CrossRef/DataCite field
+  values instead of letting the original entry override them; original
+  fields still fill gaps the API leaves empty, and the existing citation
+  key is still preserved.
+- A CrossRef 404 now always falls back to DataCite instead of only doing so
+  for a short hardcoded prefix list, so dataset/software/thesis DOIs
+  registered under other DataCite prefixes resolve.
+- `suggest` no longer routes queries containing words such as "synthesis",
+  "hypothesis", or "parenthesis" to the thesis search (whole-word match for
+  "thesis"/"dissertation").
+- GitHub clone URLs ending in `.git` now resolve to the correct repository.
+- Plain-text entry ids stay contiguous when entries are separated by more
+  than one blank line, and a dead PLOS article-id branch was removed from
+  the text parser.
 
 ## [0.1.1] - 2026-04-17
 

diff --git a/README.md b/README.md
@@ -36,13 +36,13 @@
 ---
 
 <p align="center">
-  OneCite is a command-line tool and Python library for citation management. It accepts DOIs, paper titles, arXiv IDs, and mixed inputs, and outputs formatted bibliographic entries.
+  OneCite is a command-line tool and Python library for citation management. It resolves strong identifiers such as DOIs, PMIDs, arXiv IDs, ISBNs, GitHub URLs, and data DOIs into formatted bibliographic entries, while plain-text title searches are handled by the separate candidate-only suggest command.
 </p>
 
 ---
 
 
- Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, titles typed by hand, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and attempts metadata lookup against configured sources such as CrossRef, PubMed, arXiv, and Semantic Scholar. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.
+ Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, PMIDs, ISBNs, software URLs, data DOIs, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and resolves strong identifiers against configured sources such as CrossRef, PubMed, arXiv, DataCite, GitHub, and Google Books. Plain-text title searches are exposed through `onecite suggest` so candidates can be reviewed without being mistaken for verified BibTeX. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.
 
 
 
@@ -54,14 +54,13 @@
 
 | Feature                 | Description                                                                                             |
 | ----------------------- | ------------------------------------------------------------------------------------------------------- |
-| **Fuzzy Matching**          | Attempt to match incomplete references against configured academic metadata sources.                 |
+| **Candidate Suggestions**   | Search incomplete plain-text references with `onecite suggest` without resolving them to BibTeX.     |
 | **Multiple Formats**        | Input `.txt`/`.bib` → Output **BibTeX**.                                                             |
 | **4-stage Pipeline**        | A 4-stage process (clean → query → validate → format) to produce consistent output.                  |
 | **Field Completion**        | Fill available fields returned by metadata sources, such as journal, volume, pages, authors, and abstract. |
 | 🎓 **7+ Citation Types**    | Handles journal articles, conference papers, books, software, datasets, theses, and preprints.        |
 | **Multi-Source Lookup**     | Uses source-specific routes for CrossRef, arXiv, PubMed, Semantic Scholar, Google Books, and others. |
-| **Many Identifier Types**   | Accepts DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, or plain text queries.                    |
-| 🎛️ **Interactive Mode**    | Manually select the correct entry when multiple potential matches are found.                          |
+| **Many Identifier Types**   | Resolves DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, and DataCite DOI inputs.                 |
 | **Custom Templates**        | YAML-based presets that provide a fallback BibTeX entry type when auto-detection is inconclusive.    |
 
 
@@ -97,9 +96,9 @@ Create a file named `references.txt` with your mixed-format references:
 
 10.1038/nature14539
 
-Attention is all you need, Vaswani et al., NIPS 2017
+arXiv:1706.03762
 
-Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
+ISBN:9780262035613
 
 https://github.com/tensorflow/tensorflow
 
@@ -157,7 +156,7 @@ Your `results.bib` file now contains entries of different types.
 
 ```bash
 onecite process "10.1038/nature14539"
-onecite process "Attention is all you need, Vaswani et al., NIPS 2017"
+onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017"
 echo "10.1038/nature14539" | onecite process -
 ```
 </details>
@@ -198,16 +197,12 @@ Use OneCite directly in your Python scripts.
 ```python
 from onecite import process_references
 
-# A callback can be used for non-interactive selection (e.g., always choose the best match)
-def auto_select_callback(candidates):
-    return 0 # Index of the best candidate
-
 result = process_references(
-    input_content="Deep learning review\nLeCun, Bengio, Hinton\nNature 2015",
+    input_content="10.1038/nature14539",
     input_type="txt",
     template_name="journal_article_full",
     output_format="bibtex",
-    interactive_callback=auto_select_callback
+    interactive_callback=lambda candidates: -1
 )
 
 print('\n\n'.join(result['results']))
@@ -229,7 +224,7 @@ onecite process <input_file> [OPTIONS]
 ```
 
 **Arguments:**
-- `input_file` - Input file path, `-` for stdin, or a reference string (e.g., DOI, title)
+- `input_file` - Input file path, `-` for stdin, or a strong identifier/reference string
 
 **Options:**
 | Option | Short | Description | Default |
@@ -243,7 +238,6 @@ onecite process <input_file> [OPTIONS]
 | `--json` | | Print a stable JSON envelope instead of BibTeX text | `False` |
 | `--ndjson` | | Print newline-delimited JSON events for streaming automation workflows | `False` |
 | `--fail-on-unresolved` | | Return exit code `2` when any entry cannot be resolved | `False` |
-| `--google-scholar` | | Enable Google Scholar as an additional data source (requires scholarly package) | `False` |
 
 **Examples:**
 ```bash
@@ -253,9 +247,6 @@ onecite process references.txt -o results.bib
 # Process a BibTeX file with auto-detection
 onecite process references.bib
 
-# Process with interactive mode
-onecite process ambiguous.txt --interactive
-
 # Use stdin
 echo "10.1038/nature14539" | onecite process -
 
@@ -265,9 +256,6 @@ onecite process "10.1038/nature14539"
 # Process with custom template
 onecite process references.txt --template conference_paper
 
-# Enable Google Scholar (requires scholarly package)
-onecite process references.txt --google-scholar
-
 # Quiet mode for scripts
 onecite process references.txt -o results.bib --quiet
 
@@ -278,6 +266,28 @@ onecite process references.txt --json --fail-on-unresolved
 onecite process references.txt --ndjson
 ```
 
+### `onecite suggest`
+
+Search for candidate matches without producing BibTeX or returning a
+validation `passed` status.
+
+```bash
+onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017" --json
+```
+
+**Optional Google Scholar fallback.** `suggest` accepts `--google-scholar`
+(requires the optional `scholarly` package: `pip install onecite[scholar]`).
+It is consulted only as a best-effort fallback when CrossRef and Semantic
+Scholar return nothing. Because it scrapes a service with no public API, it
+is **off by default, may be rate-limited or blocked by a CAPTCHA, and is not
+guaranteed to be reproducible** — it is exposed only on `suggest` (candidates
+for human review), never on `process` (authoritative output).
+
+```bash
+pip install onecite[scholar]
+onecite suggest "some obscure title" --google-scholar
+```
+
 ### `onecite --version`
 
 Display the installed OneCite version.

diff --git a/docs/api/core.rst b/docs/api/core.rst
@@ -19,7 +19,6 @@ The primary function for processing citations.
         template_name: str,
         output_format: str,
         interactive_callback: Callable[[List[Dict]], int],
-        use_google_scholar: bool = False,
     ) -> Dict[str, Any]
 
 **Parameters:**
@@ -28,8 +27,7 @@ The primary function for processing citations.
 - ``input_type`` (str): Type of input - ``"txt"`` or ``"bib"`` (required)
 - ``template_name`` (str): Template name to use (e.g., ``"journal_article_full"``) (required)
 - ``output_format`` (str): Output format - currently only ``"bibtex"`` is supported (required)
-- ``interactive_callback`` (Callable): Function to handle ambiguous matches. Takes a list of candidate dicts and returns the selected index (0-based), or -1 to skip (required)
-- ``use_google_scholar`` (bool): Enable Google Scholar as an additional data source. Requires the optional ``scholarly`` package. Default is ``False``.
+- ``interactive_callback`` (Callable): Compatibility callback; plain-text candidate search is handled by ``suggest_references`` (required)
 
 **Returns:**
 
@@ -53,7 +51,7 @@ A dictionary with keys:
         input_type="txt",
         template_name="journal_article_full",
         output_format="bibtex",
-        interactive_callback=lambda candidates: 0  # Auto-select first match
+        interactive_callback=lambda candidates: -1
     )
 
     # Access results
@@ -216,7 +214,7 @@ For typical usage, ``process_references()`` is simpler. PipelineController expos
         input_type="txt",
         template_name="journal_article_full",
         output_format="bibtex",
-        interactive_callback=lambda candidates: 0
+        interactive_callback=lambda candidates: -1
     )
 
     print(result['results'])

diff --git a/docs/api/pipeline.rst b/docs/api/pipeline.rst
@@ -72,48 +72,43 @@ parsing fails.
 Stage 2: Identify (``IdentifierModule``)
 ----------------------------------------
 
-**Purpose:** resolve each ``RawEntry`` against academic data sources and
-produce an ``IdentifiedEntry`` with a DOI (when possible) plus basic
-metadata.
+**Purpose:** resolve each ``RawEntry`` with strong identifiers into an
+``IdentifiedEntry`` with a DOI / arXiv ID / URL plus basic metadata.
+Plain-text title searches are not resolved by the processing pipeline; use
+the suggestion workflow for candidate search.
 
-**Input:** ``List[RawEntry]`` and an ``interactive_callback`` that picks
-from candidate lists when confidence is medium.
+**Input:** ``List[RawEntry]`` and an ``interactive_callback`` kept for API
+compatibility.
 
 **Output:** ``List[IdentifiedEntry]``.
 
 **Data sources actually queried by the code:**
 
-- CrossRef (DOI-based and fuzzy search)
-- Semantic Scholar (keyword search)
+- CrossRef (DOI-based lookup; candidate search in suggest mode)
+- Semantic Scholar (candidate search in suggest mode)
 - arXiv (via feedparser)
 - PubMed (biomedical, queried when strong cues are present)
 - DataCite / Zenodo (datasets)
 - Google Books (books — triggered by ISBN or publisher cues)
 - external providerRE / BASE (theses)
 - GitHub (software repositories)
-- Google Scholar (optional, disabled by default; opt-in via
-  ``--google-scholar`` or ``use_google_scholar=True`` and requires the
-  ``scholarly`` package)
+- Google Scholar (optional, ``suggest``-only best-effort fallback, disabled by
+  default; opt-in via ``suggest --google-scholar`` or
+  ``suggest_references(use_google_scholar=True)`` and requires the
+  ``scholarly`` package; never used by ``process``)
 
 There is **no runtime routing based on filename** and no fixed priority
-for "medical", "CS" or "general" queries.  Signal-based heuristics
-inside ``_fuzzy_search`` decide when to *additionally* query PubMed,
-Google Books, external providerRE/BASE, etc., but CrossRef and Semantic Scholar are
-always consulted for text queries.
+for "medical", "CS" or "general" queries. Signal-based heuristics in
+suggestion mode decide when to *additionally* query PubMed, Google Books,
+external providerRE/BASE, etc. Text-only entries in process mode are
+reported as unresolved instead of being guessed.
 
 **Confidence model:**
 
-After all sources have returned candidates, ``_score_candidates`` assigns
-each candidate a ``match_score`` (0–100) based on title / author /
-year / venue similarity to the query.  The decision logic in
-``_fuzzy_search`` then chooses one of three paths:
-
-- ``match_score >= 80`` and a clear best candidate → auto-adopt
-- ``70 <= match_score < 80`` → call the ``interactive_callback`` with up
-  to 5 candidates; fall back to the top candidate if the user skips and
-  the score is still ≥ 75
-- ``match_score >= 50`` and a title is present → adopt cautiously
-- otherwise → mark the entry as ``identification_failed``
+After all suggestion sources have returned candidates, ``_score_candidates``
+assigns each candidate a ``match_score`` (0–100) based on title / author /
+year / venue similarity to the query. Scores are returned for human or
+downstream review; they are not treated as validation proof.
 
 Fallback paths never fabricate data: an entry that cannot be resolved is
 marked ``identification_failed`` rather than filled with invented
@@ -219,7 +214,7 @@ high-level ``process_references`` function:
         input_type="txt",
         template_name="journal_article_full",
         output_format="bibtex",
-        interactive_callback=lambda candidates: 0,  # auto-pick first
+        interactive_callback=lambda candidates: -1
     )
 
     print('\n\n'.join(result['results']))

diff --git a/docs/basic_usage.rst b/docs/basic_usage.rst
@@ -17,9 +17,9 @@ A text file where each reference is separated by a **blank line**::
 
     10.1038/nature14539
 
-    Vaswani et al., 2017, Attention is all you need
+    arXiv:1706.03762
 
-    Smith (2020) Neural Architecture Search
+    ISBN:9780262035613
 
 .. note::
 
@@ -115,18 +115,23 @@ line followed by result and failure events::
 
     onecite process input.txt --ndjson
 
-**Google Scholar (--google-scholar)**
+**Google Scholar (suggest only, --google-scholar)**
 
-Enable Google Scholar as an additional data source (requires the optional ``scholarly`` package)::
+A best-effort fallback for the ``suggest`` command only (requires the optional
+``scholarly`` package: ``pip install onecite[scholar]``). It is consulted only
+when CrossRef and Semantic Scholar return nothing. Because it scrapes a service
+with no public API, it is off by default, may be blocked by a CAPTCHA, and is
+not guaranteed to be reproducible. It is never used by ``process``, whose output
+is authoritative::
 
-    onecite process input.txt --google-scholar
+    onecite suggest input.txt --google-scholar
 
 **Direct String Input**
 
 Pass a reference string directly instead of a file::
 
     onecite process "10.1038/nature14539"
-    onecite process "Attention is all you need, Vaswani et al., NIPS 2017"
+    onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017"
 
 **Stdin Input**
 

diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -40,6 +40,15 @@ Changed
   live checks are explicitly marked with ``pytest.mark.live`` so the
   default suite is deterministic and offline.
 
+Removed
+~~~~~~~
+
+- ``onecite process`` no longer accepts ``--google-scholar``, and
+  ``process_references()`` no longer accepts the ``use_google_scholar``
+  parameter (both were no-ops on the authoritative ``process`` path).
+  Google Scholar remains an opt-in, best-effort fallback on
+  ``onecite suggest --google-scholar``.
+
 Fixed
 ~~~~~
 
@@ -60,6 +69,16 @@ Fixed
   distribution artifacts.
 - Added benchmark and doctor checks to the GitHub Actions test
   workflow.
+- DOI-backed BibTeX input keeps canonical CrossRef/DataCite fields
+  instead of letting the original entry override them; original fields
+  still fill gaps and the existing citation key is preserved.
+- A CrossRef 404 always falls back to DataCite instead of only doing so
+  for a short hardcoded prefix list.
+- ``suggest`` no longer routes queries containing words such as
+  "synthesis" or "hypothesis" to the thesis search.
+- GitHub clone URLs ending in ``.git`` resolve to the correct repository.
+- Plain-text entry ids stay contiguous across multi-blank-line gaps, and
+  a dead PLOS article-id branch was removed from the text parser.
 
 [0.1.1] - 2026-04-17
 ---------------------