HzaCode · HzaCode · Jun 14, 2026
diff --git a/.gitignore b/.gitignore
@@ -48,4 +48,3 @@ docs/.doctrees/
 # OS
 .DS_Store
 Thumbs.db
-
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -58,6 +58,26 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
   in the `README.md` Roadmap section and the `flake8 onecite tests`
   validation check.
 
+### Removed
+- `onecite process` no longer accepts `--google-scholar`, and
+  `process_references()` no longer accepts the `use_google_scholar`
+  parameter. Google Scholar was never consulted from the authoritative
+  `process` path, so the flag and parameter were no-ops there. Google
+  Scholar remains available as an opt-in, best-effort fallback on
+  `onecite suggest --google-scholar` /
+  `suggest_references(use_google_scholar=True)`.
+- Removed the non-functional `--interactive` flag from `onecite process` and the
+  dead interactive/fuzzy-adoption code (`_fuzzy_search`,
+  `_resolve_doi_via_crossref_title`). Plain-text disambiguation is handled by
+  `onecite suggest`; `process` resolves only strong identifiers. The
+  `interactive_callback` parameter remains as a no-op compatibility shim.
+- Removed best-effort metadata scraping of arbitrary HTML/PDF pages
+  (`_extract_metadata_from_url` and helpers, which also relied on an undeclared
+  PyPDF2 dependency) and the body-text DOI fallback in `_extract_doi_from_url`.
+  URL resolution now trusts only a publisher-declared `citation_doi` /
+  schema.org identifier (verified downstream), consistent with the
+  strong-identifier-only contract of `process`.
+
 ### Fixed
 - Corrected the benchmark Nature DQN DOI fixture from
   `10.1038/nature14539` to `10.1038/nature14236`, and added regression
@@ -87,6 +107,26 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
 - Clarified that `onecite benchmark --json` is the deterministic offline
   health check, while `onecite process ...` may contact upstream APIs
   unless fixtures or mocks are explicitly configured.
+- DOI-backed BibTeX input now keeps the canonical CrossRef/DataCite field
+  values instead of letting the original entry override them; original
+  fields still fill gaps the API leaves empty, and the existing citation
+  key is still preserved.
+- A CrossRef 404 now always falls back to DataCite instead of only doing so
+  for a short hardcoded prefix list, so dataset/software/thesis DOIs
+  registered under other DataCite prefixes resolve.
+- `suggest` no longer routes queries containing words such as "synthesis",
+  "hypothesis", or "parenthesis" to the thesis search (whole-word match for
+  "thesis"/"dissertation").
+- GitHub clone URLs ending in `.git` now resolve to the correct repository.
+- Plain-text entry ids stay contiguous when entries are separated by more
+  than one blank line, and a dead PLOS article-id branch was removed from
+  the text parser.
+- `suggest` candidate ranking now applies the tie-break (exact title, venue,
+  DOI, source tier) within the cluster of candidates scoring within 5 points of
+  the top, instead of letting a fractionally higher raw score always win.
+- BibTeX output now LaTeX-escapes the `abstract` and `editor` fields (not just
+  author/title/journal/...), so Unicode in those fields no longer leaks raw
+  into the `.bib` output.
 
 ## [0.1.1] - 2026-04-17
 

diff --git a/README.md b/README.md
@@ -36,13 +36,13 @@
 ---
 
 <p align="center">
-  OneCite is a command-line tool and Python library for citation management. It accepts DOIs, paper titles, arXiv IDs, and mixed inputs, and outputs formatted bibliographic entries.
+  OneCite is a command-line tool and Python library for citation management. It resolves strong identifiers such as DOIs, PMIDs, arXiv IDs, ISBNs, GitHub URLs, and data DOIs into formatted bibliographic entries, while plain-text title searches are handled by the separate candidate-only suggest command.
 </p>
 
 ---
 
 
- Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, titles typed by hand, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and attempts metadata lookup against configured sources such as CrossRef, PubMed, arXiv, and Semantic Scholar. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.
+ Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, PMIDs, ISBNs, software URLs, data DOIs, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and resolves strong identifiers against configured sources such as CrossRef, PubMed, arXiv, DataCite, GitHub, and Google Books. Plain-text title searches are exposed through `onecite suggest` so candidates can be reviewed without being mistaken for verified BibTeX. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.
 
 
 
@@ -54,14 +54,13 @@
 
 | Feature                 | Description                                                                                             |
 | ----------------------- | ------------------------------------------------------------------------------------------------------- |
-| **Fuzzy Matching**          | Attempt to match incomplete references against configured academic metadata sources.                 |
+| **Candidate Suggestions**   | Search incomplete plain-text references with `onecite suggest` without resolving them to BibTeX.     |
 | **Multiple Formats**        | Input `.txt`/`.bib` → Output **BibTeX**.                                                             |
 | **4-stage Pipeline**        | A 4-stage process (clean → query → validate → format) to produce consistent output.                  |
 | **Field Completion**        | Fill available fields returned by metadata sources, such as journal, volume, pages, authors, and abstract. |
 | 🎓 **7+ Citation Types**    | Handles journal articles, conference papers, books, software, datasets, theses, and preprints.        |
 | **Multi-Source Lookup**     | Uses source-specific routes for CrossRef, arXiv, PubMed, Semantic Scholar, Google Books, and others. |
-| **Many Identifier Types**   | Accepts DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, or plain text queries.                    |
-| 🎛️ **Interactive Mode**    | Manually select the correct entry when multiple potential matches are found.                          |
+| **Many Identifier Types**   | Resolves DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, and DataCite DOI inputs.                 |
 | **Custom Templates**        | YAML-based presets that provide a fallback BibTeX entry type when auto-detection is inconclusive.    |
 
 
@@ -97,9 +96,9 @@ Create a file named `references.txt` with your mixed-format references:
 
 10.1038/nature14539
 
-Attention is all you need, Vaswani et al., NIPS 2017
+arXiv:1706.03762
 
-Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
+ISBN:9780262035613
 
 https://github.com/tensorflow/tensorflow
 
@@ -157,39 +156,11 @@ Your `results.bib` file now contains entries of different types.
 
 ```bash
 onecite process "10.1038/nature14539"
-onecite process "Attention is all you need, Vaswani et al., NIPS 2017"
+onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017"
 echo "10.1038/nature14539" | onecite process -
 ```
 </details>
 
-<details>
-<summary><strong>Interactive Disambiguation</strong></summary>
-
-For ambiguous entries, use the `--interactive` flag to manually select the correct match and ensure accuracy.
-
-**Command**:
-```bash
-onecite process ambiguous.txt --interactive
-```
-
-**Example Interaction**:
-```
-Found multiple possible matches for "Deep learning Hinton":
-1. Deep learning
-   Authors: LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey
-   Journal: Nature, 2015
-   DOI: 10.1038/nature14539
-
-2. Deep belief networks
-   Authors: Hinton, Geoffrey E.
-   Journal: Scholarpedia, 2009
-   DOI: 10.4249/scholarpedia.5947
-
-Please select (1-2, 0=skip): 1
-Selected: Deep learning
-```
-</details>
-
 <details>
 <summary><strong>🐍 Use as a Python Library</strong></summary>
 
@@ -198,16 +169,12 @@ Use OneCite directly in your Python scripts.
 ```python
 from onecite import process_references
 
-# A callback can be used for non-interactive selection (e.g., always choose the best match)
-def auto_select_callback(candidates):
-    return 0 # Index of the best candidate
-
 result = process_references(
-    input_content="Deep learning review\nLeCun, Bengio, Hinton\nNature 2015",
+    input_content="10.1038/nature14539",
     input_type="txt",
     template_name="journal_article_full",
     output_format="bibtex",
-    interactive_callback=auto_select_callback
+    interactive_callback=lambda candidates: -1
 )
 
 print('\n\n'.join(result['results']))
@@ -229,7 +196,7 @@ onecite process <input_file> [OPTIONS]
 ```
 
 **Arguments:**
-- `input_file` - Input file path, `-` for stdin, or a reference string (e.g., DOI, title)
+- `input_file` - Input file path, `-` for stdin, or a strong identifier/reference string
 
 **Options:**
 | Option | Short | Description | Default |
@@ -238,12 +205,10 @@ onecite process <input_file> [OPTIONS]
 | `--template` | | Fallback BibTeX entry-type preset when auto-detection is inconclusive | `journal_article_full` |
 | `--output-format` | | Output format (currently only `bibtex` supported) | `bibtex` |
 | `--output` | `-o` | Output file path (default: stdout) | - |
-| `--interactive` | | Enable interactive mode for ambiguous matches | `False` |
 | `--quiet` | `-q` | Suppress verbose logging output | `False` |
 | `--json` | | Print a stable JSON envelope instead of BibTeX text | `False` |
 | `--ndjson` | | Print newline-delimited JSON events for streaming automation workflows | `False` |
 | `--fail-on-unresolved` | | Return exit code `2` when any entry cannot be resolved | `False` |
-| `--google-scholar` | | Enable Google Scholar as an additional data source (requires scholarly package) | `False` |
 
 **Examples:**
 ```bash
@@ -253,9 +218,6 @@ onecite process references.txt -o results.bib
 # Process a BibTeX file with auto-detection
 onecite process references.bib
 
-# Process with interactive mode
-onecite process ambiguous.txt --interactive
-
 # Use stdin
 echo "10.1038/nature14539" | onecite process -
 
@@ -265,9 +227,6 @@ onecite process "10.1038/nature14539"
 # Process with custom template
 onecite process references.txt --template conference_paper
 
-# Enable Google Scholar (requires scholarly package)
-onecite process references.txt --google-scholar
-
 # Quiet mode for scripts
 onecite process references.txt -o results.bib --quiet
 
@@ -278,6 +237,28 @@ onecite process references.txt --json --fail-on-unresolved
 onecite process references.txt --ndjson
 ```
 
+### `onecite suggest`
+
+Search for candidate matches without producing BibTeX or returning a
+validation `passed` status.
+
+```bash
+onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017" --json
+```
+
+**Optional Google Scholar fallback.** `suggest` accepts `--google-scholar`
+(requires the optional `scholarly` package: `pip install onecite[scholar]`).
+It is consulted only as a best-effort fallback when CrossRef and Semantic
+Scholar return nothing. Because it scrapes a service with no public API, it
+is **off by default, may be rate-limited or blocked by a CAPTCHA, and is not
+guaranteed to be reproducible** — it is exposed only on `suggest` (candidates
+for human review), never on `process` (authoritative output).
+
+```bash
+pip install onecite[scholar]
+onecite suggest "some obscure title" --google-scholar
+```
+
 ### `onecite --version`
 
 Display the installed OneCite version.

diff --git a/docs/advanced_usage.rst b/docs/advanced_usage.rst
@@ -1,43 +1,19 @@
 Advanced Usage
 ==============
 
-Interactive Disambiguation
----------------------------
+Reviewing Candidates for Ambiguous References
+---------------------------------------------
 
-When OneCite finds multiple potential matches for a reference, it can enter interactive mode to let you choose the correct one.
+``onecite process`` only resolves strong identifiers (DOI, PMID, arXiv ID,
+ISBN, URLs) and never guesses from an ambiguous plain-text reference. To
+inspect candidate matches for a messy or incomplete reference, use
+``onecite suggest``::
 
-Enabling Interactive Mode
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+    onecite suggest "deep learning hinton 2015"
 
-::
-
-    onecite process ambiguous.txt --interactive
-
-Example Session
-~~~~~~~~~~~~~~~
-
-::
-
-    Processing ambiguous.txt...
-
-    Found 2 matches for "Deep learning Hinton":
-
-    1. Deep Learning
-       Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
-       Journal: Nature
-       Year: 2015
-       Volume: 521, Pages: 436-444
-       DOI: 10.1038/nature14539
-
-    2. Deep Belief Networks
-       Authors: Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh
-       Journal: Neural Computation
-       Year: 2006
-       Volume: 18, Pages: 1527-1554
-       DOI: 10.1162/neco.2006.18.7.1527
-
-    Please select (1-2, 0=skip): 1
-    Selected: Deep Learning (10.1038/nature14539)
+Candidates are returned for human review (with match scores and sources) and
+are not emitted as verified BibTeX. Add ``--json`` for a machine-readable
+envelope.
 
 Batch Processing Multiple Files
 --------------------------------

diff --git a/docs/api/core.rst b/docs/api/core.rst
@@ -19,7 +19,6 @@ The primary function for processing citations.
         template_name: str,
         output_format: str,
         interactive_callback: Callable[[List[Dict]], int],
-        use_google_scholar: bool = False,
     ) -> Dict[str, Any]
 
 **Parameters:**
@@ -28,8 +27,7 @@ The primary function for processing citations.
 - ``input_type`` (str): Type of input - ``"txt"`` or ``"bib"`` (required)
 - ``template_name`` (str): Template name to use (e.g., ``"journal_article_full"``) (required)
 - ``output_format`` (str): Output format - currently only ``"bibtex"`` is supported (required)
-- ``interactive_callback`` (Callable): Function to handle ambiguous matches. Takes a list of candidate dicts and returns the selected index (0-based), or -1 to skip (required)
-- ``use_google_scholar`` (bool): Enable Google Scholar as an additional data source. Requires the optional ``scholarly`` package. Default is ``False``.
+- ``interactive_callback`` (Callable): Compatibility callback; plain-text candidate search is handled by ``suggest_references`` (required)
 
 **Returns:**
 
@@ -53,7 +51,7 @@ A dictionary with keys:
         input_type="txt",
         template_name="journal_article_full",
         output_format="bibtex",
-        interactive_callback=lambda candidates: 0  # Auto-select first match
+        interactive_callback=lambda candidates: -1
     )
 
     # Access results
@@ -216,7 +214,7 @@ For typical usage, ``process_references()`` is simpler. PipelineController expos
         input_type="txt",
         template_name="journal_article_full",
         output_format="bibtex",
-        interactive_callback=lambda candidates: 0
+        interactive_callback=lambda candidates: -1
     )
 
     print(result['results'])
Original file line number	Diff line number	Diff line change
Expand Up		@@ -48,4 +48,3 @@ docs/.doctrees/
		# OS
		.DS_Store
		Thumbs.db