Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,3 @@ docs/.doctrees/
# OS
.DS_Store
Thumbs.db

40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,26 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
in the `README.md` Roadmap section and the `flake8 onecite tests`
validation check.

### Removed
- `onecite process` no longer accepts `--google-scholar`, and
`process_references()` no longer accepts the `use_google_scholar`
parameter. Google Scholar was never consulted from the authoritative
`process` path, so the flag and parameter were no-ops there. Google
Scholar remains available as an opt-in, best-effort fallback on
`onecite suggest --google-scholar` /
`suggest_references(use_google_scholar=True)`.
- Removed the non-functional `--interactive` flag from `onecite process` and the
dead interactive/fuzzy-adoption code (`_fuzzy_search`,
`_resolve_doi_via_crossref_title`). Plain-text disambiguation is handled by
`onecite suggest`; `process` resolves only strong identifiers. The
`interactive_callback` parameter remains as a no-op compatibility shim.
- Removed best-effort metadata scraping of arbitrary HTML/PDF pages
(`_extract_metadata_from_url` and helpers, which also relied on an undeclared
PyPDF2 dependency) and the body-text DOI fallback in `_extract_doi_from_url`.
URL resolution now trusts only a publisher-declared `citation_doi` /
schema.org identifier (verified downstream), consistent with the
strong-identifier-only contract of `process`.

### Fixed
- Corrected the benchmark Nature DQN DOI fixture from
`10.1038/nature14539` to `10.1038/nature14236`, and added regression
Expand Down Expand Up @@ -87,6 +107,26 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
- Clarified that `onecite benchmark --json` is the deterministic offline
health check, while `onecite process ...` may contact upstream APIs
unless fixtures or mocks are explicitly configured.
- DOI-backed BibTeX input now keeps the canonical CrossRef/DataCite field
values instead of letting the original entry override them; original
fields still fill gaps the API leaves empty, and the existing citation
key is still preserved.
- A CrossRef 404 now always falls back to DataCite instead of only doing so
for a short hardcoded prefix list, so dataset/software/thesis DOIs
registered under other DataCite prefixes resolve.
- `suggest` no longer routes queries containing words such as "synthesis",
"hypothesis", or "parenthesis" to the thesis search (whole-word match for
"thesis"/"dissertation").
- GitHub clone URLs ending in `.git` now resolve to the correct repository.
- Plain-text entry ids stay contiguous when entries are separated by more
than one blank line, and a dead PLOS article-id branch was removed from
the text parser.
- `suggest` candidate ranking now applies the tie-break (exact title, venue,
DOI, source tier) within the cluster of candidates scoring within 5 points of
the top, instead of letting a fractionally higher raw score always win.
- BibTeX output now LaTeX-escapes the `abstract` and `editor` fields (not just
author/title/journal/...), so Unicode in those fields no longer leaks raw
into the `.bib` output.

## [0.1.1] - 2026-04-17

Expand Down
83 changes: 32 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@
---

<p align="center">
OneCite is a command-line tool and Python library for citation management. It accepts DOIs, paper titles, arXiv IDs, and mixed inputs, and outputs formatted bibliographic entries.
OneCite is a command-line tool and Python library for citation management. It resolves strong identifiers such as DOIs, PMIDs, arXiv IDs, ISBNs, GitHub URLs, and data DOIs into formatted bibliographic entries, while plain-text title searches are handled by the separate candidate-only suggest command.
</p>

---


Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, titles typed by hand, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and attempts metadata lookup against configured sources such as CrossRef, PubMed, arXiv, and Semantic Scholar. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.
Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, PMIDs, ISBNs, software URLs, data DOIs, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and resolves strong identifiers against configured sources such as CrossRef, PubMed, arXiv, DataCite, GitHub, and Google Books. Plain-text title searches are exposed through `onecite suggest` so candidates can be reviewed without being mistaken for verified BibTeX. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.



Expand All @@ -54,14 +54,13 @@

| Feature | Description |
| ----------------------- | ------------------------------------------------------------------------------------------------------- |
| **Fuzzy Matching** | Attempt to match incomplete references against configured academic metadata sources. |
| **Candidate Suggestions** | Search incomplete plain-text references with `onecite suggest` without resolving them to BibTeX. |
| **Multiple Formats** | Input `.txt`/`.bib` → Output **BibTeX**. |
| **4-stage Pipeline** | A 4-stage process (clean → query → validate → format) to produce consistent output. |
| **Field Completion** | Fill available fields returned by metadata sources, such as journal, volume, pages, authors, and abstract. |
| 🎓 **7+ Citation Types** | Handles journal articles, conference papers, books, software, datasets, theses, and preprints. |
| **Multi-Source Lookup** | Uses source-specific routes for CrossRef, arXiv, PubMed, Semantic Scholar, Google Books, and others. |
| **Many Identifier Types** | Accepts DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, or plain text queries. |
| 🎛️ **Interactive Mode** | Manually select the correct entry when multiple potential matches are found. |
| **Many Identifier Types** | Resolves DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, and DataCite DOI inputs. |
| **Custom Templates** | YAML-based presets that provide a fallback BibTeX entry type when auto-detection is inconclusive. |


Expand Down Expand Up @@ -97,9 +96,9 @@ Create a file named `references.txt` with your mixed-format references:

10.1038/nature14539

Attention is all you need, Vaswani et al., NIPS 2017
arXiv:1706.03762

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
ISBN:9780262035613

https://github.com/tensorflow/tensorflow

Expand Down Expand Up @@ -157,39 +156,11 @@ Your `results.bib` file now contains entries of different types.

```bash
onecite process "10.1038/nature14539"
onecite process "Attention is all you need, Vaswani et al., NIPS 2017"
onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017"
echo "10.1038/nature14539" | onecite process -
```
</details>

<details>
<summary><strong>Interactive Disambiguation</strong></summary>

For ambiguous entries, use the `--interactive` flag to manually select the correct match and ensure accuracy.

**Command**:
```bash
onecite process ambiguous.txt --interactive
```

**Example Interaction**:
```
Found multiple possible matches for "Deep learning Hinton":
1. Deep learning
Authors: LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey
Journal: Nature, 2015
DOI: 10.1038/nature14539

2. Deep belief networks
Authors: Hinton, Geoffrey E.
Journal: Scholarpedia, 2009
DOI: 10.4249/scholarpedia.5947

Please select (1-2, 0=skip): 1
Selected: Deep learning
```
</details>

<details>
<summary><strong>🐍 Use as a Python Library</strong></summary>

Expand All @@ -198,16 +169,12 @@ Use OneCite directly in your Python scripts.
```python
from onecite import process_references

# A callback can be used for non-interactive selection (e.g., always choose the best match)
def auto_select_callback(candidates):
return 0 # Index of the best candidate

result = process_references(
input_content="Deep learning review\nLeCun, Bengio, Hinton\nNature 2015",
input_content="10.1038/nature14539",
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=auto_select_callback
interactive_callback=lambda candidates: -1
)

print('\n\n'.join(result['results']))
Expand All @@ -229,7 +196,7 @@ onecite process <input_file> [OPTIONS]
```

**Arguments:**
- `input_file` - Input file path, `-` for stdin, or a reference string (e.g., DOI, title)
- `input_file` - Input file path, `-` for stdin, or a strong identifier/reference string

**Options:**
| Option | Short | Description | Default |
Expand All @@ -238,12 +205,10 @@ onecite process <input_file> [OPTIONS]
| `--template` | | Fallback BibTeX entry-type preset when auto-detection is inconclusive | `journal_article_full` |
| `--output-format` | | Output format (currently only `bibtex` supported) | `bibtex` |
| `--output` | `-o` | Output file path (default: stdout) | - |
| `--interactive` | | Enable interactive mode for ambiguous matches | `False` |
| `--quiet` | `-q` | Suppress verbose logging output | `False` |
| `--json` | | Print a stable JSON envelope instead of BibTeX text | `False` |
| `--ndjson` | | Print newline-delimited JSON events for streaming automation workflows | `False` |
| `--fail-on-unresolved` | | Return exit code `2` when any entry cannot be resolved | `False` |
| `--google-scholar` | | Enable Google Scholar as an additional data source (requires scholarly package) | `False` |

**Examples:**
```bash
Expand All @@ -253,9 +218,6 @@ onecite process references.txt -o results.bib
# Process a BibTeX file with auto-detection
onecite process references.bib

# Process with interactive mode
onecite process ambiguous.txt --interactive

# Use stdin
echo "10.1038/nature14539" | onecite process -

Expand All @@ -265,9 +227,6 @@ onecite process "10.1038/nature14539"
# Process with custom template
onecite process references.txt --template conference_paper

# Enable Google Scholar (requires scholarly package)
onecite process references.txt --google-scholar

# Quiet mode for scripts
onecite process references.txt -o results.bib --quiet

Expand All @@ -278,6 +237,28 @@ onecite process references.txt --json --fail-on-unresolved
onecite process references.txt --ndjson
```

### `onecite suggest`

Search for candidate matches without producing BibTeX or returning a
validation `passed` status.

```bash
onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017" --json
```

**Optional Google Scholar fallback.** `suggest` accepts `--google-scholar`
(requires the optional `scholarly` package: `pip install onecite[scholar]`).
It is consulted only as a best-effort fallback when CrossRef and Semantic
Scholar return nothing. Because it scrapes a service with no public API, it
is **off by default, may be rate-limited or blocked by a CAPTCHA, and is not
guaranteed to be reproducible** — it is exposed only on `suggest` (candidates
for human review), never on `process` (authoritative output).

```bash
pip install onecite[scholar]
onecite suggest "some obscure title" --google-scholar
```

### `onecite --version`

Display the installed OneCite version.
Expand Down
44 changes: 10 additions & 34 deletions docs/advanced_usage.rst
Original file line number Diff line number Diff line change
@@ -1,43 +1,19 @@
Advanced Usage
==============

Interactive Disambiguation
---------------------------
Reviewing Candidates for Ambiguous References
---------------------------------------------

When OneCite finds multiple potential matches for a reference, it can enter interactive mode to let you choose the correct one.
``onecite process`` only resolves strong identifiers (DOI, PMID, arXiv ID,
ISBN, URLs) and never guesses from an ambiguous plain-text reference. To
inspect candidate matches for a messy or incomplete reference, use
``onecite suggest``::

Enabling Interactive Mode
~~~~~~~~~~~~~~~~~~~~~~~~~~
onecite suggest "deep learning hinton 2015"

::

onecite process ambiguous.txt --interactive

Example Session
~~~~~~~~~~~~~~~

::

Processing ambiguous.txt...

Found 2 matches for "Deep learning Hinton":

1. Deep Learning
Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Journal: Nature
Year: 2015
Volume: 521, Pages: 436-444
DOI: 10.1038/nature14539

2. Deep Belief Networks
Authors: Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh
Journal: Neural Computation
Year: 2006
Volume: 18, Pages: 1527-1554
DOI: 10.1162/neco.2006.18.7.1527

Please select (1-2, 0=skip): 1
Selected: Deep Learning (10.1038/nature14539)
Candidates are returned for human review (with match scores and sources) and
are not emitted as verified BibTeX. Add ``--json`` for a machine-readable
envelope.

Batch Processing Multiple Files
--------------------------------
Expand Down
8 changes: 3 additions & 5 deletions docs/api/core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ The primary function for processing citations.
template_name: str,
output_format: str,
interactive_callback: Callable[[List[Dict]], int],
use_google_scholar: bool = False,
) -> Dict[str, Any]

**Parameters:**
Expand All @@ -28,8 +27,7 @@ The primary function for processing citations.
- ``input_type`` (str): Type of input - ``"txt"`` or ``"bib"`` (required)
- ``template_name`` (str): Template name to use (e.g., ``"journal_article_full"``) (required)
- ``output_format`` (str): Output format - currently only ``"bibtex"`` is supported (required)
- ``interactive_callback`` (Callable): Function to handle ambiguous matches. Takes a list of candidate dicts and returns the selected index (0-based), or -1 to skip (required)
- ``use_google_scholar`` (bool): Enable Google Scholar as an additional data source. Requires the optional ``scholarly`` package. Default is ``False``.
- ``interactive_callback`` (Callable): Compatibility callback; plain-text candidate search is handled by ``suggest_references`` (required)

**Returns:**

Expand All @@ -53,7 +51,7 @@ A dictionary with keys:
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda candidates: 0 # Auto-select first match
interactive_callback=lambda candidates: -1
)

# Access results
Expand Down Expand Up @@ -216,7 +214,7 @@ For typical usage, ``process_references()`` is simpler. PipelineController expos
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda candidates: 0
interactive_callback=lambda candidates: -1
)

print(result['results'])
Expand Down
Loading
Loading