Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,6 @@ docs/.doctrees/
.DS_Store
Thumbs.db

# Claude Code (personal/local config)
.claude/settings.local.json

23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
in the `README.md` Roadmap section and the `flake8 onecite tests`
validation check.

### Removed
- `onecite process` no longer accepts `--google-scholar`, and
`process_references()` no longer accepts the `use_google_scholar`
parameter. Google Scholar was never consulted from the authoritative
`process` path, so the flag and parameter were no-ops there. Google
Scholar remains available as an opt-in, best-effort fallback on
`onecite suggest --google-scholar` /
`suggest_references(use_google_scholar=True)`.

### Fixed
- Corrected the benchmark Nature DQN DOI fixture from
`10.1038/nature14539` to `10.1038/nature14236`, and added regression
Expand Down Expand Up @@ -87,6 +96,20 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
- Clarified that `onecite benchmark --json` is the deterministic offline
health check, while `onecite process ...` may contact upstream APIs
unless fixtures or mocks are explicitly configured.
- DOI-backed BibTeX input now keeps the canonical CrossRef/DataCite field
values instead of letting the original entry override them; original
fields still fill gaps the API leaves empty, and the existing citation
key is still preserved.
- A CrossRef 404 now always falls back to DataCite instead of only doing so
for a short hardcoded prefix list, so dataset/software/thesis DOIs
registered under other DataCite prefixes resolve.
- `suggest` no longer routes queries containing words such as "synthesis",
"hypothesis", or "parenthesis" to the thesis search (whole-word match for
"thesis"/"dissertation").
- GitHub clone URLs ending in `.git` now resolve to the correct repository.
- Plain-text entry ids stay contiguous when entries are separated by more
than one blank line, and a dead PLOS article-id branch was removed from
the text parser.

## [0.1.1] - 2026-04-17

Expand Down
54 changes: 32 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@
---

<p align="center">
OneCite is a command-line tool and Python library for citation management. It accepts DOIs, paper titles, arXiv IDs, and mixed inputs, and outputs formatted bibliographic entries.
OneCite is a command-line tool and Python library for citation management. It resolves strong identifiers such as DOIs, PMIDs, arXiv IDs, ISBNs, GitHub URLs, and data DOIs into formatted bibliographic entries, while plain-text title searches are handled by the separate candidate-only suggest command.
</p>

---


Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, titles typed by hand, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and attempts metadata lookup against configured sources such as CrossRef, PubMed, arXiv, and Semantic Scholar. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.
Researchers frequently accumulate reference lists in ad-hoc formats—DOIs copied from browser tabs, arXiv IDs from paper PDFs, PMIDs, ISBNs, software URLs, data DOIs, and BibTeX fragments from various sources. Cleaning these into consistent BibTeX output is tedious and error-prone. OneCite parses raw reference text and resolves strong identifiers against configured sources such as CrossRef, PubMed, arXiv, DataCite, GitHub, and Google Books. Plain-text title searches are exposed through `onecite suggest` so candidates can be reviewed without being mistaken for verified BibTeX. The result is a reproducible processing layer that reports unresolved entries and produces auditable BibTeX where metadata can be found.



Expand All @@ -54,14 +54,13 @@

| Feature | Description |
| ----------------------- | ------------------------------------------------------------------------------------------------------- |
| **Fuzzy Matching** | Attempt to match incomplete references against configured academic metadata sources. |
| **Candidate Suggestions** | Search incomplete plain-text references with `onecite suggest` without resolving them to BibTeX. |
| **Multiple Formats** | Input `.txt`/`.bib` → Output **BibTeX**. |
| **4-stage Pipeline** | A 4-stage process (clean → query → validate → format) to produce consistent output. |
| **Field Completion** | Fill available fields returned by metadata sources, such as journal, volume, pages, authors, and abstract. |
| 🎓 **7+ Citation Types** | Handles journal articles, conference papers, books, software, datasets, theses, and preprints. |
| **Multi-Source Lookup** | Uses source-specific routes for CrossRef, arXiv, PubMed, Semantic Scholar, Google Books, and others. |
| **Many Identifier Types** | Accepts DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, or plain text queries. |
| 🎛️ **Interactive Mode** | Manually select the correct entry when multiple potential matches are found. |
| **Many Identifier Types** | Resolves DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, and DataCite DOI inputs. |
| **Custom Templates** | YAML-based presets that provide a fallback BibTeX entry type when auto-detection is inconclusive. |


Expand Down Expand Up @@ -97,9 +96,9 @@ Create a file named `references.txt` with your mixed-format references:

10.1038/nature14539

Attention is all you need, Vaswani et al., NIPS 2017
arXiv:1706.03762

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
ISBN:9780262035613

https://github.com/tensorflow/tensorflow

Expand Down Expand Up @@ -157,7 +156,7 @@ Your `results.bib` file now contains entries of different types.

```bash
onecite process "10.1038/nature14539"
onecite process "Attention is all you need, Vaswani et al., NIPS 2017"
onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017"
echo "10.1038/nature14539" | onecite process -
```
</details>
Expand Down Expand Up @@ -198,16 +197,12 @@ Use OneCite directly in your Python scripts.
```python
from onecite import process_references

# A callback can be used for non-interactive selection (e.g., always choose the best match)
def auto_select_callback(candidates):
return 0 # Index of the best candidate

result = process_references(
input_content="Deep learning review\nLeCun, Bengio, Hinton\nNature 2015",
input_content="10.1038/nature14539",
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=auto_select_callback
interactive_callback=lambda candidates: -1
)

print('\n\n'.join(result['results']))
Expand All @@ -229,7 +224,7 @@ onecite process <input_file> [OPTIONS]
```

**Arguments:**
- `input_file` - Input file path, `-` for stdin, or a reference string (e.g., DOI, title)
- `input_file` - Input file path, `-` for stdin, or a strong identifier/reference string

**Options:**
| Option | Short | Description | Default |
Expand All @@ -243,7 +238,6 @@ onecite process <input_file> [OPTIONS]
| `--json` | | Print a stable JSON envelope instead of BibTeX text | `False` |
| `--ndjson` | | Print newline-delimited JSON events for streaming automation workflows | `False` |
| `--fail-on-unresolved` | | Return exit code `2` when any entry cannot be resolved | `False` |
| `--google-scholar` | | Enable Google Scholar as an additional data source (requires scholarly package) | `False` |

**Examples:**
```bash
Expand All @@ -253,9 +247,6 @@ onecite process references.txt -o results.bib
# Process a BibTeX file with auto-detection
onecite process references.bib

# Process with interactive mode
onecite process ambiguous.txt --interactive

# Use stdin
echo "10.1038/nature14539" | onecite process -

Expand All @@ -265,9 +256,6 @@ onecite process "10.1038/nature14539"
# Process with custom template
onecite process references.txt --template conference_paper

# Enable Google Scholar (requires scholarly package)
onecite process references.txt --google-scholar

# Quiet mode for scripts
onecite process references.txt -o results.bib --quiet

Expand All @@ -278,6 +266,28 @@ onecite process references.txt --json --fail-on-unresolved
onecite process references.txt --ndjson
```

### `onecite suggest`

Search for candidate matches without producing BibTeX or returning a
validation `passed` status.

```bash
onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017" --json
```

**Optional Google Scholar fallback.** `suggest` accepts `--google-scholar`
(requires the optional `scholarly` package: `pip install onecite[scholar]`).
It is consulted only as a best-effort fallback when CrossRef and Semantic
Scholar return nothing. Because it scrapes a service with no public API, it
is **off by default, may be rate-limited or blocked by a CAPTCHA, and is not
guaranteed to be reproducible** — it is exposed only on `suggest` (candidates
for human review), never on `process` (authoritative output).

```bash
pip install onecite[scholar]
onecite suggest "some obscure title" --google-scholar
```

### `onecite --version`

Display the installed OneCite version.
Expand Down
8 changes: 3 additions & 5 deletions docs/api/core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ The primary function for processing citations.
template_name: str,
output_format: str,
interactive_callback: Callable[[List[Dict]], int],
use_google_scholar: bool = False,
) -> Dict[str, Any]

**Parameters:**
Expand All @@ -28,8 +27,7 @@ The primary function for processing citations.
- ``input_type`` (str): Type of input - ``"txt"`` or ``"bib"`` (required)
- ``template_name`` (str): Template name to use (e.g., ``"journal_article_full"``) (required)
- ``output_format`` (str): Output format - currently only ``"bibtex"`` is supported (required)
- ``interactive_callback`` (Callable): Function to handle ambiguous matches. Takes a list of candidate dicts and returns the selected index (0-based), or -1 to skip (required)
- ``use_google_scholar`` (bool): Enable Google Scholar as an additional data source. Requires the optional ``scholarly`` package. Default is ``False``.
- ``interactive_callback`` (Callable): Compatibility callback; plain-text candidate search is handled by ``suggest_references`` (required)

**Returns:**

Expand All @@ -53,7 +51,7 @@ A dictionary with keys:
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda candidates: 0 # Auto-select first match
interactive_callback=lambda candidates: -1
)

# Access results
Expand Down Expand Up @@ -216,7 +214,7 @@ For typical usage, ``process_references()`` is simpler. PipelineController expos
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda candidates: 0
interactive_callback=lambda candidates: -1
)

print(result['results'])
Expand Down
47 changes: 21 additions & 26 deletions docs/api/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,48 +72,43 @@ parsing fails.
Stage 2: Identify (``IdentifierModule``)
----------------------------------------

**Purpose:** resolve each ``RawEntry`` against academic data sources and
produce an ``IdentifiedEntry`` with a DOI (when possible) plus basic
metadata.
**Purpose:** resolve each ``RawEntry`` with strong identifiers into an
``IdentifiedEntry`` with a DOI / arXiv ID / URL plus basic metadata.
Plain-text title searches are not resolved by the processing pipeline; use
the suggestion workflow for candidate search.

**Input:** ``List[RawEntry]`` and an ``interactive_callback`` that picks
from candidate lists when confidence is medium.
**Input:** ``List[RawEntry]`` and an ``interactive_callback`` kept for API
compatibility.

**Output:** ``List[IdentifiedEntry]``.

**Data sources actually queried by the code:**

- CrossRef (DOI-based and fuzzy search)
- Semantic Scholar (keyword search)
- CrossRef (DOI-based lookup; candidate search in suggest mode)
- Semantic Scholar (candidate search in suggest mode)
- arXiv (via feedparser)
- PubMed (biomedical, queried when strong cues are present)
- DataCite / Zenodo (datasets)
- Google Books (books — triggered by ISBN or publisher cues)
- external providerRE / BASE (theses)
- GitHub (software repositories)
- Google Scholar (optional, disabled by default; opt-in via
``--google-scholar`` or ``use_google_scholar=True`` and requires the
``scholarly`` package)
- Google Scholar (optional, ``suggest``-only best-effort fallback, disabled by
default; opt-in via ``suggest --google-scholar`` or
``suggest_references(use_google_scholar=True)`` and requires the
``scholarly`` package; never used by ``process``)

There is **no runtime routing based on filename** and no fixed priority
for "medical", "CS" or "general" queries. Signal-based heuristics
inside ``_fuzzy_search`` decide when to *additionally* query PubMed,
Google Books, external providerRE/BASE, etc., but CrossRef and Semantic Scholar are
always consulted for text queries.
for "medical", "CS" or "general" queries. Signal-based heuristics in
suggestion mode decide when to *additionally* query PubMed, Google Books,
external providerRE/BASE, etc. Text-only entries in process mode are
reported as unresolved instead of being guessed.

**Confidence model:**

After all sources have returned candidates, ``_score_candidates`` assigns
each candidate a ``match_score`` (0–100) based on title / author /
year / venue similarity to the query. The decision logic in
``_fuzzy_search`` then chooses one of three paths:

- ``match_score >= 80`` and a clear best candidate → auto-adopt
- ``70 <= match_score < 80`` → call the ``interactive_callback`` with up
to 5 candidates; fall back to the top candidate if the user skips and
the score is still ≥ 75
- ``match_score >= 50`` and a title is present → adopt cautiously
- otherwise → mark the entry as ``identification_failed``
After all suggestion sources have returned candidates, ``_score_candidates``
assigns each candidate a ``match_score`` (0–100) based on title / author /
year / venue similarity to the query. Scores are returned for human or
downstream review; they are not treated as validation proof.

Fallback paths never fabricate data: an entry that cannot be resolved is
marked ``identification_failed`` rather than filled with invented
Expand Down Expand Up @@ -219,7 +214,7 @@ high-level ``process_references`` function:
input_type="txt",
template_name="journal_article_full",
output_format="bibtex",
interactive_callback=lambda candidates: 0, # auto-pick first
interactive_callback=lambda candidates: -1
)

print('\n\n'.join(result['results']))
Expand Down
17 changes: 11 additions & 6 deletions docs/basic_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ A text file where each reference is separated by a **blank line**::

10.1038/nature14539

Vaswani et al., 2017, Attention is all you need
arXiv:1706.03762

Smith (2020) Neural Architecture Search
ISBN:9780262035613

.. note::

Expand Down Expand Up @@ -115,18 +115,23 @@ line followed by result and failure events::

onecite process input.txt --ndjson

**Google Scholar (--google-scholar)**
**Google Scholar (suggest only, --google-scholar)**

Enable Google Scholar as an additional data source (requires the optional ``scholarly`` package)::
A best-effort fallback for the ``suggest`` command only (requires the optional
``scholarly`` package: ``pip install onecite[scholar]``). It is consulted only
when CrossRef and Semantic Scholar return nothing. Because it scrapes a service
with no public API, it is off by default, may be blocked by a CAPTCHA, and is
not guaranteed to be reproducible. It is never used by ``process``, whose output
is authoritative::

onecite process input.txt --google-scholar
onecite suggest input.txt --google-scholar

**Direct String Input**

Pass a reference string directly instead of a file::

onecite process "10.1038/nature14539"
onecite process "Attention is all you need, Vaswani et al., NIPS 2017"
onecite suggest "Attention is all you need, Vaswani et al., NIPS 2017"

**Stdin Input**

Expand Down
19 changes: 19 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,15 @@ Changed
live checks are explicitly marked with ``pytest.mark.live`` so the
default suite is deterministic and offline.

Removed
~~~~~~~

- ``onecite process`` no longer accepts ``--google-scholar``, and
``process_references()`` no longer accepts the ``use_google_scholar``
parameter (both were no-ops on the authoritative ``process`` path).
Google Scholar remains an opt-in, best-effort fallback on
``onecite suggest --google-scholar``.

Fixed
~~~~~

Expand All @@ -60,6 +69,16 @@ Fixed
distribution artifacts.
- Added benchmark and doctor checks to the GitHub Actions test
workflow.
- DOI-backed BibTeX input keeps canonical CrossRef/DataCite fields
instead of letting the original entry override them; original fields
still fill gaps and the existing citation key is preserved.
- A CrossRef 404 always falls back to DataCite instead of only doing so
for a short hardcoded prefix list.
- ``suggest`` no longer routes queries containing words such as
"synthesis" or "hypothesis" to the thesis search.
- GitHub clone URLs ending in ``.git`` resolve to the correct repository.
- Plain-text entry ids stay contiguous across multi-blank-line gaps, and
a dead PLOS article-id branch was removed from the text parser.

[0.1.1] - 2026-04-17
---------------------
Expand Down
Loading
Loading