Skip to content

feat(datasets): add context-shortening support to the SDK#622

Open
RapidPoseidon wants to merge 1 commit into
mainfrom
feat(datasets)/context-shortening-sdk
Open

feat(datasets): add context-shortening support to the SDK#622
RapidPoseidon wants to merge 1 commit into
mainfrom
feat(datasets)/context-shortening-sdk

Conversation

@RapidPoseidon

@RapidPoseidon RapidPoseidon commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What

Adds context-shortening support to the Python SDK. A long datapoint context (e.g. a full scene description) is often far more than a single annotator question needs — this lets the SDK tune it down to what's relevant, keeping it within the length the backend accepts.

1. ContextManager (client.context)

  • client.context.shorten_context(context, question) -> str — single pair.
  • client.context.shorten_contexts([(context, question), ...]) -> list[str] — batch.

Both call the new batch endpoint POST /datasets/shorten-context (request { items: [{context, question}, ...] } → response { items: [{shortenedContext}, ...] }). Results are cached server-side.

2. Context-length check in the creation flow

A new auto_shorten: bool = False argument is threaded onto the order and job creation calls and enforced centrally in _create_general_order / _create_general_job_definition (the chokepoints that have both the datapoints and the workflow instruction). For each datapoint whose context exceeds the limit:

  • default — logs a clear logger.warning that the backend would reject it;
  • auto_shorten=True — shortens the context for the order/job instruction (one batched request) and substitutes the result.

The limit (400 characters) is not hardcoded as a guess — it mirrors the backend's enforced datapoint context validation in datasets-service:
CreateDatapointCommandValidatorRuleFor(x => x.Context).MaximumLength(400) (the dataset-group context validator enforces the same).

Backend dependency ⚠️

This depends on the backend contract POST /datasets/shorten-context in rapidata-backend / datasets-service, which is being added in parallel and is not deployed yet. Until it lands, the API layer ships a hand-written wrapper (service/services/context_service.py) against the agreed contract, with a TODO to drop it and regenerate the OpenAPI client once the endpoint is published. The path is kept as a single constant for an easy one-line swap.

Please confirm the deployed route before merge: every existing datasets route in the backend is singular (/dataset/{id}/...), but the agreed contract here is plural /datasets/shorten-context. The constant _SHORTEN_CONTEXT_PATH is the single place to adjust if the deployed path differs.

Scope / known limitation

auto_shorten is wired wherever the question is cleanly reachable at datapoint-build time — i.e. order and job creation, where the instruction lives on the workflow (workflow._get_instruction()).

It is not wired into the flow path (RapidataFlow.create_new_flow_batch): there the context is a single batch-level value and the ranking instruction isn't available at item-creation time, so there's no question to tune against. Rather than invent a workaround, this PR leaves flow untouched and flags it here. If flow contexts should also be shortened, the question source needs to be decided first.

Validation

  • python -c "from rapidata import RapidataClient, ContextManager" → OK
  • pyright src/rapidata/rapidata_client → 0 errors (also checked the changed service/ files)
  • New files are black-clean. I did not run black across the whole package — it wanted to reformat 27 pre-existing files unrelated to this change, so new/changed lines just match the surrounding file style.
  • uv run --group docs mkdocs build → succeeds; docs updated in job_definition_parameters.md.

🔗 Session: node-05f2b913

Adds a ContextManager (exposed as `client.context`) that tunes a long
datapoint context down to what a specific question needs, via the new
batch endpoint POST /datasets/shorten-context. Supports both a single
(context, question) and a batch of pairs.

Wires a context-length check into the order/job creation flow. The limit
mirrors the backend's datapoint context validation (400 chars, from
datasets-service CreateDatapointCommandValidator). For each datapoint
whose context exceeds the limit:
- default: logs a warning that the backend would reject it;
- with the new `auto_shorten=True` argument on the creation call: shortens
  the context for the order/job instruction and substitutes it.

The endpoint is not deployed yet, so the API layer ships a hand-written
wrapper against the agreed contract (ContextService) with a TODO to
regenerate the OpenAPI client once it lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: luca <luca@rapidata.ai>
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown

Code Review — feat(datasets): context-shortening support

Overview

Solid feature addition with a clear, well-described design. The batching approach, ContextManager abstraction, and _context_length.py enforcement chokepoint are all good calls. The PR description is unusually thorough, which helps. The notes below are mostly about a handful of bugs and structural concerns that should be addressed before merge.


Bugs / Issues

1. 400 error responses are swallowed silently (context_service.py)

response_types_map={
    "200": "object",
    "400": "object",  # ← deserialized as a plain dict, not an error
    ...
},

When the backend returns 400, the deserializer returns the error body as a dict. result.get("items", []) then returns [] (error bodies have no "items" key), and the length check raises a confusing ValueError("returned 0 item(s) for N request item(s).") — swallowing the actual error detail. Either map "400" to the appropriate error type like other service files do, or raise explicitly after checking the HTTP status. The user should see a RapidataError, not a generic ValueError.

2. Three ContextManager instances per RapidataClient

RapidataClient, RapidataOrderManager, and RapidataJobManager each create their own ContextManager. Since the PR already exposes client.context, inject that instance into the managers rather than constructing new ones:

# In rapidata_client.py
self.context = ContextManager(openapi_service=self._openapi_service)
self.__order_manager = RapidataOrderManager(
    openapi_service=self._openapi_service,
    context_manager=self.context,  # pass it in
)

This also eliminates the self.__context_manager = ContextManager(openapi_service) boilerplate in both manager __init__s.

3. assert used for runtime invariants

assert datapoint.context is not None  # appears twice in _context_length.py

assert is disabled under python -O. Use a proper guard or cast for type narrowing. If it truly can't be None at that point, a cast(str, datapoint.context) + a comment is cleaner.


Minor Issues

4. Redundant None check in the pairs comprehension

pairs = [
    (datapoint.context, question)
    for _, datapoint in over_limit
    if datapoint.context is not None  # over_limit already filters this out
]

over_limit is already built with datapoint.context is not None and len(...) > MAX_CONTEXT_LENGTH, so this guard is dead code (pyright may require it for narrowing — if so, a cast is cleaner than a filter that changes iteration behaviour).

5. URL double-slash risk

url = f"{self._api_client.configuration.host}{_SHORTEN_CONTEXT_PATH}"

If host has a trailing /, the URL becomes .../datasets//shorten-context. Use host.rstrip("/") or urllib.parse.urljoin.

6. Unguarded dict key access

return [item["shortenedContext"] for item in returned]

If the backend returns an unexpected schema, this raises a KeyError. Use .get("shortenedContext") with an explicit error or a fallback, so the error message is actionable.

7. MAX_CONTEXT_LENGTH not in the public API

It's exported from context/__init__.py but not from rapidata_client/__init__.py or rapidata/__init__.py. Users who want to pre-validate their own contexts need it too.


Design / Nits

  • The enforce_context_length logic uses two separate if auto_shorten checks instead of if/elif/else, and silently falls through to the warning loop when auto_shorten=True but question is falsy. An explicit if auto_shortenelif questionelse structure would make the three paths unmistakeable and avoid the fall-through.
  • ContextManager.__init__ logs "ContextManager initialized" — with the triple-instantiation issue above this fires three times per client; even after fixing it's a noisy debug line given all other managers don't log their own init.

Test Coverage

There are no automated tests. At minimum, unit tests for:

  • enforce_context_length with contexts under/over the limit (both auto_shorten=True/False)
  • ContextManager.shorten_contexts (mock the service layer)
  • ContextService.shorten_contexts count-mismatch validation and 400-error path

would significantly reduce the risk of the backend dependency unknowns surfacing as silent failures in production.


Backend-Not-Deployed Risk

The PR ships working SDK code against a contract that isn't deployed yet. That's fine, but consider guarding the shorten_contexts call so it fails fast with a clear message if the endpoint returns a 404, rather than surfacing an opaque error from the length-mismatch check.


Summary: The overall structure is good. The three issues above (400 errors swallowed, triple-instantiation, assert in hot path) should be fixed before merge; the rest are polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant