feat(datasets): add context-shortening support to the SDK#622
feat(datasets): add context-shortening support to the SDK#622RapidPoseidon wants to merge 1 commit into
Conversation
Adds a ContextManager (exposed as `client.context`) that tunes a long datapoint context down to what a specific question needs, via the new batch endpoint POST /datasets/shorten-context. Supports both a single (context, question) and a batch of pairs. Wires a context-length check into the order/job creation flow. The limit mirrors the backend's datapoint context validation (400 chars, from datasets-service CreateDatapointCommandValidator). For each datapoint whose context exceeds the limit: - default: logs a warning that the backend would reject it; - with the new `auto_shorten=True` argument on the creation call: shortens the context for the order/job instruction and substitutes it. The endpoint is not deployed yet, so the API layer ships a hand-written wrapper against the agreed contract (ContextService) with a TODO to regenerate the OpenAPI client once it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-Authored-By: luca <luca@rapidata.ai>
Code Review — feat(datasets): context-shortening supportOverviewSolid feature addition with a clear, well-described design. The batching approach, Bugs / Issues1. 400 error responses are swallowed silently ( response_types_map={
"200": "object",
"400": "object", # ← deserialized as a plain dict, not an error
...
},When the backend returns 400, the deserializer returns the error body as a dict. 2. Three
# In rapidata_client.py
self.context = ContextManager(openapi_service=self._openapi_service)
self.__order_manager = RapidataOrderManager(
openapi_service=self._openapi_service,
context_manager=self.context, # pass it in
)This also eliminates the 3. assert datapoint.context is not None # appears twice in _context_length.py
Minor Issues4. Redundant pairs = [
(datapoint.context, question)
for _, datapoint in over_limit
if datapoint.context is not None # over_limit already filters this out
]
5. URL double-slash risk url = f"{self._api_client.configuration.host}{_SHORTEN_CONTEXT_PATH}"If 6. Unguarded dict key access return [item["shortenedContext"] for item in returned]If the backend returns an unexpected schema, this raises a 7. It's exported from Design / Nits
Test CoverageThere are no automated tests. At minimum, unit tests for:
would significantly reduce the risk of the backend dependency unknowns surfacing as silent failures in production. Backend-Not-Deployed RiskThe PR ships working SDK code against a contract that isn't deployed yet. That's fine, but consider guarding the Summary: The overall structure is good. The three issues above (400 errors swallowed, triple-instantiation, |
What
Adds context-shortening support to the Python SDK. A long datapoint
context(e.g. a full scene description) is often far more than a single annotator question needs — this lets the SDK tune it down to what's relevant, keeping it within the length the backend accepts.1.
ContextManager(client.context)client.context.shorten_context(context, question) -> str— single pair.client.context.shorten_contexts([(context, question), ...]) -> list[str]— batch.Both call the new batch endpoint
POST /datasets/shorten-context(request{ items: [{context, question}, ...] }→ response{ items: [{shortenedContext}, ...] }). Results are cached server-side.2. Context-length check in the creation flow
A new
auto_shorten: bool = Falseargument is threaded onto the order and job creation calls and enforced centrally in_create_general_order/_create_general_job_definition(the chokepoints that have both the datapoints and the workflow instruction). For each datapoint whosecontextexceeds the limit:logger.warningthat the backend would reject it;auto_shorten=True— shortens the context for the order/jobinstruction(one batched request) and substitutes the result.The limit (400 characters) is not hardcoded as a guess — it mirrors the backend's enforced datapoint context validation in
datasets-service:CreateDatapointCommandValidator→RuleFor(x => x.Context).MaximumLength(400)(the dataset-group context validator enforces the same).Backend dependency⚠️
This depends on the backend contract
POST /datasets/shorten-contextin rapidata-backend / datasets-service, which is being added in parallel and is not deployed yet. Until it lands, the API layer ships a hand-written wrapper (service/services/context_service.py) against the agreed contract, with aTODOto drop it and regenerate the OpenAPI client once the endpoint is published. The path is kept as a single constant for an easy one-line swap.Please confirm the deployed route before merge: every existing datasets route in the backend is singular (
/dataset/{id}/...), but the agreed contract here is plural/datasets/shorten-context. The constant_SHORTEN_CONTEXT_PATHis the single place to adjust if the deployed path differs.Scope / known limitation
auto_shortenis wired wherever the question is cleanly reachable at datapoint-build time — i.e. order and job creation, where the instruction lives on the workflow (workflow._get_instruction()).It is not wired into the flow path (
RapidataFlow.create_new_flow_batch): there the context is a single batch-level value and the ranking instruction isn't available at item-creation time, so there's no question to tune against. Rather than invent a workaround, this PR leaves flow untouched and flags it here. If flow contexts should also be shortened, the question source needs to be decided first.Validation
python -c "from rapidata import RapidataClient, ContextManager"→ OKpyright src/rapidata/rapidata_client→ 0 errors (also checked the changedservice/files)black-clean. I did not runblackacross the whole package — it wanted to reformat 27 pre-existing files unrelated to this change, so new/changed lines just match the surrounding file style.uv run --group docs mkdocs build→ succeeds; docs updated injob_definition_parameters.md.🔗 Session: node-05f2b913