Skip to content

Feature Request: Add PaddleOCR GPU support to Docker image #533

@137137137

Description

@137137137

Feature Request: Add PaddleOCR GPU support to unstructured-api Docker image

Summary

The official unstructured-api Docker image includes Tesseract but not PaddleOCR. Users who want to use PaddleOCR with GPU acceleration must build a custom image. It would be valuable to have an official GPU-enabled image with PaddleOCR pre-installed.

Current Behavior

  • The unstructured-api:latest image only includes Tesseract OCR
  • PaddleOCR must be manually installed via pip install paddlepaddle unstructured-paddleocr
  • There is no GPU-enabled variant of the image
  • The OCR_AGENT environment variable is ignored (see related issue: OCR_AGENT_BUG_ISSUE.md)

Proposed Solution

Option 1: Provide GPU-enabled image tags

Publish additional Docker image variants:

unstructured-api:latest-gpu-cu118  # CUDA 11.8
unstructured-api:latest-gpu-cu126  # CUDA 12.6

These images would include:

  • paddlepaddle-gpu from the appropriate CUDA index
  • unstructured-paddleocr
  • NVIDIA CUDA runtime

Option 2: Add build args to existing Dockerfile

Add build arguments to allow users to build GPU-enabled images:

ARG USE_GPU=false
ARG CUDA_VERSION=cu118

RUN if [ "$USE_GPU" = "true" ]; then \
        pip install --no-cache-dir \
            paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/stable/${CUDA_VERSION}/ \
            unstructured-paddleocr; \
    fi

Implementation Details

The unstructured library already supports GPU acceleration for PaddleOCR. In unstructured/partition/utils/ocr_models/paddle_ocr.py:

gpu_available = paddle.device.cuda.device_count() > 0
if gpu_available:
    logger.info(f"Loading paddle with GPU on language={language}...")

paddle_ocr = PaddleOCR(
    use_angle_cls=True,
    use_gpu=gpu_available,  # Auto-detects GPU
    lang=language,
    enable_mkldnn=True,
    show_log=False,
)

This means the library automatically uses GPU when available - the only requirement is installing paddlepaddle-gpu instead of paddlepaddle.

Workaround

Users can extend the official image:

FROM downloads.unstructured.io/unstructured-io/unstructured-api:latest

USER root

ARG USE_GPU=false
RUN if [ "$USE_GPU" = "true" ]; then \
        pip install --no-cache-dir \
            paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ \
            unstructured-paddleocr; \
    else \
        pip install --no-cache-dir \
            paddlepaddle \
            unstructured-paddleocr; \
    fi

USER notebook-user

ENV OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle

Note: This also requires patching general.py to pass ocr_agent to partition() - see OCR_AGENT_BUG_ISSUE.md.

Benefits

  1. Performance: PaddleOCR with GPU is significantly faster than Tesseract for batch processing
  2. Accuracy: PaddleOCR (especially PP-OCRv4) provides better accuracy on many document types
  3. Ease of use: Official GPU images eliminate the need for custom Dockerfiles

Environment

  • unstructured-api: latest
  • unstructured: 0.18.18+
  • PaddlePaddle: 3.2.2
  • CUDA: 11.8 / 12.6

Related Issues

  • OCR_AGENT environment variable is ignored (see OCR_AGENT_BUG_ISSUE.md)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions