Skip to content

Add example for "How to Create a Searchable PDF document via Python"#5

Open
AnastasiaRadtsevich wants to merge 1 commit intomainfrom
3-add-example-for-how-to-create-a-searchable-pdf-document-via-python
Open

Add example for "How to Create a Searchable PDF document via Python"#5
AnastasiaRadtsevich wants to merge 1 commit intomainfrom
3-add-example-for-how-to-create-a-searchable-pdf-document-via-python

Conversation

@AnastasiaRadtsevich
Copy link
Copy Markdown
Contributor

Fixes #3

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “create searchable PDF” example to the Aspose.PDF for Python via .NET examples set (addressing Issue #3) by introducing an OCR-based workflow and adding the needed sample input and dependency.

Changes:

  • Added a scanned/sample PDF input for the searchable-PDF example.
  • Implemented create_searchable_document() in example_create_pdf_document.py and wired it into run_all_examples().
  • Added pytesseract to examples/requirements.txt to support OCR.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

File Description
sample_data/working_with_documents/input/create_searchable_document.pdf Adds the input PDF used by the new searchable-PDF example.
examples/working_with_documents/example_create_pdf_document.py Adds OCR-based “searchable PDF” generation and integrates it into the examples runner.
examples/requirements.txt Adds the OCR Python dependency (pytesseract).

Comment on lines +24 to +34
An example of using optical character recognition (OCR) technology to create a searchable PDF document.

Args:
infile (str): The name of the input PDF file
outfile (str): The base name for output files (index will be appended)
image_file_path (str): The name of the image file
page_number (int): The page number

Returns:
None
"""
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring formatting is over-indented, and it claims outfile is a base name with an appended index, but the implementation saves exactly to outfile without adding an index. Please align the docstring indentation and parameter descriptions with the actual behavior.

Suggested change
An example of using optical character recognition (OCR) technology to create a searchable PDF document.
Args:
infile (str): The name of the input PDF file
outfile (str): The base name for output files (index will be appended)
image_file_path (str): The name of the image file
page_number (int): The page number
Returns:
None
"""
Use optical character recognition (OCR) to create a searchable PDF document.
Args:
infile (str): The path to the input PDF file.
outfile (str): The path to the output searchable PDF file.
image_file_path (str): The path to the intermediate image file.
page_number (int): The page number to process.
Returns:
None
"""

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +42
image_stream = io.FileIO(image_file_path, 'x')
try:
document = ap.Document(infile)
resolution = ap.devices.Resolution(300)
png_device = ap.devices.PngDevice(resolution)
png_device.process(document.pages[page_number], image_stream)
pdf = pytesseract.image_to_pdf_or_hocr(image_file_path, extension='pdf')
document = ap.Document(io.BytesIO(pdf))
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image_stream is opened with mode 'x' before the try block. If the file already exists (e.g., a previous run crashed before cleanup), this raises before cleanup runs. Also, the stream remains open when pytesseract reads image_file_path, which can fail on Windows due to file locking and/or unflushed writes. Open the file inside the try (or use a context manager), write/flush/close it before calling pytesseract, and consider using a tempfile-managed path to avoid collisions.

Copilot uses AI. Check for mistakes.
finally:
image_stream.close()
image_file = Path(image_file_path)
image_file.unlink(missing_ok=True)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.unlink(missing_ok=True) requires Python 3.8+, but the repo README states Python 3.7+ support. Replace this with a try/except FileNotFoundError (or check existence) to keep compatibility.

Suggested change
image_file.unlink(missing_ok=True)
try:
image_file.unlink()
except FileNotFoundError:
pass

Copilot uses AI. Check for mistakes.
Comment thread examples/requirements.txt
Comment on lines +4 to +5
pandas
pytesseract No newline at end of file
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding pytesseract introduces a runtime dependency on the native Tesseract binary (not installed via pip). Without documenting installation steps (or handling TesseractNotFoundError with a clear message), users will hit confusing failures at runtime. Consider adding a short note in the example (and/or README) describing how to install Tesseract and how to configure pytesseract.pytesseract.tesseract_cmd on Windows.

Suggested change
pandas
pytesseract
pandas

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
import aspose.pdf as ap
import io
import pytesseract
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytesseract is imported at module load time, which makes all examples in this file fail to run if the optional OCR dependency (or its native tesseract binary) is not present, even when only create_new_document is executed. Consider moving the pytesseract import (and any related setup) inside create_searchable_document, and raise a clear exception when Tesseract is not available.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add example for "How to Create a Searchable PDF document via Python"

2 participants