Add example for "How to Create a Searchable PDF document via Python" by AnastasiaRadtsevich · Pull Request #5 · aspose-pdf/Aspose.PDF-for-Python-via-.NET

AnastasiaRadtsevich · 2026-01-23T11:37:50Z

Fixes #3

Copilot

Pull request overview

Adds a new “create searchable PDF” example to the Aspose.PDF for Python via .NET examples set (addressing Issue #3) by introducing an OCR-based workflow and adding the needed sample input and dependency.

Changes:

Added a scanned/sample PDF input for the searchable-PDF example.
Implemented create_searchable_document() in example_create_pdf_document.py and wired it into run_all_examples().
Added pytesseract to examples/requirements.txt to support OCR.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`sample_data/working_with_documents/input/create_searchable_document.pdf`	Adds the input PDF used by the new searchable-PDF example.
`examples/working_with_documents/example_create_pdf_document.py`	Adds OCR-based “searchable PDF” generation and integrates it into the examples runner.
`examples/requirements.txt`	Adds the OCR Python dependency (`pytesseract`).

Copilot · 2026-04-15T08:07:29Z

+        An example of using optical character recognition (OCR) technology to create a searchable PDF document.
+
+        Args:
+            infile (str): The name of the input PDF file
+            outfile (str): The base name for output files (index will be appended)
+            image_file_path (str): The name of the image file
+            page_number (int): The page number
+
+        Returns:
+            None
+        """


The docstring formatting is over-indented, and it claims outfile is a base name with an appended index, but the implementation saves exactly to outfile without adding an index. Please align the docstring indentation and parameter descriptions with the actual behavior.

Suggested change

An example of using optical character recognition (OCR) technology to create a searchable PDF document.

Args:

infile (str): The name of the input PDF file

outfile (str): The base name for output files (index will be appended)

image_file_path (str): The name of the image file

page_number (int): The page number

Returns:

None

"""

Use optical character recognition (OCR) to create a searchable PDF document.

Args:

infile (str): The path to the input PDF file.

outfile (str): The path to the output searchable PDF file.

image_file_path (str): The path to the intermediate image file.

page_number (int): The page number to process.

Returns:

None

"""

Copilot · 2026-04-15T08:07:30Z

+    image_stream = io.FileIO(image_file_path, 'x')
+    try:
+        document = ap.Document(infile)
+        resolution = ap.devices.Resolution(300)
+        png_device = ap.devices.PngDevice(resolution)
+        png_device.process(document.pages[page_number], image_stream)
+        pdf = pytesseract.image_to_pdf_or_hocr(image_file_path, extension='pdf')
+        document = ap.Document(io.BytesIO(pdf))


image_stream is opened with mode 'x' before the try block. If the file already exists (e.g., a previous run crashed before cleanup), this raises before cleanup runs. Also, the stream remains open when pytesseract reads image_file_path, which can fail on Windows due to file locking and/or unflushed writes. Open the file inside the try (or use a context manager), write/flush/close it before calling pytesseract, and consider using a tempfile-managed path to avoid collisions.

Copilot · 2026-04-15T08:07:30Z

+    finally:
+        image_stream.close()
+        image_file = Path(image_file_path)
+        image_file.unlink(missing_ok=True)


Path.unlink(missing_ok=True) requires Python 3.8+, but the repo README states Python 3.7+ support. Replace this with a try/except FileNotFoundError (or check existence) to keep compatibility.

Suggested change

image_file.unlink(missing_ok=True)

try:

image_file.unlink()

except FileNotFoundError:

pass

Copilot · 2026-04-15T08:07:30Z

+pandas
+pytesseract


Adding pytesseract introduces a runtime dependency on the native Tesseract binary (not installed via pip). Without documenting installation steps (or handling TesseractNotFoundError with a clear message), users will hit confusing failures at runtime. Consider adding a short note in the example (and/or README) describing how to install Tesseract and how to configure pytesseract.pytesseract.tesseract_cmd on Windows.

Suggested change

pandas

pytesseract

pandas

Copilot · 2026-04-15T08:07:31Z

+import aspose.pdf as ap
+import io
+import pytesseract


pytesseract is imported at module load time, which makes all examples in this file fail to run if the optional OCR dependency (or its native tesseract binary) is not present, even when only create_new_document is executed. Consider moving the pytesseract import (and any related setup) inside create_searchable_document, and raise a clear exception when Tesseract is not available.

Add example for How to create Searchable PDF

bd6089c

AnastasiaRadtsevich linked an issue Jan 23, 2026 that may be closed by this pull request

Add example for "How to Create a Searchable PDF document via Python" #3

Open

andruhovski requested a review from Copilot April 15, 2026 08:02

Copilot started reviewing on behalf of andruhovski April 15, 2026 08:02 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example for "How to Create a Searchable PDF document via Python"#5

Add example for "How to Create a Searchable PDF document via Python"#5
AnastasiaRadtsevich wants to merge 1 commit intomainfrom
3-add-example-for-how-to-create-a-searchable-pdf-document-via-python

AnastasiaRadtsevich commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        image_file.unlink(missing_ok=True)
+        try:
+            image_file.unlink()
+        except FileNotFoundError:
+            pass

Conversation

AnastasiaRadtsevich commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants