Add example for "How to Create a Searchable PDF document via Python"#5
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “create searchable PDF” example to the Aspose.PDF for Python via .NET examples set (addressing Issue #3) by introducing an OCR-based workflow and adding the needed sample input and dependency.
Changes:
- Added a scanned/sample PDF input for the searchable-PDF example.
- Implemented
create_searchable_document()inexample_create_pdf_document.pyand wired it intorun_all_examples(). - Added
pytesseracttoexamples/requirements.txtto support OCR.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
sample_data/working_with_documents/input/create_searchable_document.pdf |
Adds the input PDF used by the new searchable-PDF example. |
examples/working_with_documents/example_create_pdf_document.py |
Adds OCR-based “searchable PDF” generation and integrates it into the examples runner. |
examples/requirements.txt |
Adds the OCR Python dependency (pytesseract). |
| An example of using optical character recognition (OCR) technology to create a searchable PDF document. | ||
|
|
||
| Args: | ||
| infile (str): The name of the input PDF file | ||
| outfile (str): The base name for output files (index will be appended) | ||
| image_file_path (str): The name of the image file | ||
| page_number (int): The page number | ||
|
|
||
| Returns: | ||
| None | ||
| """ |
There was a problem hiding this comment.
The docstring formatting is over-indented, and it claims outfile is a base name with an appended index, but the implementation saves exactly to outfile without adding an index. Please align the docstring indentation and parameter descriptions with the actual behavior.
| An example of using optical character recognition (OCR) technology to create a searchable PDF document. | |
| Args: | |
| infile (str): The name of the input PDF file | |
| outfile (str): The base name for output files (index will be appended) | |
| image_file_path (str): The name of the image file | |
| page_number (int): The page number | |
| Returns: | |
| None | |
| """ | |
| Use optical character recognition (OCR) to create a searchable PDF document. | |
| Args: | |
| infile (str): The path to the input PDF file. | |
| outfile (str): The path to the output searchable PDF file. | |
| image_file_path (str): The path to the intermediate image file. | |
| page_number (int): The page number to process. | |
| Returns: | |
| None | |
| """ |
| image_stream = io.FileIO(image_file_path, 'x') | ||
| try: | ||
| document = ap.Document(infile) | ||
| resolution = ap.devices.Resolution(300) | ||
| png_device = ap.devices.PngDevice(resolution) | ||
| png_device.process(document.pages[page_number], image_stream) | ||
| pdf = pytesseract.image_to_pdf_or_hocr(image_file_path, extension='pdf') | ||
| document = ap.Document(io.BytesIO(pdf)) |
There was a problem hiding this comment.
image_stream is opened with mode 'x' before the try block. If the file already exists (e.g., a previous run crashed before cleanup), this raises before cleanup runs. Also, the stream remains open when pytesseract reads image_file_path, which can fail on Windows due to file locking and/or unflushed writes. Open the file inside the try (or use a context manager), write/flush/close it before calling pytesseract, and consider using a tempfile-managed path to avoid collisions.
| finally: | ||
| image_stream.close() | ||
| image_file = Path(image_file_path) | ||
| image_file.unlink(missing_ok=True) |
There was a problem hiding this comment.
Path.unlink(missing_ok=True) requires Python 3.8+, but the repo README states Python 3.7+ support. Replace this with a try/except FileNotFoundError (or check existence) to keep compatibility.
| image_file.unlink(missing_ok=True) | |
| try: | |
| image_file.unlink() | |
| except FileNotFoundError: | |
| pass |
| pandas | ||
| pytesseract No newline at end of file |
There was a problem hiding this comment.
Adding pytesseract introduces a runtime dependency on the native Tesseract binary (not installed via pip). Without documenting installation steps (or handling TesseractNotFoundError with a clear message), users will hit confusing failures at runtime. Consider adding a short note in the example (and/or README) describing how to install Tesseract and how to configure pytesseract.pytesseract.tesseract_cmd on Windows.
| pandas | |
| pytesseract | |
| pandas |
| import aspose.pdf as ap | ||
| import io | ||
| import pytesseract |
There was a problem hiding this comment.
pytesseract is imported at module load time, which makes all examples in this file fail to run if the optional OCR dependency (or its native tesseract binary) is not present, even when only create_new_document is executed. Consider moving the pytesseract import (and any related setup) inside create_searchable_document, and raise a clear exception when Tesseract is not available.
Fixes #3