thzinc/README.md

Last active September 6, 2024 05:11

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/thzinc/612c82219cc4ec59e828a9ff19df8fe6.js"></script>
Save thzinc/612c82219cc4ec59e828a9ff19df8fe6 to your computer and use it in GitHub Desktop.

Use Tesseract to add a text layer to a PDF via OCR

Raw

Use Tesseract to add a text layer to a PDF via OCR

(Using macOS)

Dependencies:

curl -OL https://github.com/gsauthof/utility/raw/master/pdfmerge.py
pip install pdfrw
brew install imagemagick
brew install tesseract

Assuming the source file is called report.pdf:

Convert the PDF into one PNG per page:

convert -density 150 report.pdf +adjoin report-%03d.png

Perform OCR on each page and produce a text-only PDF called textonly.pdf:

ls report-*.png | tesseract -c textonly_pdf=1 --dpi 150 - textonly pdf

Merge the original PDF with the text-only PDF into a PDF called searchable.pdf:

python pdfmerge.py --pdfrw textonly.pdf report.pdf searchable.pdf

Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment