Skip to content

Instantly share code, notes, and snippets.

@thzinc
Last active September 6, 2024 05:11
Show Gist options
  • Save thzinc/612c82219cc4ec59e828a9ff19df8fe6 to your computer and use it in GitHub Desktop.
Save thzinc/612c82219cc4ec59e828a9ff19df8fe6 to your computer and use it in GitHub Desktop.
Use Tesseract to add a text layer to a PDF via OCR

Use Tesseract to add a text layer to a PDF via OCR

(Using macOS)

Dependencies:

  • ImageMagick
  • Tesseract
  • pdfmerge
curl -OL https://github.com/gsauthof/utility/raw/master/pdfmerge.py
pip install pdfrw
brew install imagemagick
brew install tesseract

Assuming the source file is called report.pdf:

Convert the PDF into one PNG per page:

convert -density 150 report.pdf +adjoin report-%03d.png

Perform OCR on each page and produce a text-only PDF called textonly.pdf:

ls report-*.png | tesseract -c textonly_pdf=1 --dpi 150 - textonly pdf

Merge the original PDF with the text-only PDF into a PDF called searchable.pdf:

python pdfmerge.py --pdfrw textonly.pdf report.pdf searchable.pdf
@thzinc
Copy link
Author

thzinc commented Apr 18, 2019

Using report.pdf, this produced searchable.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment