Skip to content

Instantly share code, notes, and snippets.

@johnidm
Created March 31, 2024 11:52
Show Gist options
  • Save johnidm/956d31117762baef4b8ee1cbf5044582 to your computer and use it in GitHub Desktop.
Save johnidm/956d31117762baef4b8ee1cbf5044582 to your computer and use it in GitHub Desktop.
Python Tesseract 5 - Example of Use

A simple approach to using Tesseract with Python using pytesseract.

For more information about Tesseract, see the offical documentation

Install dependecies:

sudo add-apt-repository ppa:alex-p/tesseract-ocr5
sudo apt update
sudo apt install tesseract-ocr-por

Check tesseract version:

tesseract --version

Install Python's dependencies:

pip install pytesseract
pip install pdf2image 

Code example:

import os
from pdf2image import convert_from_path
import pytesseract


def main(filename: str) -> None:
    _, tail = os.path.split(filename)

    basename, _ = os.path.splitext(tail)

    docs = convert_from_path(filename)
    texts = []
    for page_number, page_data in enumerate(docs):
        print(f"Converting page #{page_number}")

        text = pytesseract.image_to_string(page_data, lang="por")

        texts.append(str(text))

    with open(f"{basename}.txt", "w") as f:
        f.write("\n".join(texts))


if __name__ == "__main__":
    main("3.pdf")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment