Python Tesseract 5 - Example of Use

A simple approach to using Tesseract with Python using pytesseract.

For more information about Tesseract, see the offical documentation

Install dependecies:

sudo add-apt-repository ppa:alex-p/tesseract-ocr5
sudo apt update
sudo apt install tesseract-ocr-por

Check tesseract version:

tesseract --version

Install Python's dependencies:

pip install pytesseract
pip install pdf2image

Code example:

import os
from pdf2image import convert_from_path
import pytesseract


def main(filename: str) -> None:
    _, tail = os.path.split(filename)

    basename, _ = os.path.splitext(tail)

    docs = convert_from_path(filename)
    texts = []
    for page_number, page_data in enumerate(docs):
        print(f"Converting page #{page_number}")

        text = pytesseract.image_to_string(page_data, lang="por")

        texts.append(str(text))

    with open(f"{basename}.txt", "w") as f:
        f.write("\n".join(texts))


if __name__ == "__main__":
    main("3.pdf")

johnidm/README.md