A simple approach to using Tesseract with Python using pytesseract.
For more information about Tesseract, see the offical documentation
Install dependecies:
sudo add-apt-repository ppa:alex-p/tesseract-ocr5
sudo apt update
sudo apt install tesseract-ocr-por
Check tesseract version:
tesseract --version
Install Python's dependencies:
pip install pytesseract
pip install pdf2image
Code example:
import os
from pdf2image import convert_from_path
import pytesseract
def main(filename: str) -> None:
_, tail = os.path.split(filename)
basename, _ = os.path.splitext(tail)
docs = convert_from_path(filename)
texts = []
for page_number, page_data in enumerate(docs):
print(f"Converting page #{page_number}")
text = pytesseract.image_to_string(page_data, lang="por")
texts.append(str(text))
with open(f"{basename}.txt", "w") as f:
f.write("\n".join(texts))
if __name__ == "__main__":
main("3.pdf")