Skip to content

Instantly share code, notes, and snippets.

@neurojojo
Last active February 22, 2022 21:00
Show Gist options
  • Save neurojojo/4981ee023135a79c86c0f55a64d01f24 to your computer and use it in GitHub Desktop.
Save neurojojo/4981ee023135a79c86c0f55a64d01f24 to your computer and use it in GitHub Desktop.
dictionary_of_texts = dict()
for filename,pdf_text in zip(files,pdf_as_text):
if len( pdf_text )!=0:
dictionary_of_texts[filename] = pdf_text
try:
for filename,pdf_text in zip(problem_pdfs,image_pdf_text):
dictionary_of_texts[filename] = pdf_text
except:
print('No image PDFs added')
import re
for k,v in dictionary_of_texts.items():
newfilename = re.sub( 'pdf', 'txt', k )
print( newfilename )
with open( newfilename, 'w+' ) as f:
for pagenum,page in enumerate(v):
f.write(f'ARCHIVE START OF PAGE {pagenum}\n{page}\nARCHIVE END OF PAGE {pagenum}\n')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment