-
-
Save rileypeterson/723a8650affec02098fd5146f47bf488 to your computer and use it in GitHub Desktop.
import PyPDF2 | |
from PyPDF2 import PdfFileReader, PdfFileWriter | |
from PyPDF2.generic import StreamObject, DecodedStreamObject | |
from PyPDF2 import filters | |
# You change the phrases/text you want redacted and the file paths: | |
phrases = ["123-45-6789", "YOUR_SECRET_PASSWORD", "YOUR_SECRET_ID"] | |
read_path = "/path/to/yourpdf.pdf" | |
write_path = "/path/to/yourpdf-redacted.pdf" | |
def replace_w_stars(text, phrase): | |
return text.replace(phrase, "*" * len(phrase)) | |
def redact(text, phrases=phrases): | |
if isinstance(text, bytes): | |
text = text.decode() | |
for phrase in phrases: | |
text = replace_w_stars(text, phrase) | |
return text | |
class EncodedStreamObject(StreamObject): | |
def __init__(self): | |
self.decodedSelf = None | |
def getData(self): | |
if self.decodedSelf: | |
# cached version of decoded object | |
return self.decodedSelf.getData() | |
else: | |
# create decoded object | |
decoded = DecodedStreamObject() | |
decoded._data = filters.decodeStreamData(self) | |
decoded._data = redact(decoded._data) | |
for key, value in list(self.items()): | |
if not key in ("/Length", "/Filter", "/DecodeParms"): | |
decoded[key] = value | |
self.decodedSelf = decoded | |
return decoded._data | |
# Overload with redaction version | |
PyPDF2.generic.EncodedStreamObject = EncodedStreamObject | |
# Read in the PDFs | |
f = open(read_path, "rb") | |
r = PdfFileReader(f) | |
# Force the redaction by merging with itself | |
page = r.getPage(0) | |
page.mergePage(r.getPage(0)) | |
# Write out the result | |
f_out = open(write_path, "wb") | |
w = PdfFileWriter() | |
w.addPage(page) | |
w.write(f_out) | |
f_out.close() |
Hi there Andrea (@ndricca)! The PyPDF2 package has changed quite a bit since I wrote this. It still works! But you need to use version 1.26.0
(pip install PyPDF2==1.26.0
). I just tried it and it works using this version. :)
To confirm you have the right version do this before running the gist:
pip show PyPDF2
Name: PyPDF2
Version: 1.26.0 <--- Should be this
.
.
.
Hope that helps!
Hello,
to be able to use latest pypdf(2) version you'll have to change the overridding code to :
# Overload with redaction version
pypdf2.generic._data_structures.EncodedStreamObject = EncodedStreamObject
the class was in 'generic' and moved to 'generic._data_structures'
'+ some minor deprecated call changes
btw thanks for sharing.
Hi, upgrading to pypdf was quite straightforward, the code runs till the end but withouth any replacement.
I guess I am still not able to decode the text correctly.
By putting a breakpoint on line 21 above on text = replace_w_stars(text, phrase)
, I see the text
variable has a value that starts like this:
'0.00000 0.00000 595.00000 842.00000 re W n'
.
I tried changing the decoding bytes with different results but with no result. Do you have any suggestion?
You're replacing line 7 with the actual phrases you want to redact, right? I would check if you can find your phrases within the text
variable at your breakpoint (e.g. assert phrases[0] in text
).
Also, I would see if adding '595.0000'
as a phrase, results in it being redacted (replaced with *
) from your particular PDF, since that is a phrase which definitely does appear in the text
variable based on your post.
Let me know the results and hopefully we can figure out why it's not working.
Thanks for your quick reply riley, what I am saying is that the content of the text
variable is encoded or maybe encrypted.
Just a more comprehensive example:
- I created a .docx file, wrote "This is a very standard text." inside and saved it as PDF
- If I apply
extract_text()
method to the page without overwriting classEncodedStreamObject
, I correctly get 'This is a very standard text. \n '. - If I overwrite the class and place a breakpoint on line 21 above, the
text
variable has the following content:
/Span <</MCID 0/Lang (en-GB)>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 56.64 760.68 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(T)] TJ
ET
Q
q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 62.04 760.68 Tm
0 g
0 G
[(h)3(is is a )8(v)-4(er)10(y)-3( )] TJ
ET
Q
q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 116.06 760.68 Tm
0 g
0 G
[(s)11(tand)5(ard)5( t)-3(ex)7(t.)] TJ
ET
Q
q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 178.46 760.68 Tm
0 g
0 G
[( )] TJ
ET
Q
EMC /Span <</MCID 1/Lang (en-GB)>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 56.64 738.1 Tm
0 g
0 G
[( )] TJ
ET
Q
EMC
Hi, I am trying to use your script but sadly I don't see any effect. I don't understand exactly where your
EncodedStreamObject
class is actually used. I guess it should produce its effect insidemergePage()
method.I tried to set a breakpoint inside
__init__()
and the debugger is not stopping anywhere so I think it has never been called inside the script execution.Could you help me on it? All other resources on the web seems to be quite unuseful and/or outdated.
Thanks in advance
Andrea