Skip to content

Instantly share code, notes, and snippets.

@convexset
Last active September 17, 2024 15:02
Show Gist options
  • Save convexset/28abc9c7be261954507ac705cbf64099 to your computer and use it in GitHub Desktop.
Save convexset/28abc9c7be261954507ac705cbf64099 to your computer and use it in GitHub Desktop.
Remove multiple text watermarks from a PDF file. Requires xxd and qpdf to work correctly.
#!/bin/bash
# Remove multiple text watermarks from a PDF file. Requires xxd and qpdf to work correctly.
#
# Usage:
#
# remove-pdf-watermark.sh "Your Input File.pdf" "Your Output File.pdf" [WATERMARK1] [WATERMARK2] [WATERMARK3] [...]
#
# For Example:
#
# remove-pdf-watermark.sh "Your Input File.pdf" "Your Output File.pdf" "Watermark 1" "Watermark 2"
#
# This is a more general (lesser dependencies, more functionality, but slower) version of
# https://gist.github.com/elfsternberg/a96883018d783cbbad7b454ecd0a7ffe
INPUT_FILENAME=$1
OUTPUT_FILENAME=$2
echo "Processing: $INPUT_FILENAME"
UNCOMPRESSED=`mktemp -t 'uncompressed'`
UNCOMPRESSED_HEX=`mktemp -t 'uncompressed-hex'`
UNMARKED_PRE=`mktemp -t 'unmarked-pre'`
qpdf --stream-data=uncompress --decode-level=all "$INPUT_FILENAME" $UNCOMPRESSED
echo " - Decompressing to: $UNCOMPRESSED"
xxd -ps -c 0 $UNCOMPRESSED > $UNCOMPRESSED_HEX
echo " - Hex dumping to: $UNCOMPRESSED_HEX"
rm $UNCOMPRESSED
for ARG in "$@"; do
COUNT=$((COUNT+1))
if [[ $COUNT -gt 2 ]]
then
WATERMARK=$ARG
echo " - Processing Watermark: \"$WATERMARK\""
WATERMARKLEN=${#WATERMARK}
WATERMARK_HEX=$(echo -n $WATERMARK | xxd -p -c 0)
BLANKS=$(printf %${WATERMARKLEN}s)
BLANKS_HEX=$(echo -n "$BLANKS" | xxd -p -c 0)
NUM_OCCURENCES_=$(grep -o $WATERMARK_HEX $UNCOMPRESSED_HEX | wc -l)
NUM_OCCURENCES=$((0+$NUM_OCCURENCES_))
echo " * Number of occurences: $NUM_OCCURENCES"
if [[ $NUM_OCCURENCES -gt 0 ]]
then
echo " * Replacing with $WATERMARKLEN blanks..."
sed -i -e "s/$WATERMARK_HEX/$BLANKS_HEX/g" $UNCOMPRESSED_HEX
echo " * Replacement done."
NUM_OCCURENCES_=$(grep -o $WATERMARK_HEX $UNCOMPRESSED_HEX | wc -l)
NUM_OCCURENCES=$((0+$NUM_OCCURENCES_))
echo " * Final number of occurences: $NUM_OCCURENCES"
fi
fi
done
xxd -r -p -c 0 $UNCOMPRESSED_HEX $UNMARKED_PRE
rm $UNCOMPRESSED_HEX
echo " - Reverting from hex dump to: $UNMARKED_PRE"
qpdf --stream-data=compress $UNMARKED_PRE "$OUTPUT_FILENAME"
rm $UNMARKED_PRE
echo " - Output written to: $OUTPUT_FILENAME"
echo
echo
# NO WARRANTY
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@jberkenbilt
Copy link

It will be interesting to see whether/how qpdf json v2 might help you with this kind of task once qpdf 11 is out. qpdf json v2 is in main now, though the interface isn't frozen until I release. If you want to take a look, you can check out current main. Documentation is at https://qpdf.readthedocs.io/en/latest/json.html

@convexset
Copy link
Author

I think that sounds really really cool. Having structured data to work with allows so much more. Also, thanks for qpdf. It has saved me so much time.

@dhewg
Copy link

dhewg commented Jan 18, 2023

Here's one json example which works for some pdfs

qpdf --json --json-stream-data=file in.pdf out.json
for i in `grep -Hi confidential out.json-* | cut -d':' -f1`; do
  cp /dev/null $i;
done
qpdf --json-input out.json out.pdf

@jberkenbilt
Copy link

@dhewg I think this is the first example I've seen of user-contributed qpdf json. :-)

@dhewg
Copy link

dhewg commented Jan 18, 2023

And such a small yet powerful one! ;) Some watermarks are just too much in-your-face...
Thanks for the json format, this makes it easy to manipulate with standard tools!

@savchenko
Copy link

Fails with:

mktemp: too few X's in template ‘uncompressed’
mktemp: too few X's in template ‘uncompressed-hex’
mktemp: too few X's in template ‘unmarked-pre’

qpdf: an output file name is required; use - for standard output

For help:
  qpdf --help=usage       usage information
  qpdf --help=topic       help on a topic
  qpdf --help=--option    help on an option
  qpdf --help             general help and a topic list

 - Decompressing to: 
./pdf_remove_watermark.sh: line 29: $UNCOMPRESSED_HEX: ambiguous redirect
 - Hex dumping to: 
rm: missing operand
Try 'rm --help' for more information.

Running on:

qpdf v11.9.1-1
coreutils v9.4-3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment