Friend had around 350 pdfs with annotations (highlights, comments etc). The annotations had to be removed.
Left - Before | Right - After
Manual deletion - would have taken tens of hours.
Existing tools - We found a website, but it didn't solve our problem without payment
We quickly gave up on readymade solutions.
Solution? Python script to the rescue. It was surprisingly quick and easy.
Took around 11s to process 350 pdfs.
python3 -m venv .venv
source .venv/bin/activate
pip install pypdf
from pathlib import Path
import os
from pypdf import PdfReader, PdfWriter
source_dir = Path("source")
source_pdf_list = list(source_dir.rglob("*.pdf"))
for pdf_path in source_pdf_list:
with open(pdf_path, 'rb') as pdf:
# read/parse pdf file
pdf_in = PdfReader(pdf)
# clean annotations page by page
pdf_out = PdfWriter()
for page in pdf_in.pages:
if page.annotations:
page.annotations.clear()
pdf_out.add_page(page)
# generate destination path, while maintaining directory structure as source
destination_path = Path(str(pdf_path).replace("source", "cleaned"))
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
# write pdf file
with open(destination_path, 'wb') as f:
pdf_out.write(f)
print(f"Processed: {destination_path}")
Output:
...
Processed: cleaned/R/GS 4/12.pdf
Processed: cleaned/R/GS 4/22a.pdf
Processed: cleaned/R/GS 4/22.pdf
Processed: cleaned/R/GS 4/19.pdf
Processed: cleaned/R/GS 4/21.pdf
Processed: cleaned/R/GS 4/20.pdf
Processed: cleaned/R/GS 4/17a.pdf
Processed: cleaned/R/GS 4/20a.pdf
Processed: cleaned/R/GS 4/18a.pdf
Processed: cleaned/R/GS 4/12a.pdf
Processed: cleaned/R/GS 4/18.pdf
...
That's it. Another everyday problem solved with some programming.