Niraj Zade   Home  Blog  Notes  Tools 

Remove annotations from PDFs in bulk

Tags: everyday engineering  

THE PROBLEM

Friend had around 350 pdfs with annotations (highlights, comments etc). The annotations had to be removed.

Left - Before | Right - After

Solutions tried

Manual deletion - would have taken tens of hours.

Existing tools - We found a website, but it didn't solve our problem without payment

  • It only allowed uploading and cleaning one pdf at a time. So, it was still tiresome manual process, and would've taken a long time.
  • The free cleanup only allowed limited number of files. After that, we had to pay for the service.

We quickly gave up on readymade solutions.

Final solution (python script)

Solution? Python script to the rescue. It was surprisingly quick and easy.

Took around 11s to process 350 pdfs.

python3 -m venv .venv
source .venv/bin/activate
pip install pypdf
from pathlib import Path
import os
from pypdf import PdfReader, PdfWriter

source_dir = Path("source")
source_pdf_list = list(source_dir.rglob("*.pdf"))

for pdf_path in source_pdf_list:
    with open(pdf_path, 'rb') as pdf:
        # read/parse pdf file
        pdf_in = PdfReader(pdf)

        # clean annotations page by page
        pdf_out = PdfWriter()
        for page in pdf_in.pages:
            if page.annotations:
                page.annotations.clear()
            pdf_out.add_page(page)

    # generate destination path, while maintaining directory structure as source
    destination_path = Path(str(pdf_path).replace("source", "cleaned"))
    os.makedirs(os.path.dirname(destination_path), exist_ok=True)
    # write pdf file
    with open(destination_path, 'wb') as f: 
        pdf_out.write(f)
        print(f"Processed: {destination_path}")

Output:

...

Processed: cleaned/R/GS 4/12.pdf
Processed: cleaned/R/GS 4/22a.pdf
Processed: cleaned/R/GS 4/22.pdf
Processed: cleaned/R/GS 4/19.pdf
Processed: cleaned/R/GS 4/21.pdf
Processed: cleaned/R/GS 4/20.pdf
Processed: cleaned/R/GS 4/17a.pdf
Processed: cleaned/R/GS 4/20a.pdf
Processed: cleaned/R/GS 4/18a.pdf
Processed: cleaned/R/GS 4/12a.pdf
Processed: cleaned/R/GS 4/18.pdf

...

That's it. Another everyday problem solved with some programming.