Producing a Searchable PDF

September 10, 2020•231 words

Convert the pdf to ppms, for -r use the dpi that you scanned the document with

pdftoppm -r 200 inname.pdf outname

mkdir GRAY

Convert the images to grayscale and make them shaper

for image in *.ppm; do

convert $image -grayscale Rec709Luminance $image;

echo "$image converted to grayscale";

convert -normalize -level 10%,90% -sharpen 0x1 $image GRAY/processed_$image;

echo "$image contrast increased and sharpened";

done

mkdir DESKEW

Deskewed pages

unpaper name-%03d.ppm DESKEW/post-%03d.ppm

mkdir PNG

Convert from ppm to png

for image in *.ppm; do

convert $image PNG/$image.png

echo "$image converted to $image.png"

done

rename files

mmv '*.ppm*' '#1#2'

Get all the file paths and names in the directory

for image in *.ppm; do

find $PWD -name $image >> files.txt ;

done

Use this file as the input for the tesseract command

tesseract files.txt finaloutname -l eng pdf