Producing a Searchable PDF

Convert the pdf to ppms, for -r use the dpi that you scanned the document with

pdftoppm -r 200 inname.pdf outname

mkdir GRAY

Convert the images to grayscale and make them shaper

     for image in *.ppm; do 

        convert $image -grayscale Rec709Luminance $image;

        echo "$image converted to grayscale"; 

        convert -normalize -level 10%,90% -sharpen 0x1 $image GRAY/processed_$image;

        echo "$image contrast increased and sharpened"; 

done

mkdir DESKEW

Deskewed pages 

unpaper name-%03d.ppm DESKEW/post-%03d.ppm

mkdir PNG

Convert from ppm to png

for image in *.ppm; do

convert $image PNG/$image.png

echo "$image converted to $image.png"

done

rename files

mmv '*.ppm*' '#1#2'

Get all the file paths and names in the directory 

for image in *.ppm; do

   find $PWD -name $image >> files.txt ; 

done

Use this file as the input for the tesseract command

tesseract files.txt finaloutname -l eng pdf


You'll only receive email when they publish something new.

More from 7756
All posts