Producing a Searchable PDF
September 10, 2020•231 words
Convert the pdf to ppms, for -r use the dpi that you scanned the document with
pdftoppm -r 200 inname.pdf outname
mkdir GRAY
Convert the images to grayscale and make them shaper
for image in *.ppm; do
convert $image -grayscale Rec709Luminance $image;
echo "$image converted to grayscale";
convert -normalize -level 10%,90% -sharpen 0x1 $image GRAY/processed_$image;
echo "$image contrast increased and sharpened";
done
mkdir DESKEW
Deskewed pages
unpaper name-%03d.ppm DESKEW/post-%03d.ppm
mkdir PNG
Convert from ppm to png
for image in *.ppm; do
convert $image PNG/$image.png
echo "$image converted to $image.png"
done
rename files
mmv '*.ppm*' '#1#2'
Get all the file paths and names in the directory
for image in *.ppm; do
find $PWD -name $image >> files.txt ;
done
Use this file as the input for the tesseract command
tesseract files.txt finaloutname -l eng pdf