Please suggest a Windows program to add OCR to a pdf documen

Questions about Wine on Linux
Locked
ottavio
Level 2
Level 2
Posts: 19
Joined: Thu May 17, 2018 11:21 am

Please suggest a Windows program to add OCR to a pdf documen

Post by ottavio »

I'm desperately trying to add OCR to a long (300+ pages) pdf ebook with no text recognition. I've tried several native programs for Linux but they all failed.

I've never done it on Windows and I wonder if there is one that you could recommend and would work fine under WINE.
User avatar
DarkShadow44
Level 8
Level 8
Posts: 1207
Joined: Tue Nov 22, 2016 5:39 pm

Re: Please suggest a Windows program to add OCR to a pdf doc

Post by DarkShadow44 »

What exactly are you trying to achieve here? I'd assume there is text inside the PDF, not images. So you wouldn't need OCR, but a method of extracting the text from the PDF.
ottavio
Level 2
Level 2
Posts: 19
Joined: Thu May 17, 2018 11:21 am

Re: Please suggest a Windows program to add OCR to a pdf doc

Post by ottavio »

DarkShadow44 wrote:What exactly are you trying to achieve here? I'd assume there is text inside the PDF, not images. So you wouldn't need OCR, but a method of extracting the text from the PDF.
Yes, the pdf is 99% text (with some odd images) but it's not searchable. I've downloaded it off the net. I want to be able to copy text from it. I've tried all possible pdf tools available for Debian Jessie.
ldkraemer
Level 1
Level 1
Posts: 6
Joined: Sat Oct 20, 2018 12:19 pm

Re: Please suggest a Windows program to add OCR to a pdf doc

Post by ldkraemer »

If you install pdftk you can use it to burst the *.PDF pages into single PDF pages.

Code: Select all

pdftk yourdocument.pdf burst
Now, you will have the 300 pages as pg_0001.pdf, pg_0002.pdf, pg_0003.pdf, etc.

Convert the PDF's to tiff format or whatever format your OCR program requires.
Most PDF's are 72 DPI so you will need to make that 300 DPI. Convert is a
program installed with Imagemagick. You may also need to convert to 1 Bit Black,
-Monochrome, -Density 1, to get your OCR software to accept the *.TIF's.

Code: Select all

convert pg_0001.pdf -density 300 pg_0001.tiff
convert pg_0002.pdf -density 300 pg_0002.tiff
Then use tesseract to OCR the tiff's.

Code: Select all

tesseract pg_0001.tiff pg_0001
tesseract pg_0002.tiff pg_0002[
If your OCR program requires .BMP or TIF, or GIF or some other format convert to what is required.

Code: Select all

cat pg_0001.txt
cat pg_0002.txt
Tesseract-OCR ver 3 seems to do above 90% on most documents. TextBridge 2.0 Classic
also does a fine job, but requires .BMP (and other acceptable formats).
If the PDF is terrible, and neither of these packages work, install Irfanview, and the
KADCAS OCR Plugin. It will work, but you might have to draw a box around each section
of text, then start the OCR Plugin and redraw the Box for the text you want. One thing
you must do is save the OCR's text and then edit it. You CAN NOT, edit the OCR's text
and then save the edited file. That is the only limitation, except the KADCAS Plugin
is ONLY for 32 Bit Linux Distro's. Someday it may work for 64 Bit Distro's.

If you have a sample page we can see what works best in your case.

Thanks.

Larry
Locked