Converting an image pdf file to a searchable text pdf file in a Linux environment
![]() |
gscan2pdf |
Okay, so that's a really long title for a blog post, but sometimes you must use many words to explain what it really is that you are doing, a lesson learned by spending a lot of time on the mostly worthless forums where people have very little ability to form a subject line that has anything to do with their issue.
At any rate, some background. I love downloading public-domain (mostly) books and documents, but often they are scanned as image files. As I am a writer and want to use quotes from the pdf, it is much easier if I convert the picture pdf to a text pdf so I can copy and paste, rather than re-typing the quoted material.
There are lots of ways to go about this conversion task, but often they require buying conversion software or paying to play in the cloud. I hate spending money on work stuff, so here's my simple, quick solution.
- Install gscan2pdf. In Ubuntu, you can do that from the Ubuntu Software Centre or the Gnome Software Center. If you are into installing using a terminal, have at it, but we won't describe the "how to" of terminal installation here.
- Open your File browser and find the file you want to scan and ocr. Right click on the file and Open With gscan2pdf. If you get a screen that no scanner was detected, just close the dialog screen. You aren't going to scan anything on the scanner.
- gscan2pdf will proceed to load all the pages of the image pdf.
- Click on Tools, OCR.
- Select Page Range, then click OCR (I leave the other settings at default).
- Click on the OCR tab to watch as the OCR is being performed. Don't be frightened, it looks like an awful jumble but the end product won't be jumbled.
- After the last page is OCR'd, click File, Save. Select the format to which you want to save the OCR'd file. In this case, as I simply want a pdf file with searchable text that I can copy, highlight and annotate, I chose to save the file as PDF. (Bonus: you also can now convert the pdf to a text file that can be edited in a word processor.)
The file that I converted from an image to a text pdf was a 35-page Employee Handbook from 1940. From start to finish took less than five minutes.
gscan2pdf also works well with the scanner in my Epson MFP when I need to scan documents from scratch.
Comments
Post a Comment