Extolling the virtues of gscan2pdf and Linux for Windows users

(We recommend you bookmark this blog. Trying to find helpful information in forums usually just ends in making you frustrated and angry. This blog does not speculate or guess. If we post it, we've tried it and it works. As with all our posts, if you don't care about the background info, you can head straight to the yellow highlighted sections for the problem and the fix.)

Over the years I have scanned and converted literally thousands of pdf files.  Most of those scans and conversions have been split between on-the-job tasks that required conversions of oil and gas leases and other documents, and personal projects in which I have converted hundreds of books, magazines and articles for various reasons that I will not go into here. Suffice it to say that I have spent thousands of hours working with pdf files, and when it comes to doing OCR on imaged files my favorite program to use is gscan2pdf
I am tempted to say that it is a shame that there is not a Windows version of gscan2pdf, but for someone who does not believe that any program works better than or as well in Windows as it works in a Linux environment, I will forego that lament.

Just one personal story and example here. I have rarely gone near a Windows machine since I left the corporate world to work for myself almost 10 years ago, but my wife still toils as a contractor in the O&G industry, mostly remotely these days. That means she is pretty much obliged to work in a Windows environment on a Windows machine. It also means that, as a contractor, she must furnish her own computer and software. The nature of her job requires that she provide her own equipment and software. Now, if any of you have worked with oil and gas documents you will recognize what an awful mess they are from a document-handling, storage and reporting perspective. We are talking about documents that were created 50, 60, 70 years ago, or longer. Some were handwritten, most were typed, and some were typed only to have words, sentences and whole paragraphs crossed out and replaced by handwritten notes.

Of course, virtually all those hundreds of thousands of documents created long ago have now been scanned as pdf files, and therein lies the rub. Many of them were scanned when pdf technology was new and not very sophisticated, thus many were scanned as unsearchable image files, and many of the scans contain both typed and handwritten information. (For some completely unfathomable reason, many continue to be scanned as image files, but that's another story for another day.)

Now, with the advent and coming of age of electronic storage systems and programs, it is imperative that data must be extracted from all those pdf files either by machine-searching machine-readable documents, or by having a human being search the document and transcribe the information into the electronic storage and reporting systems. Naturally, having a machine do all the work is preferable to having a person do it, but the history and nature of the O&G industry is such that the vast majority of documents must be "touched" by human minds and hands before it can be processed, stored and retrieved electronically.

But enough of that. Suffice it to say that many, many people currently working in the O&G industry are busy day-after-day, month-after-month and year-after-year either preparing old documents for electronic processing, or simply reading the documents and entering data manually into modern document handling and data storage and retrieval systems. Ergo, the venerable pdf file either helps or hampers the entire operation—too often it does the latter—but make no mistake about it, the pdf currently is the feedstock of virtually all electronic document processing.

Of course, the O&G industry example above applies to all historical documentation in the business world, as it does also to the worlds of government, literature, art, etc. Old writings using a fountain pen, typewriter or linotype on the medium of paper are extremely difficult to capture electronically, especially when it comes to a concise and error-free scan or conversion of a document. 

Here is where we return to the central point of this article. Any job or task that involves the conversion of information from pre-1990's (roughly) paper documents into pdf documents is onerous, and it becomes even more so the older the paper document is. For that reason, you need the best OCR program you can get your hands on, and believe me there are OCR engines that do a crappy job competing with OCR engines that do a much better job. Among the latter is tesseract-based gscan2pdf.

Back to the O&G example for just a minute. Often my wife runs across a pdf version of a document that was created 50 years ago. The pdf is a scan of a typewritten document. Now, for those who remember typewriters, especially the manual typewriter, you know that the ink often is blurred on the paper—not such that it is illegible, but certainly compared to the clean crisp type that comes out of a laser printer today. Typewriter type is difficult for OCR to recognize properly. Then throw in all the crossouts and hand-written notes and corrections, and accurate OCR becomes really difficult. My wife opted to purchase from the Microsoft store a pdf program with OCR capabilities. The program claimed to do the same things as the "industry-leading" pdf program, but at a somewhat lower price than the better-known pdf program. The company that produces the program is also a Microsoft Silver Partner, so it seemed like a safe bet to give it a trial. I concurred with her that it was worth a try, but neither of us bothered to ask what kind of search engine was behind the OCR part of the program. We still don't know, but suffice it to say that most of the OCR conversions are completely useless. The OCR does not recognize any word that is not crystal clear; it does not distinguish well between handwriting and typing. It seems completely at odds with tables and graphics. The output is, again, a useless mish-mash of unsearchable, illegible . . . well, I'm not sure what to call it.

Enter gscan2pdf. The first time my wife had a disastrous OCR conversion experience using her rather expensive program on her Windows 10 machine, I asked her to share the pdf with me so I could give gscan2pdf a go at converting the image file to a readable, searchable pdf, one that after it was OCR'd looked pretty much like the original. Sure enough, the gscan2pdf version looked pretty much identical to the original, including the placement of images, the construction of tables, signature lines with signatures, etc. It took all of about four minutes to provide her with a searchable document, saving her a considerable amount of time reading the entire document word for word to find keywords and phrases for entering into the electronic storage and reporting system.

Why You Should Always Have a Linux Machine Available

There are some things you simply cannot do well, and you certainly cannot do inexpensively, in a Windows environment.  Working with pdf files is one of those things. You can spend upwards of one or two-hundred dollars for the desktop version of a pdf program, or $10-$20 per month in perpetuity for the cloud version, or you can use free, open-source software (you really should make at least a small contribution) to do the same things that the commercial program will do.  That is why I would always have a Linux machine available, even if forced to use Windows by an employer. And that is why employers should always make sure there is a LInux machine handy.

Comments

Popular posts from this blog

Installing Donation Version of FreeFileSync to Replace Free Flatpak Version

Using Thorium Reader to Read DRM protected Readium lcpl books from archive.org, Linux Ubuntu

NormCap as Linux Screen Capture and Text OCR Tool, a Windows Power Tools Alternative