Converting a DRM-Protected PDF to an Editable Text PDF in the Linux Environment
(We recommend you bookmark this blog. Trying to find helpful information in forums usually just ends in making you frustrated and angry. This blog does not speculate or guess. If we post it, we've tried it and it works. As with all our posts, if you don't care about the background info, you can head straight to the yellow highlighted sections for the problem and the fix.)
Scenario and Caveat
I download and convert a LOT of public domain books. Sometimes the scanned pdf version is DRM-protected, for some reason that defies all logic for un-copyrighted works in the public domain, and probably defies all legality. I use Okular for my PDF reader/editor. I want to be able to open the pdf in Okular and make bookmarks and annotations. I also want to be able to highlight certain passages, and to be able to copy and paste certain sentences or paragraphs. Finally, often my end game is to produce a docx or odt version. I cannot do that as long as the pdf file is DRM-protected. I need a way to convert the DRM-Protected PDF to a readable, editable PDF. And I want to do the conversion using nothing but free, open-source Linux software.
Here is where the caveat comes in. In order to use the solution below, your pdf must have also been scanned as a .djvu file, which often is the case. So you need to download both the pdf and the djvu files, though you could actually just download the djvu file if your sole purpose is to convert it to a readable pdf file. I always download both pdf and djvu files because the pdf can be useful strictly as a reference copy.
The Fix
You will need the following (recommended) software, which anyone who uses their Linux computer for actual work ought to have installed.. Without going into detail as to why, I strongly recommend that you use your OS's software store to install the non-snap or non-flatpak versions of these software packages:
- Okular
- gscan2pdf
- Calibre
Steps:
- Download the pdf and djvu versions of the file.
- Launch the gscan2pdf program and Open the djvu version of the file. (If the file is, for example, a very long book, and you only need a chapter or section of the book, designate the page numbers you want to open).
- Select Tools, OCR.
- After the OCR process completes, select File, Save, PDF. Be sure to change any metadata on th screen to whatever makes sense for the file you are saving (Title, Subject, Keywords, etc.)
- Open the newly-saved PDF file in Okular. Try selecting some text. If you can do so, then you have succeeded in turning your un-editable PDF into one which you can edit.
Other Potentially Important Stuff:
Q: Instead of opening the djvu file in gscan2pdf, why don't you just open the DRM-protected pdf instead?
A: Two reasons. First, gscan2pdf won't OCR a DRM-protected file. Second, when you open a DRM-protected file in gscan2pdf two extra pages are inserted for each original page, which means that a 300-page document is now a 900-page document and you must manually delete the 600 pages you don't want or need. Second, for some unfathomable reason the imported pages are all "negatives." While it is no big deal to "un-negate" them, it IS annoying that this happens. Opening the djvu file does not add extra pages and it does not negate the pages.
Q: I downloaded Calibre, but you did not use it in your instructions. Do I really need it?
A. Touché. Sometimes my ultimate goal is to convert a long document into docx or odt format. I use Calibre for that conversion (djvu to epub, then epub to docx). If you don't need Calibre for that purpose, uninstall it, but keep it if you simply want a great way to manage your ebook library.
Comments
Post a Comment