This week we answer a burning question we have all asked ourselves at some point, ‘how do I sort through pdfs?’ The answer is simpler than you’d think, all thanks to something called “Optical Character Recognition”.
I have a whole stack of things that I’d love to move from the physical world to the digital world, so I can then Marie Kondo the original documents and photos into oblivion. Stacks of paper do not bring me joy.
You have a few options you can try. I’d start with an obvious one: Google. Assuming you’re creating PDFs, upload your file(s) to Google Drive. Right-click on any individual PDF, hover your mouse over “Open With,” and select “Google Docs.”
Google will then attempt to run some OCR on your PDF, and you should be able to save the resulting file as a document. You can then search through this document (and any others you convert) via Drive itself.
The more I think about it, though, that solution seems a little inelegant given how many files you have to work with. Instead, I might try a piece of software like TesseractStudio.Net — or just Tesseract OCR, if you don’t fear the command line.
You should be able to use this to create OCR data from your files, and you can then search for them directly via Windows or macOS. OCRmyPDF is another option that’s similar to Tesseract OCR, but, again, you’ll be playing with typed commands to apply OCR to your files. There’s no GUI, nor is there (direct) Windows support.
There’s also Paperwork, an open-source document cataloging tool that comes with OCR built right in, which I would definitely consider given that it’s designed to be an all-in-one piece of software for archiving, sorting, and searching documents. That sounds like it might be just what you’re looking for.
I haven’t used PDF-XChange Viewer, but others have recommended it as an option. The free version will drop watermarks into your PDFs, but it can create PDFs from images and, if I’m correct, add OCR to these and any existing PDFs you have.
It’s worth exploring, even if it’s not the ideal (free) solution. Similarly, FreeOCR can take your images or PDFs, apply OCR, and export the results as plain text files or Word documents. If you don’t mind searching through your archives that way, it’s an option.
[referenced url=”https://www.lifehacker.com.au/2019/01/27-free-alternatives-to-adobes-expensive-app-subscriptions/” thumb=”https://i.kinja-img.com/gawker-media/image/upload/t_ku-large/qwloksc6qml4ozg3rqq3.png” title=”The Best Free Alternatives To Adobe Software” excerpt=”Adobe appears to have upset a number of users with another price increase for its app subscriptions. While the hit only appears to be targeting specific countries at this point there’s no reason to think that you won’t have to pay more to subscribe to an Adobe app (or its whole suite of creative apps) at some future point.
That’s business, folks.”]
As for paid solutions, there’s always Adobe Acrobat Pro or Foxit PhantomPDF. Both will allow you to add OCR to PDFs, and you should be able to process all of your documents as a big batch (or create a script that does this with a folders’ worth of contents).
You might even be able to get this all done during the apps’ free trials, if they don’t put limitations on their OCR capabilities. I’ve also seen others with your particular problem find success using an app like PDF OCR, which could be a cheaper alternative.
That’s everything I can think of off the top of my head (and with a little research). Hopefully, one of those solutions works out for you — without costing you a small fortune.
Comments
One response to “Ask LH: How Can I Create A Searchable Archive Of PDFs?”
Tesseract command line is a little more involved as it doesn’t take PDF as an input, just an output, so really you need GhostScript/Imagemagick to convert the input PDF first.
TesseractStudio.NET looks interesting but isn’t it disingenuous to put it on Github when the Github project simply points to the binary download on the proprietors website (not the source code)?