Grab Just the Text from Documents with Text Mining Tool
Posted by Kevin Purdy at 1:00 AM on January 19, 2008

Windows only: Free copying utility Text Mining Tool grabs just the text out of Word documents, PDFs, HTML pages, and other documents without the hassle of opening, selecting everything and hoping embedded images don't leave strange markers in the text. Once your text is copied, you can either re-save it as a text file or copy it to the clipboard. Its function might not sound all that helpful—until you've tried to select multiple pages' worth of text from a scanned PDF, or tried to grab text from around awkward Flash boxes on web sites. Text Mining Tool unzips to a folder that can be put anywhere and comes with a command line tool for your batch-script-writing pleasure. Text Mining Tool is a free download for Windows systems only. For similar copy power from the selection screen, try DragKing.
Tags: COPY AND PASTE | documents | featured windows download

Comments (AU Comments · US Comments)
There are currently no AU comments for this post.
wordwhiz
Posted 7:27 AM 18/1/08
Well, it works, but in my (limited) experience, just barely. It has problems if the pdf contains any formatting (e.g., multiple columns). Headers and footers are repeated. It refused to even open one pdf I tried it on. BTW, both Foxit Reader and Adobe Acrobat Reader have "export text" or "save as text" options which work about as well.
Just my $.02...
wordwhiz
TechTalk WRLR 98.3FM
Posted 7:23 AM 18/1/08
holy ASCII batman! this is EXACTLY what i was just looking for, as we are talking about converting about a million word documents into xml and need to grab the text out ... couple of quick swipes of the keyboard for an old-style DOS BAT program and I'll be the hero of the day it seems! Shouts out to LH and Kevin for digging this one up.
TechTalk WRLR 98.3FM
Brentis
Posted 5:33 AM 19/1/08
Have you tried deskUNPDF Professional? It has an XML output that supports the docbook standard. Also has a batch mode.
Brentis
codykniffen
Posted 12:34 PM 18/1/08
Ummmm...could someone post the rest of that Mac&Cheese recipe?
codykniffen
mahalie
Posted 9:32 AM 19/1/08
This sounds really useful, even if it only barely works. I have a daunting amount of seriously old intranet content to convert to a new CMS.
mahalie