Wednesday, January 2, 2013

Make PDFs searchable

I searched for a good way to make scanned documents searchable. Most newer scanning software already has some OCR built-in, but what about all the old documents? Using pdfsandwich and Tesseract, we recover the text from each page of a PDF and put it behind each page as an invisible layer. That way, we can search the PDF with a normal PDF reader or upload it to Google translate to get a translated version. To get a text-only version, pdftotext can be used.

First, install the missing packages (tested on Ubuntu 12.04):
# we use tesseract-ocr-deu for German
apt-get install tesseract-ocr tesseract-ocr-deu poppler-utils
apt-get install exactimage imagemagick ghostscript
Second, we download and install pdfsandwich:
wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.0.7/\
pdfsandwich_0.0.7_amd64.deb
dpkg -i pdfsandwich_0.0.7_amd64.deb
Finally, we run pdfsandwich and pdftotext on a PDF:
pdfsandwich -resolution 240x240 -rgb -lang deu german_document.pdf
# creates german_document_ocr.pdf with colors and 240dpi

pdftotext german_document_ocr.pdf
# gives german_document_ocr.txt
To process all PDFs in the current directory, find can be used:
find . -name "*.pdf" -exec pdfsandwich -resolution 240x240 -rgb -lang deu {} \;

No comments:

Post a Comment

Labels

performance (23) benchmark (6) MySQL (5) architecture (5) coding style (5) memory usage (5) HHVM (4) C++ (3) Java (3) Javascript (3) MVC (3) SQL (3) abstraction layer (3) framework (3) maintenance (3) Go (2) Golang (2) HTML5 (2) ORM (2) PDF (2) Slim (2) Symfony (2) Zend Framework (2) Zephir (2) firewall (2) log files (2) loops (2) quality (2) real-time (2) scrum (2) streaming (2) AOP (1) Apache (1) Arrays (1) C (1) DDoS (1) Deployment (1) DoS (1) Dropbox (1) HTML to PDF (1) HipHop (1) OCR (1) OOP (1) Objects (1) PDO (1) PHP extension (1) PhantomJS (1) SPL (1) SQLite (1) Server-Sent Events (1) Silex (1) Smarty (1) SplFixedArray (1) Unicode (1) V8 (1) analytics (1) annotations (1) apc (1) archiving (1) autoloading (1) awk (1) caching (1) code quality (1) column store (1) common mistakes (1) configuration (1) controller (1) decisions (1) design patterns (1) disk space (1) dynamic routing (1) file cache (1) garbage collector (1) good developer (1) html2pdf (1) internationalization (1) invoice (1) just-in-time compiler (1) kiss (1) knockd (1) legacy code (1) legacy systems (1) logtop (1) memcache (1) memcached (1) micro framework (1) ncat (1) node.js (1) openssh (1) pfff (1) php7 (1) phpng (1) procedure models (1) ramdisk (1) recursion (1) refactoring (1) references (1) regular expressions (1) search (1) security (1) sgrep (1) shm (1) sorting (1) spatch (1) ssh (1) strange behavior (1) swig (1) template engine (1) threads (1) translation (1) ubuntu (1) ufw (1) web server (1) whois (1)