PDF Text Only Extraction Primer Tutorial

PDF Text Only Extraction Primer Tutorial

PDF Text Only Extraction Primer Tutorial

Quite some time ago now, the web browsers offered a text only view of the web contents. Perhaps they still do. It was a great idea back then for slow communication protocols involving personal modems, and their sluggish speed. Sometimes, though, I feel like revisiting that text only webpage view. The same can be said, for me, sometimes, regarding PDF files. Call it “cut to the chase” thinking, but sometimes it is good to just cut through the presentation aspects of a PDF and get to the wording.

And that is what we are doing today for PDF data defined by any of …

  • PDF URL
  • PDF filename (that would have to be expressed relative to https://www.rjmprogramming.com.au/PHP/ place of residence of today’s textofpdf.php‘s live run
  • PDF file browsed for on your client device or computer, utilizing the wonderful HTML5 File API functionality

… at the client end of the web application where an HTML form is presented to the user and their data is POSTed to the server where it uses the Ghostscript (interpreter for the PostScript language and PDF) from the command line via PHP’s passthru in a PHP commands of the ilk (where variable $inpdf contains PDF data information) …


if (strpos(strtolower(("~" . $inpdf)), "~http") !== false) {
file_put_contents("/tmp/textofpdf.in", file_get_contents(str_replace('+',' ',$inpdf)));
passthru("gs -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -f ps2ascii.ps -dQUIET - < /tmp/textofpdf.in ; rm -f /tmp/textofpdf.in");
} else if ($viafile != "") { //strlen($inpdf) > 200) {
file_put_contents("/tmp/textofpdf.in", $inpdf);
passthru("gs -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -f ps2ascii.ps -dQUIET - < /tmp/textofpdf.in ; rm -f /tmp/textofpdf.in");
} else {
passthru("gs -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -f ps2ascii.ps \"" . str_replace('+',' ',$inpdf) . "\" -dQUIET -c quit");
}

… the idea for which we gratefully acknowledge the online community’s useful link here, as well as this useful link‘s advice about HTML5 File API specifics for PDF data.

If this was interesting you may be interested in this too.

This entry was posted in eLearning, Event-Driven Programming, Tutorials and tagged , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>