{"id":39643,"date":"2018-07-22T03:01:51","date_gmt":"2018-07-21T17:01:51","guid":{"rendered":"http:\/\/www.rjmprogramming.com.au\/ITblog\/?p=39643"},"modified":"2018-07-21T22:44:22","modified_gmt":"2018-07-21T12:44:22","slug":"pdf-text-only-extraction-primer-tutorial","status":"publish","type":"post","link":"https:\/\/www.rjmprogramming.com.au\/ITblog\/pdf-text-only-extraction-primer-tutorial\/","title":{"rendered":"PDF Text Only Extraction Primer Tutorial"},"content":{"rendered":"<div style=\"width: 230px\" class=\"wp-caption alignnone\"><a target=_blank href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/textofpdf.php\"><img decoding=\"async\" style=\"float:left; border: 15px solid pink;\" alt=\"PDF Text Only Extraction Primer Tutorial\" src=\"http:\/\/www.rjmprogramming.com.au\/PHP\/textofpdf.jpg\" title=\"PDF Text Only Extraction Primer Tutorial\"  \/><\/a><p class=\"wp-caption-text\">PDF Text Only Extraction Primer Tutorial<\/p><\/div>\n<p>Quite some time ago now, the web browsers offered a text only view of the web contents.  Perhaps they still do.  It was a great idea back then for slow communication protocols involving personal modems, and their sluggish speed.  Sometimes, though, I feel like revisiting that text only webpage view.  The same can be said, for me, sometimes, regarding PDF files.  Call it &#8220;cut to the chase&#8221; thinking, but sometimes it is good to just cut through the presentation aspects of a PDF and get to the wording.<\/p>\n<p>And that is what we are doing today for PDF data defined by any of &#8230;<\/p>\n<ul>\n<li>PDF URL<\/li>\n<li>PDF filename (that would have to be expressed relative to https:\/\/www.rjmprogramming.com.au\/PHP\/ place of residence of today&#8217;s <a target=_blank href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/textofpdf.php_GETME\">textofpdf.php<\/a>&#8216;s <a target=_blank href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/textofpdf.php\" title=\"Click picture\">live run<\/a><\/li>\n<li>PDF file browsed for on your client device or computer, utilizing the wonderful <a target=_blank title='Useful link, thanks' href='https:\/\/www.html5rocks.com\/tutorials\/file\/dndfiles\/'>HTML5 File API<\/a> functionality<\/li>\n<\/ul>\n<p> &#8230; at the client end of the web application where an HTML form is presented to the user and their data is POSTed to the server where it uses the <a target=_blank title=\"Ghostscript information\" href=\"https:\/\/www.ghostscript.com\/\">Ghostscript<\/a> (interpreter for the PostScript language and PDF) from the command line via PHP&#8217;s <a target=_blank title='PHP passthru() method information' href='http:\/\/php.net\/manual\/en\/function.passthru.php'>passthru<\/a> in a PHP commands of the ilk (where variable $inpdf contains PDF data information) &#8230;<\/p>\n<p><code><br \/>\nif (strpos(strtolower((\"~\" . $inpdf)), \"~http\") !== false) {<br \/>\n  file_put_contents(\"\/tmp\/textofpdf.in\", file_get_contents(str_replace('+',' ',$inpdf)));<br \/>\n  passthru(\"gs -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE  -f ps2ascii.ps -dQUIET - &lt; \/tmp\/textofpdf.in ; rm -f \/tmp\/textofpdf.in\");<br \/>\n} else if ($viafile != \"\") { \/\/strlen($inpdf) &gt; 200) {<br \/>\n  file_put_contents(\"\/tmp\/textofpdf.in\", $inpdf);<br \/>\n  passthru(\"gs -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE  -f ps2ascii.ps -dQUIET - &lt; \/tmp\/textofpdf.in ; rm -f \/tmp\/textofpdf.in\");<br \/>\n} else {<br \/>\n  passthru(\"gs -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE  -f ps2ascii.ps  \\\"\" . str_replace('+',' ',$inpdf) . \"\\\" -dQUIET -c quit\");<br \/>\n}<br \/>\n<\/code><\/p>\n<p> &#8230; the idea for which we gratefully acknowledge the online community&#8217;s <a target=_blank title='Useful link, thanks' href='https:\/\/stackoverflow.com\/questions\/3650957\/how-to-extract-text-from-a-pdf'>useful link here<\/a>, as well as <a target=_blank title='Useful link, thanks' href='https:\/\/stackoverflow.com\/questions\/11876175\/how-to-get-a-file-or-blob-from-an-object-url'>this useful link<\/a>&#8216;s advice about HTML5 File API specifics for PDF data.<\/p>\n<p>If this was interesting you may be interested in <a title='Click here to see topics in which you might be interested' href='#d39643' onclick='var dv=document.getElementById(\"d39643\"); dv.innerHTML = \"<iframe width=670 height=600 src=\" + \"\/\/www.rjmprogramming.com.au\/ITblog\/tag\/passthru\" + \"&gt;&lt;\/iframe&gt;\"; dv.style.display = \"block\";'>this<\/a> too.<\/p>\n<div id='d39643' style='display: none; border-left: 2px solid green; border-top: 2px solid green;'><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Quite some time ago now, the web browsers offered a text only view of the web contents. Perhaps they still do. It was a great idea back then for slow communication protocols involving personal modems, and their sluggish speed. Sometimes, &hellip; <a href=\"https:\/\/www.rjmprogramming.com.au\/ITblog\/pdf-text-only-extraction-primer-tutorial\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,14,37],"tags":[2051,2618,419,452,2619,576,578,652,2427,913,932,970,997,1254,1319],"class_list":["post-39643","post","type-post","status-publish","format-standard","hentry","category-elearning","category-event-driven-programming","category-tutorials","tag-blob","tag-extract","tag-file-api","tag-form","tag-ghostscript","tag-html","tag-html5","tag-javascript","tag-passthru","tag-pdf","tag-php","tag-post","tag-programming","tag-text","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/39643"}],"collection":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/comments?post=39643"}],"version-history":[{"count":4,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/39643\/revisions"}],"predecessor-version":[{"id":39647,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/39643\/revisions\/39647"}],"wp:attachment":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/media?parent=39643"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/categories?post=39643"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/tags?post=39643"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}