{"id":66610,"date":"2025-01-21T03:01:00","date_gmt":"2025-01-20T17:01:00","guid":{"rendered":"https:\/\/www.rjmprogramming.com.au\/ITblog\/?p=66610"},"modified":"2025-01-21T08:14:05","modified_gmt":"2025-01-20T22:14:05","slug":"pandoc-on-almalinux-conversions-primer-tutorial","status":"publish","type":"post","link":"https:\/\/www.rjmprogramming.com.au\/ITblog\/pandoc-on-almalinux-conversions-primer-tutorial\/","title":{"rendered":"Pandoc on AlmaLinux Conversions Primer Tutorial"},"content":{"rendered":"<div style=\"width: 230px\" class=\"wp-caption alignnone\"><a target=\"_blank\" href=\"https:\/\/www.rjmprogramming.com.au\/macos_textutil_convert.php\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 15px solid pink;\" alt=\"Pandoc on AlmaLinux Conversions Primer Tutorial\" src=\"http:\/\/www.rjmprogramming.com.au\/macos_textutil_convert_more.gif\" title=\"Pandoc on AlmaLinux Conversions Primer Tutorial\"  style=\"float:left;\"  \/><\/a><p class=\"wp-caption-text\">Pandoc on AlmaLinux Conversions Primer Tutorial<\/p><\/div>\n<p>Yesterday&#8217;s <a title='Word to HTML to CSV Delimitation Primer Tutorial' href='#whtmlcsvdpt'>Word to HTML to CSV Delimitation Primer Tutorial<\/a> offered a timely reminder that not only &#8230;<\/p>\n<ul>\n<li>LibreOffice and Microsoft Office software applications offer exports of document formats to HTML &#8230; but, also, open source gives us &#8230;<\/li>\n<li><font size=1>(what is now possible to offer in a public sense because of the recent AlmaLinux installation (you can read more about at <a target=\"_blank\" href='https:\/\/www.rjmprogramming.com.au\/ITblog\/pandoc-install-and-public-face-tutorial\/' title='Pandoc Install and Public Face Tutorial' rel=\"noopener\">Pandoc Install and Public Face Tutorial<\/a>) of)<\/font> <a target=\"_blank\" title='Pandoc open source document conversions' href='https:\/\/pandoc.org\/' rel=\"noopener\">Pandoc<\/a> command line application we can use to convert input document formats such as *.doc* and *.html and *.txt to others &#8230; and down the track &#8230;<\/li>\n<li>tomorrow&#8217;s job can involve the interfacing of another inhouse &#8220;open source using&#8221; web application so that input *.pdf is possible here too<\/li>\n<\/ul>\n<p>For security purposes we restrict where output files end up to &#8230;<\/p>\n<p><code><br \/>\n\/tmp\/<br \/>\n<\/code><\/p>\n<p> &#8230; as you might surmise would be a wise move.  The user ends up relying on <a target=\"_blank\" href=\"https:\/\/www.rjmprogramming.com.au\/PHP\/Geographicals\/diff.php?one=https:\/\/www.rjmprogramming.com.au\/macos_textutil_convert.php------------GETME\" rel=\"noopener\">the changed<\/a> <a target=\"_blank\" href=\"https:\/\/www.rjmprogramming.com.au\/macos_textutil_convert.php------------GETME\" rel=\"noopener\">macos_textutil_convert.php<\/a> PHP <a target=\"_blank\" href=\"https:\/\/www.rjmprogramming.com.au\/macos_textutil_convert.php\" rel=\"noopener\">web application<\/a> itself, that way, to display the outputted data <font size=1>(created via command line Pandoc commands performed on the RJM Programming AlmaLinux web server via PHP <a target=\"_blank\" title='PHP exec() method information' href='http:\/\/php.net\/manual\/en\/function.exec.php' rel=\"noopener\">exec<\/a> calls)<\/font> for them.<\/p>\n<p><iframe style=\"width:100%;height:900px;\" src=\"\/\/www.rjmprogramming.com.au\/macos_textutil_convert.php\" rel=\"noopener\"><\/iframe><\/p>\n<p>In making this happen, exporting to PDF, we found that we additionally had to install to the AlmaLinux web server &#8230;<\/p>\n<p><code><br \/>\ndnf install texlive<br \/>\n<\/code><\/p>\n<p> &#8230; the best &#8220;heads up&#8221; for this after reading the Pandoc error message being <a target=\"_blank\" href='https:\/\/www.google.com\/search?q=install+pdflatex+on+almalinux&#038;oq=install+pdflatex+on+almalinux&#038;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIHCAEQIRigATIHCAIQIRigATIHCAMQIRifBdIBCTg3OTdqMGoxNagCCLACAQ&#038;sourceid=chrome&#038;ie=UTF-8\n' rel=\"noopener\">this useful webpage<\/a>, thanks.<\/p>\n<p><!--p>You can also see this play out at WordPress 4.1.1's <a target=\"_blank\" href='\/\/www.rjmprogramming.com.au\/ITblog\/pandoc-on-almalinux-conversions-primer-tutorial\/' rel=\"noopener\">Pandoc on AlmaLinux Conversions Primer Tutorial<\/a>.<\/p-->\n<hr>\n<p id='whtmlcsvdpt'>Previous relevant <a target=\"_blank\" title='Word to HTML to CSV Delimitation Primer Tutorial' href='\/\/www.rjmprogramming.com.au\/ITblog\/word-to-html-to-csv-delimitation-primer-tutorial\/' rel=\"noopener\">Word to HTML to CSV Delimitation Primer Tutorial<\/a> is shown below.<\/p>\n<div style=\"width: 230px\" class=\"wp-caption alignnone\"><a target=\"_blank\" href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/dostuff.jpg\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 15px solid pink;\" alt=\"Word to HTML to CSV Delimitation Primer Tutorial\" src=\"http:\/\/www.rjmprogramming.com.au\/PHP\/dostuff.jpg\" title=\"Word to HTML to CSV Delimitation Primer Tutorial\"  style=\"float:left;\"  \/><\/a><p class=\"wp-caption-text\">Word to HTML to CSV Delimitation Primer Tutorial<\/p><\/div>\n<p>The modern document applications allow conversion to HTML.  <i>What happens<\/i> during that process, exactly?  Well, that&#8217;s &#8220;under the hood&#8221; stuff.  A little background, though, and context &#8230;<\/p>\n<ul>\n<li>Why would you want to convert, say a Word file, to HTML (using, perhaps, LibreOffice, in our case, or Microsoft Word)? &#8230; well, as a mere mortal programmer &#8230;<\/li>\n<li><font size=1>(any form of)<\/font> text is easier to deal with for &#8220;mere mortal programmer&#8221; languages we might want to use like &#8230;<\/li>\n<li>PHP &#8230; is very good at the <strong>delimiter<\/strong> processing bits that allow the programmer be useful &#8230;<\/li>\n<li>converting &#8230; the data into other guises, the one that interested us being &#8230;<\/li>\n<li>CSV (comma separated value) data &#8230; to be fed into spreadsheet applications like Excel or LibreOffice&#8217;s one &#8230; and then create charts<\/li>\n<\/ul>\n<p> &#8230; and to do useful <strong>delimiter<\/strong> work in PHP you need to know, or suss out, <i>&#8220;what happens&#8221;<\/i>, or evidence of that &#8230; think hex dumps (where $dr is a PHP variable containing an HTML file record) &#8230;<\/p>\n<p>&lt;?php<br \/>\n<code><br \/>\necho <a target=\"_blank\" title='PHP bin2hex information' href='https:\/\/www.php.net\/manual\/en\/function.bin2hex.php' rel=\"noopener\">bin2hex<\/a>($dr) . \"\\n\";<br \/>\n\/\/ ... gave, in our case, output such as ...<br \/>\n\/\/ c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020546f74616c207c20c2a020c2a02036302c30333220c2a020c2a020c2a03130302e3030<br \/>\n<\/code><br \/>\n?&gt;<\/p>\n<p>And so we line up all the useful contributors &#8230;<\/p>\n<ol>\n<li>CP3O<\/li>\n<li>C2A0<\/li>\n<li>R2D2<\/li>\n<li>&#8230; &#8230; &#8230; <\/li>\n<\/ol>\n<blockquote><p>\nHang on?!  What&#8217;s with C2A0?  And for that matter, the pitiful &#8220;am typing&#8221; simulation &#8220;&#8230; &#8230; &#8230; &#8220;?!\n<\/p><\/blockquote>\n<p>Well, we asked around, and got to <a target=\"_blank\" title='Useful link' href='https:\/\/community.notepad-plus-plus.org\/topic\/19862\/replace-non-breaking-space-utf-8-c2-a0\/7' rel=\"noopener\">this useful link<\/a> telling us these are non-ascii characters describing a &#8230;<\/p>\n<p><code><br \/>\nNon-breaking space<br \/>\n<\/code><\/p>\n<p> &#8230; scenario programmers of HTML will know can be those &#8230;<\/p>\n<p><code><br \/>\n&amp;nbsp;<br \/>\n<\/code><\/p>\n<p> &#8230; HTML entities in your webpage content.  Well, now, at least to us, that all makes sense.  But, for our job, that could be the tip of the &#8220;UTF-8 headache&#8221; situation!  We know we&#8217;re only interested in ascii data characters for the conversion job we are trying to do.  Is there a way to simplify this &#8220;middleperson&#8221; HTML data content?  Well, <a target=\"_blank\" title='Other useful link' href='https:\/\/stackoverflow.com\/questions\/25236761\/how-to-replace-non-ascii-characters-in-a-string-in-php' rel=\"noopener\">this other useful link<\/a> &#8230; got us to use &#8230;<\/p>\n<p>&lt;?php<br \/>\n<code><br \/>\n$dr=preg_replace('\/[\\x7F-\\xFF]\/ui', '', $dr);<br \/>\n<\/code><br \/>\n?&gt;<\/p>\n<p> &#8230; helped us with &#8230;<\/p>\n<ol>\n<li>sanity<\/li>\n<li>simplification<\/li>\n<\/ol>\n<p> &#8230; as far as the PHP delimitation logic went.  This was an inhouse job, but we&#8217;ll show you a skeletal of how we used &#8230;<\/p>\n<ul>\n<li>input Word report &#8230; we are calling from_word_to_html.html &#8230; say &#8230;<\/li>\n<li>containing spreadsheet<sub>able<\/sub> data &#8230;<\/li>\n<li>we wanted to extract into &#8230;<\/li>\n<li>individual CSV files &#8230; ready to &#8230;<\/li>\n<li>open as useful spreadsheets &#8230; and perhaps onto some chart production &#8230;<\/li>\n<li>processing via command line command &#8230;<br \/>\n<code><br \/>\nphp dostuff.php<br \/>\n<\/code><br \/>\n &#8230; where that PHP is (very informally) &#8230;<\/li>\n<li><a target=\"_blank\" rel=\"noopener\" href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/dostuff.php-GETME\">dostuff.php<\/a><\/li>\n<\/ul>\n<p> &#8230; in case these ideas interest you?!<\/p>\n<p>If this was interesting you may be interested in <a title='Click here to see topics in which you might be interested' href='#d66591' onclick='var dv=document.getElementById(\"d66591\"); dv.innerHTML = \"&lt;iframe width=670 height=600 src=\" + \"https:\/\/www.rjmprogramming.com.au\/ITblog\/tag\/word\" + \"&gt;&lt;\/iframe&gt;\"; dv.style.display = \"block\";'>this<\/a> too.<\/p>\n<div id='d66591' style='display: none; border-left: 2px solid green; border-top: 2px solid green;'><\/div>\n<hr>\n<p>If this was interesting you may be interested in <a title='Click here to see topics in which you might be interested' href='#d66610' onclick='var dv=document.getElementById(\"d66610\"); dv.innerHTML = \"&lt;iframe width=670 height=600 src=\" + \"https:\/\/www.rjmprogramming.com.au\/ITblog\/tag\/almalinux\" + \"&gt;&lt;\/iframe&gt;\"; dv.style.display = \"block\";'>this<\/a> too.<\/p>\n<div id='d66610' style='display: none; border-left: 2px solid green; border-top: 2px solid green;'><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Yesterday&#8217;s Word to HTML to CSV Delimitation Primer Tutorial offered a timely reminder that not only &#8230; LibreOffice and Microsoft Office software applications offer exports of document formats to HTML &#8230; but, also, open source gives us &#8230; (what is &hellip; <a href=\"https:\/\/www.rjmprogramming.com.au\/ITblog\/pandoc-on-almalinux-conversions-primer-tutorial\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,29,37],"tags":[4855,96,233,234,263,5080,405,407,576,885,3365,932,997,5081,1319,1411,1452],"class_list":["post-66610","post","type-post","status-publish","format-standard","hentry","category-elearning","category-operating-system","category-tutorials","tag-almalinux","tag-application","tag-command","tag-command-line","tag-conversion","tag-document-open-source","tag-exec","tag-export","tag-html","tag-operating-system-2","tag-pandoc","tag-php","tag-programming","tag-public","tag-tutorial","tag-web-server","tag-word"],"_links":{"self":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/66610"}],"collection":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/comments?post=66610"}],"version-history":[{"count":7,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/66610\/revisions"}],"predecessor-version":[{"id":66620,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/66610\/revisions\/66620"}],"wp:attachment":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/media?parent=66610"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/categories?post=66610"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/tags?post=66610"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}