{"id":66591,"date":"2025-01-20T03:01:00","date_gmt":"2025-01-19T17:01:00","guid":{"rendered":"https:\/\/www.rjmprogramming.com.au\/ITblog\/?p=66591"},"modified":"2025-01-20T09:52:20","modified_gmt":"2025-01-19T23:52:20","slug":"word-to-html-to-csv-delimitation-primer-tutorial","status":"publish","type":"post","link":"https:\/\/www.rjmprogramming.com.au\/ITblog\/word-to-html-to-csv-delimitation-primer-tutorial\/","title":{"rendered":"Word to HTML to CSV Delimitation Primer Tutorial"},"content":{"rendered":"<div style=\"width: 230px\" class=\"wp-caption alignnone\"><a target=\"_blank\" href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/dostuff.jpg\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 15px solid pink;\" alt=\"Word to HTML to CSV Delimitation Primer Tutorial\" src=\"http:\/\/www.rjmprogramming.com.au\/PHP\/dostuff.jpg\" title=\"Word to HTML to CSV Delimitation Primer Tutorial\"  style=\"float:left;\"  \/><\/a><p class=\"wp-caption-text\">Word to HTML to CSV Delimitation Primer Tutorial<\/p><\/div>\n<p>The modern document applications allow conversion to HTML.  <i>What happens<\/i> during that process, exactly?  Well, that&#8217;s &#8220;under the hood&#8221; stuff.  A little background, though, and context &#8230;<\/p>\n<ul>\n<li>Why would you want to convert, say a Word file, to HTML (using, perhaps, LibreOffice, in our case, or Microsoft Word)? &#8230; well, as a mere mortal programmer &#8230;<\/li>\n<li><font size=1>(any form of)<\/font> text is easier to deal with for &#8220;mere mortal programmer&#8221; languages we might want to use like &#8230;<\/li>\n<li>PHP &#8230; is very good at the <strong>delimiter<\/strong> processing bits that allow the programmer be useful &#8230;<\/li>\n<li>converting &#8230; the data into other guises, the one that interested us being &#8230;<\/li>\n<li>CSV (comma separated value) data &#8230; to be fed into spreadsheet applications like Excel or LibreOffice&#8217;s one &#8230; and then create charts<\/li>\n<\/ul>\n<p> &#8230; and to do useful <strong>delimiter<\/strong> work in PHP you need to know, or suss out, <i>&#8220;what happens&#8221;<\/i>, or evidence of that &#8230; think hex dumps (where $dr is a PHP variable containing an HTML file record) &#8230;<\/p>\n<p>&lt;?php<br \/>\n<code><br \/>\necho <a target=\"_blank\" title='PHP bin2hex information' href='https:\/\/www.php.net\/manual\/en\/function.bin2hex.php' rel=\"noopener\">bin2hex<\/a>($dr) . \"\\n\";<br \/>\n\/\/ ... gave, in our case, output such as ...<br \/>\n\/\/ c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020c2a020546f74616c207c20c2a020c2a02036302c30333220c2a020c2a020c2a03130302e3030<br \/>\n<\/code><br \/>\n?&gt;<\/p>\n<p>And so we line up all the useful contributors &#8230;<\/p>\n<ol>\n<li>CP3O<\/li>\n<li>C2A0<\/li>\n<li>R2D2<\/li>\n<li>&#8230; &#8230; &#8230; <\/li>\n<\/ol>\n<blockquote><p>\nHang on?!  What&#8217;s with C2A0?  And for that matter, the pitiful &#8220;am typing&#8221; simulation &#8220;&#8230; &#8230; &#8230; &#8220;?!\n<\/p><\/blockquote>\n<p>Well, we asked around, and got to <a target=\"_blank\" title='Useful link' href='https:\/\/community.notepad-plus-plus.org\/topic\/19862\/replace-non-breaking-space-utf-8-c2-a0\/7' rel=\"noopener\">this useful link<\/a> telling us these are non-ascii characters describing a &#8230;<\/p>\n<p><code><br \/>\nNon-breaking space<br \/>\n<\/code><\/p>\n<p> &#8230; scenario programmers of HTML will know can be those &#8230;<\/p>\n<p><code><br \/>\n&amp;nbsp;<br \/>\n<\/code><\/p>\n<p> &#8230; HTML entities in your webpage content.  Well, now, at least to us, that all makes sense.  But, for our job, that could be the tip of the &#8220;UTF-8 headache&#8221; situation!  We know we&#8217;re only interested in ascii data characters for the conversion job we are trying to do.  Is there a way to simplify this &#8220;middleperson&#8221; HTML data content?  Well, <a target=\"_blank\" title='Other useful link' href='https:\/\/stackoverflow.com\/questions\/25236761\/how-to-replace-non-ascii-characters-in-a-string-in-php' rel=\"noopener\">this other useful link<\/a> &#8230; got us to use &#8230;<\/p>\n<p>&lt;?php<br \/>\n<code><br \/>\n$dr=preg_replace('\/[\\x7F-\\xFF]\/ui', '', $dr);<br \/>\n<\/code><br \/>\n?&gt;<\/p>\n<p> &#8230; helped us with &#8230;<\/p>\n<ol>\n<li>sanity<\/li>\n<li>simplification<\/li>\n<\/ol>\n<p> &#8230; as far as the PHP delimitation logic went.  This was an inhouse job, but we&#8217;ll show you a skeletal of how we used &#8230;<\/p>\n<ul>\n<li>input Word report &#8230; we are calling from_word_to_html.html &#8230; say &#8230;<\/li>\n<li>containing spreadsheet<sub>able<\/sub> data &#8230;<\/li>\n<li>we wanted to extract into &#8230;<\/li>\n<li>individual CSV files &#8230; ready to &#8230;<\/li>\n<li>open as useful spreadsheets &#8230; and perhaps onto some chart production &#8230;<\/li>\n<li>processing via command line command &#8230;<br \/>\n<code><br \/>\nphp dostuff.php<br \/>\n<\/code><br \/>\n &#8230; where that PHP is (very informally) &#8230;<\/li>\n<li><a target=\"_blank\" rel=\"noopener\" href=\"http:\/\/www.rjmprogramming.com.au\/PHP\/dostuff.php-GETME\">dostuff.php<\/a><\/li>\n<\/ul>\n<p> &#8230; in case these ideas interest you?!<\/p>\n<p>If this was interesting you may be interested in <a title='Click here to see topics in which you might be interested' href='#d66591' onclick='var dv=document.getElementById(\"d66591\"); dv.innerHTML = \"&lt;iframe width=670 height=600 src=\" + \"https:\/\/www.rjmprogramming.com.au\/ITblog\/tag\/word\" + \"&gt;&lt;\/iframe&gt;\"; dv.style.display = \"block\";'>this<\/a> too.<\/p>\n<div id='d66591' style='display: none; border-left: 2px solid green; border-top: 2px solid green;'><\/div>\n","protected":false},"excerpt":{"rendered":"<p>The modern document applications allow conversion to HTML. What happens during that process, exactly? Well, that&#8217;s &#8220;under the hood&#8221; stuff. A little background, though, and context &#8230; Why would you want to convert, say a Word file, to HTML (using, &hellip; <a href=\"https:\/\/www.rjmprogramming.com.au\/ITblog\/word-to-html-to-csv-delimitation-primer-tutorial\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,29,37],"tags":[105,5079,199,5077,5078,234,263,283,290,2276,2147,349,403,407,5075,5076,576,694,782,789,932,4116,1185,1254,1352,1452],"class_list":["post-66591","post","type-post","status-publish","format-standard","hentry","category-elearning","category-operating-system","category-tutorials","tag-ascii","tag-bin2hex","tag-chart","tag-cli","tag-comma-separated-value","tag-command-line","tag-conversion","tag-csv","tag-data","tag-delimitation","tag-delimiter","tag-document","tag-excel","tag-export","tag-hex","tag-hex-dump","tag-html","tag-libreoffice","tag-microsoft","tag-microsoft-word","tag-php","tag-preg_replace","tag-spreadsheet","tag-text","tag-utf-8","tag-word"],"_links":{"self":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/66591"}],"collection":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/comments?post=66591"}],"version-history":[{"count":13,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/66591\/revisions"}],"predecessor-version":[{"id":66609,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/posts\/66591\/revisions\/66609"}],"wp:attachment":[{"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/media?parent=66591"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/categories?post=66591"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rjmprogramming.com.au\/ITblog\/wp-json\/wp\/v2\/tags?post=66591"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}