Tonight I finally converted all the Glossary pages in my mirror of the Jargon File into Unicode (utf-8 encoding) so that they will transmit and display properly from GitHub Pages (or any other modern web server). It was a fairly trivial thing to do in the end, but I am likely to need to repeat this for other things at work, so I'm blogging it.
The Jargon File was converted into XML-Dockbook and Unicode for version 4.4.0, but ESR only converted the front- and back-matter, not the Glossary entries (i.e. the actual lexicon). Those are still latin-1 (ISO-8859-1). And although the HTML rendition begins with the correct header declaring this:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
UTF-8 does not equal Latin-1
It's a widely held misconception that utf-8 is a superset of ISO-8859-1. It's not. While Unicode itself does contain a Latin-1 Supplement, all of the upper 128 character bit-patterns of ISO-8859-1 have different meaning in utf-8, and there are many legitimate ISO-8859-1 characters that are illegal utf-8 encoding byte sequences. This often confuses me because I'm used to 8-bit characters where the encoding is the same as the code-point:
0x41 and means A;
0xFF (in latin-1) and means ÿ. With Unicode, the code-points and their encodings are not the same, there are multiple byte-stream encodings for Unicode, and utf-8 is just one of them.
Despite the 8 in it's name, utf-8 is not 8-bit. It's a potentially-multi-byte encoding for Unicode, and it's at least 8 bits. It just happens to share half of the 8-bit space with ISO-8859-1 (which shares the same half with ASCII), and so for English the same characters have the same 8-bit encodings. Most Western-European languages it can be encoded in utf-8 with between 8 and 16-bits as well, and some require 21-bits (so 3 bytes or 4 bytes for utf-8's codec scheme, but only for rare characters).
This is why utf-8 is so popular online:
- most web sites are in Western-European languages
- most of these Latin-based characters can be encoded with 8 bits per character, many with up to 16-bits
- the first 128 characters are the same bit-patterns as 7-bit ASCII, the original byte stream for the APRANET and the Internet
If you look at Japanese or Chinese sites, they prefer utf-16 because this encoding is more efficient for those Unicode blocks: the most common Chinese characters will fit in a 2-byte encoding in utf-16, whereas the same code-points in utf-8 typically need 3 or 4 bytes.
Why ISO-8859-1 “breaks” in utf-8
Because utf-8 is a multi-byte encoding, it reserves some bit patterns for encoding that more than one byte is involved in the current code-point for a character. The last code-point in 8-bit utf-8 is actually
0x7F: the most-significant-bit is reserved to indicate multi-byte. So all of the legitimate ISO-8859-1 8-bit bytes result in different Unicode characters, and some are illegal utf-8 byte sequences.
That's why you can't just take an ISO-8859-1 byte stream (or any other ISO-8859 code page, or the Windows one) and interpret it as utf-8. It only works if the bytes from the stream are in the 7-bit (ASCII) range.
Fixing the Jargon File
When I mirrored the Jargon File back in October 2015, I asked HTTrack to encode as UTF-8, thinking this would fix it. It didn't really succeed, all it did was to add its own encoding header:
<!-- Mirrored from www.catb.org/~esr/jargon/html/A/ABEND.html by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 26 Oct 2015 13:15:16 GMT --> <!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=utf-8" /><!-- /Added by HTTrack -->
The bytes within the page were still ISO-8859-1. So I was having the same issue as catb.org: while the page declares ISO, the web server actually sends utf-8; and now the HTTrack insertion adds its own
content to the confusion.
Fixing the Jargon file properly will involve actually going into the Docbook sources and ESRs makefiles and correcting it there. I'm not going to do that: it's too much effort to recover software that understands DocBook XML 1.62 still and can do a lossless conversion. That'll be a "someday" project probably (it's been 15 years since ESR updated the Jargon File himself, and it's dated, and bordering on becoming bogus because of a lack of currency, so it's a low-priority maybe project). Instead what I've done is just run a filter over all the HTML output of the Jargon, since that is what is actually being served most of the time.
Here's the python code,
#!/usr/bin/env python3 import sys import os import codecs sourceFileName = sys.argv targetFileName = sys.argv+'.tmp' BLOCKSIZE = 1048576 # or some other, desired size in bytes with codecs.open(sourceFileName, "r", "ISO-8859-1") as sourceFile: with codecs.open(targetFileName, "w", "utf-8") as targetFile: while True: contents = sourceFile.read(BLOCKSIZE) if not contents: break targetFile.write(contents.replace('ISO-8859-1','utf-8')) os.rename(targetFileName,sourceFileName)
- take the
sourceFileNamefrom the script argument (without checking if there is an argument)
- set a temporary
targetFileNamebased on the source file
- hardcoded the source encoding
- and replaced "
ISO-8859-1" in the output file with "
utf-8" so that the headers match the content (very näive: just looks for the exact string match, so any mention of "ISO-8859-1" within the body will also be replaced)
- Finally, replace the original source file with the temporary target file, by intentionally clobbering the original with
I then ran this code in a shell loop like so:
[src][mjl@milo:~/hax/blog/milosophical.me/files/jargon/html] [22:56](nikola)β for Y in 0 [A-Z]; do echo $Y for X in $Y/*.html; do echo $X ~/hax/transcode.py $X done done
And there we have it: all the Jargon lexicon is now encoded as utf-8 and declared as such in the headers. Here's /dev/null:
<?xml version="1.0" encoding="utf-8" standalone="no"?> <html xmlns="http://www.w3.org/1999/xhtml"> <!-- Mirrored from www.catb.org/~esr/jargon/html/0/dev-null.html by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 26 Oct 2015 13:16:51 GMT --> <!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=utf-8" /><!-- /Added by HTTrack --> <head><title>/dev/null</title><link rel="stylesheet" href="../../jargon.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.61.0"/><link rel="home" href="../index.html" title="The Jargon File"/><link rel="up" href="../0.html" title="0"/><link rel="previous" href="TM.html" title="(TM)"/><link rel="next" href="me.html" title="/me"/></head><body><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">/dev/null</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="TM.html">Prev</a> </td><th width="60%" align="center">0</th><td width="20%" align="right"> <a accesskey="n" href="me.html">Next</a></td></tr></table><hr/></div><dt><a id="dev-null"/><dt xmlns="" id="dev-null"><b>/dev/null</b>: <span xmlns="http://www.w3.org/1999/xhtml" class="pronunciation">/dev·nuhl/</span>, <span xmlns="http://www.w3.org/1999/xhtml" class="grammar">n.</span></dt></dt><dd><p> [from the Unix null device, used as a data sink] A notional ‘black hole’ in any information space being discussed, used, or referred to. A controversial posting, for example, might end “<span class="quote">Kudos to email@example.com, flames to /dev/null</span>”. See <a href="../B/bit-bucket.html"><i class="glossterm">bit bucket</i></a>.</p></dd><div class="navfooter"><hr/><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="TM.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="../0.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="me.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">(TM) </td><td width="20%" align="center"><a accesskey="h" href="../index.html">Home</a></td><td width="40%" align="right" valign="top"> /me</td></tr></table></div></body> <!-- Mirrored from www.catb.org/~esr/jargon/html/0/dev-null.html by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 26 Oct 2015 13:16:51 GMT --> </html>