flexibeast.space - gemlog - 2022-08-30

Creating minimal EPUBs with epub-create

i've recently had something of a Knuth moment, resulting in me writing a small POSIX shell script for generating an EPUB.

When putting together his epic “Art of Computer Programming” (‘TAOCP’) series[a], computing scientist Don Knuth[b] took a slight detour to develop the TeX typesetting system[c]:

In 1976, Knuth prepared a second edition of Volume 2, requiring it to be typeset again, but the style of type used in the first edition (called hot type) was no longer available. In 1977, he decided to spend some time creating something more suitable. Eight years later, he returned with TeX

on top of which computing scientist Leslie Lamport created LaTeX[d], which today is still the de facto standard for typesetting mathematics.

In my own case, last week i went looking for an EPUB version of a physical book i own, so that i could have it on my Kobo. It turned out the author made the contents available years ago, but as a number of distinct Web pages, rather than a single file, with many of them containing bad HTML. By ‘bad’ i don't mean “a bit inelegant”, i mean “CSS is a faraway land, let's use a morass of HTML elements with formatting attributes”.

This is a problem, because the EPUB format is based on XHTML files, and XHTML, as an XML application, has to be valid markup; it doesn't allow the sort of tag soup that a number of HTML parsers are forgiving of. (Which has been to the short-term benefit of some but the long-term cost of many.)

The reason i know that EPUB is based on XHTML files is because i asked myself, “Okay, how can I turn this collection of crappy HTML files into an EPUB?” There doesn't seem to be an obvious way to use the Pandoc document-conversion tool[e] for this - please correct me if i'm wrong! - and it seemed excessive to have to install the Sigil EPUB authoring software[f], based as it is on QtWebEngine, which is a huge dependency (particularly on Gentoo, where it has to be compiled; a ‘-bin’ package for it is not currently available).

As a result, i found myself spelunking through the various EPUB specs to try to work out the minimum that i needed to do to create a usable EPUB. Now that i've done that, you don't have to. i've added another item to my ‘guides’ collection:

“Creating a minimal EPUB”

and publicly released the small POSIX shell script i mentioned in my opening paragraph:

sourcehut: ‘epub-create’

The latter is intended to be as portable as possible, having only cat(1p), date(1p) and zip(1) as external dependencies. i've run it through checkbashisms (though i use zsh myself); it flagged ‘read -p’ as a possible bashism, though POSIX doesn't seem to specify a ‘read’ built-in[g][h], and i'm using ‘-p’ to specify a prompt, as supported by dash(1). (EDIT: The ‘-p’ option for ‘read’ for OpenBSD 7.1's ksh(1) is about co-processes, and this isn't affected by ‘set -o posix’, so i've removed the ‘-p’ option from ‘read’ calls, relying instead on ‘echo -n’.) i've also run it through shellcheck, which in addition to the ‘read -p’ issue (SC2039), flagged various other issues, a number of which i've addressed, but one of which is ‘echo -n’ - not specified by POSIX, but also supported by dash(1). (EDIT: u9000-Nine submitted, and i've merged, a PR which replaces ‘echo -n’ with a call to ‘printf’, which for some reason i keep forgetting is a POSIX utility as well as a function .... ) Please let me know if there are any ways the script's portability could be improved. :-)

🏷 documentation,ict


Gemlog Home

[a] Wikipedia: ‘The Art of Computer Programming’

[b] Wikipedia: ‘Donald Knuth’

[c] Wikipedia: ‘TeX’

[d] Wikipedia: ‘LaTeX’

[e] Pandoc

[f] Sigil

[g] “Shell Command Language”

[h] It turns out ‘read’ is in POSIX, just as a standard utility, not a shell builtin:

“read - read from standard input into shell variables”