For the first six years the Yearbook of Finnish Game Studies was published as a PDF file, but few years ago I promised that we could publish it in three formats: as a webpage (HTML), as an electronic book (EPUB) and as a PDF.
I’d read on how to create a workflow that makes all this possible, but I was more ambitious: I thought that I could more or less automate the creation of all three formats. We’re not quite there, but most of the conversion steps are automatic (see the postscript about PDF conversion for some of the problems). I figured all of this out by trial and error, and by following the excellent advice from the Institute of Network Cultures, especially their “From Print to Ebooks: a Hybrid Publishing Toolkit for the Arts”.
From docx to markdown
Most academic scholars in humanities and social sciences work with Microsoft Word. All of the manuscripts we’ve received in these three years have been in Microsoft Word’s doc(x) format, even after I updated the submissions guidelines to state that we accept papers in a variety of formats. I think it’s a bad format for writing academic papers, but that’s beside the point. It’s definitely a bad format for publishing academic papers, so the first thing I do after receiving a manuscript is to convert it to a more useful format.
Publishing an article is a multi-step process: original manuscripts always have errors in them and they need to be corrected. The corrections need to be approved by the authors, new versions need to be created and so on. We are working in three different formats, so it would make no sense to make three complete versions and then redo all of them after each round of corrections.
Instead of three different formats of text, there is one baseline text that works as a basis for all the others. I create this file by converting the original manuscripts from docx files to markdown-formatted files with Pandoc. Pandoc is a “Swiss-army knife” of document conversion and is the central tool for making sure all of the formats correspond to each other.
The first thing that needs to be done after the conversion is to clean up the text. Some authors don’t use styles for formatting their texts, so I guess what level of heading they mean when they bold some lines and write some in italics, remove extra line breaks and indents, and do all kinds of small formatting changes (that we explicitly ask authors not to do, but they insist on doing anyway). I also go over the list of references, since it’s usually not consistent with the guidelines. This is one of the most laborious things in preparing the files and can’t be easily automated.
I wrote some templates that help the conversion from docx to markdown. Markdown files support adding metadata in a separate YAML section, which I use for things like author info, abstracts and keywords. The idea is pretty simple: these things need to be in all of the formats, but they are written in slightly different ways in the different formats so they’re easier to handle as metadata.
I also add any final comments from the editors to appropriate locations in the text.
After the cleaning process, any further changes to the text are made to the markdown file. This way all of the versions of the text stay consistent and any changes need to be done only once.
I then convert the markdown file to a HTML file with Pandoc. I take the HTML file and copy-paste the contents onto our WordPress-based webpage. I have to take any included image files and transfer them separately onto the server and add them to the article. This can be very easy, if the files were given to me as separate, appropriately sized image files. It can also take more work, if the images were created with Microsoft Office’s tools, which creates them in its own format that needs to be extracted from the docx file.
After the images are transferred, I send a link to the page for the author(s) to review. They usually notice some room for improvement at this point and there might be some minor changes that the editors requested. The author(s) send me their corrections either in new docx files or in email. I find the bits that need to be changed in the markdown file and convert it to HTML again. I copy-paste the result again in WordPress, and the first version of the text is finished.
Creating the EPUB file is mostly very easy. The EPUB format is quite simple, and in many ways resembles the HTML version. Pandoc mostly handles the conversion automatically, but in some cases has problems with tables. I clean up these manually and add some metadata with Sigil, but otherwise the process is automatic.
Creating the PDF is probably the hardest and most time consuming part of the process. I use Scribus to make the PDF. It’s far from perfect, and I’ve had many frustrated moments learning how to do things the way the developers decided would be best way. I had the possibility of using InDesign for making the PDF, but eventually decided against it. Since there is no guarantee that I will make the PDF each year, I wanted to create a template that could be used by anyone if the layout would be done by someone else.
Since I already have a text that is corrected by the author(s), I import the HTML to Scribus. I change the styles to correct ones, place images, adjust typography, and do a ton of manual fixes. This is my least favourite part of making the book, since it involves so much manual, very precise labour. It’s very easy make small mistakes and they are very difficult to catch. Because I always do this in a hurry, there are probably small mistakes in all of the books I’ve made.
The whole process usually takes a few weeks, with corrections, comments and conversions taking a lot of my time, but after the PDF is done, the book is finished. The image below shows the relationship between the formats and how the conversions happen.
The build files are available on GitHub.
Postscript: Automating PDF creation
Originally I tried to automate creating the PDF files, converting the markdown files to PDF with Pandoc. Pandoc uses LaTeX to make PDF files so I used the basic templates given and wrote a very complex template to convert the files from markdown to LaTeX to PDF. Testing these was pretty difficult, since there were two steps in between the end result and the files I used as input.
The conversion wasn’t always correct, so I opened new bug reports to Pandoc and scoured the web for advice on how LaTeX interprets all kinds of weird edge cases. I almost had the conversion happening automatically in 2016, but every time I converted the Yearbook text, two paragraphs of one article would just disappear. Everything else looked right, but I can’t really publish a book that is missing two paragraphs. There was no apparent reason and no logic I could find, so I gave up and moved on to using Scribus. If this would have been the only problem, I probably would have tried to solve it, but this was after I already tackled a dozen weird problems.
If you’re really good with LaTeX and want to solve the problem for me, the templates are available in the repository.