2020-08-01

A guide to getting your feet wet with pandoc, for writers

Pandoc is a magical piece of software. It'll handle anything you give it, and with just a few bits of tweaking, you can get its output to look like anything you want. It can turn your work into a book-formatted PDF with page numbers, anchors, a table of contents, and anything else you might need. And by changing a few arguments, you can have it generate an epub in whatever style you like, complete with autogenerated metadata.

As writers, it's always best to write in whatever environment you're most comfortable with, and pandoc won't interfere with that in any way. Be it Google Docs, Microsoft Word, Libreoffice Writer, or just a plain text editor, pandoc will take it and turn it into anything you want.

The basic way to make a PDF or an epub is the following, with book.docx being used as an example. Anything else will work, too.

pandoc book.docx -o book.epub

Well, that was easy. There's a problem, though, if you're using a WYSIWYG editor (something where you can see the changes you make instantly), there's likely to be a few invisible formatting quirks that you'd want to fix before publishing. Luckily, pandoc helps with that too.

The best option is to use an intermediary markup format. You could go with HTML, but it's awfully heavy if you just want to worry about formatting, so I'd recommend Markdown as your intermediary. Depending on your tastes, Textile is also an option, and of course, pandoc can do both.

So instead of the above, let's do the following:

pandoc book.docx -o book.md

Now your precious manuscript is one giant markdown file, which you can freely edit in a text editor to examine any formatting mistakes!

Some common examples of such are three periods instead of ellipses, minus signs instead of en- or em-dashes, and incorrect quotemarks.

Example Libreoffice Writer document

When putting the above document into pandoc, the resulting markdown is the following:

This isn't an ellipses, it's three periods\... This one is though...

Well, this sentence is fine\--wait, that's not an em-dash! But---this is!

**This isn't a title!**

But this is!
============

As you can see, pandoc has inserted a backslash in front of things that it doesn't want automatically formatted into their proper formatting equivalents. Without the markdown step, you'd have to hunt down every last fake ellipses or broken em-dash, but with markdown you can simply ctrl+F all the backticks! Sometimes they're needed, for when a character means something to markdown. Like asterisks, used for emphasis and strong text. If you wanted a literal *, you'd write \*. And if you wanted a literal \*, well, here's the source for this sentence.

If you wanted a literal \*, you'd write \\\*. And if you wanted a literal \\\*, well, here's the source for this sentence.

But yes, it's much easier to make sure a markdown file is all cleaned up than in a WYSIWYG editor! Once every title is a title, a table of contents automatically gets generated for you!

If you've already run the epub-generating commands above, you'll notice that it complains about not having a specified title. The title of your book is a part of the metadata, which exists as frontmatter to the book itself. Frontmatter is just a chunk of data that doesn't end up in the final product, and is only used by pandoc. An example from one of my books is this:

---
title: "Anthology of Lewd Vol 3"
author: "Anna Harren"
description: "Third yearly anthology of smut from Anna Harren"
date: "2018-12-01"
cover-image: "cover-small.png"
...

You can throw this right at the top of your markdown file, or you can have a separate metadata.yaml that you include in the command line like the following:

pandoc chapter1.md chapter2.md metadata.yaml -o book.epub

Generating PDFs🔗

There's a few ways to go about getting pandoc to spit out a PDF, and they depend on what you want to deal with the most. If you're lucky enough to have enough experience with LaTeX to make a good book-style template for it, it's a decent option, but for everyone else, I'd recommend either WeasyPrint or wkhtmltopdf. They both have strengths and weaknesses, and they're both supported by pandoc.

Here's the example document put through wkhtmltopdf:

pandoc example_doc.odt --pdf-engine wkhtmltopdf -o test.pdf

Example PDF

As you can see, it's nothing fancy, but that's far from the end of what pandoc offers. What makes the resulting document, be it PDF or epub, are pandoc's template files. On Linux these are contained in /usr/share/pandoc/data/templates/. PDFs use default.html5 to format the document into something for wkhtmltopdf or weasyprint to render, while epubs use default.epub3.

In the end they are both html documents, with pandoc's template syntax thrown in. With a little css and html knowledge you can get most things styled as you want them, and then you can point pandoc to your custom template by adding a command-line option like --template custom.epub.

A few last notes: Most devices can take epubs directly, and those that can't (kindles) can easily have epubs converted for mobis for them using something like Calibre. If uploading to the Kindle store, it'll do the conversion for you. Also, there's no need to write your own table of contents! there's a simple option for it, --toc. Note that epubs don't really need it as they have the table of contents as a part of the metadata that any ebook reader can open up by itself. It's definitely useful for PDFs, though.

One last note: If you want to avoid all of this, and just give me your manuscript and get a PDF and an epub, I offer that service! Feel free to contact me.