Thursday, August 7

Clean Up Your Ebook Files With HTML

A geek love letter [for YOU]By Jordan McCollum, @JordanMcCollum

Part of the Indie Authors Series

I know. After you've gone to all the trouble to get your book looking good in your word processor, why would you want to go to all the trouble of formatting it in HTML—by hand? Especially when your word processor will do it for you?

A few reasons why HTML comes in handy:
  • You can post a professional-looking preview of your book on your website—without crazy characters, random backgrounds, and errors.
  • Converting your book to HTML can produce a cleaner result for ebook conversion.
  • Doing HTML yourself eliminates the needless clutter most word processors insert in their code, making it easier to produce a result that passes a validation test. 
  • If your word processor's HTML doesn't validate, your ebook won't either, which will get it rejected by some stores like Smashwords & Apple.
  • It's a lot easier to format something in clean HTML than to clean up the output of most word processors.

For the purposes of comparison, I had Microsoft Word (2007) save the beginning of one of my books as HTML (Web page, filtered option). Some of the problems (and yes, some of these are pretty technical, just smile and nod):
  • It failed to indicate its doctype, which made it fail the validation test.
  • The charset (the encoding used on the characters) is windows-1252. Standards for valid ePubs require the UTF-8 charset.
  • None of the special characters—"smart" or curly quotes and apostrophes, and an ê in this excerpt—were converted to HTML, which means that many e-readers will show them as a? or a [].
  • Some other attributes of the code produced a number of other errors in the validation test (9 errors in 7 paragraphs)
  • The code is just a mess—tons of unnecessary clutter that can greatly effect the behavior of your ebook.
So what can you do? Fortunately, it's pretty easy to format in clean HTML, even right in your word processor.

First things first, save this as a new copy of your document. You don't want to overwrite your only copy with one full of code!

Next, we want to indicate where the paragraphs are. To do this, you can use Find & Replace to replace a return character with </p><p> which closes one paragraph and opens the next. If you want to keep the line breaks in there, you could also try </p>[paragraph break code]<p>. (Make sure to add a <p> at the beginning and eliminate the extra <p> at the end!) In Word, the paragraph break code for Find & Replace is ^p. (Look out for any tricky line breaks that might have snuck in! You can find them by searching for ^l.)

I like to convert italics next. Here, we have Find & Replace search out any italic font, and then insert <i> before and </i> after. The ^& code tells Word to leave the same text intact inside the <i> elements.

To search for italic font in Word, click on the More button in the Find & Replace dialogue. This will open up more options.

Click on the Format button to open up this list.

And select Font... to open up the Font dialogue.

Now you can select Italic and click okay.

You can do the same for bold text, too, using <b> and </b>.

Once you're done, you'll want to turn off the Italic/Bold font in the Find What box, so select that box and click the No Formatting button at the bottom.

One issue here is that if you have italics across more than one paragraph, your HTML will not be nested correctly (you have to close tags in the opposite order you opened them). To find any spots where this might be a problem, search for and italics paragraph returns, and add the appropriate </i> and <i> tags, making sure they're between the <p> and </p> tags.

Now for special characters. We can start with one of my favorites, em dashes. Type or cut and paste an em dash into the find box. Replace it with the code (actually an HTML entity) &mdash; . Personally, because I first used a Kindle keyboard and it had a tendency to treat words with an em dash between them as one big unit when I made notes, I like to put spaces around my em dashes. But I also don't want to start a line with an em dash, so I make the first space nonbreaking (kind of "gluing" it to what comes before): &nbsp;&mdash;[there's a regular space here]. (I make it a bit more complicated by not doing this at the end of quotations. I'm a complex person ;) .

I do something similar for ellipses. You might use the three conjoined periods as a single character method, but I don't see many trade published books using that character. So I replace " . . ." with &nbsp;.&nbsp;.&nbsp;.

Next, smart quotes. Personally, I think they make your ebook/excerpt look that much more professional, so I like to code them in to keep them. In Word, it's best to copy and paste each type of individual curly quote (left and right, double and single) to replace only those, or you'll end up with lots of backwards quotes.

For the opening quotation mark, replace it with &ldquo; (left double quote). The closing quotation mark is &rdquo; . The opening single quote mark (used rarely) is &lsquo; (left single quote). The closing single quotation mark/apostrophe is &rsquo; .

Finally, you'll want to replace any other special characters—accents, degree signs, etc.—with their respective HTML entities. (Accents are &agrave; where a = the letter with the accent [case sensitive] and grave = the direction/type of accent. Degrees are &deg;. Everything else I look up ;) .

The only other things you'll need to do are convert your chapter headings to a special type of paragraph (which largely depends on how you proceed, so it might be best to wait until you've got it in your program to do this) and add the header information. I use this guide to create my ebooks by hand, and it includes more details on these more technical aspects, such as this header information template.)

This probably sounds like a lot of work, and I said this was easy, didn't I? Well, it is—especially when you use a macro.

A macro is a piece of code, often used to automate a repetitive task. I use them to help me freshen up my writing and avoid overused gesture crutches, and I've also programmed all my HTML conversion find & replaces into a macro so it takes just three clicks to do all this (and a little clean up to fix those stray </i> tags).

You can make a macro by recording all these find & replace searches (check out Abby Annis's guide to recording macros), or you can enter the whole thing in code, such as this code.

I always cut and paste my coded books into a plain text editor such as Notepad to make sure nothing wonky happens when I save them (and so I can pick the right encoding: UTF-8).

And voila! You can cut and paste this into the HTML tab of a new post or page on your site or start building your own ebook now!

Jordan McCollum is the (indie!) author of the romantic suspense series Spy Another Day which begins with I, Spy. She enjoys teaching writing craft through her writing craft blog at, as the Education Director of Authors Incognito (an online writers' support group with over four hundred members), and through her book CHARACTER ARCS (with a foreword by Janice Hardy) and CHARACTER SYMPATHY.


  1. Whew, this intimidates me! I use Apple Pages, which converts to ebook. Thank goodness. I'll share these tips.

  2. Great help - I know basic HTML, but this step by step process specific to converting books is great!