Standardizing Standards 2: XML Workflow Choices


In this session, we’re going to be talking about workflow choices and when and where in your workflow you can introduce XML.

My name is Bruce Rosenblum. I’m the CEO of Inera Incorporated and Co-Chair of the NISO STS Working Group.

Transformative technologies such as the iPhone, the Kindle and the iPad have changed the way that we interact with content.

Instead of just working with paper, or a PDF on a large screen, we’re now reading our content on many many different devices and all kinds of different screen sizes.

These transformative technologies require new product features to take advantage of the capabilities of the technology. Responsive design so that, for example, we can read text efficiently on a small screen such as a smartphone; automatic reflowable text; richly hyperlinked content; content that’s dynamically updated; and we can add in now accessibility for the visually impaired.

Reading a static PDF is just no longer good enough when we have all these new technological capabilities. Users expect a much more dynamic experience from our information.

Let me show you a brief example of what I mean by this. This is the ISO Online Browsing Platform, where ISO now hosts of all of their standards, and they’ve created a completely new user experience.

If you go to, which is a freely-available website, you can try this out for yourself. In this example, we’re searching for standards about pasta. And we very quickly after the search discover that there arestandards that have the word ‘pasta’ in them, including one that has the word ‘pasta’ in the title, and there are thirteen additional standards that mention the word ‘pasta’.

If we click through on that first standard you can see that we now have a table of contents for the standard, and we can see the foreword—all of this freely available.

Within that table of contents, though, you’ll see that some items are black—the scope, normative references, and terms and definitions—but everything after that is in grey.

This is actually by design, because what ISO is doing is giving you more information for free than they could previously do—the information that will help you make an informed buying decision. But the heart of the standard, they’re keeping until you actually pay them for it. So, when we get to the end of Section 3, everything else isn’t available until we get to the bibliography.

This is a wonderful freemium model, and a perfect example of the kind of thing that you can do with high-quality XML, because with that XML you can choose to expose more information without exposing all of it. Imagine trying to do this with a PDF.

Within the standard itself, because we’re in an HTML view, we can have richly hyperlinked content, so each one of these red circles is actually a clickable link, both within the standard for things like 6.11 or Annex A, and also external. So, we can link from this standard, ISO 7304, to other standards such as ISO 24333 with just a simple click of the mouse.

So, you can see this has now opened a new tab in their search interface so that you can see this other ISO standard. This is an incredibly effective use of XML. You can’t get this in a workflow where you’re only generating PDFs.

So, the foundation of all of this—not just the online browsing platform, but premium content distribution, responsive design, automatic reflowable text, rich hyperlinks, dynamic updating, and accessibility—the foundation for all of this is XML.

How do you get to XML? That requires thoughtful choices.

XML doesn’t just happen. It’s not something that you can wave your magic wand and the next day you’ll have it. What XML does require is re-engineering your publication workflow, new software tools to go with that re-engineered workflow, and additional production training. So, to get to XML it requires deliberate and thoughtful choices.

But the good news is that there’s already a lot of history of how to do this well, and how to do it efficiently, so that you don’t have to invent a whole lot of new wheels in order to make this happen for your organization.

If you want to add XML into your workflow, there are four key steps where you can add it.

The first is at the authoring or drafting stage of a document.

The second is after the authoring and drafting is done, but before you do any copyediting.

The third is after you’ve done your editing of the material, but before you do your page layout.

And the last is post-publication, meaning you take your final PDF file and convert it to XML.

Each of these points has pros and cons. There is no perfect solution, but there are solutions that have more advantages than others.

So, let’s start with the original XML dream. In this model, committees would create XML documents natively, editors would edit those documents in XML, and then you’d have XML that you could use for what’s called “single-source publication”.

You could very easily take that same XML file, you could make a print or PDF version, an HTML version, you could make your e-books, you could do metadata feeds, and you could create derivative products.

All this, of course, requires you to work in an XML environment.

But the reality is that authors don’t work in an XML environment. Yes, some of you may say, “Well isn’t DOCX from Microsoft an XML format for Microsoft Word?” And the answer is, “Yes, under the hood it’s XML, but it’s XML about preserving the format and layout of the document, not about preserving the semantics and the structure of the document.”

And so most people are using Microsoft Word in a method that only is focused on format and not on the structure, and that’s not good enough for the kind of XML that can drive something like the ISO Online Browsing Platform.

Most people today do their authoring in Microsoft Word. A few still use WordPerfect—there are those lingering souls who are having trouble giving it up. We do know some standards that are created with FrameMaker, and finally some people are actually starting to move to Google Docs, because it’s an environment where they can easily work in a collaborative fashion.

But whichever of these tools you use, what you’re going to run into is an ‘author reality’. Which is, first of all, most authors don’t ‘think’ structure. They think about the information in which they are the world’s experts.

Furthermore, most authors don’t like production tasks. They just want to write up the text and maybe add a little formatting, but then getting it published? That’s not my problem.

What these authors are, are brilliant subject matter experts, often the smartest people in the world in their particular subject matter or their specialty. But that also makes them hard to train and support, and even harder to control. Because it turns out the more brilliant an author, it turns out the less sophisticated they often are about how to use Microsoft Word, or actually, the better way to say it is the more creative they are about how to use Microsoft Word.

Let me give you a quick example. We saw an author one day who knew that he needed to put a minus sign in front of negative numbers, and knew that the hyphen wasn’t the same as a minus sign, but he couldn’t figure out how to insert a minus sign. So, he finally took an underscore character on his keyboard and made it superscript, and that was his minus sign.

Did it look right? Absolutely! Was it the correct thing from a semantic perspective? Absolutely not.

So, these kinds of creative authors will, in fact, get in your way when it comes to trying to put them into an XML environment.

Some organizations such as ISO have actually tried to help the authors or the committees by creating Word macros. They’re really helpful, so long as they’re used properly. But they’re actually really hard to write, in part because smart authors will always try to outsmart idiot-proof macros.

So, what happens is you have an arms race where you try and make the macros more and more sophisticated, to make them easier and easier to use. And the more sophisticated you make them, the more complex they are, and we find that authors just find more and more ways to break them.

In addition, you’ve got to support multiple versions of Word—not just Word for Windows in five or six different versions, but also the Macintosh version of Microsoft Word.

And finally, macros—as active code—are getting harder and harder to install on users’ systems because of IT security requirements.

Some people have taken a different approach by trying to sidestep Microsoft Word altogether, and looked at online tools for authoring directly in XML. Is this the wave of the future? We’ve certainly seen some progress in tools that provide an HTML-like experience that have XML under the hood.

But what we’ve also seen in those attempts is that authors can just as easily break those environments in terms of the structure they’re trying to create, as though they were using Microsoft Word.

Also, these tools require continual online access—getting them to work equally well online and offline is a real challenge. And the math editors, for those standards that have any kind of display math in them, the math editors are still somewhat immature.

But finally, we come back to the same problem we had with Word, is that they may give you the words and some formatting, but structurally is it correct? Well the same thing can happen in online authoring. They might give you XML, but it may not be structurally correct, no matter how much guidance you give them.

So, at least for now, we think that online XML authoring will continue to have some fairly large challenges.

Let me move to the back end, and look at the concept of post-publication workflows for XML. In this, you actually keep pretty much the same workflow you have today.

The committee submits a draft Word document, it’s edited in Word, it’s typeset—most standards organizations are doing their typesetting in Word; a few might use InDesign or FrameMaker because you can get a better-looking result. But a lot of organizations have stayed in Word because ultimately, once the standard is published, they have to give a Word file back to the committee to work on the next revision, with the exact same text as the published version.

If you keep everything in Word and just make a PDF from Word, that becomes easy to give it back to the committee, though it doesn’t look as good because Word’s not the kind of page layout program that InDesign or FrameMaker is.

You then—whichever model you use—you proof and typeset your corrections, you publish the print version if you’re still doing print, and the PDF, and then you can make XML from that PDF file or sometimes from the Word file itself.

If you used InDesign or FrameMaker, at the end you have to convert the final typeset file back to Word, which is a fairly daunting prospect with both FrameMaker and InDesign—neither of them have good export capabilities back to Word. And finally, you return the Word file to the committee.

What works in this workflow? Well the biggest advantage is you have no workflow changes. But what are the disadvantages? There are many.

The biggest and first one is that the quality of the XML is unchecked. What do I mean by that? You may say, “I’m checking my XML quality.”

Well, if you are making the XML from your PDF, if you don’t sit there and check character-for-character—every single character in that XML—against the PDF, then your XML is unchecked, meaning you have no guarantee that it absolutely matches the PDF. That can be a huge liability.

Second, this workflow adds extra production time and cost, because it’s something you’re bolting on after the fact.

Third, errors can be discovered in creating the XML. For example, if you have a cross-reference to Section 5.9, but only at the point that you’re creating the XML do you discover that you only have Sections 5.1 through 5.8, you now have a content error and it may be too late to fix it.

XML production is actually great at catching those kinds of errors that might have been overlooked, even by the best of copy editors.

Ultimately, it’s not an integrated workflow, it’s almost essential to outsource, and what you’re left with is the choice of poor-quality typesetting from Microsoft Word, or doing some sort of a post-composition conversion back to Microsoft Word.

We tend to think of this workflow as putting a dinosaur behind the wheel of a car. Because what you’re not doing is looking at the best of what modern technology can do to solve a new technological problem.

The other point we need to make is that in this kind of a workflow it’s almost essential to outsource the XML conversion to a vendor. But vendors, like employees, need to be managed. If you don’t tell them exactly what you want them to do, and then check their work, the result may differ from your expectations.

We’ve seen many many cases over the years where firms outsourced an XML conversion to a vendor, and then what came back was never actually further reviewed by the publisher that outsourced the work.

And then they eventually look at the work four or five years later or, in some cases, they were doing it to futureproof themselves, to protect themselves for the future, and they discover when they finally open it up that all of the work is unusable, because they didn’t provide enough guidance upfront and they didn’t do any quality checking.

And we’ve unfortunately seen that story happen multiple times through the years.

So how do you manage these vendors for a successful project if you want to use this workflow?

First, you need to develop XML markup standards. You can’t just assume that the vendor knows what’s best for your content.

Second, you need to test several vendors, compare the results to tag the same standards—so give them all the same document and look at the results that come back—and select your final vendor based on quality, not on the cost.

You then need to develop quality assurance tools and provide those to your vendors, but you also need to constantly recheck the quality of what the vendors are doing for you.

Finally, for those of you who have a back-content conversion for which you’d like to use an outside vendor, I strongly recommend reading this paper, ‘Beware of the laughing horse’, which is available at this URL.

It was presented last spring at the annual JATS-Con meeting, and it’s an excellent story about how to make a successful offshore conversion project.

The third of these four workflows is an XML-first workflow. In this, the committee drafts the standard in Microsoft Word, they convert the standard immediately on entry to the editorial workflow to XML, and then they do all of the editing XML, typeset from that XML, and keep everything in XML to the very end.

And only once the standard has been published, you convert that XML back to Microsoft Word, and give it back to the committee for the next revision.

What are the advantages of this workflow? Well the biggest is that the file is continually validated to what’s called the DTD or Document Type Definition, in this case, the STS standard.

What are the disadvantages? First, it requires XML editing software for all editors. Training for that can be expensive, as all your editors now have to learn to work in XML. Freelance editors may not be practical, because you have to provide them the same XML software and, in some countries, providing software to a freelance copy editor can actually jeopardize their freelance status.

Editors have to work amidst the XML tags or, in some cases, you can customize the editing environment to minimize how much the editors are seeing of the tags, but that customization can be very expensive.

And finally, you have to do an extra conversion at the end to convert the XML back to Microsoft Word.

The fourth and last of these workflows is what we call the XML-middle workflow. In this workflow, the committee submits the standard as a Word document, it’s cleaned up in Microsoft Word and paragraph styling is applied if the committee hasn’t used a template that you may have provided to them. Then you edit in Microsoft Word, and you only convert the Word document to XML just before you’re going to do typesetting.

Here’s the beautiful part—you now typeset from the XML. So instead of creating a PDF from Word, you create the PDF from that XML. Then you proof that PDF that was created from the XML.

If the PDF is right, you know that the XML was right because the PDF has been created automatically from that XML.

If you need to make corrections you do them in Word, you regenerate new XML, and finally once you’ve proofed the PDF and blessed it, you can create your EPUB and your other derivative formats.

But ultimately, because you have a workflow that goes from Word to XML to PDF, and you know that the PDF is right because it was made from XML which is made from Word, you have a Word file you can give back to the committee and you know it has exactly the right content. So, you don’t have to worry about converting anything back to Microsoft Word at the end.

What are the advantages and disadvantages of this workflow?

In this workflow, you keep the editors working in Microsoft Word, which is an environment that virtually every editor knows and is comfortable with. You lower your training costs because you’re not introducing a lot of XML technology to most of the editors, and freelance editors are practical because they can still work on the document in Microsoft Word.

Structure is enforced prior to the final pages—what I mean by that is you’re creating XML before the content is final, so you’ll catch something like that error where you’re trying to cross-reference Section 5.9 and it doesn’t exist.

And ultimately the final content is ready in Word format for the next update by the committee.

What’s the disadvantage of this workflow? It typically requires some sort of an application in-house to create the XML.

So those are your four workflow choices. Authoring in XML, post-publication conversion, conversion to XML as soon as the document arrives in the editorial process, or conversion just in time for typeset—the XML-middle workflow.

As I said, all of these have their advantages and disadvantages, and hopefully the pros and cons of each are quite clear now.

I want to talk for a few minutes before closing about XML quality.

XML doesn’t come for free. As you’ve seen with all of these workflows, you have to have some sort of investment—whether it’s in software, whether it’s in an outsourced vendor to do your conversion. So, the XML isn’t free, but the quality doesn’t come for free either.

With a PDF-only output for your workflow, it’s simple. You create the PDF, you proof it, and publish it. XML is a little more complicated, because you create the XML, you proof it, and then you publish from it; or in this case, what you do is publish the PDF from the XML and you proof that.

The XML-first and XML-middle workflows definitely facilitate XML quality, and any workflow where you have the PDF created from the XML is a much more robust workflow.

This really is a huge liability with the post-publication conversion to XML, in that you just have no guarantee that that XML matches the final published content in the PDF.

But you may want to up the game a little bit more, and take a look at what we call XML quality plus. Because the content that’s in between the XML tags is important, but what’s often even more important is the metadata about the document, which may not be visible in the document but may be part of the XML for that document.

So, doing checks on this kind of information requires more quality checks. There are a variety of methods, and I won’t discuss them in detail, but the two most common are false color proofing where you create a sort of HTML view that puts different kinds of text in different colors so it stands out to make sure that the tagging is correct in each of those cases.

Or another alternative is a technology called a schematron, which is something you can do to write scripts to proof your XML.

Ultimately though, you should run your quality tools on every single XML file, and if you do use an outside vendor for any of your XML work, you need to make sure that you provide your quality tools to those vendors, that you require your vendors to run those tools, but then, finally, you rerun the tools when they submit the XML to you, using the old Russian motto: “Trust but verify.”

You can assume they’re going to do a good job, but it’s best to verify they actually have.

So, at this point, if you’re wondering where to go next, we have a few recommendations.

If you’re going to move ahead with an XML project, the first is to evaluate and set business goals. Make sure you understand, from a business perspective, why you want to bring XML into your organization, and what the business value is that you’re going to derive from it.

We certainly believe that there are many business values and you can build a strong business case for bringing XML in, but each organization is different, and so we recommend that you build your own business case.

Once you’ve built that business case, then you can start driving technology decisions, but never never let technology decisions drive your business requirements. It should always come from your business requirements.

Second, learn more about XML. There are a number of places you can learn about XML, and more about the STS standard for markup of standards. Mulberry Technologies in Rockville, Maryland, has all kinds of really great courses where you can learn more about XML, and they were part of the authoring group for the STS standard so they know STS inside out. If you’re going to get outside help to learn about XML, there’s absolutely no better place to do it.

Also, there are two key sites to learn more about STS itself. First is at the NISO site, where there’s a workroom for the STS project, and you can go there and see information about the standard, including all of the committee minutes.

But, more importantly, the NISO STS documentation is at, and that’s where—if you’re implementing an STS project—you can learn about every last bit of detail that you need to know on this standard, including common markup practices and best practices.

Third, talk to XML-savvy standards publishers. Lots of our publishers have now been down this road, and they can give you a lot of insight into what their experiences were, what their challenges were, and they can help you out. Most of the ones that we know are more than happy to share information with you so you don’t learn the same lessons they did the hard way.

Finally, if you don’t have an XML expert in-house, we recommend that you either hire one or hire a consultant who can help guide you through the process. There’s a lot of history already on how to do this well. Don’t try to reinvent the wheel without getting outside advice.

So, to get an XML project started once you’ve gone through those steps, select what XML workflow that you’re going to use, and do this based on your business goals. Develop and document what your XML markup standards will be, and again, for this you’ll either need an in-house XML expert, or you’ll want to talk to an outside consultant.

Build some XML quality assurance tools that will work for your environment, and you may be able to use some tools that are starting to show up, including some that are available for free from ISO.

Start a pilot project. Don’t just step back and say, “We’re going to do the whole thing and, on Tuesday, we’re going to switch to doing everything in XML.” Start with a small pilot project, evaluate the results, fine-tune, re-evaluate and fine-tune.

We typically recommend that you allow at least a few months to do a fine-tuning project or a series of fine-tuning steps on a pilot project to make sure that you’ve got the XML exactly right for what your organization requirements are.

And finally, when you do have everything fine-tuned, then you can start your XML workflow. But you should always review that workflow once or twice a year so that you can continue to refine and improve, especially as you want to introduce new products based on that XML.

So, standards publishers! The good news is that we now can have standard XML. And what this will do is bring new production efficiencies, it’ll help you bring new products to your customers, it will bring you new business opportunities, and the solutions for all of this are improving daily.

So now is the time for you to move to XML.

If you have any questions please feel free to contact us at Inera Incorporated, and our information is on the slide. Thank you very much for listening.