Standardizing Standards 3: Getting Your Content from Word to STS XML


In this session we’re going to talk about getting your content from Microsoft Word to NISO STS XML.

My name is Bruce Rosenblum, I’m the CEO of Inera Incorporated and Co-Chair of the NISO STS Working Group.

Inera Incorporated has been around since 1992. Our focus is to develop systems for editorial and XML to automate the process of getting from Microsoft Word to XML.

We have a combination of expertise in software, technology, publishing, and workflow, and we apply that to help publishers develop single-source workflows to publish their content, usually using XML as a key part of that single-source workflow.

Ultimately what we’re trying to do is help our customers gain efficiencies through the publishing process, and leverage industry standards such as JATS, BITS, STS, and MathML.

eXtyles is a suite of editorial and XML tools for Microsoft Word. It facilitates not just converting from Word to XML, but a lot of the editorial preparation that you have to do in your document whether or not you’re creating XML.

eXtyles helps to clean up and mould Word documents into a standard visual and editorial style, helps to automatically edit references to external content, integrate metadata into the file, check and link normative references and their designations to other databases, correct content in the document through looking up information in databases, validate URLs, check internal cross-references, and finally, after all that’s been done as an aid to helping the editor, all of that work also helps to convert the document from Word to valid XML according to NISO STS and other schemas.

And of course, all of this can be completely customized for any publisher requirements. So if you have unique requirements, eXtyles is an open framework that can be customized to meet those needs.

We have customers in 26 different countries on six different continents. This is a small sampling; you can see that we work with quite a number of standards organizations, including organizations like ISO and IEEE.

We also work with very high-profile publishers of content in other realms such as journal publishing, where we work with many of the top journals around the world, and also governments such as the US Government Printing Office.

Fundamentally we have three key design principles behind eXtyles.

The single most important one is that the content authors are subject matter experts, and that’s all. They aren’t editorial experts, they aren’t production experts, they aren’t going to do your production and publication work for you.

They know Microsoft Word, they’re used to working in Word—actually they’re often used to working badly in Word, so often the documents you get in from them are not exactly the best-constructed documents in Microsoft Word—but they’re not going to do the editorial and production work for you.

So, you’re going to have to do that yourself and, most importantly, they’re not going to give you XML.

But interestingly, the next one of our key design principles is that most editors actually prefer working in Microsoft Word over specialized XML tools. Again, just like the authors, they know Word, they’re used to it, and they can focus on the content rather than focusing on the underlying technology that you’ll be using in the publication process.

And our final key design principle is that what we’re trying to do is automate any repetitive process in the editorial and production stage, but to do so safely. Anything that risks making an incorrect change in the content is well beyond the scope of what we’re trying to do.

But what we are trying to do is help the editors get through a lot of the technical editing of the content so they can focus on making sure that the content is readable and correct. And ultimately, that’s what’s most important to the consumers of your content.

What our proven approach has shown is that it increases productivity, improves quality of the published output, and lowers overall costs. Many of our customers have seen their time to publication drop by more than half once they’ve implemented an XML workflow with eXtyles.

So now let’s move onto the fun part, which is a demonstration of eXtyles.

We have a Microsoft Word document that was published by CEN a couple of years ago. CEN is the European Union standards organization.

This one is about cereals and, like most standards documents, it has a foreword, scope, normative references, terms and definitions. Then we get into the body of the standard, and then in the back it has a bibliography, it has four annexes, A through D, so it’s a very typical standards document.

When eXtyles is installed in the environment, you have an extra ribbon within Microsoft Word that gives you the eXtyles options. And initially all of them are greyed out, because the main thing we want you to focus on first is the Activate and Normalization step. This is the first step in preparing the document.

And for CEN what we’ve actually done is automated much of their metadata collection. So what I can do is actually fill in, in this dialogue, the work item number for this document, and then I can click the Get Metadata button and what this is actually doing is looking up this metadata on a server located in Brussels at CEN and automatically adding it to the document.

So I click OK, and this information is now being stored with the document, and in fact there’s additional information that we didn’t even show onscreen. But this avoids rekeying of information—this avoids errors that can come with rekeying information—and provides a much easier mechanism for updating much of the front matter metadata in the document.

Once this is done, we can then move on to the next step of preparing the document, which is document cleanup.

Oh, and you’ll notice even before I move on, that all of a sudden the font has changed in this document because CEN changed styles a few years ago from using Arial for their documents to using Cambria, and automatically eXtyles has loaded and applied the new Cambria-based template.

So this is just a small example of how eXtyles can automate much of the formatting of your documents.

The next step is a Cleanup step where we have a large variety of options, including white space cleanup, we can control whether or not we’re removing white space from sections of the document that might have computer code—clearly we wouldn’t want to do that.

And we can remove some of Word’s typographic controls because InDesign has much better typographic controls. We can do a whole bunch of other cleanup operations that help prepare this document.

I’m just going to click OK with this group of settings, and this will take just a moment to run on the document. In addition to all of that, there’s one item that’s not a checkbox item, and that’s Recovery of Special Characters.

One of the biggest problems we’ve seen publishers have over the years, particularly if they move content from Word to InDesign, is loss of special characters because of oddball font issues.

eXtyles works to make sure that every character is automatically recognized and reinserted into the document with a Unicode value, although visually it appears as a special character, so you’ll see where we have any accented letters such as these e-acutes, they are visually correct.

But more importantly they are structurally correct so that you’ll always have the correct special character when you go to XML or when you go to InDesign, or to any other environment.

Having done that Cleanup step, the next step is the single most critical one for preparing the document for XML.

If we change to Draft view in Microsoft Word, you can see that we have some styles here, but some paragraphs in fact aren’t styled correctly. And if we were to start with the foreword we could, if we wanted to, go to Microsoft’s Home ribbon and use Microsoft style controls to actually set up styles for this document.

So, we can go looking through here for a Foreword Title, and there’s my Foreword Title, but that’s actually a challenging interface to use.

And so in fact what we do instead is we provide the end users with a palette organized by logical sections of the document or logical clustering of styles such as title styles, ten point body styles and so on. And each organization can have their own organization styles.

Each time you click a button it applies a style and automatically highlights the next paragraph. And so this palette makes it much much easier to accurately and quickly apply styles.

So here we have a Heading 1, here we have Body Text, another Body Text paragraph, four List Continue paragraphs, Body Text again, another Heading 1, more Body Text. So you can see on the Body Text styles have gone actually from Normal to Body Text. This paragraph is actually a normative reference, and so we have a special style for that.

And what we’re doing by adding these styles is providing all of the critical structural information that we ultimately need to create the XML.

Here we have terms and definitions, within this we have notes, and by the way there are hotkeys for all of these characters, so by pressing N here I also get a Note style applied.

I’ve actually pre-styled the rest of this document but you can see that this is a very easy and intuitive process, and in fact many of your organizations may have already been doing this kind of styling to prepare a document in Microsoft Word to make PDF anyhow.

So this is actually a natural extension of it, but it’s better because the styles are organized in a more logical fashion, number one. And number two, you may have occasionally through the years seen a case where you had a bit of italic or bold in the middle of a paragraph and applying a style with Microsoft Word tools actually obliterated that font change.

That’s actually a bug in Microsoft Word; Microsoft refused to fix that bug. But using the eXtyles palette you will never experience that bug because we’ve managed to program around it.

So this provides you a much better model for templating a document.

As I said, the rest of the document I’ve pre-styled so you don’t need to watch me do the entire document, but you can imagine that most documents would actually go through this styling process fairly quickly.

Now, once we have those styles we actually have the foundation to create XML, but there’s much much more that eXtyles can do before we create the XML.

The first thing, as we move from left to right across our menu or our ribbon, is we have a feature called Auto-Redact. And the idea behind Auto-Redact is that it’s a large Find and Replace with thousands of rules pre-programmed to your editorial style.

It will go through and not replace the copy editor, because a copyediting task is a human task which can’t be replaced by a computer even with artificial intelligence, but it will do a lot of technical cleanup on the document.

I won’t dive into all of these rules, but I’ll just click OK and let this run and describe, for example, that we have rules that can convert American English spellings to British English if that’s your style. Or the other way around, if you’re working with an international working group but you use standardized spellings.

We have rules that can clean up callouts to figures and tables. We have rules that can clean up units of measure so, for example, if your standard is to always abbreviate ‘Hour’ as ‘H’ we can automatically go through and do that.

But here’s the cool thing. We can do it in a context-sensitive fashion, so we would only do it if there’s a number preceding the unit of measure. So the expression ‘the experiment took 3 hours’ or ‘the test will take 3 hours’ can be converted to ‘the test will take 3H’. But a sentence that has ‘the test should take several hours’ will not be converted because that would be a nonsensical change.

So eXtyles can be very very context-sensitive in these rules. It can also be sensitive to specific paragraph styles when applying these rules. So it’s almost like having regular expressions that you can apply to the document.

Now for those of you who worry about what’s going on with automated changes, we actually make a backup copy of the document immediately before we run Auto-Redact, and then we can use Word’s Compare feature and we can see exactly what’s changed in the document.

So here, for example, you can see a non-breaking space has been added in the middle of the designation. And in fact you’ll see that throughout this document, that wherever we have a designation, a non-breaking space has been added just to make sure that you don’t get a bad break across lines.

We’ve actually added non-breaking spaces in lots of places. Per European style or Continental style, non-breaking spaces have been added between numbers and percent signs wherever those appear.

Here’s more of an editorial type of change. Fig abbreviated has been changed to Figure spelled out.

This is an interesting one that CEN asked us to make. Any time they have words like ‘recommend’, ‘must’, ‘shall’, ‘may’, those are all highlighted in yellow so that the editor’s attention is drawn to those expressions and make sure that they’re used correctly.

That’s absolutely critical with standards, that it’s clear what’s a recommendation versus what’s a requirement, and by doing this automatically throughout the document we guarantee that the copy editor will actually see all of those cases.

We have lots more cases of non-breaking spaces being added; that seems a be a lot of the work that’s been done here. We have spaces added around equals signs, so we’re cleaning up the content around mathematical operators.

And that looks like it’s about it, so not a huge number of changes, but you can see a lot of bulk cleanup being done. And if you can imagine with all of these non-breaking spaces having been put in, it automatically saves time during the typesetting phase of the operation because you don’t have to go in and manually start putting in non-breaking spaces everywhere you need them in order to avoid having bad line breaks.

Once we’ve reviewed the results from Auto-Redact, we can actually turn our attention to some other parts of the document, for example, the normative references and the bibliography.

What we have in eXtyles is actually an open hook for advanced processes, and you can do all kinds of really cool things with the editorial content.

The first one is Advanced Processing for Bibliographic and Normative References, and what this is going to do is go through and identify the normative references and bibliographic entries primarily by the paragraph style that was applied.

And then it’s going to both do some content cleanup—although in this case this reference was already formatted correctly—and also indicate what type the reference is, and add these character styles which are later used for creating XML that can be used for linking through to external sources.

So you can see in the bibliography as well that we’ve not only taken care of marking up and cleaning up these references to other standards, but also we’ve handled this reference to a book completely correctly, even recognizing the organizational author in it.

Once we’ve gone through and dealt with the normative references and the bibliography, the next thing that we can do we can check all of the in-text citations.

So, for example, FprEN, that’s an in-text reference to another standard, and we actually will go through and mark up all of those so that we know where we need to potentially have external links when this content might be posted online.

In addition we do some cross-checks of all of these items, and we’ll warn using Word comments if there are problems.

The first cross-check we do is we check to make sure that every item in the normative reference list is, in fact, cited at least once from within the body of the document. This standard was, because we didn’t receive a warning.

However, we did receive a different kind of warning. Let me just quickly close the Word Style palette so it’s easier to read the warning. And what we have here is that inline citation checking or matching detected that we have a reference to an object within a standard, but it’s an undated reference.

This is the kind of thing an editor might overlook but it’s actually quite important. If you’re referring to an object within a standard it should always be a dated reference so, for example, ISO 5223:2008 or whatever the year is. Because what Table 4 is may change over time if a new table is inserted or a table is deleted, and you want to make sure you’re referring to the right Table 4.

So these are the kinds of warnings that eXtyles can give during the editorial process that actually help make sure that the content is as accurate as possible.

We can actually go a step further with all of this, and for all of the ISO and CEN standards we can actually now take these references to them and check them against databases located at ISO and CEN to make sure that, in fact, these are valid references and current references to these standards.

So we’re actually going through right now and taking every reference to an external standard in this document, and doing a web service query against an ISO server in Geneva and a CEN server in Brussels to make sure that, in fact, these are correct and valid references to standards.

And it’s done, it says it’s checked thirteen standards, and it’s found problems with five. Let’s see, we have on this first one an invalid reference. If you look at the date this is 2013, and the Fpr means that this is a final proof, not the final standard, so in fact this should be modified, probably to be EN 16378. And the same right here.

So immediately we’ve been able to detect some problems with this and, of course, I could double check against the database; I do happen to know that that is final. And then we can delete the comment very quickly and easily once we’ve resolved the problem.

But again we have other kinds of problems. This is a reference to a withdrawn standard. In this case there’s a newer version, the 2011 version, and it’s warning us ‘gee, maybe the newer version should be cited’. So this is a case where as an editor you might actually want to go back to the working group and make sure that they specifically meant to cite the older version rather than the newer version.

So again, this is a great way that you can work with your working group with these automated tools to make sure that you’ve got the most current version of the citation available.

Another example here is that the reference came back with a different title. The year was missing in the original title, which we preserve in the comment, and the 2009 was added here, so this is a way in which the published document that’s in front of us is being made more accurate before it’s published.

And again a warning on this last one that we saw earlier in-text, that this is a reference to an out-of-date standard.

So this is where adding this kind of automation can really help with the process, to save editorial time and make sure your content’s as accurate as possible.

We do have a feature for going through the document and validating that all of the URLs are pointing to current and valid websites, that you’re not going to get any 404 errors. I’m not going to bother running it on this document because we don’t have any URLs.

The last thing I’m going to do before I create the XML, is a Citation Matching check, where we check all of the internal cross-references. So, for example, if we have Tables 1, 2, 3, and 4, is each table cited at least once and does any citation to a table resolve correctly?

So if we have four tables, but there’s a citation for Table 5, that will give us a warning that there’s a problem that we have a citation for Table 5 but in fact we haven’t found the right matching point for it.

This cross-checks not only figures and tables but also bibliographic references, and sections and equations. And it has found a few problems so we’ll go looping through our comments again here.

And we find a warning: ‘No section matches the in-text citation 5.9. Please supply the missing section or delete the citation.’

So if we go backwards here we can see we’re in Section 7, but in fact Section 5 only had sections up through 5.8.

So we very quickly caught a problem again that, in this case, you would have to go back to the committee and say, “Gee, which section should this be pointing to, or do we need to revise the text?”

And you can see all of the other section citations have been correctly marked up, and this markup will in fact be used when creating the XML.

Let’s see what other warnings we have here. Annex B hasn’t been cited—that may be OK but again you might want to check with the working group. Figure B.1 hasn’t been cited.

By the way, you’ll notice that in CEN’s workflow we have just names to the external figures here, we don’t in fact have the figures in the Word document, because in an XML workflow those figures ultimately have to be separate files so that they can be called in as images when you go to make PDF, when you go to the web, and so on.

But we also do support workflows where the images are embedded in Microsoft Word.

Again we have a warning that Figure D.1 hasn’t been cited, and D.2 hasn’t been cited, and D.3 and D.4. Again, these may be benign warnings, but better to have the warnings and be able to cross-check than to not have these automated warnings at all.

So having done all of the processes on the Advanced Processing menu, we’ve actually now helped the copy editor quite a bit, and at this point the document could be sent out for further copy editing, or at this point we’ve done enough that we can actually make XML.

So in order to make XML we come over to the Export menu and we choose if we’re making XML specific to CEN, in their case because they have some specific metadata requirements we’ve customized for.

And this’ll take just a few moments to convert the Word document over to XML using not just the paragraph styles that we added earlier, but also the character styles that have been added through the automatic processing.

And what’s really cool with all this, is that we haven’t had to go in and turn our copy editors into XML taggers. With the exception of applying paragraph styles, which you may already be doing in either Microsoft Word or InDesign to format your document today, we haven’t done anything more than that but we’ve added a lot of extra granularity.

And we have a message, congratulations, our file is valid according to the Document Type Definition or DTD—that means the document is valid according to the XML rules that you’ve laid out for it.

And what we have here in the XML is, first of all, all of the metadata. So we have the title in multiple languages, you can see that accented letters are represented as numeric Unicode entities. We could also do UTF-8, we can do ISO entities, so we have a lot of flexibility in how we set all of this up.

We have a lot more of that metadata that came, if you remember, from the beginning of this session from the Document Information dialog where we loaded this metadata from the server located in Brussels.

All of this metadata came in through that, including various meta dates that are included in this documentation, the permission statement, the entire title page. This actually was never even visible, but when we did that web services call, this title page information also came in.

And now we finally have information from the Word document itself. We have the foreword which, if I go back to the Word document, you’ll see the foreword here. And where we have the reference to the external standard, we’ve actually marked that up as a standard reference.

That’s really cool because then CEN can take this and make it a hyperlink when creating HTML or any other organization can.

And then we get into the body of the document, and we have sections. And this is where things also start to be really neat, because if you remember in the Word document, we don’t have anything special here in terms of the number 1 in this heading, or the bullets at the beginning of the list.

But yet, in the XML, we’re able to automatically make this a label and separate it from the title, which gives tremendous flexibility when doing formatting for online or for PDF, the fact that these are in separate elements.

We preserve the italic markup from Word. Again with the lists, we’ve automatically separated out the bullets of the label from the rest of the paragraph, and you can see our non-breaking spaces in here.

And then where we have a cross-reference to Annex D,—we’ve marked that up with an internal cross-reference, which means that when making PDF or making HTML, these can all be hyperlinks.

So what that means is now, instead of doing a lot of manual work to get internal cross-references as hyperlinks in your PDF, whether you’re setting up in Word or InDesign, all of this will happen automatically when you use Typefi to make your PDF.

Here’s our normative reference list, we have a special attribute on this, and if you were to look at the ISO Online Browsing Platform we’d see how they take advantage of these attributes.

We have the terms and definitions section marked up in TBX markup which is highly granular, highly structured, so that again you can do all kinds of flexible things in terms of pulling out the definitions, perhaps creating databases of them.

We also will cherry pick a few other things. We have the tables in the document. We preserve the column width information as well as the table width information, which can help with automatic layout.

More importantly, we preserve things such as column spans and row spans, so the entire structure of the table can be faithfully re-rendered online or in PDF, so no manual intervention is necessary.

You’ll also notice this scope attribute sitting in the table. This is for Section 508 accessibility. So by going to XML and using eXtyles to create that XML, you can start making it much easier to meet any accessibility requirements, whether you’re just concerned with, for example, in the United States, Section 508 accessibility requirements, or perhaps the Marrakesh Treaty which does require accessible content in many countries around the world.

So we have tables, we have figures in the document where again we parsed out the label at the beginning of the figure from the rest of the title. We’ve actually kept any author queries such as ‘gee, this figure hasn’t been cited’ as processing instructions. Those could actually be put onto a PDF proof if you wanted to.

And finally we have MathML, where we have converted all of the equation objects into MathML, and that can faithfully be rendered as well.

So we have very rich XML that nobody has to then go and edit by hand. So what that means is you can keep your team working very effectively in Microsoft Word without having much, if any, knowledge of XML.

They can very easily produce this kind of high quality XML, very granular highly-structured XML, according to the NISO STS DTD, and that can then be flowed with Typefi into InDesign to make PDFs automatically, it can be flowed onto a web page, it can be used to make EPUB, and it truly becomes the core of a single-source workflow.

So I hope in these last few minutes, I’ve shown you how easy it can be to take the Word content that comes from your committees or your working groups, and bring it through a process that gets you not only to NISO STS XML, but in addition can actually help your editorial team by making things faster and smoother through the whole process.

And ultimately get you to the point that you can produce wonderful documents in multiple formats, and actually do it faster, more easily, and more accurately, and at lower cost, than you’ve been creating just a PDF.

Thank you for listening, and we hope you’ll join us for the rest of the webinars in this series.