Bring on the data citations!

In a recent blog post, we talked about the process of building preprint citation handling into eXtyles Reference Processing. But that’s not all we’ve been working on!

Starting with Build 4613, you’ll see two new behaviors in eXtyles Reference Processing and Crossref Linking: first, eXtyles can now parse data citations; and, second, we’ve updated and improved how eXtyles verifies DOIs.

What does eXtyles do with data citations?

eXtyles approaches data citations similarly to how we handle citations to preprints and conference proceedings: these types of entries are parsed but not restructured.

Here’s an example:

Input:

Screenshot of a reference entry: Bandy, Altaf; Almaeen, Abdulrahman Hamdan (2020), Antibiotic resistance in Gram-negative Bacteria Causing Blood stream infections, v13, Dryad, Dataset, doi:10.5061/dryad.nvx0k6dp9

Output:

Screenshot of a reference entry processed through eXtyles and identified as a data set: Bandy, Altaf; Almaeen, Abdulrahman Hamdan (2020), Antibiotic resistance in Gram-negative Bacteria Causing Blood stream infections, v13, Dryad, Dataset, doi:10.5061/dryad.nvx0k6dp9. Different elements such as author names and date are highlighted in different colors, and the entry has "data" tags around it.

As the screenshot indicates, eXtyles has added our new <data> tag to the entry and has applied appropriate character styles to elements such as author names, title, repository name, date, and DOI, but has not added, changed, or rearranged any of these elements.

How does it work?

In deciding whether an entry should be identified as a data citation, eXtyles looks for the names of known data servers (e.g., Dryad Digital Repository, HEPData, Harvard Dataverse) and for DOI prefixes that we know to be associated with specific data servers.

If one or more of these elements is found in a reference entry, eXtyles adds the <data> tag.

💡 Did You Know? A data citation that isn’t tagged as <data> will usually be tagged as <eref>. This is because, unlike most preprint citations we’ve seen, author-supplied data citations almost always include a DOI or URL.

When eXtyles parses a data citation, it’s reusing some character styles you’ll typically see in entries for journal articles, such as bib_journal and bib_article, but exporting them differently: whereas a bib_article element inside <jrn> tags is exported as <article-title>, a bib_article element inside <data> tags is exported as <data-title>.

We also have the ability to add new character styles, but to ensure complete backwards compatibility, we’ve chosen not to do so at this time.

When is a data citation not (necessarily) a data citation?

Some data servers, such as those we mentioned above, host data sets exclusively; if a reference entry includes the name of one of these servers, we can be confident that the entry should be tagged as <data>.

Other servers, including the widely used Zenodo and FigShare, host data sets and also a variety of other content types. This means that in order to tag a reference to a source hosted on one of these servers as <data>, eXtyles needs more information: for example, a Zenodo entry that includes a URL or DOI + “data set” can confidently be tagged as <data>:

Screenshot of a reference entry processed by eXtyles: <data>Hanson, B., Wooden, P., & Lerback, J. (2019). Datasets for Age, Gender, and International Author Networks in the Earth and Space Sciences: Implications for Addressing Implicit Bias [Data set]. doi:10.5281/ZENODO.3591871</data>

but one that includes the elements “Zenodo” + a URL or DOI will be tagged as <eref>:

Screenshot of a reference entry processed by eXtyles: <eref>Hanson, B., Wooden, P., & Lerback, J. (2019). Datasets for age, gender, and international author networks in the Earth and Space Sciences: Implications for addressing implicit bias. Zenodo. doi:10.5281/zenodo.3591871</eref>

As always, eXtyles errs on the side of caution in making the <data> vs <eref> decision. In the second example above, the presence of the word “Datasets” in the title of the cited work isn’t enough to reliably identify this as a data citation, since without the explicit identification “[Data set]”, this could be an article about data sets!

→ Note: Like all things eXtyles reference processing, recognizing and parsing data-set references is an ongoing project as we learn about new data archives and encounter new (and sometimes creative!) ways of citing data. If you see a data citation that is not handled correctly in the latest version of eXtyles, please email it to [email protected], and we’ll be happy to try to add support for it.