Data Format

Overview

The database snapshot, Simple Query Tool, REST API, and Data Feed products all return JSON-formatted data. For simplicity, that data is organized under the same schema in all cases; that schema is informally described on this page.
Regardless of the source, each record returned consists of one DOI Object, containing resource metadata. Each DOI Object in turn contains a list of zero or more OA Location Objects.

DOI object

The DOI object is more or less a row in our main database...it's everything we know about a given DOI-assigned resource, including metadata about the resource itself, and information about its OA status. It includes a list of zero or more OA Location Objects, as well as a best_oa_location property that's probably the OA Location you'll want to use.


best_oa_location
Object|null
The best OA Location Object we could find for this DOI.

The "best" location is determined using an algorithm that prioritizes publisher-hosted content first (eg Hybrid or Gold), then prioritizes versions closer to the version of record (PublishedVersion over AcceptedVersion), then more authoritative repositories (PubMed Central over CiteSeerX).
Returns null if we couldn't find any OA Locations.


data_standard
Integer
Indicates the data collection approaches used for this resource.

Possible values

  • 1 First-generation hybrid detection. Uses only data from the Crossref API to determine hybrid status. Does a good job for Elsevier articles and a few other publishers, but most publishers are not checked for hybrid.
  • 2 Second-generation hybrid detection. Uses additional sources, checks all publishers for hybrid. Gets about 10x as much hybrid. data_standard==2 is the version used in the paper we wrote about the dataset.



doi
String
The DOI of this resource.
This is always lowercase.

doi_url
String
The DOI in hyperlink form.
This field simply contains "https://doi.org/" prepended to the doi field. It expresses the DOI in its correct format according to the Crossref DOI display guidelines.

genre
String
The type of resource.
Currently the genre is identical to the Crossref-reported type of a given resource. The "journal-article" type is most common, but there are many others.

is_paratext
Boolean
Is the item an ancillary part of a journal, like a table of contents?
See here for more information on how we determine whether an article is paratext.

is_oa
Boolean
Is there an OA copy of this resource.
Convenience attribute; returns true when best_oa_location is not null.

journal_is_in_doaj
Boolean
Is this resource published in a DOAJ-indexed journal.
Useful for defining whether a resource is Gold OA (depending on your definition, see also journal_is_oa).

journal_is_oa
Boolean
Is this resource published in a completely OA journal.
Useful for defining whether a resource is Gold OA. Includes any fully-OA journal, regardless of inclusion in DOAJ. This includes journals by all-OA publishers and journals that would otherwise be all Hybrid or Bronze OA. See here for more information on OA journals.

journal_issns
String
Any ISSNs assigned to the journal publishing this resource.
Separate ISSNs are sometimes assigned to print and electronic versions of the same journal. If there are multiple ISSNs, they are separated by commas. Example: 1232-1203,1532-6203

journal_issn_l
String
A single ISSN for the journal publishing this resource.
An ISSN-L can be used as a primary key for a journal when more than one ISSN is assigned to it. Resources' journal_issns are mapped to ISSN-Ls using the issn.org table, with some manual corrections.

journal_name
String
The name of the journal publishing this resource.
The same journal may have multiple name strings (eg, "J. Foo", "Journal of Foo", "JOURNAL OF FOO", etc). These have not been fully normalized within our database, so use with care.

oa_locations
List
List of all the OA Location objects associated with this resource.
This list is unnecessary for the vast majority of use-cases, since you probably just want the best_oa_location. It's included primarily for research purposes.

oa_status
String
The OA status, or color, of this resource.
Classifies OA resources by location and license terms as one of: gold, hybrid, bronze, green or closed. See here for more information on how we assign an oa_status.

published_date
String|Null
The date this resource was published.
As reported by the publishers, who unfortunately have inconsistent definitions of what counts as officially "published." Returned as an ISO8601-formatted timestamp, generally with only year-month-day.

publisher
String
The name of this resource's publisher.
Keep in mind that publisher name strings change over time, particularly as publishers are acquired or split up.

title
String
The title of this resource.
It's the title. Pretty straightforward.

updated
String
Time when the data for this resource was last updated.
Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

year
Integer|Null
The year this resource was published.
Just the year part of the published_date

z_authors
List of Crossref Contributor objects
The authors of this resource.
These are formatted as a list of Crossref Contributor objects, which are described in the Crossref API docs here.

OA Location object

The OA Location object describes particular place where we found a given OA article. The same article is often available from multiple locations, and there may be differences in format, version, and license depending on the location; the OA Location object describes these key attributes. An OA Location Object is always a Child of a DOI Object.


evidence
String
How we found this OA location.

Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning. Example values:

  • oa journal (via journal title in doaj) We found the name of the journal that publishes this article in the DOAJ database.
  • oa repository (via pmcid lookup) We found this article in an index of PubMed Central articles.



host_type
String
The type of host that serves this OA location.

There are two possible values:

  • publisher means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to).
  • repository means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there.



is_best
Boolean
Is this location the best_oa_location for its resource.
See the DOI object's best_oa_location description for more on how we select which location is "best."

license
String|Null
The license under which this copy is published.

We return several types of licenses:

  • Creative Commons licenses are uniformly abbreviated and lowercased. Example: cc-by-nc
  • Publisher-specific licenses are normalized using this format: acs-specific: authorchoice/editors choice usage agreement
  • When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns implied-oa



pmh_id
String|Null
OAI-PMH endpoint where we found this location.
This is primarily for internal debugging. It's Null for locations that weren't found using OAI-PMH.

updated
String
Time when the data for this location was last updated.
Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

url
String
The url_for_pdf if there is one; otherwise landing page URL.

When we can't find a url_for_pdf (or there isn't one), this field uses the url_for_landing_page, which is a useful fallback for some use cases.


url_for_landing_page
String
The URL for a landing page describing this OA copy.

When the host_type is "publisher" the landing page usually includes HTML fulltext.


url_for_pdf
String|Null
The URL with a PDF version of this OA copy.

Pretty much what it says.


version
String
The content version accessible at this location.

We use the DRIVER Guidelines v2.0 VERSION standard to define versions of a given article; see those docs for complete definitions of terms. Here's the basic idea, though, for the three version types we support:

  • submittedVersion is not yet peer-reviewed.
  • acceptedVersion is peer-reviewed, but lacks publisher-specific formatting.
  • publishedVersion is the version of record.