8  FAIRness

Purpose:

“enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals” (Wilkinson et al., 2016, p. 1)

See also go-fair.org

FAIRness vs. openness

“does not necessarily mean that data has to be “open” […] even highly protected data can be FAIR data”
(Kraft, 2023)


The problem:
Just because we provide data online, doesn’t mean that others will find it.

We could have the greatest data set to answer further research questions - if our colleagues don’t know it exists or can’t locate the data, openness will be of little value.

The solutions:

  • Get a persistent identifier (e.g., DOI), where you provided your data
  • Mention DOI in publication that builds on this data (e.g., in the “data accessibility statement”)
  • Describe your data as richly as possible (metadata). Research data centers offer form fields tailored to the discipline or data type. With repositories use alternative possibilities, such as keyword fields.
    • e.g., which variables does the quantitative data set contain?
    • e.g., which topics does your data cover?
    • e.g., which population did you draw your sample from?

The problem:
Just because others find our data doesn’t mean the access barriers are as low as possible and doesn’t mean they know in which way they are allowed to access it. Examples:

  • Providing a link to the data in the text of a paywalled journal article
  • Unclear licensing / use conditions when providing data (e.g., are non-researchers allowed to access the data or is it only open for qualified researchers?)

The solutions:

  • Make sure access is free of charge (or as cheap as possible)
    • e.g., by providing link to data in publicly accessible sections of journal articles that are not open access
    • e.g., by using repositories or research data centers that allow access free of charge
  • Make sure users know if they can access and under which conditions
    • e.g., research data centers ensure that terms of use are clear (who may access under what conditions) and offer different levels of access restriction
    • e.g., on repositories provide a readme-file and an open license (e.g., CC0, CC-BY, CC-BY-SA) with data sets for access cases

The problem:
Just because others downloaded our data doesn’t mean they can open and manipulate it.

The solutions:

  • Use file formats with open licenses
    • e.g., tabular data: CSV (with additional labelling script), RData
    • e.g., text data: PDF, HTML, ODT, RTF
  • Make sure users know how different files are related to one another
    • e.g., define which file contains student data and which teacher data
    • e.g., define which file contains data from cohort 1 and which cohort 2, …

The problem:
Just because others opened our data doesn’t mean they understand the data and its use-conditions. Examples:

  • Others can’t understand what the column names of the tabular data set mean: Which columns in the data set relate to which variables in the journal article?
  • Can someone from sociology use the data set from psychology they found on osf.io?
  • Does someone reusing a data set have to cite the authors?

The solutions:

  • Adhere to standards in folder organization
    • e.g., PSYCH-DS (see technical specification draft)
  • Rich description/explanation of what user will find in the data set (≠ meta descriptions about the data set as a whole, as for accessibility)
    • e.g., provide a codebook. How to semi-automatically create a codebook, see the R package codebook
  • Provide a license for the use-cases
    • again, research data centers ensure that terms of use are clear (who may use under what conditions)
    • again, on repositories provide a readme-file and an open license (e.g., CC0, CC-BY, CC-BY-SA) with data sets for the use-cases


(FAIR principles and the role of scientists: Kraft, 2023)

Questions to be answered at the end?
Please put them here!