Data Publication Guide
How to publish your research data in Yoda
This guide outlines the basic steps that are necessary for creating an Earth science data publication using Yoda. An explanation of the terminology used in this guide can be found at the bottom of this page.
Yoda (short for ‘your data’) is a research data management service developed by Utrecht University that enables researchers to deposit, publish, and preserve their research data. It offers researchers a collaborative environment to store their research data. This research data can be shared with collaborators or members of the research group, if needed. The steps to go from storing research data to a formal data publication that can be accessed and cited by others are outlined in the guide below. Once data is published in the Yoda repository, the associated metadata can be harvested by research data catalogs, such as the EPOS Multi-scale laboratories data catalog, making your data publication Findable, Accessible, Interoperable, and Re-usable (FAIR).
The process to publish your data in Yoda is outlined by the 9 steps below:
The Geosciences data manager (Vincent Brunst – firstname.lastname@example.org) will
1) give you access to the Yoda data repository to store your research data, and
2) be able to address any questions you may have after browsing through this tutorial.
The research data from which you want to create a data publication can:
- i) Already be in the Yoda environment. In this case you can ignore step 5 from this guide (since your data is already in Yoda).
- ii) Be stored outside of the Yoda environment. In this case you need to upload your dataset to Yoda in step 5 of this guide.
In both cases:
- Create a folder in which to place the dataset that you want to publish.
- Decide on the data that you want to include in the dataset for publication. This can be based on what you want to publish, what your journal wants you to publish and/or what your funder wants you to publish.
- Give the files and (sub)folders appropriate names, making them easy to interpret for other researchers.
- Assign a logical folder structure.
Create any necessary documentation (called unstructured metadata) that provides essential information for future use of the dataset, notably how the data was collected, how it should be interpreted, and how it can be reused.
Consult the funding agent and/or publisher’s requirements for both the type of license and embargo period. In most cases we advise to use a CC BY 4.0 license. This license allows anyone to share (copy and redistribute the data in any medium or format) and adapt (remix, transform, and build upon the data) the dataset, under conditions of i) correct attribution (you receive credit through citation), ii) duly indicate if and how the dataset is adapted, iii) linking to the license information. Depending on your licensing requirements, you can also use this tool to select the appropriate license, or contact the data manager when in doubt.
The embargo period (usually set between 0 and 3 years) should be decided upon based on your preference and the requirements laid out by the funder or journal. During the embargo period, only the metadata is published (so others are aware the data set exists). Publication of its actual content then only takes place once the embargo period has elapsed.
This creates a so-called data package. Metadata provide qualitative descriptions about your data, e.g. how/where were the data obtained, on what samples, etc. The steps to add metadata are described here.
The data manager will review the file types, folder structure, file formats, and metadata. Apply the corrections that the data manager proposes. After you have implemented the suggested changes, the process can be repeated if necessary, or the dataset continues to the next publication step.
The dataset can now be transferred as a bounded data package for sustainable storage in the Yoda archive (also known as the “Vault”). Your data package will be retained unchanged during its retention period in the vault. The steps to archive your data are found here.
After archiving the data package in the vault, it is ready to be published. Your data package will obtain a DOI and the structured metadata will be published so it can be harvested by data catalogs, making your data publication findable, accessible, interoperable, and re-usable! The process of publishing your data package is described here.
Facts or information that are examined and used to find out things or to make decisions.
Data is generally classified into two different types:
- Raw data: data collected from a source that has not been subject to any other manipulation by software or a human researcher
- Processed data: data that has been manipulated by software or a human researcher
We make a distinction between the terms data, dataset and datapackage. The term data is used to refer to the research data on any level while being agnostic about the form. Data can be raw data derived from a particular tool or it can be a data product, such as a model.
A human-readable text that provides detailed information about the data. Three levels of documentation are usually distinguished:
– a general description of the study aims and methodology
– a list of files and folders and how they relate to each other
– codebook: a list of variables with names and explanations
Depending on the data level, the content of the data documentation will vary. For example, the explanation for raw data will probably include information on how the data was obtained and selected. If the data set includes processed or analyzed data products, the documentation has to explain how these products were derived from the raw data.
Has expertise in general data management. The data manager is responsible for data organisation and maintenance in the repository. He or she performs a qualitative assessment of the dataset concerning its form. This role requires general knowledge of research data management and the domain-specific metadata structure.
We see a data package as a publication-ready dataset. A data package houses data files that can be accommodated in a folder structure; next to data files, a data package always includes the corresponding data documentation and the structured metadata. If a data package is published, a DOI is added to its metadata. The dataset becomes a data package in the process of data deposition.
Dataset is used to describe an organized collection of data files. Creation of a dataset necessarily involves a selection process. A dataset may include data documentation.
Digital Object Identifier (DOI) are persistent interoperable identifiers that uniquely identify an object, making it findable, discoverable, and citable.
A facility for sustainable data storage and -publication.
Standardised structured information explaining the characteristics of the “content” data. Metadata provides information for discovery and contextualization of the dataset. Different EPOS communities use different metadata standards, dependent on their domain of research. The EPOS Multiscale Laboratories community, in which UU is highly involved, uses a selection of the ISO19115 metadata standard as its baseline. The community has extended this standard by the inclusion of additional keywords in order to provide a more precise description of experimental setups underlying the data collection process.