Skip to content

Terms and Concepts

Tabular data file

A data file that is organized in a table with rows and columns.

Study files/resources

All data and non-data supporting files/documents/resources generated by or for or otherwise associated with your study, regardless of whether you are planning to share them. Also known as: study resources, study files, study documents, study artefacts.

NOTE: For most studies, study resources will all be files. However for some studies, this may also include (for example) bio samples or other non-file items.

Associated Files/Dependencies

The term Associated Files/Dependencies may be used with respect to a Resource (i.e. study file/resource) or with respect to a Result (i.e. results).

Resource

  • When adding a study file/resource to your study's Resource Tracker, you will be asked to supply Associated Files/Dependencies for that study file/resource.
  • An Associated File/Dependency, is any file that the current study file/resource depends on to be interpreted, replicated, or used.
  • For example, consider a processed tabular data file (e.g. a final analytic dataset):

    • A raw data file(s), plus a code file that merges and cleans the raw data file(s) may be required to replicate the processed tabular data file, and
    • A data dictionary for the processed tabular data file is likely to be required to interpret/use the processed tabular data file.
  • When documenting associated files/dependencies of a resource, only list the files directly underlying or associated with the resource.
    • You will then document each of these associated files/dependencies as resources, only listing the files directly underlying or associated with the resource as associated files/dependencies.
      • Continue documenting in this "stepping backward" process until the fields you are listing as associated files/dependencies have no direct associated files/dependencies themselves.

Note

  • Data dictionaries and protocols should be shared in the more specific Associated Data Dictionary, Associated Protocol fields in the Resource Tracker wherever possible
  • When adding a publication or report to the Resource Tracker, the only dependency should be the Results Tracker for the publication, and the associated Results Tracker for the publication should be added in the more specific Associated Results Tracker field in the Resource Tracker wherever possible

Result

  • When adding a result to your study's Results Tracker, you will be asked to supply Associated Files/Dependencies for that result.
  • An Associated File/Dependency, is any file that the current result depends on to be interpreted, replicated, or used.
  • For example, consider a figure that has been created as a potential candidate for inclusion as Figure 1A (in the context of a Figure 1 with panels A, B, and C) in a draft manuscript:

    • A png or jpg image file exported from Corel Draw or Illustrator may directly underly the whole of Figure 1 (where image files for panel A, B, and C were formatted into a single Figure 1 in the context of a Corel Draw or Illustrator file);
    • A png or jpg image file produced and written to file by R or python code may directly underly specifically Figure 1A;
    • This image file in turn depends upon the code file that produced it, as well as the data file(s) the code file read in and operated upon to create the final image file;
    • The code file may have been created to implement a specific statistical analysis plan that is laid out in a larger study protocol file, or in a statistical analysis plan file;
    • The data in the data file may have been collected using a protocol laid out in a specific protocol file, and if the data file(s) underlying the result in 1A is/are tabular data file(s), a data dictionary for the data file(s) is likely to be required to interpret/use the data file(s)
  • When listing Associated Files/Dependencies for a result in the Results Tracker, including only the Associated Files/Dependencies directly underlying the result.
    • In the above example, that would mean documenting only the png or jpg image file underlying Figure 1.
    • You would then document the png or jpg image file as a resource in the Resource Tracker, including only the associated files/dependencies directly underlying the resource (i.e., png or jpg image file underlying Figure 1A, specifically).
      • Continue this "stepping backwards" process documenting resources in the Resource Tracker until the files you are listing as associated files/dependencies have no direct associated files/dependencies themselves.

Associated Data Dictionary

The Data Dictionary that inventories and provides detailed information regarding all variables in the specific tabular data file for which it is the Associated Data Dictionary.

Associated Protocol

The protocol that provides detailed information regarding the procedure followed to produce the data contained in the specific tabular data file for which it is the Associated Protocol.

Associated Results Tracker

The Results Tracker that inventories and provides detailed information regarding all results (e.g. figures, figure panels, tables, text statements) shared in the specific publication or report for which it is the Associated Results Tracker.

Results

A figure, figure panel, table, or text statement that communicates a result and is, will, or may be shared in the context of a publication, report or presentation

dsc-pkg Folder

A file folder/directory that will hold all of the Standard Data Package Metadata Files for your Data Package. It is generally recommended that you create a single overarching study file folder/directory to hold all study files and folders and that you create your dsc-pkg folder as a direct sub-directory of your study folder and name it "dsc-pkg".

Standard data package metadata files

Standard metadata file types that, altogether, provide essential usability and context information about the study as a whole and about the data files your study has produced/collected. These metadata files should be included in all data packages. They should be stored together in a single file directory, preferably as a sub-directory within your study file directory called "dsc-pkg". See below for an example directory structure.

NOTE: All standard data package metadata files have a standardized csv format in which they should be completed and provided. See here for csv templates and schemas/field definitions to aid you in completing the templates.

Example Directory Structure 2

Standard data package metadata files - Study-level

  • Experiment Tracker

    One per study; An inventory of experiments or activities included in the study; For a clinical trial, this may be simply one experiment equal to the registered clinical trial activity; For a basic biology study, this may be a listing of several orthogonal experiments used altogether to address and advance the study aims - See here for more detail

  • Resource Tracker

    One per study; An inventory of all data and non-data supporting files produced during the course of the study (or, in some cases, only those which will be shared in a public data repository), including a description of what is in the file or what the file represents, file relationships and dependencies, and whether/how each file is shareable in a public repository or not - See here for more detail

Standard data package metadata files - File-level

  • Data Dictionary

    One per tabular data file; An inventory of variables included in a tabular data file - See here for more detail

  • Results Tracker

    One per publication or report; An inventory of figure, table, and text statement results included in a publication or report - See here for more detail

Experiment Tracker - Overview

The Experiment Tracker is an inventory and annotated list of all component experiments or activities that are part of the larger study. Each row of the experiment tracker corresponds to one component experiment or activity. Information in the tracker about each experiment includes the research question(s), approach, and hypotheses.

The Experiment Tracker is one of the standard data package metadata files which should always be included in a data package to provide essential usability and context information about the study as a whole and about the data files your study has produced/collected. There are study-level and file-level standard data package metadata files. The Experiment Tracker is a study-level standard data package metadata file (you should create and complete one Experiment Tracker per study).

Please follow the links below for additional information on:

Resource Tracker - Overview

The Resource Tracker is an inventory and annotated list of data and non-data supporting files/resources for the study. Each row of the resource tracker corresponds to one data or non-data resource. Information in the tracker about each resource includes file path, description, access restrictions, and dependencies (i.e. files necessary to interpret, replicate, or use the resource).

The Resource Tracker is one of the standard data package metadata files which should always be included in a data package to provide essential usability and context information about the study as a whole and about the data files your study has produced/collected. There are study-level and file-level standard data package metadata files. The Resource Tracker is a study-level standard data package metadata file (you should create and complete one Resource Tracker per study).

Please follow the links below for additional information on:

Data Dictionary - Overview

The Data Dictionary is an inventory and annotated list of variables within a single tabular data file (e.g. subject ID, blood pressure, zip code, protein activitiy, etc.). Each row of the data dictionary corresponds to one variable within a tabular data file. Information in the data dictionary about each variable includes the name of the variable, a description of the variable, type of variable (string, numeric, integer), etc.

The Data Dictionary is one of the standard data package metadata files which should always be included in a data package to provide essential usability and context information about the study as a whole and about the data files your study has produced/collected. There are study-level and file-level standard data package metadata files. The Data Dictionary is a file-level standard data package metadata file (you should create and complete one Data Dictionary per tabular data file in your data package).

Please follow the links below for additional information on:

Results Tracker - Overview

The Results Tracker is an inventory and annotated list of results within a single publication or report. Each row of the results tracker corresponds to one result (e.g. a figure, table, or textual statement) within a publication or report. Information in the tracker about each result includes the type of result (figure, table, text), description, and dependencies (i.e. files necessary to interpret, replicate, or use the result).

The Results Tracker is one of the standard data package metadata files which should always be included in a data package to provide essential usability and context information about the study as a whole and about the data files your study has produced/collected. There are study-level and file-level standard data package metadata files. The Results Tracker is a file-level standard data package metadata file (you should create and complete one Results Tracker per publication or report in your data package).

Please follow the links below for additional information on:

Minimal annotation

When completing the Resource Tracker for your study:

  • Minimal annotation implies that you will list and annotate relevant study resources in the resource tracker ONLY if you will share those files in a public data repository
  • When completing the Resource Tracker you will NOT list and annotate relevant study files if they will not be shared in a public data repository
  • See the alternative: Wholistic Annotation
Advantages
  • You only catalog the data and non-data supporting files that you will share/submit to a repository.
  • Especially if you are late in your study, Minimal Annotation may be less time consuming than the alternative (Wholistic Annotation), because you are listing and annotating fewer files (i.e. only the files that will be shared , versus all files that will be shared AND all files that will NOT be shared) in the Resource Tracker for your study.
  • This approach still provides a lot of value to researchers who may find your study - It will help them to parse what the study was trying to do, how the study was designed, what has been made available, whether or not the data that has been made available may be useful for their purposes (e.g. secondary data analysis, comparison to their own results, etc.), and even whether it may be useful to reach out to the study group of origin to request data that has not been provided or to set up a formal collaboration.
  • This approach allows you to fulfill minimal data sharing requirements.
Caveats
  • As compared to the alternative (Wholistic Annotation), you don’t get the full local annotation benefit that would come with fully cataloguing all data and non-data/supporting files relevant to a study (including files you will not share/submit to a repository), and how they relate to each other and to published results – these benefits include facilitating continuity and passed-down knowledge within study groups, and discovery, sharing, and re-use of the data and knowledge produced by the study outside of the original study group.
  • As compared to the alternative (Wholistic Annotation), you may not get the full benefit of added study discoverability and transparencty for potential secondary data users and collaborators that the Resource Tracker can provide.

Wholistic annotation

When completing the Resource Tracker for your study:

  • Wholistic annotation implies that you will list and annotate relevant study resources without regard for whether (or not) you will share those files in a public data repository
  • When completing the Resource Tracker you will list and annotate relevant study files that will be shared AND those that will NOT be shared
  • When listing/annotating a file that will NOT be shared, access level should be set to "permanent-private" to indicate that the file will not be shared
  • See the alternative: Minimal Annotation
Advantages
  • Maximizes transparency and allows other researchers interested in the data to understand the full scope of the project and the data when accessing study documentation.
  • Allows for documentation of the existence and disposition of files that are too sensitive to share but are important for reproducibility and can perhaps be requested directly from the study team by another researcher.
  • You get the benefit of full local annotation, which not only maximizes the usefulness of your data for other investigators but also can be helpful internally, especially in preserving knowledge about the data even as team members may change over the course of the study.
  • Documenting and sharing all metadata associated with your study can increase the discoverability of your study.
Caveats
  • Especially if you are late in your study, Wholistic Annotation may be more time consuming than the alternative (Minimal Annotation), because you are listing and annotating more files (i.e. files that will be shared AND files that will NOT be shared, versus ONLY files that will be shared) in the Resource Tracker for your study.

Finding your study's best fit annotation approach

For guidance on determining the best fit annotation approach for your study group, see the best fit questions.

As-you-go annotation

As-you-go annotation implies that you will begin the data packaging and annotation process right away; you will audit and annotate all study resources already created, and keep up with annotation as you move through the remainder of your study timeline; you will generally audit and annotate all study resources regardless of whether these resources will ultimately be shared at a public data repository ("wholistic" annotation)

See the alternative: Top-down Annotation

Advantages
  • Spreads out annotation and data packaging work across the course of the study so that burden at the end of the stusy is minimal.
  • You get the benefit of full local annotation, which not only maximizes the usefulness of your data for other investigators but also can be helpful internally, especially in preserving knowledge about the data even as team members may change over the course of the study.
  • Maximizes transparency and allows other researchers interested in the data to understand the full scope of the project and the data when accessing study documentation.
  • Allows for documentation of the existence and disposition of files that are too sensitive to share but are important for reproducibility and can perhaps be requested directly from the study team by another researcher.
  • Documenting and sharing all metadata associated with your study can increase the discoverability of your study.
Caveats
  • The "as-you-go" annotation approach, **when applied broadly as outlined above** is strongly recommended for study groups that are early on in their study as the burden of starting up is relatively light when few study files/resources have so far been collected or produced by or for the study. However, the start up burden of this approach may be quite substantial for studies groups that are late or even well into their study and have already accumulated many study files/resources, and we generally recommend these groups consider the alternative, more goal-focused and narrow annotation approach: Top-down annotation.
  • **The "as-you-go" annotation approach may also be applied in a narrower sense**, especially by study groups that are later in their study and who will not apply the "as-you-go" annotation approach in the broadest sense. This implies that studies will consider the whole packaging overview process and complete items as they can, as opposed to waiting until the very end (for example, when they are about to submit a study manuscript for peer review) to start the process. Some examples include, 1) auditing study files for tabular data files, creating data dictionaries for existing tabular data files right away, and creating data dictionaries right away for new tabular data files as the study collects or produces them, 2) annotating results the study group knows will or likely will be included in a final manuscript as they are produced, and creating a results tracker for final manuscript documents as drafts begin to be forumulated by the study group, 3) annotating component experiments and other activities that are part of the study right away if already designed, or as soon as they are designed (especially if it is clear that the experiment or activity will or likely will produce data that will be used to support/produce results that will be included in a final manuscript).

Finding your study's best fit annotation approach

For guidance on determining the best fit annotation approach for your study group, see the best fit questions.

Top-down annotation

Top-down annotation implies that you will generally implement the data packaging and annotation process in a somewhat narrower, more goal-oriented manner as compared to how you would implement if using the alternative As-you-go annotation approach; You will generally determine the data-sharing orientation(s) or goal(s) of your study right away (results-support orientation and/or dataset-sharing orientation), then wait until your study has produced the goal sharing product (i.e. respectively, the publication containing a set of results for which your data sharing will provide support, or the final dataset your study is interested in sharing/disseminating); you will then audit and annotate the subset of study resources required to interpret, use, and/or reproduce the results or dataset already created, and annotate this subset of study resources right away; when you take the "top down" annotation approach, you will generally audit the full subset of study resources required to support your result(s) and/or dataset regardless of whether these resources will ultimately be shared at a public data repository, however you may choose to 1) annotate all resources in this subset regardless of whether they will be shared in a public repository ("wholistic" annotation), or 2) annotate only the resources in this subset that will be shared in a public repository ("minimal" annotation)

See the alternative: As-you-go Annotation

Advantages
  • "Top-down" annotation allows for a narrower and more goal-focused approach to annotation and data-sharing as compared to the alternative "As-you-go" annotation; This approach may be very appealing to study groups that are well into their study as it reduces annotation and data sharing burden by focusing on producing the essential annotation needed to ensure that the results or datasets they share are discoverable, interpretable, reusable, and replicable by researchers and other potential collaborators.
  • Concentrates annotation and data packaging work at the end of the study which allows singular focus on the task and may speed the process.
  • If you choose to use "wholistic" annotation with the "top-down" annotation approach, you get the benefit of full local annotation of the files underlying your study's goal data sharing product (manuscript results or shared dataset), which not only maximizes the usefulness of your data for other investigators but also can be helpful internally, especially in preserving knowledge about the data even as team members may change over the course of the study. Even if you choose to use "minimal" annotation with the "top down" approach you still get the benefit of some local annotation to support preserving institutional knowledge within your study group.
  • If you choose to use "wholistic" annotation with the "top-down" annotation approach, you maximize transparency and allow other researchers interested in the data to understand the full scope of the project and the data that explicitly underlies your study's goal data sharing product (manuscript results or shared dataset) when accessing study documentation. Even if you choose to use "minimal" annotation with the "top down" approach you still provide a lot of transparency about what data and non-data supporting files you are sharing to support your study's goal data sharing product, how these files relate to each other and how to use and interpret these files; this presents a huge value to potential secondary data users or collaborators.
  • If you choose to use "wholistic" annotation with the "top-down" annotation approach, this still allows for documentation of the existence and disposition of files that are too sensitive to share but are important for reproducibility, interpretability and use of the study's goal data sharing product (manuscript results or shared dataset) and can perhaps be requested directly from the study team by another researcher.
  • If you choose to use "wholistic" annotation with the "top-down" annotation approach, this allows for documenting and sharing all metadata associated with the study's goal data sharing product (manuscript results or shared dataset) and this can increase the discoverability of your study.
Caveats
  • With "top-down" annotation, your study team will not have the benefits of full local annotation. If there are any data/files that were not included in a manuscript's results, then they will not be documented for your study team's reference. Future publications on this data, with similar sharing requirements, will not be able to draw on an existing fully documented inventory of study files and resources.
  • "Top-down" annotation is necessarily more narrow and goal-focused than "As-you-go," which means that someone accessing your study on a repository will get a focused view of a specific set of results or a key dataset associated with your study. This focused view is helpful for understanding your published results, but does not give researchers insight into the full picture of the study. For example, it may be very useful for researchers conducting similar studies to be able to review and learn from negative data, where an experiment did not acheive the desired result.

Finding your study's best fit annotation approach

For guidance on determining the best fit annotation approach for your study group, see the best fit questions.

Data Package

A data package is made up of your study files and the Standard Data Package Metadata, a set of metadata files that describe your study, study files and results, and study file and results relationships.

Shareable Data Package

A shareable data package is a zip archive produced locally that is ready to be submitted to a repository. A shareable data package includes only files to be shared to a repository. There are multiple "flavors" of shareable data package depending on how and when files are designated to be shared:

  • open-access now
  • open-access-by-date (i.e., embargoed through a specific date)
  • managed-access now
  • managed-access-by-date (i.e.,embargoed through a specific date)

Data-sharing Orientation

A study group's goal(s) for data-sharing.

People will have different goals in sharing data and non-data supporting files/resources from their study. These goals, or 'orientations' may be dictated by a number of factors such as the nature of the data, study staff resources and time constraints, investigator preference, and requirements imposed by a funder, journal, or other entity somehow governing the study.

Determining what your goals are for data sharing will help guide you in determining what you will include in your data package and how and when you will annotate the data and non-data supporting files/resources in your data package.

There are two main data-sharing "orientations." These data sharing goals or "orientations" are not mutually exclusive. You may choose both Results-support and Dataset-sharing as goals for your study's data sharing goal.

Results-support

  • The study group wants to share data and non-data supporting files specifically to support results shared in a published manuscript or other venue (e.g. presentation, poster, report, etc.)
  • They will share items required to interpret, replicate, or use these results

Dataset-sharing

  • The study group wants to share a specific dataset(s);
    • perhaps the group has collected/created a very rich dataset and used this dataset to ask and publish results related to specific scientific questions;
    • they believe other study groups may be able to leverage this dataset to ask and publish results related to other scientific questions that may be related or unrelated to the questions the dataset was originally collected to help investigate
  • The study group wants to share this specific dataset(s), as well as the data and non-data supporting files specifically to support use of this dataset(s)
  • They will share items required to interpret, replicate, or use this dataset(s)