Skip to content

More Information about Data Packaging

A data package is a collection of research data together with metadata, supporting files and documentation needed to permit someone unconnected to the original study to discover, understand, replicate, and/or (re)use the data or other study materials (e.g. code, protocol, survey instrument) for a specific purpose.

This guide aims to provide clear, step-by-step instructions for how to:

  1. Create a data package
  2. Prepare a data package for submission to a public data repository

Creating a data package inherently makes the originating study group's job of preparing the data package for submission to a public data repository easier!

Benefit for Secondary Users

The focus here is on creating data packages suitable for submission to a public data repository to permit replication of a previous analysis and/or secondary analyses.

Such packages include the information necessary to make the data findable, accessible, interoperable and reusable (FAIR). In particular, they (1) include metadata that may be indexed and searched to permit researchers to find the data; (2) organize and store data and supporting files using common, open standards and formats that make them easy to use; and (3) include sufficient documentation, including information on provenance and data use requirements and restrictions, to permit researchers to understand and reuse the data.

Benefit for originating study group & Data packages without data

Importantly, while data packages are the best way to share primary research data with potential secondary users, they are also useful for the originating study group AND they are also useful even when the originating study group is generating data that cannot be shared or is working with secondary data that cannot be redistributed.

A data package provides an ideal way to organize your work locally, increasing the efficiency and reproducibility of your own work and facilitating collaboration among your team. In addition, a data package may be easily modified to exclude all or some of the data themselves—possibly replacing them with links to their sources in the case of secondary data or with synthetic data—leaving you with a product that can then be shared in order to document your work and permit other researchers to replicate your analysis.