Deidentification Guidance for Genomic Data¶

When considering how to deidentify or anonymize data before it is submitted to a repository, genomic data is a special case. Genomic data is complex and highly identifying, especially in combination with other information (e.g., other genomic data types, phenotypic information, etc.), which makes it very difficult to adequately anonymize.

There is not currently consensus on best practice for deidentification of genomic data, although some methods have shown promise. Deidentification methods for this type of data will also likely continue to rapidly advance and shift over time as the field of genomic research expands and the number of researchers interested in sharing this type of data increases.

There are a number of different strategies that researchers have utilized to attempt to adequately anonymize this type of data, with varying success. Strategies and recommendations range from only sharing summary data to sharing individual level data with phenotype-only de-identification with strong access controls, to using complex statistical methods, such as differential privacy to additionally de-identify the genotype data directly (still with strong access controls).

Given the difficulty of anonymizing this data adequately, the best practice at this time is to apply current best practices in deidentification of this type of data, in consultation with a statistician or other individual familiar with genomic privacy considerations, supplemented by privacy and governance protections. Privacy and governance protections can include applying access controls, data use agreements, and physical and electronic security protocols.

Note about NIH-funded Genomic Research

If your study is funded by NIH and will generate genomic or sequencing data, you must adhere to the NIH Genomic Data Sharing Policy.

The genomic data sharing policy provides a framework for genomic data sharing, but some individual NIH Institutes and Centers (IC) may have additional requirements for genomic data sharing. You should review your specific funding IC requirements to help determine what you are required to submit, requirements for submission timing in relation to data collection, and which of the below deidentification methods you should pursue.

Although there is not consensus on best practices for deidentifying this type of data, investigators can generally choose from a few different options when considering how to share their genomic data:

Option 1: Share summary data only
- If you would like to use this approach and your study is funded by NIH, review the NIH Genomic Data Sharing Policy guidelines and those of your specific funding IC to ensure that sharing summary data only is sufficient to fulfill data sharing requirements for your grant.

Option 2: Share individual-level data
- When using this approach, you should share your data in a repository that allows for sharing with strict access controls.
  - Generally, a repository with strict access controls in place will require that anyone who wants to access managed access data to put in a formal request for access and may require they provide an IRB-approved research proposal for how they will use the data.
  - Depending on the level of access controls applied, a repository may also require researchers requesting to access the data to sign a data use agreement (DUA) and may only permit users to work with the data in a secure environment.
- Additionally, if you pursue this option, some amount of deidentification will be required
  - Generally, deidentification is the responsibility of the investigator or study group. However, some repositories may provide some support or resources for deidentification. See the HEAL Data Stewards guidance on repositories and repository selection for more information on repository selection and support.
  - Best practice: Work with an expert of deidentification
  - There are two options for deidentifying when sharing individual-level data with strict access controls:
    - Option A: Phenotype-only deidentification
      - This involves deidentifying only the clinical/phenotype data that accompanies genotype data but not deidentifying the genotype data
      - Note: This is the approach dbGaP suggests for data submitted to their repository; dbGaP additionally implements very strong access controls to protect the data
    - Option B: Phenotype and genotype deidentification
      - This approach involves the use of complex statistical methods that have been used with some success in deidentification of genomic data such as differential privacy tools/algorithms
      - Although with this approach, the resulting dataset with have deidentified phenotypic and genotypic data, this data will likely still require strict access controls due to its sensitivity. Consult with a deidentification expert for guidance on appropriate access controls for your specific dataset.

References

NIH Genomic Data Sharing Policy
dbGaP Study Submission Guide: Submission guide for the database of Genotypes and Phenotypes (dbGaP), the repository where all large-scale human genomic studies funded by NIH must register. This guide also provides information on subject de-identification for submission to the repository.
A practical path toward genetic privacy (2020): Reviews the deidentification landscape as it relates to genomic data, including reviewing regulatory requirements, the challenge of deidentifying this type of data, and some technical solutions. Brings together findings from multiple leading organizations to recommend that deidentification methods be combined with privacy and governance protections.
Methods for de-identification of electronic health records for genomic data (2011): Discusses issues related to deidentification and some good practices for managing reidentification risk related to genomic data.
Privacy considerations for sharing genomic data (2021): Reviews multiple different genomic data types (sequence, transcriptomic, epigenetic) and their risk of reidentification and reviews some established data sharing techniques
Membership privacy in microRNA-based studies (2016): Discusses the threat of reidentification related to microRNA data and discusses potential solutions for sharing this type of data