Protecting Genetic Data with Synthetic Cohorts from Deep Generative Models (PRO-GENE-GEN)

Genetic data is highly privacy sensitive information and therefore is protected under stringent legal regulations, making sharing it burdensome. However, leveraging genetic information bears great potential in diagnosis and treatment of diseases and is essential for personalized medicine to become a reality. While privacy preserving mechanisms have been introduced, they either pose significant overheads or fail to fully protect the privacy of sensitive patient data. This reduces the ability to share data with the research community which hinders scientific discovery as well as reproducibility of results. Hence, we propose a different approach using synthetic data sets that share the properties of patient data sets while respecting the privacy. We achieve this by leveraging the latest advances in generative modeling to synthesize virtual cohorts. Such synthetic data can be analyzed with established tool chains, repeated access does not affect the privacy budget and can even be shared openly with the research community. While generative modeling of high dimensional data like genetic data has been prohibitive, latest developments in deep generative models have shown a series of success stories on a wide range of domains. The project will provide tools for generative modeling of genetic data as well as insights into the long-term perspective of this technology to address open domain problems. The approaches will be validated against existing analysis that are not privacy preserving. We will closely collaborate with the scientific community and propose guidelines how to deploy and experiment with approaches that are practical in the overall process of scientific discovery. This unique project will be the first to allow the generation of synthetic high-dimensional genomic information to boost privacy compliant data sharing in the medical community.

Participating Institutions


The CISPA Helmholtz Center for Information Security (CISPA) in Saarbrücken is one of the world’s leading research institutions in information security and privacy, with a dedicated focus on addressing the grand research challenges in security and privacy in a comprehensive and holistic manner, in particular the intersection of security and privacy with AI/machine learning. It strives for cutting-edge, often disruptive foundational research, augmented with innovative application-oriented research, corresponding technology transfer and societal outreach. Medical security and privacy, as well as foundational research in AI/machine learning have been topics of central importance for CISPA ever since its inauguration.

The German Research Center for Neurodegenerative Diseases (DZNE) is dedicated to neurodegenerative diseases, including Alzheimer’s, Parkinson’s, Huntington’s, amyotrophic lateral sclerosis (ALS), and frontotemporal dementia (FTD), in all their facets. To cover this diversity, the DZNE pursues an interdisciplinary research strategy that comprises four interconnected areas: Fundamental research, clinical research, health care research, and population research. In line with this agenda, DZNE experts cooperate across sites and disciplines to promote translation of research findings into practical application. Efficient and trustworthy processing of large datasets containing sensitive biomedical information is crucial to the core of DZNE’s research, and the HMSP was initially inaugurated by a bilateral endeavor between DZNE and CISPA.