Challenges & Approach
While there is a long-standing tradition of generative modeling in AI and machine learning, only the most recent techniques involving deep learning have managed to make these models work on high dimensional, complex real world data. Success stories range from image over video, speech to patient records which includes sequence, array and volumetric data. Our goal is to bring these techniques to the complex medical domains with a focus on genetic data. This bears the potential to generate synthetic cohorts that allow for privacy preserving analysis, with existing toolchains as well as distribution of sanitized data to accelerate research and allow reproducible research. However, genomics data is of high dimensionality, high variance and its structure only partially understood since we lack mechanistic models explaining genomic features. For these reasons, we need to develop novel techniques that can break down the problem. We will tackle these challenges using task/problem-class specific models, incorporation of domain knowledge in the learning as well as transfer learning. Beyond the methodology we will investigate different processes that utilize such synthetic data while maintaining high accuracy in the scientific findings. This includes fully relying on synthetic data, but also verifying findings obtained on synthetic data in a second stage with real data. Genetic data and in particular the full genome is high-dimensional and contains a large number of features. The modification of a single base can result in diseases and modeling such changes in a privacy-conserving manner is not possible. The human transcriptome covers less than 5% of the genome and is more variable than the genome and differs between tissues. The transcriptome and corresponding regulatory networks are more robust to the fuzziness that will be introduced from modeling. Therefore, we opt to study the transcriptome with a selection of key genes first, and then extend to the whole transcriptome and beyond. Nonetheless, we will use the term genomic data throughout this document.