PoSeNoGap

Portable Scalable Concurrency for Genomic Data Processing

PoSeNoGap aims at designing a new framework for managing big genomic data. Sequencing a human genome outputs around 300GB of raw data. Compressing these data is necessary and requires computing resources. Analyzing these data also is very time consumming. Therefore within this project the team will develop new approaches for both compression and analysis of genomic data. Within this context, the REDS is developing an efficient solution for clustering unmapped sequences thanks to a high-end FPGA platform.

The sequencing of the genetic information of human genome has become affordable due to high-throughput sequencing technology. This opens new perspectives for the diagnosis and successful treatment of cancer and other genetic illnesses. However, there remain challenges, scientific as well as computational, that need to be addressed for this technology to find its way into everyday practice in healthcare and medicine. The first challenge is to cope with the flood of sequencing data (several Terabytes of data per week are currently collected by Vital-IT (the SIB's High Performance and High Throughput Bioinformatics Competency Center). For instance, a database covering the inhabitants of a small country like Switzerland would need to store a staggering amount of data, about 2'335'740 Terabytes. The second challenge is the ability to process such enormous deluge of data in order to 1) increase the scientific knowledge of genome sequence information and 2) search genome databases for diagnosis and therapy purposes. Significant compression of genomic data is required to reduce the storage size, to increase the transmission speed and to reduce the cost of I/O bandwidth connecting the database and the processing facilities. In order to process the data in a timely and effective manner, algorithms need to exploit the significant data parallelism and they need to b