COMPSCI_C267

About Me

My name is Devin Petersohn. I am a first year Computer Science PhD student working with Prof. Anthony Joseph in the RISE Lab. I work in the Big Data Genomics group there. I am very interested in new Biology applications for parallel and distributed computing. I received my BS in Computer Science from the University of Missouri – Columbia, and have interned at Sandia National Labs in Livermore, CA in the Bioinformatics group. In this class, I would like to improve my understanding of parallel systems and apply the knowledge to some of the projects I am involved with.

Parallel Application: Joins on Genomic Region

Problem: Genomic data is extremely large, with multiple datasets exceeding 10’s of Petabytes in size. Processing this massive amount of data is extremely difficult, primarily because the legacy tools that exist are built for specific systems or don’t scale well. Joins are a very common operation in Genomic analyses. However, the size of these datasets create new opportunities in the parallel computing arena, because data locality is an absolutely essential component to any approach that attempts good performance.

Solutions: There are many solutions for parallel joins with large datasets. Database researchers have been interested in this problem since the early 1990’s [1]. Genomic Region based joins are a bit more complicated than traditional hash-based database joins because Genomic Regions are ranges of positions and joins need to be computed based on the overlap of ranges rather than the identity of a key. This creates a much more difficult problem in data locality because some regions join to many other regions.

Currently ADAM, a large-scale genomic data processing engine, has an implementation of Genomic Region joins. These joins are done with a “sweep” where positions are sorted and binned using a fixed length bin, the joined locally after the bins are filled. This creates problems when a bin contains significantly more data than other bins. The data locality problem is fixed, but a new problem is introduced regarding the amount of processing assigned to each node.

[1] DeWitt, D. J., Naughton, J. F., Schneider, D. A., & Seshadri, S. (1992). Practical skew handling in parallel joins (pp. 27-40). University of Wisconsin-Madison. Computer Sciences Department.