SC15のBioinformaticsセッションの論文紹介 - それなりにマジメなメモ

11/１5-11/20にスパコンのランキングが発表されることで有名なSCがあった。今年もbioinformatics関係のセッションがあり、3つ論文があったのでせっかくなので読んでみたので簡単な論文紹介と感想の記事。

今回、発表のあった論文は以下の３つ。

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. 2015. HipMer: an extreme-scale de novo genome assembler. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 14 , 11 pages. DOI=http://dx.doi.org/10.1145/2807591.2807664
Patrick Flick, Chirag Jain, Tony Pan, and Srinivas Aluru. 2015. A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 15 , 11 pages. DOI=http://dx.doi.org/10.1145/2807591.2807619
Patrick Flick and Srinivas Aluru. 2015. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 16 , 10 pages. DOI=http://dx.doi.org/10.1145/2807591.2807609

HipMer: an extreme-scale de novo genome assemble

3本の中で一番bioinformaticsっぽい論文。HipMerというMeraculous[1]をマルチノードに拡張したツールの話。
昨年のSC14でも同じようにMeraculousの分散並列版の話をしていたが、contigの生成のところに新しい最適化を入れたのとscaffoldingの部分も分散並列させたという部分が新しい。
これによってMeraculousの全部の処理を分散環境で実行できるようになった。
実験として、ヒトのデータ（101 bpのread、2.9 billion本）のアセンブルを12コアのマシン1280ノード（合計15360コア）使って8.4分で処理させることができたとのこと。

スケーリング自体は悪いものの他のマルチノードのアセンブラであるRayやABySSと比べても速度が10倍以上でているらしいのでうまくタスクを分散できている模様。
Scaffoldingの部分の並列化はデータ並列である程度並列化がでるようなのであまり面白くなかったが、contig生成の部分で新しく追加された最適化である頻度の高いk-merの判定(Section 3.1)とde Bruijn Graphの探索時の通信回数削減方法(Section 3.2)が面白かった。

A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications

AssemblerにGraph500でおなじみBreadth First Search (BFS)を適応するという話だったのでde Bruijn Graphの探索部分に使うのか？と思ったが違った。Metagenomeのデータだとhigh species level（？）でde Bruijn graphsがつながっていないことが多いらいし、ということが[2]で示されているとか。このため、最初にde Bruijn graphsの連結を調べて、グラフを分割してしまえばあとは個別にアセンブルできる。（[2]を読んでないので私自身どれくらい分割できるかとかわかっていない）

この最初の分割をBFSでやってしまうというのがこの論文で紹介されていること。なのでメインの話はBFSを如何に高速に処理するかの部分で、BFSを使ってどのようにde Bruijn graphsを分割するかがおまけ程度に書かれている。
BFSの応用にこんなのがあったのかという発見があったので、その部分は面白かった。

Parallel distributed memory construction of suffix and longest common prefix arrays

３つの論文でバイオっぽい単語が一番出てこなかった論文。タイトルの通り、suffix arrayとlcpの構築を分散環境で高速に行う話。
分散環境においてのsuffix arrayの構築はすごい通信が多くなりそうで並列度でないのでは？と思ったがそうでもないようで128コアから1024コアの部分でみるとstrong scalingで0.5でているのでそこそこでている。

大本のアルゴリズムはprefix doublingで並列用に少し手を加えたものを利用している。
並列化方法はall-to-all、all-reduce,、scan、exscanなどの並列化するときによく用いられている処理を駆使して行っている。
一番の重要な部分はsortのときの通信回数を減らすためにメモリの局所性を意識して実装されているところ。これによって速度がかなり上がっている模様。

個人的に驚いたのは1024コアでヒトゲノムのsuffix arrayがたった5秒でできてしまうところ。
コア数が多い気もするがこれだけ速ければsuffix arrayの構築で時間がかかるから使うのをあきらめてた部分とかに使えないかな？

全体の感想

昨年よりもバイオ感がかなり減っていたが、面白い部分がいくつかあった。
ただ、問題設定がde novo assemblerに偏ってたのでもう少しいろいろbioinformaticsでHPCを活用しなきゃならない問題がないのかな？と思った。

Reference

Meraculous: De Novo Genome Assembly with Short Paired-End Reads Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, et al. (2011) Meraculous: De Novo Genome Assembly with Short Paired-End Reads. PLoS ONE 6(8): e23501. doi: 10.1371/journal.pone.0023501
A. C. Howe, J. K. Jansson, S. A. Malfatti, S. G. Tringe, J. M. Tiedje, and C. T. Brown. Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences, 111(13):4904--4909, 2014.