Program


  
8:15 AM 

8:30 AM 

Welcome Address and Opening Remarks 

 8:30 AM 
-
10:00 AM

Technical Session 1: Architectures

Toshihiro Hanawa, Norihisa Fujita, Tetsuya Odajima, Kazuya Matsumoto and Taisuke BokuEvaluation of FFT for GPU Cluster Using Tightly Coupled Accelerators Architecture

Inter-node communications between accelerators in heterogeneous clusters require extra latency due to the data copy between host and accelerator, and such communication latency causes severe performance degradation on applications.

To address this problem, we proposed the Tightly Coupled Accelerators (TCA) architecture and designed an interconnection router chip named PEACH2. Accelerators in the TCA architec- ture communicate directly via the PCIe protocol, which is current fundamental interface for all the accelerators and host CPU, to eliminate protocol overhead, as well as the data copy overhead.

In this paper, we apply the TCA architecture to the FFT (Fast Fourier Transform) program, which is commonly used in the scientific computation. First, we implemented the all- to-all communication for the TCA, and then applied the all- to-all communication to the FFTE, which is one of the FFT implementations. As a result by evaluation using HA-PACS/TCA system, we achieved 2.7 times speedup for TCA compared with MPI using 16 nodes on the medium size.

Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Hitoshi Murai, Masahiro Nakao, Akihiro Tabuchi and Mitsuhisa SatoHybrid Communication with TCA and InfiniBand on A Parallel Programming Language XcalableACC for GPU Clusters

In recent years, GPU equipped clusters are widely used in HPC fields. However, a large communication latency between GPUs over nodes is a serious problem on the strong scalability. To reduce the communication latency between GPUs, we proposed “Tightly Coupled Accelerators (TCA)” Architec- ture and developed “PEACH2 Board” as the proof-of-concept interconnection system for it. Current PEACH2 is based on a direct utilization of PCI Express interface for interconnection, and there is some hardware limitation although it provides very low communication latency. One of them is a limited number of nodes which limits the scalability of the system. Currently, we limit the number of nodes in PEACH2 network up to 16 which is named “sub-cluster”. A larger system should be configured as a collection of sub-clusters by a conventional interconnection such as InfiniBand to combine the features of both communication systems; low latency by TCA and high scalability by InfiniBand. For the easiness of programming by users, it is desired to hide such a complicated communication system under library or language level.

In this paper, we develop a hybrid interconnection network system with PEACH2 for TCA and InfiniBand and implement this feature into a high-level parallel programming language for accelerated clusters named XcalableACC (XACC). By the preliminary performance evaluation, it is confirmed that the hybrid network improves the performance of Himeno benchmark for stencil computation up to 40% against the MVAPICH2 with GDR on InfiniBand. Also, Allgather collective communication with hybrid network improves the performance up to 50% both on 8 nodes and 16 nodes. The combination of local communication supported by a low latency of TCA and global communication supported by a high bandwidth and scalability of InfiniBand achieves the improvement of total performance.

Santiago Mislata and Federico Silla: On the Execution of Computationally Intensive CPU-based Libraries on Remote Accelerators for Increasing Performance: Early Experience with the OpenBLAS and FFTW Libraries

Virtualization techniques have shown to report benefits to data centers and other computing facilities. In this regard, virtual machines not only allow reducing the size of the computing infrastructure while increasing overall resource utilization but virtualizing individual components of computers may also provide significant benefits. This is the case, for example, for the remote GPU virtualization technique, implemented in several frameworks during the last years.

In this paper we present an initial implementation of a new middleware for the remote virtualization of another component of computers: the CPU itself. Our proposal uses remote accel- erators to perform computations that were initially intended to be carried out in the local CPUs, doing so transparently to the application and without having to modify its source code. By making use of the OpenBLAS and FFTW libraries as case studies, we carry out a performance evaluation targeting several system configurations comprising Xeon processors as well as Ethernet and InfiniBand QDR, FDR, and EDR network adapters in addition to NVIDIA Tesla K40 GPUs. Results not only demonstrate that the new middleware is feasible, but they also show that mathematical libraries may experience a significant speed up, despite of having to move data forth and back to/from remote servers.


10:00 AM 
-
10:30 AM 

Coffee break

10:30 AM 

11:30 AM

Technical Session 2: Applications

Thomas C. Carroll, Jude-Thaddeus Ojiaku and Prudence W.H. WongPairwise Sequence Alignment with Gaps with GPU

In this paper we consider the pair-wise sequence alignment problem with gaps, which is motivated by the re- sequencing problem that requires to assemble short reads se- quences into a genome sequence by referring to a reference sequence. The problem has been studied before for single gap and bounded number of gaps. For single gap, there is a GPU- based algorithm proposed. In our work we propose a GPU-based algorithm for the bounded number of gaps case. We implemented the algorithm and compare the performance with the CPU- based algorithm; the results are promising with the GPU version achieving a speed up of more than 30 times.

Forrest Wolfgang Glines, Matthew Anderson and David Neilsen: Scalable Relativistic High-Resolution Shock-Capturing for Heterogeneous Computing 

A shift is underway in high performance computing (HPC) towards heterogeneous parallel architectures that em- phasize medium and fine grain thread parallelism. Many sci- entific computing algorithms including simple finite-differencing methods have already been mapped to heterogeneous architec- tures with order-of-magnitude gains in performance as a result. Recent case studies examining high resolution shock capturing (HRSC) algorithms suggest that these finite-volume methods are good candidates for emerging heterogeneous architectures. HRSC methods form a key scientific kernel in the compressible inviscid solvers that appear in astrophysics and engineering applications and tend to require enormous memory and computing resources. This work presents a case study of an HRSC method executed on a heterogeneous parallel architecture utilizing hundreds of GPU enabled nodes with remote direct memory access to the GPUs for a non-trivial shock application using the relativistic magnetohydrodynamics model. 


11:30 AM 
-
12:30 PM

Keynote address


Comments