Technical Session 1: Architectures
Toshihiro Hanawa, Norihisa Fujita, Tetsuya Odajima, Kazuya Matsumoto and Taisuke Boku: Evaluation of FFT for GPU Cluster Using Tightly Coupled Accelerators Architecture Inter-node communications between accelerators in heterogeneous clusters require extra latency due to the data copy between host and accelerator, and such communication latency causes severe performance degradation on applications. To address this problem, we proposed the Tightly Coupled Accelerators (TCA) architecture and designed an interconnection router chip named PEACH2. Accelerators in the TCA architec- ture communicate directly via the PCIe protocol, which is current fundamental interface for all the accelerators and host CPU, to eliminate protocol overhead, as well as the data copy overhead. In this paper, we apply the TCA architecture to the FFT (Fast Fourier Transform) program, which is commonly used in the scientific computation. First, we implemented the all- to-all communication for the TCA, and then applied the all- to-all communication to the FFTE, which is one of the FFT implementations. As a result by evaluation using HA-PACS/TCA system, we achieved 2.7 times speedup for TCA compared with MPI using 16 nodes on the medium size. Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Hitoshi Murai, Masahiro Nakao, Akihiro Tabuchi and Mitsuhisa Sato: Hybrid Communication with TCA and InfiniBand on A Parallel Programming Language XcalableACC for GPU Clusters
In recent years, GPU equipped clusters are widely used in HPC fields. However, a large communication latency between GPUs over nodes is a serious problem on the strong scalability. To reduce the communication latency between GPUs, we proposed “Tightly Coupled Accelerators (TCA)” Architec- ture and developed “PEACH2 Board” as the proof-of-concept interconnection system for it. Current PEACH2 is based on a direct utilization of PCI Express interface for interconnection, and there is some hardware limitation although it provides very low communication latency. One of them is a limited number of nodes which limits the scalability of the system. Currently, we limit the number of nodes in PEACH2 network up to 16 which is named “sub-cluster”. A larger system should be configured as a collection of sub-clusters by a conventional interconnection such as InfiniBand to combine the features of both communication systems; low latency by TCA and high scalability by InfiniBand. For the easiness of programming by users, it is desired to hide such a complicated communication system under library or language level. In this paper, we develop a hybrid interconnection network system with PEACH2 for TCA and InfiniBand and implement this feature into a high-level parallel programming language for accelerated clusters named XcalableACC (XACC). By the preliminary performance evaluation, it is confirmed that the hybrid network improves the performance of Himeno benchmark for stencil computation up to 40% against the MVAPICH2 with GDR on InfiniBand. Also, Allgather collective communication with hybrid network improves the performance up to 50% both on 8 nodes and 16 nodes. The combination of local communication supported by a low latency of TCA and global communication supported by a high bandwidth and scalability of InfiniBand achieves the improvement of total performance. Santiago Mislata and Federico Silla: On the Execution of Computationally Intensive CPU-based Libraries on Remote Accelerators for Increasing Performance: Early Experience with the OpenBLAS and FFTW Libraries Virtualization techniques have shown to report benefits to data centers and other computing facilities. In this regard, virtual machines not only allow reducing the size of the computing infrastructure while increasing overall resource utilization but virtualizing individual components of computers may also provide significant benefits. This is the case, for example, for the remote GPU virtualization technique, implemented in several frameworks during the last years. In this paper we present an initial implementation of a new middleware for the remote virtualization of another component of computers: the CPU itself. Our proposal uses remote accel- erators to perform computations that were initially intended to be carried out in the local CPUs, doing so transparently to the application and without having to modify its source code. By making use of the OpenBLAS and FFTW libraries as case studies, we carry out a performance evaluation targeting several system configurations comprising Xeon processors as well as Ethernet and InfiniBand QDR, FDR, and EDR network adapters in addition to NVIDIA Tesla K40 GPUs. Results not only demonstrate that the new middleware is feasible, but they also show that mathematical libraries may experience a significant speed up, despite of having to move data forth and back to/from remote servers.
|