Program


  
9:00 AM
-
9:10 AM

Welcome Address and Opening Remarks

9:10 AM
-
10:00 AM

Keynote by Mark Hummel, NVidia: The challenges of creating computing nodes tailored for heterogeneous clusters

Much of the focus to date on heterogeneous compute clusters has centered on networking and the energy required to move data. However there is another set of factors that has inhibited the promise offered by heterogeneous computing. These factors can be categorized into issues related to the construction of effective heterogeneous nodes. This talk will outline some of these factors and the challenges they present.

Bio: Mark Hummel attended Clarkson College of Technology where he received an undergraduate degree in Electrical Engineering (1981). He later earned a master degree from Worcester Polytechnic in Electrical Engineering. He's has 30+ years of experience in CPU, system, IO, coherency and interconnect design. He has worked at various companies including Data General, AMD and multiple startups. During his career he was a key contributor in the development of the Clariion disk array, Hypertransport bus protocol, PCI Express bus protocol and AMD's Heterogeneous Computing Architecture (HSA). He is currently a Principal Architect employed by NVidia.

Presentation

 10:00 AM
-
10:30 AM

Technical Session 1: Specialized Accelerators

Oliver Knodel, Andy Georgi, Patrick Lehmann, Wolfgang E. Nagel and Rainer G. Spallek: Integration of a Highly Scalable, Multi-FPGA-Based Hardware Accelerator in Common Cluster Infrastructures

Heterogeneous systems consisting of general-purpose processors and different types of hardware accelerators are becoming more and more common in HPC systems. Especially Field Programmable Gate Arrays (FPGAs) provide an energy-efficient way to achieve high performance. Numerous application areas, including bio- and neuroinformatics, require enormous processing capability and employ simple computation cores, elementary data structures and algorithms highly suitable for FPGAs. To allow an efficient work with distributed FPGAs, it is necessary to provide a simple and scalable integration of these FPGAs in a common cluster architecture and to permit an easy access to these resources. Our approach enables a system-wide dynamic partitioning, a batch-based administration and the monitoring of FPGA resources. The system can easily be reconfigured to user-specific requirements and provides a high degree of flexibility and performance.

Presentation

10:30 AM
-
11:00 AM

Coffee break

11:00 AM
-
1:00 PM

Technical Session 2: Mainstream Accelerators

Sebastian Rinke, Suraj Prabhakaran and Felix Wolf: Efficient Offloading of Parallel Kernels Using MPI_Comm_Spawn

The integration of accelerators into cluster systems is currently one of the architectural trends in high performance computing. Usually, those accelerators are manycore compute devices which are directly connected to individual cluster nodes via PCI Express. Recent advances of accelerators, however, do not require a host CPU anymore and now even enable their integration as self-contained nodes that are able to MPI-communicate over their own network interface. This approach offers new opportunities for application developers, as compute kernels can now span multiple communicating accelerators to better account for larger MPI-based code regions with the potential for massive node-level parallelism. However, it also raises the question of how to program such an environment. An instance of this novel cluster architecture is the DEEP cluster system currently under development. Based on this hardware concept, we investigate the MPI_Comm_spawn process creation mechanism for offloading MPI-based distributed memory compute kernels onto multiple network-attached accelerators. We identify limitations of MPI_Comm_spawn and present an offloading mechanism which results in only a fraction of the overhead of a pure MPI_Comm_spawn solution.

Presentation

Norbert  Eicker, Thomas Lippert, Thomas Moschny and Estela Suarez: The DEEP project: Pursuing cluster-computing in the many-core era

Facing Exascale by the end of the decade homogeneous cluster architectures dominating high-performance computing (HPC) today will be challenged by heterogeneous approaches utilizing accelerator elements in the future. The DEEP (Dynamical Exascale Entry Platform) project aims for implementing a novel architecture for high-performance computing consisting of two components - a standard HPC-Cluster and a cluster of many-core processors called Booster. In order to enable application developers to adapt their codes to this Cluster-Booster architecture as seamlessly as possible, DEEP provides a programming environment integrating offloading functionality as given by the MPI standard with an abstraction layer based on the task-based OmpSs programming paradigm. This paper presents the DEEP project with an emphasis of the DEEP programming environment.

Presentation

A. K. Bahl, O. Baltzer, A. Rau-Chaplin, B. Varghese and A. Whiteway: Achieving Speedup in Real-time Aggregate Risk Analysis using Multiple GPUs

Stochastic simulation techniques employed for the analysis of portfolios of insurance/reinsurance risk, often referred to as ‘Aggregate Risk Analysis’, can benefit from exploiting state-of-the-art high-performance computing platforms. In this paper, we propose parallel methods to speedup aggregate risk analysis for supporting real-time pricing. To achieve this an algorithm for analysing aggregate risk is proposed and implemented in C++ and OpenMP for multicore CPUs and in C++ and CUDA for many-core GPUs. An evaluation of the performance of the algorithm indicates that GPUs offer a feasible alternative solution over traditional highperformance computing systems. An aggregate simulation of 1 million trials with 1000 catastrophic events per trial on a typical exposure set and contract structure is performed in less than 5 seconds on a multiple GPU platform. The key result is that the multiple GPU implementation of the algorithm presented in this paper can be used in real-time pricing scenarios as it is approximately 77x times faster than the sequential counterpart implemented on a CPU.
Presentation

Raúl Pardo, Fernando L. Pelayo, Pedro Valero-Lara: GPU powered ROSA Analyzer

In this work we present the first version of ROSAA, Rosa Analyzer, using a GPU architecture. ROSA is a Markovian Process Algebra able to capture pure non-determinism, probabilities and timed actions; Over it, a tool has been developed for getting closer to a fully automatic process of analyzing the behaviour of a system specified as a process of ROSA, so that, ROSAA is able to automatically generate the part of the Labeled Transition System (occasionally the whole one), LTS in the sequel, in which we could be interested, but, since this is a very computationally expensive task, a GPU powered version of ROSAA which includes parallel processing capabilities, has been created to better deal with such generating process. As the conventional GPU processing loads are mainly focused on data parallelization over quite similar types of data, this work means a quite novel use of these kind of architectures, moreover the authors do not know any other formal model tool running over GPUs. ROSAA running starts with the Syntactic analysis so generating a layered structure suitable to, afterwards, apply the Operational Semantics transition rules in the easiest way. Since from each specification/state more than one rule could be applied, this is the key point at which GPU should provide its benefits, i.e., allowing to generate all the new states reachable in a single-semantics-step from a given one, at the same time through a simultaneous launching of a set of threads over the GPU platform. Although this establishes a step forward to the practical usefulness of such type of tools, the state-explosion problem arises indeed, so we are aware that reducing the size of the LTS will be sooner or later required, in this line the authors are working on an heuristics to properly prune an enough number of branches of the LTS, so making the task of generating it, more tractable.

Presentation

1:00 PM
-
1:10 PM

Concluding remarks


Comments