Program


  
1:30 PM
-
1:45 PM

Welcome Address and Opening Remarks

 1:45 PM
-
3:15 PM

Technical Session 1: Specialized Applications

Abhijeet Lawande, Hanchao Yang, Alan George and Herman La:
Simulative Analysis of a Multidimensional Torus-based Reconfigurable Cluster for Molecular Dynamics

Molecular dynamics (MD) is a large-scale, communication-intensive problem that has been the subject of high-performance computing research and acceleration for years. Not surprisingly, the most success in accelerating MD comes from specialized systems such as the Anton machine. Our goal is to design a reconfigurable system that can accelerate MD while also being amenable to other communication-intensive applications. In this paper, we present a performance model for the 3D FFT kernel that forms the core of MD simulation on Anton. We validate the model against published Anton performance data and use the data to design and evaluate a similar interconnect for our existing Novo-G reconfigurable supercomputer. Through simulation studies, we predict that the upgraded machine will achieve nearly double the performance of Anton and fifty times that of established clusters like BlueGene/L for the 3D FFT kernel.

Presentation

Leiming Yu, Yash Ukidave and David Kaeli:
GPU-accelerated HMM for Speech Recognition

Speech recognition is used in a wide range of applications and devices such as mobile phones, in-car entertainment systems and web-based services. Hidden Markov Models (HMMs) is one of the most popular algorithmic approaches applied in speech recognition. Training and testing a HMM is computationally intensive and time-consuming. Running multiple applications concurrently with speech recognition could overwhelm the compute resources, and introduce unwanted delays in the speech processing, eventually dropping words in the process due to buffer overruns. Graphics processing units (GPUs) have become widely accepted as accelerators which offer massive amounts of parallelism. The host processor (the CPU) can offload compute-intensive portions of an application to the GPU, leaving the CPU to focus on serial tasks and scheduling operations. In this paper, we provide a parallelized Hidden Markov Model to accelerate isolated words speech recognition. We experiment with different optimization schemes and make use of optimized GPU computing libraries to speedup the computation on GPUs. We also explore the performance benefits of using advanced GPU features for concurrent execution of multiple compute kernels. The algorithms are evaluated on multiple Nvidia GPUs using CUDA as a programming framework. Our GPU implementation achieves better performance than traditional serial and multithreaded implementations. When considering the end-to-end performance of the application, which includes both data transfer and computation, we achieve a 9x speedup for training with the use of a GPU over a multi-threaded version optimized for a multi-core CPU.

Presentation

3:15 PM
-
3:45 PM

Coffee break

3:45 PM
-
4:45 PM

Technical Session 2: Communications

Benjamin Klenk, Lena Oden and Holger Fröning:
Analyzing Put/Get APIs for Thread-collaborative Processors 

In High-Performance Computing (HPC), GPU-based accelerators are pervasive for two reasons: first, GPUs provide a much higher raw computational power than traditional CPUs. Second, power consumption increases sub-linearly with the performance increase, making GPUs much more energy-efficient in terms of GFLOPS/Watt than CPUs. Although these advantages are limited to a selected set of workloads, most HPC applications can benefit a lot from GPUs. The top 11 entries of the current Green500 list (November 2013) are all GPU-accelerated systems,
which supports the previous statements.
For system architects the use of GPUs is challenging though, as their architecture is based on thread-collaborative execution and differs significantly from CPUs, which are mainly optimized for single-thread performance. The interfaces to other devices in a system, in particular the network device, are still solely optimized for CPUs. This makes GPU-controlled IO a challenge, although it is desirable for savings in terms of energy and time. This is especially true for network devices, which are a key component in HPC systems.
In previous work we have shown that GPUs can directly source and sink network traffic for Infiniband devices without any involvement of the host CPUs, but this approach does not provide any performance benefits. Here we explore another API for Put/Get operations that can overcome some limitations. In particular, we provide a detailed reasoning about the issues that prevent performance advantages when directly controlling IO from the GPU domain.

Presentation

Sébastien Varrette, Valentin Plugaru, Mateusz Guzek, Xavier Besseron and Pascal Bouvry:
HPC Performance and Energy-Efficiency of the OpenStack Cloud Middleware

Since its advent in the middle of the 2000’s, the Cloud Computing (CC) paradigm is increasingly advertised as THE solution to most IT problems. While High Performance Computing (HPC) centers continuously evolve to provide more computing power to their users, several voices (most probably commercial ones) emit the wish that CC platforms could also serve HPC needs and eventually replace in-house HPC platforms. However, it is still unclear whether the overhead induced by the virtualization layer at the heart of every Cloud middleware suits an environment as high-demanding as an HPC platform. In parallel, with a growing concern for the considerable energy consumed by HPC platforms and data centers, research efforts are targeting green approaches with higher energy efficiency. At this level, virtualization is also emerging as the prominent approach to reduce the energy consumed by consolidating multiple running VM instances on a single server, thus giving credit towards a Cloud-based approach. In this paper, we analyze from an HPC perspective the performance and the energy efficiency of the leading open source Cloud middleware, OpenStack, when compared to a bare-metal (i.e. native) configuration. The conducted experiments were performed on top of the Grid’5000 platform with benchmarking tools that reflect "regular" HPC workloads, i.e. HPCC (which includes the reference HPL benchmark) and Graph500. Power measurements were also performed in order to quantify the potential energy efficiency of the tested configurations, using the approaches proposed in the Green500 and GreenGraph500 projects. In order to abstract from the specifics of a single architecture, the benchmarks were run using two different hardware configurations, based on Intel and AMD processors. This work extends previous studies dedicated to the evaluation of hypervisors against HPC workloads. The results of this study pleads for in-house HPC platforms running without any virtualized frameworks, assessing that the current implementation of Cloud middleware is not well adapted to the execution of HPC applications.

Presentation

4:45 PM
-
5:00 PM

Concluding Remarks


Comments