Program

Monday, Aug 14, 1:45pm-5pm
Room 3.33

 
1:45 PM
-
2:00 PM

Welcome Address and Opening Remarks 

2:00 PM
-
3:00 PM

Technical Session 1: Performance

14:00-14:30 Thomas Carroll and Prudence W.H. Wong. An Improved Abstract GPU Model with Data Transfer

GPUs are commonly used as coprocessors to accelerate a compute-intensive task, thanks to their massivley parallel architecture. There is study into different abstract parallel models, which allow researchers to design and analyse parallel algorithms. However, most work on analysing GPU algorithms has been software based tools for profiling a GPU program. Recently, some abstract GPU models have been proposed, yet they do not capture all elements of a GPU, missing the data transfer between CPU and GPU, which in practice can cause a bottleneck and reduce performance dramatically. We propose a comprehensive model called Abstract Transferring GPU which to our knowledge is the first abstract GPU model to capture data transfer between CPU and GPU. We show via experiments, that existing models are not able to sufficiently model the actual running time in all cases, as they do not capture data transfer. We show that by capturing the data transfer with our model, we are able to obtain more accurate predictions of the actual running time. It is expected that our model helps improve design and analysis of heterogeneous systems consisting of CPU and GPU, and will allow researchers to make better informed implementation decisions, as they will be aware how data transfer affect their programs.

14:30-15:00 Carlos Reaño and Federico Silla. A Comparative Performance Analysis of Remote GPU Virtualization over Three Generations of GPUs

The use of Graphics Processing Units (GPUs) has become a very popular way to accelerate the execution of many applications. However, GPUs are not exempt from side effects. For instance, GPUs are expensive devices which additionally consume a non-negligible amount of energy even when they are not performing any computation. Furthermore, most applications present low GPU utilization. To address these concerns, the use of GPU virtualization has been proposed. In particular, remote GPU virtualization is a promising technology that allows applications to transparently leverage GPUs installed in any node of the cluster. In this paper the remote GPU virtualization mechanism is comparatively analyzed across three different generations of GPUs. The goal of this study is to analyze how the performance of the remote GPU virtualization technique is impacted by the underlying hardware. To that end, the Tesla K20, Tesla K40 and Tesla P100 GPUs along with FDR and EDR InfiniBand fabrics are used in the study. The analysis is performed in the context of the rCUDA middleware.  It is clearly shown that the GPU virtualization middleware requires a comprehensive design of its communication layer, which should be perfectly adapted to every hardware generation in order to avoid a reduction in performance.

3:00 PM
-
3:30 PM

Coffee break

3:30 PM
-
5:00 PM

Technical Session 2: Programming and Resource Management

15:30-16:00 Manjunath Gorentla Venkata, Ferrol Aderholdt and Zachary Parchman. SharP: Towards Programming Extreme-Scale Systems with Hierarchical Heterogeneous Memory
The pre-exascale systems are expected to have a significant amount of hierarchical and heterogeneous on-node memory, and this trend of system architecture in extreme-scale systems is expected to continue into the exascale era. Along with hierarchical-heterogeneous memory, the system typically has a high-performing network and a compute accelerator. This system architecture is not only effective for running traditional High-Performance Computing (HPC) applications (Big-Compute), but also running data-intensive HPC applications and Big-Data applications. As a consequence, there is a growing desire to have a single system serve the needs of both Big-Compute and Big-Data applications. 
Though the system architecture supports the convergence of the Big-Compute and Big-Data, the programming models have yet to evolve to support either hierarchical-heterogeneous memory systems or the convergence. In this work, we propose and develop the programming abstraction called SHARed data-structure centric Programming abstraction (SharP) to address both of these goals, i.e., provide (1) a simple, usable, and portable abstraction for hierarchical-heterogeneous memory and (2) a unified programming abstraction for Big-Compute and Big-Data applications. 
To evaluate SharP, we implement a Stencil benchmark using SharP, port QMCPack, a petascale-capable application, and adapt Memcached ecosystem, a popular Big-Data framework, to use SharP, and quantify the performance and productivity advantages. Additionally, we demonstrate the simplicity of using SharP on different memories including DRAM, High- bandwidth Memory (HBM), and non-volatile random access memory (NVRAM).

16:00-16:30 Huanhuan Xiong and John Morrison. Towards a Scalable and Adaptable Resource Allocation Framework in Cloud Environments
Finding an appropriate resource to host the next application to be deployed in a Cloud environment can be a non-trivial task. To deliver the appropriate level of service, the functional requirements of the application must be met. Ideally, this process involves filtering the best resource from a number of possible candidates, whilst simultaneously satisfying multiple objectives. If timely responses to resource requests are to be maintained, the sophistication of the filtering mechanism and the size of the search space have to be carefully balanced. The quality of the solution will thus not readily scale with growth in cloud resources and filtering complexity. This limitation is becoming more evident with the emergence of hyper-scale clouds and with the increased complexity needed to accommodate the growing heterogeneity in resources. Moreover, meeting non-functional requirements, reflecting the Cloud Service Provider's business objects, is also becoming increasingly critical as service utilization and energy efficiency in a typical cloud deployment are extremely low. 
This paper proposes a reexamination of the resource allocation problem by proposing a framework to support distributed resource allocation decisions and that can be dynamically populated with strategies to reflect the ever-growing number of diverse objectives as they become evident in the evolving cloud infrastructure.

16:30-17:00 Javier Prades and Federico Silla. Turning GPUs into Floating Devices over The Cluster: The Beauty of GPU Migration

Virtualization techniques have shown to report benefits to data centers and other computing facilities. In this regard, not only virtual machines allow reducing the size of the computing infrastructure while increasing overall resource utilization but also virtualizing individual components of computers may provide significant benefits. This is the case, for example, for the remote GPU virtualization technique, implemented in several frameworks during the recent years. The large degree of flexibility provided by the remote GPU virtualization technique, however, can be further increased by applying the migration mechanism to it, so that the GPU part of an application can be live migrated to another GPU elsewhere in the cluster during the execution of the application in a transparent way to it. In this paper we present a discussion about how the migration mechanism has been applied to different GPU virtualization frameworks. We also provide a big picture about the possibilities that migrating the GPU part of applications can provide to data centers and other computing facilities. We finally present the first results of an ongoing work consisting on applying the migration mechanism to the rCUDA remote GPU virtualization framework. 

Comments