High Performance Runtime Systems



GPGPUs and Optimization Techniques

Exploring optimizations such as kernel fusion, coarsening, tiling etc. for HPC applications like object detection, speech recognition, DNN training etc.

LEARN MORE

Scheduling in Heterogeneous Systems

Scheduling mechanisms for heterogeneous large scale clusters and embedded GPGPU platforms comprising multiple devices such OCPUs, iGPUs,Exploring optimizations FPGAs etc.

LEARN MORE

GPGPU Resources

A repository of online learning articles, books, libraries pertaining to CUDA/OpenCL and the latest HPC news facilitating individuals to explore a career in GPGPU Programming and Research

LEARN MORE

Ongoing Projects

Accelerating deep learning

The landscape of deep learning has witnessed a paradigm shift in recent years with the advent of compiler tool chains with dedicated graphical IRs, custom made accelerators tuned for tensor computation and algorithmic optimizations for both training and inference workloads on both heterogeneous compute clusters and embedded platforms. In this context, we are also exploring optimizations for deriving novel scheduling solutions for mapping deep learning workloads on heterogeneous architectures. For significantly large training workloads, the primary bottleneck is the on-chip memory of accelerators. In this regard we are investigating scheduling techniques of asynchronously prefetching and offloading memory in parallel with the computation for accelerating the training process. For inference pipelines, the primary objective would be to ensure that thermal degradation of the platform does not occur. In this regard, we are investigating novel GPGPU optimizations such as kernel fusion and coarsening for mapping the pipeline efficiently on an embedded platform. For both styles of optimizations, predictive modelling and control theoretic solutions are being envisaged for facilitating dynamic and adaptive mapping of computation kernels pertaining to deep learning workloads.

DL optimization


DAG scheduling in CPU-GPU clusters

Recent efforts towards application scheduling on heterogeneous multiprocessor architectures typically consider applications to be represented as directed acyclic graphs (DAGs). Each application DAG is annotated at the task level with relevant execution time information that is leveraged by scheduling algorithms (List or Clustering based) to ascertain near optimal schedules at runtime. However, given an online setting, where applications are submitted by multiple users and the types of applications are not restrictive, the chances of knowing execution time information for every program is highly unlikely. In this context, we propose a class of intelligent algorithms for heterogeneous CPU-GPU platforms that leverages static analysis methodologies coupled with machine learning and ascertains how based on static program structure of tasks, device assignments should be made at runtime. This methodology helps in bypassing the requirement for profiling based offline execution time analysis thus making it applicable in an online scheduling scenario. We model applications as data parallel task graphs using the OpenCL programming model leveraging its device independent portability feature. The scheduling algorithms (list and clustering) consider both coarse-grained (where operations of kernels are mapped in its entirety to a device) and fine-grained mapping decisions (where read, write and execute operations are interleaved in parallel).

DAG scheduling


Intelligent ADAS task management on heterogeneous automotive architecture

An ADAS system constitutes multiple object detection pipelines that process sensor data periodically leveraging state of the art Deep Neural Networks (DNNs) and Convolutional Neural Networks(CNNs). The pipelines are used to detect objects in the vicinity and accordingly dispatch commands to other vehicular subsystems such as park assist, anti-lock braking systems etc. for taking relevant actions. Therefore, there exists a natural requirement for real time guarantees for executing these object detection pipelines. Furthermore, such pipelines possess time varying dynamic FPS requirements dictated by different driving contexts. Given the set of ADAS detection workloads, designing automotive computing solutions with the ability to sustain the maximum FPS requirements of all detection pipelines simultaneously may lead to over-provisioning of resources on a restricted memory architecture, and also increased power consumption and thermal aging of the heterogeneous platform.

The present work proposes an intelligent runtime system which can manage the mapping and scheduling of ADAS detection pipelines in next-generation automotive embedded platforms in a self-learning fashion so that the time varying dynamic detection requirements of existing pipelines are efficiently managed while maintaining real time guarantees. Since investigating every possible mapping decision imposes a considerably large searchspace of decisions to be evaluated for finding an optimal solution,we design Reinforcement Learning based schemes for predicting task-device mapping decisions for each pipeline subject to dynamic FPS requirements over time.



Thermal aware scheduling in heterogeneous embedded platform

Modern mobile platforms are equipped with heterogeneous multiprocessor system-on-chip (MPSoC) where the architecture is becoming increasingly complex and heterogeneous having multiple compute elements with varying compute capabilities. They can include multicore big-LITTLE CPUs, accelerators capable of general-purpose computing like graphic processing units (GPUs), and field programmable gate arrays (FPGAs) or digital signal processor (DSP). Increasingly smaller chip size and functional complexity leads to increase in the power density which in turn results in chip overheating which degrades its reliability, and may risk physical safety. Due to the absence of active cooling measures in such mobile platforms, thermal constraints are much more stringent . Thus, in recent times temperature-aware resource scheduling that can maintain on-chip thermal constraints as well as application timing constraints has become a key system design objective in heterogeneous embeded platform with varying compute elements. We are actively researching on efficent scheduling mechanisms that can determine optimal or near-optimal assignment of tasks to devices with multiple control values of the thermal features in heterogeneous embedded processors having multiple constraints such as meeting deadline, power enveloping etc.

thermalOverview

Tool Developed

PySchedCL: Python based Scheduling framework for OpenCL applications

In the past few years, OpenCL has emerged as a widely used parallel programming standard for high perfor- mance computing on heterogeneous platforms comprising processors of widely varying architectures. The programming standard offers program portability across an array of different processor architectures e.g., general purpose (CPU), data parallel (GPU), task parallel (CELL/B.E.) etc. The OpenCL API provides the programmer with a vast array of options to write data-parallel programs efficiently for heterogeneous architectures. However there exists a steep learning curve in learning the low level directives offered by the OpenCL API completely. In addition to getting acquainted with the API, the programmer is burdened with the task of ascertaining target device characteristics and writing efficient code for utilizing the varied processing elements in the heterogeneous system to their complete potential. Our framework is based on the PyOpenCL python package and provides user friendly abstractions over the underlying API for designing efficient solutions much faster.

The primary goal of PySchedCL is to provide a platform for rapid prototyping of high performance applications as well as a research tool for experimenting various scheduling and mapping policies of multiple OpenCL applications for different heterogeneous platforms. The framework leverages a kernel specification scheme that specifies information regarding the host side management for the execution of an OpenCL kernel. Given the corresponding specification file for a kernel, the framework is capable of distributing the computation across the devices of the target heterogeneous architecture.

The salient features of the framework is represented in the following figure and discussed below.

  1. The framework is capable of scheduling a set of independent OpenCL kernels (with their specification files) provided as input.
  2. The framework comprises a static analysis frontend capable of extracting program features that will distribute the computation across available devices in the heterogeneous platform.
  3. serve in designing efficient heuristic functions for optimal mapping of kernels to devices.
  4. In addition to the static analysis frontend, the framework provides a robust scheduling backend that is capable of automatically managing OpenCL runtime parameters while intelligently scheduling data parallel workloads across the different processors of the heterogeneous architecture.

The current version of PySchedCL provides a scheduling engine that may be leveraged for the following tasks.

  1. Partitioning an OpenCL kernel across multiple CPU and GPU device(s).
  2. Scheduling a set of independent OpenCL kernels across multiple devices
  3. Designing runtime scheduling heuristics that leverage optimal partitioning information of individual OpenCL kernels.
  4. Utilities for carrying out extensive experimentation and evaluation of scheduling policies .

In addition to the above tasks, the tool provides a rich API and an extensive documentation for extending the current capabilities of the framework with ease. about img

Our Testbeds

Hardware

  • Cluster
    • 1 Master Node and 3 Compute nodes connected via Infiniband
    • Master node: Intel Xeon E5-2650V4 CPU and AMD FirePro S7150 GPU
    • Compute node(s): Intel Xeon E5-2650V4 CPU and a NVIDIA Tesla K40 GPU
  • Odroid XU4
    • Quad-core ARM Cortex-A7 CPU
    • Quad-core ARM Cortex-A15 CPU
    • ARM Mali-T628 GPU
  • Snapdragon 820
    • Quad-core Qualcomm Kryo CPU
    • Qualcomm Adreno 530 GPU
    • Qualcomm Hexagon 680 DSP
  • NXP S32V234 Vision Processor
    • Quad Cortex-A53 CPU
    • 3D GPU (GC3000)
  • NVIDIA Jetson Nano Developer Kit
    • Quad-core ARM A57 CPU
    • 128-core NVIDIA Maxwell GPU

cluster
embedded boards