Publications

Mosaic flows: A transferable deep learning framework for solving PDEs on unseen domains

SURFNet: Super-resolution of Turbulent Flows with Transfer Learning using Small Datasets

Demystifying asynchronous I/O Interference in HPC applications

Train Once and Use Forever: Solving Boundary Value Problems in Unseen Domains with Pre-trained Deep Learning Models

Logically Parallel Communication for Fast MPI+ Threads Applications

adPerf: Characterizing the Performance of Third-party Ads

Optimizing the Hypre solver for manycore and GPU architectures

Pencil: A pipelined algorithm for distributed stencils

Only Relative Speed Matters: Virtual Causal Profiling

Artificial Intelligence and High-Performance Computing: The Drivers of Tomorrow's Science

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

How I Learned to Stop Worrying about User-Visible Endpoints and Love MPI

CFDNet: A deep learning-based accelerator for fluid simulations

Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring

Breaking Band: A Breakdown of High-performance Communication

What-If Analysis of Page Load Time in Web Browsers Using Causal Profiling

Multi-criteria partitioning of multi-block structured grids

Portal: A High-Performance Language and Compiler for Parallel N-body Problems

Scalable Communication Endpoints for MPI+Threads Applications

Roofline Guided Design and Analysis of a Multi-stencil CFD Solver for Multicore Performance

Sugar: Secure GPU Acceleration in Web Browsers

cudaCR: An in-kernel application-level checkpoint/restart scheme for CUDA-enabled GPUs

PASCAL: A Parallel Algorithmic SCALable Framework for N-body Problems

We propose PASCAL, a parallel unified algorithmic framework for generalized N-body problems. PASCAL utilizes tree data structures and …

Unsteady Navier-Stokes Flow on GPU Architectures

Parallel Performance-Energy Predictive Modeling of Browsers: Case Study of Servo

A SystemC model for N-body problems and its Parallel Design Space Exploration

A CPU--GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Brief Announcement: Towards a Communication Optimal Fast Multipole Method and its Implications at Exascale

Courses in High-Performance Computing for Scientists and Engineers

Communication-Optimal Parallel N-body Solvers

A massively parallel adaptive Fast Multipole Method on heterogeneous architectures

Balance principles for algorithm-architecture co-design

Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures

Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

On the limits of GPU acceleration

Performance evaluation of Concurrent Collections on high-performance multicore computing systems

Optimizing and tuning the Fast Multipole Method for state-of-the-art multicore architectures

Performance evaluation of Concurrent Collections on high-performance multicore computing systems

Applying the Concurrent Collections Programming Model to Asynchronous Parallel Dense Linear Algebra

A massively parallel adaptive Fast Multipole Method on heterogeneous architectures

Multi-core implementations of the Concurrent Collections programming model

Declarative Aspects of Memory Management in the Concurrent Collections Parallel Programming Model

On the Design of Fast Pseudo-Random Number Generators for the Cell Broadband Engine and an Application to Risk Analysis

Numerical algorithms with tunable parallelism