Haichen Shen

haichen at scroll dot io


I'm the co-founder at Scroll and lead the engineering team. I'm also an open-source enthusiast and have been contributing to several open source projects, including zkEVM, Apache TVM, RAF.

Previously, I worked at Amazon Web Services as a senior applied scientist on AI compilers. I received my Ph.D. in computer science from University of Washington, advised by Professor Arvind Krishnamurthy and Matthai Philipose. I received my bachelor degree in computer science from Institute for Theoretical Computer Science at Tsinghua University.


  • Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference
  • Haichen Shen*, Jared Roesch*, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wang
  • MLSys, Apr 2021
  • abstract paper slides
    Modern deep neural networks increasingly make use of features such as control flow, dynamic data structures, and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a pre-determined model architecture and input data shapes—assumptions that are violated by dynamic neural networks. Therefore, executing dynamic models with deep learning systems is currently both inflexible and sub-optimal, if not impossible. Optimizing dynamic neural networks is more challenging than static neural networks; optimizations must consider all possible execution paths and tensor shapes. This paper proposes Nimble, a high-performance and flexible system to optimize, compile, and execute dynamic neural networks on multiple platforms. Nimble handles model dynamism by introducing a dynamic type system, a set of dynamism-oriented optimizations, and a light-weight virtual machine runtime. Our evaluation demonstrates that Nimble outperforms existing solutions for dynamic neural networks by up to 20x on hardware platforms including Intel CPUs, ARM CPUs, and Nvidia GPUs.

  • Nexus: a GPU cluster engine for accelerating DNN-based video analysis
  • Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, Ravi Sundaram
  • SOSP, Oct 2019
  • abstract paper bibtex slides talk code
    We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high utilization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be coscheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully implemented system that includes these innovations. In large-scale case studies on 16 GPUs, when required to stay within latency constraints at least 99% of the time, Nexus can process requests at rates 1.8-12.7× higher than state of the art systems can. A long-running multi-application deployment stays within 84% of optimal utilization and, on a 100-GPU cluster, violates latency SLOs on 0.27% of requests.

  • TVM: An automated end-to-end optimizing compiler for deep learning
  • Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
  • OSDI, Sept 2018
  • abstract paper bibtex project code
    There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms - such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) - requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

  • Fast Video Classification via Adaptive Cascading of Deep Models
  • Haichen Shen, Seungyeop Han, Matthai Philipose, Arvind Krishnamurthy
  • CVPR, July 2017 (spotlight)
  • abstract paper bibtex slides talk project code
    Recent advances have enabled "oracle" classifiers that can classify across many classes and input distributions with high accuracy without retraining. However, these classifiers are relatively heavyweight, so that applying them to classify video is costly. We show that day-to-day video exhibits highly skewed class distributions over the short term, and that these distributions can be classified by much simpler models. We formulate the problem of detecting the short-term skews online and exploiting models based on it as a new sequential decision making problem dubbed the Online Bandit Problem, and present a new algorithm to solve it. When applied to recognizing faces in TV shows and movies, we realize end-toend classification speedups of 2.4-7.8×/2.6-11.2× (on GPU/CPU) relative to a state-of-the-art convolutional neural network, at competitive accuracy.

  • MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints
  • Seungyeop Han*, Haichen Shen*, Matthai Philipose, Sharad Agarwal, Alec Wolman, Arvind Krishnamurthy
  • MobiSys, Jun. 2016 (*equally contributed)
  • abstract paper bibtex slides talk 1-min video
    We consider applying computer vision to video on cloud-backed mobile devices using Deep Neural Networks (DNNs). The computational demands of DNNs are high enough that, without careful resource management, such applications strain device battery, wireless data, and cloud cost budgets. We pose the corresponding resource management problem, which we call Approximate Model Scheduling, as one of serving a stream of heterogeneous (i.e., solving multiple classification problems) requests under resource constraints. We present the design and implementation of an optimizing compiler and runtime scheduler to address this problem. Going beyond traditional resource allocators, we allow each request to be served approximately, by systematically trading off DNN classification accuracy for resource use, and remotely, by reasoning about on-device/cloud execution trade-offs. To inform the resource allocator, we characterize how several common DNNs, when subjected to state-of-the art optimizations, trade off accuracy for resource use such as memory, computation, and energy. The heterogeneous streaming setting is a novel one for DNN execution, and we introduce two new and powerful DNN optimizations that exploit it. Using the challenging continuous mobile vision domain as a case study, we show that our techniques yield significant reductions in resource usage and perform effectively over a broad range of operating conditions.

  • Enhancing Mobile Apps To Use Sensor Hubs Without Programmer Effort
  • Haichen Shen, Aruna Balasubramanian, Anthony LaMarca, David Wetherall
  • Ubicomp, Sept. 2015 (Best paper award, Gaetano Borriello best student paper award)
  • abstract paper bibtex slides project
    Always-on continuous sensing apps drain the battery quickly because they prevent the main processor from sleeping. Instead, sensor hub hardware, available in many smartphones today, can run continuous sensing at lower power while keeping the main processor idle. However, developers have to divide functionality between the main processor and the sensor hub. We implement MobileHub, a system that automatically rewrites applications to leverage the sensor hub without additional programming effort. MobileHub uses a combination of dynamic taint tracking and machine learning to learn when it is safe to leverage the sensor hub without affecting application semantics. We implement MobileHub in Android and prototype a sensor hub on a 8-bit AVR micro-controller. We experiment with 20 applications from Google Play. Our evaluation shows that MobileHub significantly reduces power consumption for continuous sensing apps.

  • MetaSync: File Synchronization Across Multiple Untrusted Storage Services
  • Seungyeop Han, Haichen Shen, Taesoo Kim, Arvind Krishnamurthy, Thomas Anderson, David Wetherall
  • USENIX ATC, July 2015
  • abstract paper bibtex project
    Cloud-based file synchronization services, such as Dropbox, are a worldwide resource for many millions of users. However, individual services often have tight resource limits, suffer from temporary outages or even shutdowns, and sometimes silently corrupt or leak user data.
    We design, implement, and evaluate MetaSync, a secure and reliable file synchronization service that uses multiple cloud synchronization services as untrusted storage providers. To make MetaSync work correctly, we devise a novel variant of Paxos that provides efficient and consistent updates on top of the unmodified APIs exported by existing services. Our system automatically redistributes files upon reconfiguration of providers.
    Our evaluation shows that MetaSync provides low update latency and high update throughput while being more trustworthy and available. MetaSync outperforms its underlying cloud services by 1.2-10× on three realistic workloads.

  • Enable Flexible Spectrum Access with Spectrum Virtualization
  • Kun Tan, Haichen Shen, Jiansong Zhang, Yongguang Zhang
  • DySPAN, Oct. 2012
  • abstract paper bibtex
    Enabling flexible spectrum access (FSA) in existing wireless networks is challenging due to the limited spectrum programmability - the ability to change spectrum properties of a signal to match an arbitrary frequency allocation. This paper argues that spectrum programmability can be separated from general wireless physical layer (PHY) modulation. Therefore, we can support flexible spectrum programmability by inserting a new spectrum virtualization layer (SVL) directly below traditional wireless PHY, and enable FSA for wireless networks without changing their PHY designs.
    SVL provides a virtual baseband abstraction to wireless PHY, which is static, contiguous, with a desirable width defined by the PHY. At the sender side, SVL reshapes the modulated baseband signals into waveform that matches the dynamically allocated physical frequency bands - which can be of different width, or non-contiguous - while keeping the modulated information unchanged. At the receiver side, SVL performs the inverse reshaping operation that collects the waveform from each physical band, and reconstructs the original modulated signals for PHY. All these reshaping operations are performed at the signal level and therefore SVL is agnostic and transparent to upper PHY. We have implemented a prototype of SVL on a software radio platform, and tested it with various wireless PHYs. Our experiments show SVL is flexible and effective to support FSA in existing wireless networks.

  • Frame Retransmissions Considered Harmful: Improving Spectrum Efficiency Using Micro-ACKs
  • Jiansong Zhang, Haichen Shen, Kun Tan, Ranveer Chandra, Yongguang Zhang, Qian Zhang
  • MobiCom, Aug. 2012
  • abstract paper bibtex
    Retransmissions reduce the efficiency of data communication in wireless networks because of: (i) per-retransmission packet headers, (ii) contention overhead on every retransmission, and (iii) redundant bits in every retransmission. In fact, every retransmission nearly doubles the time to successfully deliver the packet. To improve spectrum efficiency in a lossy environment, we propose a new in-frame retransmission scheme using µACKs. Instead of waiting for the entire transmission to end before sending the ACK, the receiver sends smaller µACKs for every few symbols, on a separate narrow feedback channel. Based on these µACKs, the sender only retransmits the lost symbols after the last data symbol in the frame, thereby adaptively changing the frame size to ensure it is successfully delivered. We have implemented µACK on the Sora platform. Experiments with our prototype validate the feasibility of symbollevel µACK. By significantly reducing the retransmistion overhead, the sender is able to aggressively use higher data rate for a lossy link. Both improve the overall network efficiency. Our experimental results from a controlled environment and an 9-node software radio testbed show that µACK can have up to 140% throughput gain over 802.11g and up to 60% gain over the best known retransmission scheme.

Contact Me