Title: Intrepydd: A Tensor Operator Compiler for Python

 

Date: Tuesday, May 2nd

 

Time: 2:00pm - 4:00pm EST

 

Location: 1116E Klaus, Virtual: https://gatech.zoom.us/my/tzhou80?pwd=b3EzUGRiWGY5Y2VkamZtdVJ2WTh5UT09

 

Tong Zhou

 

Ph.D. Computer Science

 

School of Computer Science

 

Georgia Institute of Technology

 

Committee:

 

Dr. Vivek Sarkar (Advisor) - School of Computer Science, Georgia Institute of Technology

 

Dr. Jun Shirako - School of Computer Science, Georgia Institute of Technology

 

Dr. Santosh Pande - School of Computer Science, Georgia Institute of Technology

 

Dr. Richard Vuduc - School of Computational Science, Georgia Institute of Technology

 

Abstract:

 

Kernels in machine learning, data science and scientific computing applications are often expressed as a sequence of matrix operations. These programs typically consist of a combination of element-wise (point-wise) operators such as addition and multiplication; reduction operators, such as summation; and, matrix multiplications (matmuls). Moreover, dimension broadcasting is often required, e.g., when a 2D matrix is added to a 1D vector. A common approach to implementing such kernels is to create a separate hand-optimized library implementation for each operator. Such hand-coded implementations are often backed by vendor libraries, such as cuBLAS and Intel MKL to achieve high performance on targeted hardware. One downside of this approach is that when calling a library function for each operator, intermediate results must be stored and subsequently loaded by the next operator.  This overhead becomes increasingly significant for operators with low arithmetic intensity, such as point-wise reduction operators, and tall and skinny matrices, which also exhibit lower arithmetic intensities.

 

To enhance overall performance in these cases, a series of operators can be computed in a fused fashion (known as operator fusion, a concept akin to loop fusion) so as to reduce or eliminate loads and stores of intermediate results. However, such fused multi-operator kernels must be either created manually by performance experts, or generated automatically by a compiler. While it is possible for performance experts to manually create fused kernels for common patterns for specific hardware platforms, it becomes impractical for them to anticipate and cover all desired combinations of operator sequences and target platforms. Therefore, using a compiler-based approach to automatically generate fused kernels is a more viable and cost-effective approach. However, past compiler based efforts have either only focused on a few hand-picked application-specific combinations of operators or were too general to generate performance competitive fused kernels.

 

In this work, we advance the state-of-the-art of compiler-based operator fusion in several aspects. 1) We introduce an intermediate presentation: tiled index tree on which several transformations can be easily applied such as fusion, interchanging and tiling. 2) We broaden the operator fusion scope to fuse arbitrary element-wise, reduction and matmul operators based on automatic tiled index tree transformations. 3) We leverage operator-level information and propose dependence-breaking transformations to fuse otherwise unfusable operator patterns, so as to enable single-pass incremental fused computation of special patterns. 4) We extend the fusion scope to also fuse sparse operators, identify four common types of redundancies that can appear when fusing sparse and dense operators and introduce techniques to eliminate them. 5) We deal with the presence of explicit loops, and propose techniques such as loop invariant code motion, memory reusing to further improve the performance. 6) We implement the techniques in a prototype Python compiler framework -- Intrepydd and perform evaluation on both CPU and GPU using a set of representative tensor operator sequences/programs on representative input shapes. As of now, we have some preliminary results and still have many performance results to collect. In the proposed work, we plan to finish the implementation and evaluation.