Publications

Scaling Polyhedral Neural Network Verification on GPUs

François Serre, Christoph Müller, Gagandeep Singh, Markus Püschel and Martin Vechev
MLSYS'21 | PDF

Certifying the robustness of neural networks against adversarial attacks is critical to their reliable adoption in real-world systems including autonomous driving and medical diagnosis. Unfortunately, state-of-the-art verifiers either do not scale to larger networks or are too imprecise to prove robustness, which limits their practical adoption. In this work, we introduce GPUPoly, a scalable verifier that can prove the robustness of significantly larger deep neural networks than possible with prior work. The key insight behind GPUPoly is the design of custom, sound polyhedra algorithms for neural network verification on a GPU. Our algorithms leverage the available GPU parallelism and the inherent sparsity of the underlying neural network verification task. GPUPoly scales to very large networks: for example, it can prove the robustness of a 1M neuron, 34-layer deep residual network in about 1 minute. We believe GPUPoly is a promising step towards the practical verification of large real-world networks.

Optimal Streaming Permutations and Transforms: Theory and Implementation

François Serre
PhD Thesis | PDF

Many algorithms used in hardware applications across disciplines, such as the fast Fourier transform, the Walsh-Hadamard transform or sorting networks share a common structure. They consist of stages of parallel and identical processing elements that each operates on two inputs with different data permutations in between. The symmetry in this structure enables folding to an area-efficient in a streaming architecture that accepts the input over several cycles. However, necessary for folding are streaming permutation circuits that require memory and routing components. In this dissertation, we focus on the optimal implementation of these algorithms. We provide lower bounds for the implementation of streaming permutations, and propose algorithms that produce solutions that match it (under certain assumptions). We apply these algorithms in the context of a generator capable of implementing the hardware applications previously mentioned. Finally, for a given problem, we address the question of finding an optimal algorithm, i.e., an algorithm which streaming implementations are the cheapest to implement.

DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks

François Serre and Markus Püschel
TRETS'19 | PDF

We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks and Walsh-Hadamard transforms that outperform prior generators.

DSL-Based Modular IP Core Generators: Example FFT and Related Structures

François Serre and Markus Püschel
ARITH'19 | PDF

We present a hardware generator for signal processing algorithms that consist of a network of small processing elements, including the fast Fourier transform and sorting networks. The generator is implemented in Scala and uses a principled design that leverages modern language features to generate an entire design space of hardware implementations. Examples include the use of embedded domain-specific languages and staging to represent and optimize the designs at different levels of abstraction, and the use of Scala's type system to efficiently encode different degrees of hardware reuse and arithmetic formats.

In Search of the Optimal Walsh-hadamard Transform for Streamed Parallel Processing

François Serre and Markus Püschel
ICASSP'19 | PDF

The Walsh-Hadamard transform (WHT) is computed using a network of butterflies, similar to the fast Fourier transform. The network is not unique but can be modified in exponentially many ways by properly changing the permutations between butterfly stages. Our first contribution is the exact char-acterization of all possible WHT networks. Then we aim to find the optimal networks for streaming implementations. In such an implementation the input is fed in chunks over several cycles and the hardware cost is thus reduced in proportion. To find the optimal network we smartly search through all possibilities for small sizes and discover novel networks that are thus proven optimal. The results can be used to extrapolate the optimal hardware cost for all sizes but the associated algorithms still remain elusive.

A DSL-Based FFT Hardware Generator in Scala

François Serre and Markus Püschel
FPL'18 | PDF

We present a generator for fast Fourier transforms (FFTs) on hardware. The input of the generator is a high-level description of an FFT algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM-and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamlessly switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs that outperform prior FFT generators.

Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing Permutations

François Serre and Markus Püschel
FPGA'18 | PDF

We propose a novel FFT datapath that reduces the memory requirement compared to state-of-the-art RAM-based implementations by up to a factor of two. The novelty is in a technique to fuse the datapaths for the required perfect shuffle and bit reversal and is applicable to an entire design space of FFT implementations with varying degrees of reuse and number of input ports. We implemented a tool to generate this FFT design space for a given input size and to benchmark against prior work. The results show a reduction of half the RAM banks and/or half the logic complexity used for the permutations. The technique for fusing permutations is more generally applicable beyond the FFT.

Characterizing and Enumerating Walsh-Hadamard Transform Algorithms

François Serre and Markus Püschel
CoRR'17 | PDF

We propose a way of characterizing the algorithms computing a Walsh-Hadamard transform that consist of a sequence of arrays of butterflies ($\I_{2^{n-1}} \otimes \text{DFT}_2$) interleaved by linear permutations. Linear permutations are those that map linearly the binary representation of its element indices. We also propose a method to enumerate these algorithms.

Optimal Streamed Linear Permutations

François Serre and Markus Püschel
ARITH'17 | PDF

We give an overview on optimal circuits to implement linear permutations on FPGAs using only RAM banks and switches. Linear means that the permutation maps linearly the bit representation of the indices, as it is the case with most permutations arising in digital signal processing algorithms including those in fast Fourier transforms, Viterbi decoders, and sorting networks. Additionally, we assume that the data to be permuted is streamed, i.e., input in chunks over several cycles. The circuits are obtained from a suitable factorization of the bit matrix representing the permutation and achieve the minimal number of switches possible.

Generalizing Block LU factorization: A lower–upper–lower block triangular decomposition with minimal off-diagonal ranks

François Serre and Markus Püschel
LAA'16 | PDF

We propose a novel factorization of a non-singular matrix P, viewed as a 2×2-blocked matrix. The factorization decomposes P into a product of three matrices that are lower block-unitriangular, upper block-triangular, and lower block-unitriangular, respectively. Our goal is to make this factorization “as block-diagonal as possible” by minimizing the ranks of the off-diagonal blocks. We give lower bounds on these ranks and show that they are sharp by providing an algorithm that computes an optimal solution. The proposed decomposition can be viewed as a generalization of the well-known Block LU factorization using the Schur complement. Finally, we briefly explain one application of this factorization: the design of optimal circuits for a certain class of streaming permutations.

Optimal Circuits for Streamed Linear Permutations using RAM

François Serre, Thomas Holenstein and Markus Püschel
FPGA'16 | PDF

We propose a method to automatically derive hardware structures that perform a fixed linear permutation on streaming data. Linear permutations are permutations that map linearly the bit representation of the elements addresses. This set contains many of the most important permutations in media processing, communication, and other applications and includes perfect shuffles, stride permutations, and the bit reversal. Streaming means that the data to be permuted arrive as a sequence of chunks over several cycles. We solve this problem by mathematically decomposing a given permutation into a sequence of three permutations that are either temporal or spatial. The former are implemented as banks of RAM, the latter as switching networks. We prove optimality of our solution in terms of the number of switches in these networks.

Automatic Generation of Hardware Designs for Matrix-Matrix Multiplication

François Serre
Master Thesis | PDF

Matrix-Matrix Multiplication (MMM) is a key computational kernel in scientific and engineering applications. Therefore, different implementations of this operation have been designed for FPGAs. However, it is hard to find, given a particular set of constraints (maximal number of slices, minimum frequency), the most appropriate design. This thesis proposes the Operator Language for Schedules (OLS) to describe hardware designs that can perform MMM on arbitrarily-sized matrices. Algorithmic and implementation strategies such as blocking and reuse are represented as a set of rewriting rules that are recursively applied to OLS expressions. Finally, a compiler was built to translate a final OLS expression into Verilog code. The different rewriting-rules that can be recursively applied, given the size requirements for MMM, give rise to large design space. Every design alternative, for a subset of matrix sizes, was synthesized and routed for our target FPGA platform and precise cost and performance metrics were stored in a database. Designers can later visualize the alternatives in our database and choose the design that suits the goals and constraints of the target system.