# Wenda Zhou

I am a researcher at OpenAI.

## Research

I am interested in various topics in high-dimensional statistics, compressed sensing and deep learning. More recently, I have also started exploring applications of machine learning in different contexts, such as data compression and chemistry.

### Papers

#### A code superoptimizer through neural Monte-Carlo tree search NeurIPS 2022 MLSys Workshop (spotlight)

##### W. Zhou, O. Solodova, R. P. Adams

There are many ways to turn a high-level program into a sequence of instructions consistent with that computation. Selecting the most performant such instruction sequence for a given piece of hardware - optimized compilation - is a central challenge of computer science. Optimizing compilers perform this task through a series of reductions and local transformations (e.g. register allocation, instruction scheduling, peephole optimization) driven by heuristics. A natural and well-explored avenue of research is to replace current hand-written heuristics by data-driven, automatically-designed heuristics which may be obtained from machine learning. We propose a radically different approach, in which we view compilation as a combinatorial optimization problem which consists of finding the optimal (e.g. fastest executing or shortest) sequence of instructions subject to the constraint that it has the semantics of the specified program. We show how this problem can be practically framed as a finite Markov decision process, unlocking a rich space of potential algorithms from reinforcement learning. We implement one such algorithm in particular, an AlphaGo-like distributed neural Monte-Carlo tree search procedure, and demonstrate that it is able to directly generate optimized assembly. Unlike a traditional optimizing compiler, this approach does not rely on an existing library of optimizations to transform the code, but rather directly attempts to generate the most optimal program instruction-by-instruction, taking into account effects including register allocation, instruction scheduling and operation fusion.

#### Compressed Sensing in the Presence of Speckle Noise IEEE Trans. I.T. (2022)

##### W. Zhou, S. Jalali, A. Maleki

Speckle or multiplicative noise is a critical issue in coherence-based imaging systems, such as synthetic aperture radar and optical coherence tomography. Existence of speckle noise considerably limits the applicability of such systems by degrading their performance. On the other hand, the sophistications that arise in the study of multiplicative noise have so far impeded theoretical analysis of such imaging systems. As a result, the current acquisition technology relies on heuristic solutions, such as oversampling the signal and converting the problem into a denoising problem with multiplicative noise. This paper attempts to bridge the gap between theory and practice by providing the first theoretical analysis of such systems. To achieve this goal the log-likelihood function corresponding to measurement systems with speckle noise is characterized. Then employing compression codes to model the source structure, for the case of under-sampled measurements, a compression-based maximum likelihood recovery method is proposed. The mean squared error (MSE) performance of the proposed method is characterized and is shown to scale as $O(\sqrt{\frac{k\phantom{\rule{0.1667em}{0ex}}\mathrm{log}\phantom{\rule{0.1667em}{0ex}}n}{m}})$, where k, m and n denote the intrinsic dimension of the signal class according to the compression code, the number of observations, and the ambient dimension of the signal, respectively. This result, while in contrast to imaging systems with additive noise in which MSE scales as $O(\frac{k\phantom{\rule{0.1667em}{0ex}}\mathrm{log}\phantom{\rule{0.1667em}{0ex}}n}{m})$, suggests that if the signal class is structured (i.e., $k\ll n$ ), accurate recovery of a signal from under-determined measurements is still feasible, even in the presence of speckle noise. Simulation results are presented that suggest image recovery under multiplicative noise is inherently more challenging than additive noise, and that the derived theoretical results are sharp.

#### Vitruvion: A Generative Model of Parametric CAD Sketches ICLR 2022

##### A. Seff, W. Zhou, N. Richardson, R. P. Adams

Parametric computer-aided design (CAD) tools are the predominant way that engineers specify physical structures, from bicycle pedals to airplanes to printed circuit boards. The key characteristic of parametric CAD is that design intent is encoded not only via geometric primitives, but also by parameterized constraints between the elements. This relational specification can be viewed as the construction of a constraint program, allowing edits to coherently propagate to other parts of the design. Machine learning offers the intriguing possibility of accelerating the design process via generative modeling of these structures, enabling new tools such as autocompletion, constraint inference, and conditional synthesis. In this work, we present such an approach to generative modeling of parametric CAD sketches, which constitute the basic computational building blocks of modern mechanical design. Our model, trained on real-world designs from the SketchGraphs dataset, autoregressively synthesizes sketches as sequences of primitives, with initial coordinates, and constraints that reference back to the sampled primitives. As samples from the model match the constraint graph representation used in standard CAD software, they may be directly imported, solved, and edited according to downstream design tasks. In addition, we condition the model on various contexts, including partial sketches (primers) and images of hand-drawn sketches. Evaluation of the proposed approach demonstrates its ability to synthesize realistic CAD sketches and its potential to aid the mechanical design workflow.

#### Autobahn: Automorphism-based Graph Neural Nets NeurIPS 2021

##### E. H. Thiede, W. Zhou, R. Kondor

We introduce Automorphism-based graph neural networks (Autobahn), a new family of graph neural networks. In an Autobahn, we decompose the graph into a collection of subgraphs and apply local convolutions that are equivariant to each subgraph's automorphism group. Specific choices of local neighborhoods and subgraphs recover existing architectures such as message passing neural networks. Our formalism also encompasses novel architectures: as an example, we introduce a graph neural network that decomposes the graph into paths and cycles. The resulting convolutions reflect the natural way that parts of the graph can transform, preserving the intuitive meaning of convolution without sacrificing global permutation equivariance. We validate our approach by applying Autobahn to molecular graphs, where it achieves state-of-the-art results.

#### SketchGraphs: a large-scale dataset for modeling relational geometry in computer-aided design Arxiv pre-print

##### A. Seff, Y. Ovadia, W. Zhou, R. P. Adams

Parametric computer-aided design (CAD) is the dominant paradigm in mechanical engineering for physical design. Distinguished by relational geometry, parametric CAD models begin as two-dimensional sketches consisting of geometric primitives (e.g., line segments, arcs) and explicit constraints between them (e.g., coincidence, perpendicularity) that form the basis for three-dimensional construction operations. Training machine learning models to reason about and synthesize parametric CAD designs has the potential to reduce design time and enable new design workflows. Additionally, parametric CAD designs can be viewed as instances of constraint programming and they offer a well-scoped test bed for exploring ideas in program synthesis and induction. To facilitate this research, we introduce SketchGraphs, a collection of 15 million sketches extracted from real-world CAD models coupled with an open-source data processing pipeline. Each sketch is represented as a geometric constraint graph where edges denote designer-imposed geometric relationships between primitives, the nodes of the graph. We demonstrate and establish benchmarks for two use cases of the dataset: generative modeling of sketches and conditional generation of likely constraints given unconstrained geometry.

#### Asymptotics of Cross-Validation Arxiv pre-print

##### M. Austern, W. Zhou

Cross validation is a central tool in evaluating the performance of machine learning and statistical models. However, despite its ubiquitous role, its theoretical properties are still not well understood. We study the asymptotic properties of the cross validated-risk for a large class of models. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds, which enable us to compute asymptotically accurate confidence intervals. Using our results, we paint a big picture for the statistical speed-up of cross validation compared to a train-test split procedure. A corollary of our results is that parametric M-estimators (or empirical risk minimizers) benefit from the "full" speed-up when performing cross-validation under the training loss. In other common cases, such as when the training is done using a surrogate loss or a regularizer, we show that the behavior of the cross-validated risk is complex with a variance reduction which may be smaller or larger than the "full" speed-up, depending on the model and the underlying distribution. We allow the number of folds to grow with the number of observations at any rate.

#### Error Bounds in Estimating the Out-of-sample Prediction Error Using Leave-one-out Cross-Validation in High Dimensions AISTATS 2020

##### K.R. Rad, W. Zhou, A. Maleki

We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size $n$ and number of features $p$ are large, and $n/p$ can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as $n,p\to \infty $, even when the dimension $p$ of the feature vectors is comparable with or greater than the sample size $n$. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.

#### Discrete Object Generation with Reversible Inductive Construction NeurIPS 2019

##### A. Seff, W. Zhou, F. Damani, A. Doyle, R. P. Adams

The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.

#### Denoising Structured Random Processes ArXiv e-print

##### W. Zhou, S. Jalali

Denoising a stationary process ${\left({X}_{i}\right)}_{i\in \mathbb{Z}}$ corrupted by additive white Gaussian noise is a classic and fundamental problem in information theory and statistical signal processing. Despite considerable progress in designing efficient denoising algorithms, for general analog sources, theoretically-founded computationally-efficient methods are yet to be found. For instance in denoising ${X}_{n}$ corrupted by noise ${Z}_{n}$ as ${Y}_{n}={X}_{n}+{Z}_{n}$, given the full distribution of ${X}_{n}$, a minimum mean square error (MMSE) denoiser needs to compute $\mathbb{E}[{X}_{n}\mid {Y}_{n}]$. However, for general sources, computing $\mathbb{E}[{X}_{n}\mid {Y}_{n}]$ is computationally very challenging, if not infeasible. In this paper, starting by a Bayesian setup, where the source distribution is fully known, a novel denoising method, namely, quantized maximum a posteriori (Q-MAP) denoiser, is proposed and its asymptotic performance in the high signal to noise ratio regime is analyzed. Both for memoryless sources, and for structured first-order Markov sources, it is shown that, asymptotically, as σ converges to zero, $\frac{1}{{\sigma}^{2}}\mathbb{E}\left[{({X}_{i}-{X}^{\text{Q-MAP}})}^{2}\right]$ achieved by Q-MAP denoiser converges to the information dimension of the source. For the studied memoryless sources, this limit is known to be optimal. A key advantage of the Q-MAP denoiser is that, unlike an MMSE denoiser, it highlights the key properties of the source distribution that are to be used in its denoising. This property dramatically reduces the computational complexity of approximating the solution of the Q-MAP denoiser. Additionally, it naturally leads to a learning-based denoiser. Using ImageNet database for training, initial simulation results exploring the performance of such a learning-based denoiser in image denoising are presented.

#### Non-Vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach ICLR 2019

##### W. Zhou, V. Veitch, M. Austern, R. P. Adams, P. Orbanz

Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be

compressed

to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size that, combined with off-the-shelf compression algorithms, leads to state-of-the-art generalization guarantees. In particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. Additionally, we show that compressibility of models that tend to overfit is limited. Empirical results show that an increase in overfitting increases the number of bits required to describe a trained network.#### Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data AISTATS 2019

##### V. Veitch, M. Austern, W. Zhou, P. Orbanz, D. Blei

Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.

#### Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions ICML 2018

##### S. Wang

^{*}, W. Zhou^{*}, H. Lu, A. Maleki, V. MirrokniConsider the following class of learning schemes: $$\widehat{\beta}={\displaystyle {\mathrm{arg}\phantom{\rule{0.167em}{0ex}}\mathrm{min}}_{\beta \in \mathcal{C}}}{\displaystyle \sum _{j=1}^{n}}\ell ({x}_{j}^{\top}\beta ;{y}_{j})+\lambda R\left(\beta \right)$$ where ${x}_{i}\in {\mathbb{R}}^{p}$ and ${y}_{i}\in \mathbb{R}$ denote the

`i`th feature and response variable respectively. Let ℓ and R be the convex loss function and regularizer, β denote the unknown weights, and λ be a regularization parameter. $\mathcal{C}\subset {\mathbb{R}}^{p}$ is a closed convex set. Finding the optimal choice of λ is a challenging problem in high-dimensional regimes where both n and p are large. We propose three frameworks to obtain a computationally efficient approximation of the leave-one-out cross validation (LOOCV) risk for nonsmooth losses and regularizers. Our three frameworks are based on the primal, dual, and proximal formulations of the above problem. Each framework shows its strength in certain types of problems. We prove the equivalence of the three approaches under smoothness conditions. This equivalence enables us to justify the accuracy of the three methods under such conditions. We use our approaches to obtain a risk estimate for several standard problems, including generalized LASSO, nuclear norm regularization, and support vector machines. We empirically demonstrate the effectiveness of our results for non-differentiable cases.

A version of this paper received the runner up for best paper award at the DMDA Workshop INFORMS 2019.#### Analysis of Genotype by Methylation Interactions through Sparsity-Inducing Regularized Regression BMC Proceedings 2018 (GAW 20)

##### W. Zhou, S. Lo

We consider the use of the least absolute shrinkage and selection operator (LASSO)-type regression techniques to detect important genetic or epigenetic loci in genome-wide association studies (GWAS) and epigenome-wide association studies (EWAS). We demonstrate how these techniques can be adapted to provide quantifiable uncertainty using stability selection, including explicit control of the family-wise error rate. We also consider variants of the LASSO, such as the group LASSO, to study genetic and epigenetic interactions. We use these techniques to reproduce some existing results on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) data set, which collects from 991 individuals blood triglyceride and differential methylation at 464,000 cytosine-phosphate-guanine (CpG) sites and 761,000 single-nucleotide polymorphisms (SNPs), and to identify new research directions. Epigenome-wide and genome-wide models based on the LASSO are considered, as well as an interaction model limited to chromosome 11. The analyses replicate findings concerning 2 CpGs in carnitine palmitoyltransferase 1A (CPT1A). Some suggestions are made regarding potentially interesting directions for the analysis of genetic and epigenetic interactions.

### Service

I have served as a reviewer for NeurIPS, ICML, JMLR, ICLR, ISIT, AAAI, Annals of Statistics, IEEE IT.## Education

### Columbia University

### Cambridge University

### Lycée International de St Germain en Laye

*très bien*.

## Teaching

### Introduction to Statistics with Calculus Summer 2017

#### S1201

This is Columbia's introductory statistics class. All the material for this class (including full notes and R notebook demonstrations) are available on the github.