# Wenda Zhou

I am a post-doctoral researcher at the NYU CDS and the Flatiron Institute CCM. I am interested in high-dimensional statistics, compressed sensing, and deep learning.

## Research

I am interested in various topics in high-dimensional statistics, compressed sensing and deep learning. More recently, I have also started exploring applications of machine learning in different contexts, such as data compression and chemistry.

### Papers

#### SketchGraphs: a large-scale dataset for modeling relational geometry in computer-aided design Arxiv pre-print

##### A. Seff, Y. Ovadia, W. Zhou, R. P. Adams

Parametric computer-aided design (CAD) is the dominant paradigm in mechanical engineering for physical design. Distinguished by relational geometry, parametric CAD models begin as two-dimensional sketches consisting of geometric primitives (e.g., line segments, arcs) and explicit constraints between them (e.g., coincidence, perpendicularity) that form the basis for three-dimensional construction operations. Training machine learning models to reason about and synthesize parametric CAD designs has the potential to reduce design time and enable new design workflows. Additionally, parametric CAD designs can be viewed as instances of constraint programming and they offer a well-scoped test bed for exploring ideas in program synthesis and induction. To facilitate this research, we introduce SketchGraphs, a collection of 15 million sketches extracted from real-world CAD models coupled with an open-source data processing pipeline. Each sketch is represented as a geometric constraint graph where edges denote designer-imposed geometric relationships between primitives, the nodes of the graph. We demonstrate and establish benchmarks for two use cases of the dataset: generative modeling of sketches and conditional generation of likely constraints given unconstrained geometry.

#### Asymptotics of Cross-Validation Arxiv pre-print

##### M. Austern, W. Zhou

Cross validation is a central tool in evaluating the performance of machine learning and statistical models. However, despite its ubiquitous role, its theoretical properties are still not well understood. We study the asymptotic properties of the cross validated-risk for a large class of models. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds, which enable us to compute asymptotically accurate confidence intervals. Using our results, we paint a big picture for the statistical speed-up of cross validation compared to a train-test split procedure. A corollary of our results is that parametric M-estimators (or empirical risk minimizers) benefit from the "full" speed-up when performing cross-validation under the training loss. In other common cases, such as when the training is done using a surrogate loss or a regularizer, we show that the behavior of the cross-validated risk is complex with a variance reduction which may be smaller or larger than the "full" speed-up, depending on the model and the underlying distribution. We allow the number of folds to grow with the number of observations at any rate.

#### Error Bounds in Estimating the Out-of-sample Prediction Error Using Leave-one-out Cross-Validation in High Dimensions AISTATS 2020

##### K.R. Rad, W. Zhou, A. Maleki

We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size $n$ and number of features $p$ are large, and $n/p$ can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as $n,p\to \infty $, even when the dimension $p$ of the feature vectors is comparable with or greater than the sample size $n$. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.

#### Discrete Object Generation with Reversible Inductive Construction NeurIPS 2019

##### A. Seff, W. Zhou, F. Damani, A. Doyle, R. P. Adams

The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.

#### Denoising Structured Random Processes ArXiv e-print

##### W. Zhou, S. Jalali

Denoising a stationary process ${\left({X}_{i}\right)}_{i\in \mathbb{Z}}$ corrupted by additive white Gaussian noise is a classic and fundamental problem in information theory and statistical signal processing. Despite considerable progress in designing efficient denoising algorithms, for general analog sources, theoretically-founded computationally-efficient methods are yet to be found. For instance in denoising ${X}_{n}$ corrupted by noise ${Z}_{n}$ as ${Y}_{n}={X}_{n}+{Z}_{n}$, given the full distribution of ${X}_{n}$, a minimum mean square error (MMSE) denoiser needs to compute $\mathbb{E}[{X}_{n}\mid {Y}_{n}]$. However, for general sources, computing $\mathbb{E}[{X}_{n}\mid {Y}_{n}]$ is computationally very challenging, if not infeasible. In this paper, starting by a Bayesian setup, where the source distribution is fully known, a novel denoising method, namely, quantized maximum a posteriori (Q-MAP) denoiser, is proposed and its asymptotic performance in the high signal to noise ratio regime is analyzed. Both for memoryless sources, and for structured first-order Markov sources, it is shown that, asymptotically, as σ converges to zero, $\frac{1}{{\sigma}^{2}}\mathbb{E}\left[{({X}_{i}-{X}^{\text{Q-MAP}})}^{2}\right]$ achieved by Q-MAP denoiser converges to the information dimension of the source. For the studied memoryless sources, this limit is known to be optimal. A key advantage of the Q-MAP denoiser is that, unlike an MMSE denoiser, it highlights the key properties of the source distribution that are to be used in its denoising. This property dramatically reduces the computational complexity of approximating the solution of the Q-MAP denoiser. Additionally, it naturally leads to a learning-based denoiser. Using ImageNet database for training, initial simulation results exploring the performance of such a learning-based denoiser in image denoising are presented.

#### Non-Vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach ICLR 2019

##### W. Zhou, V. Veitch, M. Austern, R. P. Adams, P. Orbanz

Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be

compressed

to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size that, combined with off-the-shelf compression algorithms, leads to state-of-the-art generalization guarantees. In particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. Additionally, we show that compressibility of models that tend to overfit is limited. Empirical results show that an increase in overfitting increases the number of bits required to describe a trained network.#### Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data AISTATS 2019

##### V. Veitch, M. Austern, W. Zhou, P. Orbanz, D. Blei

Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.

#### Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions ICML 2018

##### S. Wang

^{*}, W. Zhou^{*}, H. Lu, A. Maleki, V. MirrokniConsider the following class of learning schemes: $$\widehat{\beta}={\displaystyle {\mathrm{arg}\phantom{\rule{0.167em}{0ex}}\mathrm{min}}_{\beta \in \mathcal{C}}}{\displaystyle \sum _{j=1}^{n}}\ell ({x}_{j}^{\top}\beta ;{y}_{j})+\lambda R\left(\beta \right)$$ where ${x}_{i}\in {\mathbb{R}}^{p}$ and ${y}_{i}\in \mathbb{R}$ denote the

`i`th feature and response variable respectively. Let ℓ and R be the convex loss function and regularizer, β denote the unknown weights, and λ be a regularization parameter. $\mathcal{C}\subset {\mathbb{R}}^{p}$ is a closed convex set. Finding the optimal choice of λ is a challenging problem in high-dimensional regimes where both n and p are large. We propose three frameworks to obtain a computationally efficient approximation of the leave-one-out cross validation (LOOCV) risk for nonsmooth losses and regularizers. Our three frameworks are based on the primal, dual, and proximal formulations of the above problem. Each framework shows its strength in certain types of problems. We prove the equivalence of the three approaches under smoothness conditions. This equivalence enables us to justify the accuracy of the three methods under such conditions. We use our approaches to obtain a risk estimate for several standard problems, including generalized LASSO, nuclear norm regularization, and support vector machines. We empirically demonstrate the effectiveness of our results for non-differentiable cases.

A version of this paper received the runner up for best paper award at the DMDA Workshop INFORMS 2019.#### Analysis of Genotype by Methylation Interactions through Sparsity-Inducing Regularized Regression BMC Proceedings 2018 (GAW 20)

##### W. Zhou, S. Lo

We consider the use of the least absolute shrinkage and selection operator (LASSO)-type regression techniques to detect important genetic or epigenetic loci in genome-wide association studies (GWAS) and epigenome-wide association studies (EWAS). We demonstrate how these techniques can be adapted to provide quantifiable uncertainty using stability selection, including explicit control of the family-wise error rate. We also consider variants of the LASSO, such as the group LASSO, to study genetic and epigenetic interactions. We use these techniques to reproduce some existing results on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) data set, which collects from 991 individuals blood triglyceride and differential methylation at 464,000 cytosine-phosphate-guanine (CpG) sites and 761,000 single-nucleotide polymorphisms (SNPs), and to identify new research directions. Epigenome-wide and genome-wide models based on the LASSO are considered, as well as an interaction model limited to chromosome 11. The analyses replicate findings concerning 2 CpGs in carnitine palmitoyltransferase 1A (CPT1A). Some suggestions are made regarding potentially interesting directions for the analysis of genetic and epigenetic interactions.

### Service

I have served as a reviewer for NeurIPS, ICML, JMLR, ICLR, ISIT, AAAI, Annals of Statistics.## Education

### Columbia University

### Cambridge University

### Lycée International de St Germain en Laye

*très bien*.

## Teaching

### Introduction to Statistics with Calculus Summer 2017

#### S1201

This is Columbia's introductory statistics class. All the material for this class (including full notes and R notebook demonstrations) are available on the github.