Department of Statistics · Columbia University · New York · [email protected]

I am a Ph.D. student in statistics at Columbia University. I am interested in high-dimensional statistics, compressed sensing, and deep learning.

I am interested in various topics in high-dimensional statistics, compressed sensing and deep learning. More recently, I have also started exploring applications of machine learning in different contexts, such as data compression and chemistry.

#### Discrete Object Generation with Reversible Inductive Construction ArXiv e-print

##### A. Seff, W. Zhou, F. Damani, A. Doyle, R. P. Adams

The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.

#### Denoising Structured Random Processes ArXiv e-print

##### W. Zhou, S. Jalali

Denoising a stationary process ${\left({X}_{i}\right)}_{i\in \mathbb{Z}}$ corrupted by additive white Gaussian noise is a classic and fundamental problem in information theory and statistical signal processing. Despite considerable progress in designing efficient denoising algorithms, for general analog sources, theoretically-founded computationally-efficient methods are yet to be found. For instance in denoising ${X}_{n}$ corrupted by noise ${Z}_{n}$ as ${Y}_{n}={X}_{n}+{Z}_{n}$, given the full distribution of ${X}_{n}$, a minimum mean square error (MMSE) denoiser needs to compute $\mathbb{E}[{X}_{n}\mid {Y}_{n}]$. However, for general sources, computing $\mathbb{E}[{X}_{n}\mid {Y}_{n}]$ is computationally very challenging, if not infeasible. In this paper, starting by a Bayesian setup, where the source distribution is fully known, a novel denoising method, namely, quantized maximum a posteriori (Q-MAP) denoiser, is proposed and its asymptotic performance in the high signal to noise ratio regime is analyzed. Both for memoryless sources, and for structured first-order Markov sources, it is shown that, asymptotically, as σ converges to zero, $\genfrac{}{}{0.1ex}{}{1}{{\sigma}^{2}}\mathbb{E}\left[{({X}_{i}-{X}^{\text{Q-MAP}})}^{2}\right]$ achieved by Q-MAP denoiser converges to the information dimension of the source. For the studied memoryless sources, this limit is known to be optimal. A key advantage of the Q-MAP denoiser is that, unlike an MMSE denoiser, it highlights the key properties of the source distribution that are to be used in its denoising. This property dramatically reduces the computational complexity of approximating the solution of the Q-MAP denoiser. Additionally, it naturally leads to a learning-based denoiser. Using ImageNet database for training, initial simulation results exploring the performance of such a learning-based denoiser in image denoising are presented.

#### Non-Vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach ICLR 2019

##### W. Zhou, V. Veitch, M. Austern, R. P. Adams, P. Orbanz

Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be

compressed

to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size that, combined with off-the-shelf compression algorithms, leads to state-of-the-art generalization guarantees. In particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. Additionally, we show that compressibility of models that tend to overfit is limited. Empirical results show that an increase in overfitting increases the number of bits required to describe a trained network.#### Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data AISTATS 2019

##### V. Veitch, M. Austern, W. Zhou, P. Orbanz, D. Blei

Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.

#### Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions ICML 2018

##### S. Wang

^{*}, W. Zhou^{*}, H. Lu, A. Maleki, V. MirrokniConsider the following class of learning schemes: $$\widehat{\beta}={\displaystyle {\mathrm{arg}\phantom{\rule{0.167em}{0ex}}\mathrm{min}}_{\beta \in \mathcal{C}}}{\displaystyle \sum _{j=1}^{n}}\ell ({x}_{j}^{\top}\beta ;{y}_{j})+\lambda R\left(\beta \right)$$ where ${x}_{i}\in {\mathbb{R}}^{p}$ and ${y}_{i}\in \mathbb{R}$ denote the

`i`th feature and response variable respectively. Let ℓ and R be the convex loss function and regularizer, β denote the unknown weights, and λ be a regularization parameter. $\mathcal{C}\subset {\mathbb{R}}^{p}$ is a closed convex set. Finding the optimal choice of λ is a challenging problem in high-dimensional regimes where both n and p are large. We propose three frameworks to obtain a computationally efficient approximation of the leave-one-out cross validation (LOOCV) risk for nonsmooth losses and regularizers. Our three frameworks are based on the primal, dual, and proximal formulations of the above problem. Each framework shows its strength in certain types of problems. We prove the equivalence of the three approaches under smoothness conditions. This equivalence enables us to justify the accuracy of the three methods under such conditions. We use our approaches to obtain a risk estimate for several standard problems, including generalized LASSO, nuclear norm regularization, and support vector machines. We empirically demonstrate the effectiveness of our results for non-differentiable cases.#### Analysis of Genotype by Methylation Interactions through Sparsity-Inducing Regularized Regression BMC Proceedings 2018 (GAW 20)

##### W. Zhou, S. Lo

We consider the use of the least absolute shrinkage and selection operator (LASSO)-type regression techniques to detect important genetic or epigenetic loci in genome-wide association studies (GWAS) and epigenome-wide association studies (EWAS). We demonstrate how these techniques can be adapted to provide quantifiable uncertainty using stability selection, including explicit control of the family-wise error rate. We also consider variants of the LASSO, such as the group LASSO, to study genetic and epigenetic interactions. We use these techniques to reproduce some existing results on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) data set, which collects from 991 individuals blood triglyceride and differential methylation at 464,000 cytosine-phosphate-guanine (CpG) sites and 761,000 single-nucleotide polymorphisms (SNPs), and to identify new research directions. Epigenome-wide and genome-wide models based on the LASSO are considered, as well as an interaction model limited to chromosome 11. The analyses replicate findings concerning 2 CpGs in carnitine palmitoyltransferase 1A (CPT1A). Some suggestions are made regarding potentially interesting directions for the analysis of genetic and epigenetic interactions.

Ph.D. Statistics

August 2015 - Current

B.A. with MMaths (Part III)

October 2011 - May 2015

Bac Scientifique Mention *très bien*.

October 2011

### Introduction to Statistics with Calculus Summer 2017

#### S1201

This is Columbia's introductory statistics class. All the material for this class (including full notes and R notebook demonstrations) are available on the github.