Nonparametric Statistical Inference

I have worked on developing nonparametric inference techniques for both, the fixed sample-size or batch setting, and the sequential setting.

Permutation-free tests in the batch setting

Many practical tasks require us to decide whether the given data justify a particular hypothesis or not. Formally, given data X_1, \ldots, X_n, with X_i \in \mathbb{R}^d, the goal is to decide whether the null H_0 (e.g., \mathbb{E}[X_i]=0) or alternative H_1 (e.g., \mathbb{E}[X]\neq 0) is true. A standard approach to this problem is to (i) define a statistic (i.e., a function of the data) T_n, that is expected to be small under H_0, and (ii) reject H_0 if T_n exceeds a threshold t_\alpha. Here \alpha \in (0, 1) denotes the allowed false-positive rate, and t_{\alpha} must be chosen to maximize the power while limiting the false-postive rate below \alpha.

Two common ways of selecting t_\alpha are based on the permutation distribution, and the asymptotic distribution of T_n under H_0. Permutation tests have finite-sample validity, but lead to a large increase in complexity as it requires recomputing the statistic many (usually 100-1000) times. On the other hand, for a large class of statistics (degenerate U-statistics, such as kernel-MMD), the limiting distribution is different in the low-dimensional (fixed d; n \to \infty) and high-dimensional (e.g., \lim_{n\to \infty} d/n = 0.5) regimes. This leads to the following : given data with d=5000 and n=10000, should we use the low or high-dimensional asymptotics for selecting t_\alpha?

We S. Shekhar, I. Kim, and A. Ramdas, “A permutation-free kernel two-sample test,” NeurIPS, 2022.

address both these issues by designing statistics that are dimension-agnostic: that is, they have the same asymptotic distribution under both, low and high dimensional settings. The design relies on two key ideas: sample splitting, and studentization, and we instantiate this idea to develop nonparametric, minimax optimal, kernel based two-sample tests, and independence tests. In practice S. Shekhar, I. Kim, and A. Ramdas, “A permutation-free kernel test of independence,” in preparation.

, our tests trade-off a small loss in power (w.r.t. permutation tests) due to sample splitting, for a significant (often 100x) reduction in computation.

Sequential nonparametric tests and confidence sequences

In many applications S. Shekhar, and A. Ramdas, “Nonparametric Two-Sample Testing by Betting,” under revision, IEEE Transactions on Information Theory.

, such as online advertising, the data-points are observed sequentially, and the sample-size is decided on-the-fly (i.e., n itself is random). Unlike the above batch setting, there is no common principle for designing nonparametric (likelihood-free) tests in the sequential setting. To address this, my work develops a unified framework for designing sequential nonparametric tests for a large class of problems including two-sample testing, symmetry testing and independence testing. Our tests have strong theoretical guarantees, along with good empirical performance. A major advantage of our tests is their adaptivity: they collect just the “right” number of observations needed for a particular problem, without any prior information of its hardness. S. Shekhar, and A. Ramdas, “Sequential changepoint detection using confidence sequences,” in preparation.

Overall, the above results significantly advance the methodology for practical testing and inference in both batch and sequential settings S. Shekhar, N. Xu, and A. Ramdas, “Confidence sequence for weighted sampling without replacement,” in preparation.

. Building upon these ideas, I am currently working on the problems of sequential changepoint detection, and uncertainty quantification for sampling without replacement.