Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves non-convex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed K-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the \emphexact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation – a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including K-means, SDP and EM algorithms.

Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-dependent domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on a simulation example and the CIFAR-10 dataset with considerably reduced inference-stage computational cost by orders of magnitude.

Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used K-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the K-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based K-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein K-means can achieve exact recovery given the clusters are well-separated under the 2-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based K-means can achieve better classification performance over the standard centroid-based K-means for clustering probability distributions and images.

This paper concerns the nonparametric estimation problem of the distribution-state dependent drift vector field in an interacting N-particle system. Observing single-trajectory data for each particle, we derive the mean-field rate of convergence for the maximum likelihood estimator (MLE), which depends on both Gaussian complexity and Rademacher complexity of the function class. In particular, when the function class contains α-smooth Hölder functions, our rate of convergence is minimax optimal on the order of N^-\fracαd+2α. Combining with a Fourier analytical deconvolution estimator, we derive the consistency of MLE for the external force and interaction kernel in the McKean-Vlasov equation.

Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime.

Irregular functional data in which densely sampled curves are observed over different ranges pose a challenge for modeling and inference, and sensitivity to outlier curves is a concern in applications. Motivated by applications in quantitative ultrasound signal analysis, this paper investigates a class of robust M-estimators for partially observed functional data including functional location and quantile estimators. Consistency of the estimators is established under general conditions on the partial observation process. Under smoothness conditions on the class of M-estimators, asymptotic Gaussian process approximations are established and used for large sample inference. The large sample approximations justify a bootstrap approximation for robust inferences about the functional response process. The performance is demonstrated in simulations and in the analysis of irregular functional data from quantitative ultrasound analysis.

Principled nonparametric tests for regression curvature in \bR^d are often statistically and computationally challenging. This paper introduces the stratified incomplete local simplex (SILS) tests for joint concavity of nonparametric multiple regression. The SILS tests with suitable bootstrap calibration are shown to achieve simultaneous guarantees on dimension-free computational complexity, polynomial decay of the uniform error-in-size, and power consistency for general (global and local) alternatives. To establish these results, we develop a general theory for incomplete U-processes with stratified random sparse weights. Novel technical ingredients include maximal inequalities for the supremum of multiple incomplete U-processes.

We consider the problem of change point detection for high-dimensional distributions in a location family when the dimension can be much larger than the sample size. In change point analysis, the widely used cumulative sum (CUSUM) statistics are sensitive to outliers and heavy-tailed distributions. In this paper, we propose a robust, tuning-free (i.e., fully data-dependent), and easy-to-implement change point test that enjoys strong theoretical guarantees. To achieve the robust purpose in a nonparametric setting, we formulate the change point detection in the multivariate U-statistics framework with anti-symmetric and nonlinear kernels. Specifically, the within-sample noise is canceled out by anti-symmetry of the kernel, while the signal distortion under certain nonlinear kernels can be controlled such that the between-sample change point signal is magnitude preserving. A (half) jackknife multiplier bootstrap (JMB) tailored to the change point detection setting is proposed to calibrate the distribution of our ℓ∞-norm aggregated test statistic. Subject to mild moment conditions on kernels, we derive the uniform rates of convergence for the JMB to approximate the sampling distribution of the test statistic, and analyze its size and power properties. Extensions to multiple change point testing and estimation are discussed with illustration from numerical studies.

We determine the information-theoretic cutoff value on separation of cluster centers for exact recovery of cluster labels in a K-component Gaussian mixture model with equal cluster sizes. Moreover, we show that a semidefinite programming (SDP) relaxation of the K-means clustering method achieves such sharp threshold for exact recovery without assuming the symmetry of cluster centers.

We introduce the diffusion K-means clustering method on Riemannian submanifolds, which maximizes the within-cluster connectedness based on the diffusion distance. The diffusion K-means constructs a random walk on the similarity graph with vertices as data points randomly sampled on the manifolds and edges as similarities given by a kernel that captures the local geometry of manifolds. The diffusion K-means is a multi-scale clustering tool that is suitable for data with non-linear and non-Euclidean geometric features in mixed dimensions. Given the number of clusters, we propose a polynomial-time convex relaxation algorithm via the semidefinite programming (SDP) to solve the diffusion K-means. In addition, we also propose a nuclear norm regularized SDP that is adaptive to the number of clusters. In both cases, we show that exact recovery of the SDPs for diffusion K-means can be achieved under suitable between-cluster separability and within-cluster connectedness of the submanifolds, which together quantify the hardness of the manifold clustering problem. We further propose the localized diffusion K-means by using the local adaptive bandwidth estimated from the nearest neighbors. We show that exact recovery of the localized diffusion K-means is fully adaptive to the local probability density and geometric structures of the underlying submanifolds.

Abstract Cumulative sum (CUSUM) statistics are widely used in the change point inference and identification. For the problem of testing for existence of a change point in an independent sample generated from the mean-shift model, we introduce a Gaussian multiplier bootstrap to calibrate critical values of the CUSUM test statistics in high dimensions. The proposed bootstrap CUSUM test is fully data dependent and it has strong theoretical guarantees under arbitrary dependence structures and mild moment conditions. Specifically, we show that with a boundary removal parameter the bootstrap CUSUM test enjoys the uniform validity in size under the null and it achieves the minimax separation rate under the sparse alternatives when the dimension p can be larger than the sample size n. Once a change point is detected, we estimate the change point location by maximising the ℓ∞-norm of the generalised CUSUM statistics at two different weighting scales corresponding to covariance stationary and non-stationary CUSUM statistics. For both estimators, we derive their rates of convergence and show that dimension impacts the rates only through logarithmic factors, which implies that consistency of the CUSUM estimators is possible when p is much larger than n. In the presence of multiple change points, we propose a principled bootstrap-assisted binary segmentation (BABS) algorithm to dynamically adjust the change point detection rule and recursively estimate their locations. We derive its rate of convergence under suitable signal separation and strength conditions. The results derived in this paper are non-asymptotic and we provide extensive simulation studies to assess the finite sample performance. The empirical evidence shows an encouraging agreement with our theoretical results.

We derive a dimension-free Hanson–Wright inequality for quadratic forms of independent sub-gaussian random variables in a separable Hilbert space. Our inequality is an infinite-dimensional generalization of the classical Hanson–Wright inequality for finite-dimensional Euclidean random vectors. We illustrate an application to the generalized K-means clustering problem for non-Euclidean data. Specifically, we establish the exponential rate of convergence for a semidefinite relaxation of the generalized K-means, which together with a simple rounding algorithm imply the exact recovery of the true clustering structure.

Motivated by statistical inference problems in high-dimensional time series data analysis, we first derive non-asymptotic error bounds for Gaussian approximations of sums of high-dimensional dependent random vectors on hyper-rectangles, simple convex sets and sparsely convex sets. We investigate the quantitative effect of temporal dependence on the rates of convergence to a Gaussian random vector over three different dependency frameworks (α-mixing, m-dependent, and physical dependence measure). In particular, we establish new error bounds under the α-mixing framework and derive faster rate over existing results under the physical dependence measure. To implement the proposed results in practical statistical inference problems, we also derive a data-driven parametric bootstrap procedure based on a kernel estimator for the long-run covariance matrices. We apply the unified Gaussian and bootstrap approximation results to test mean vectors with combined \ell^2 and \ell^∞type statistics, change point detection, and construction of confidence regions for covariance and precision matrices, all for time series data.

This paper concerns the parameter estimation problem for the quadratic potential energy in interacting particle systems from continuous-time and single-trajectory data. Even though such dynamical systems are high-dimensional, we show that the vanilla maximum likelihood estimator (without regularization) is able to estimate the interaction potential parameter with optimal rate of convergence simultaneously in mean-field limit and in long-time dynamics. This to some extend avoids the curse-of-dimensionality for estimating large dynamical systems under symmetry of the particle interaction.

This paper is concerned with finite sample approximations to the supremum of a non-degenerate U-process of a general order indexed by a function class. We are primarily interested in situations where the function class as well as the underlying distribution change with the sample size, and the U-process itself is not weakly convergent as a process. Such situations arise in a variety of modern statistical problems. We first consider Gaussian approximations, namely, approximate the U-process supremum by the supremum of a Gaussian process, and derive coupling and Kolmogorov distance bounds. Such Gaussian approximations are, however, not often directly applicable in statistical problems since the covariance function of the approximating Gaussian process is unknown. This motivates us to study bootstrap-type approximations to the U-process supremum. We propose a novel jackknife multiplier bootstrap (JMB) tailored to the U-process, and derive coupling and Kolmogorov distance bounds for the proposed JMB method. All these results are non-asymptotic, and established under fairly general conditions on function classes and underlying distributions. Key technical tools in the proofs are new local maximal inequalities for U-processes, which may be useful in other problems. We also discuss applications of the general approximation results to testing for qualitative features of nonparametric functions based on generalized local U-processes.

This paper introduces a dynamic panel SIR (DP-SIR) model to investigate the impact of non-pharmaceutical interventions (NPIs) on the COVID-19 transmission dynamics with panel data from 9 countries across the globe. By constructing scenarios with different combinations of NPIs, our empirical findings suggest that countries may avoid the lockdown policy with imposing school closure, mask wearing and centralized quarantine to reach similar outcomes on controlling the COVID-19 infection. Our results also suggest that, as of April 4th, 2020, certain countries such as the U.S. and Singapore may require additional measures of NPIs in order to control disease transmissions more effectively, while other countries may cautiously consider to gradually lift some NPIs to mitigate the costs to the overall economy.

This paper is concerned with the estimation of time-varying networks for high-dimensional nonstationary time series. Two types of dynamic behaviors are considered: structural breaks (i.e., abrupt change points) and smooth changes. To simultaneously handle these two types of time-varying features, a two-step approach is proposed: multiple change point locations are first identified on the basis of comparing the difference between the localized averages on sample covariance matrices, and then graph supports are recovered on the basis of a kernelized time-varying constrained L 1 -minimization for inverse matrix estimation (CLIME) estimator on each segment. We derive the rates of convergence for estimating the change points and precision matrices under mild moment and dependence conditions. In particular, we show that this two-step approach is consistent in estimating the change points and the piecewise smooth precision matrix function, under a certain high-dimensional scaling limit. The method is applied to the analysis of network structure of the S&P 500 index between 2003 and 2008.

Abstract A U-statistic, calculated from a random sample of size n, is an average of a symmetric function calculated for all m-tuples in the sample. Examples include the sample variance, the Cramér-von Mises and energy statistics of goodness-of-fit, and the Kaplan–Meier and Nelson–Aalen estimators in survival analysis. Asymptotic properties are described.

We study the problem of distributional approximations to high-dimensional non-degenerate U-statistics with random kernels of diverging orders. Infinite-order U-statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.

This paper studies inference for the mean vector of a high-dimensional U-statistic. In the era of big data, the dimension d of the U-statistic and the sample size n of the observations tend to be both large, and the computation of the U-statistic is prohibitively demanding. Data-dependent inferential procedures such as the empirical bootstrap for U-statistics is even more computationally expensive. To overcome such a computational bottleneck, incomplete U-statistics obtained by sampling fewer terms of the U-statistic are attractive alternatives. In this paper, we introduce randomized incomplete U-statistics with sparse weights whose computational cost can be made independent of the order of the U-statistic. We derive nonasymptotic Gaussian approximation error bounds for the randomized incomplete U-statistics in high dimensions, namely in cases where the dimension d is possibly much larger than the sample size n, for both nondegenerate and degenerate kernels. In addition, we propose generic bootstrap methods for the incomplete U-statistics that are computationally much less demanding than existing bootstrap methods, and establish finite sample validity of the proposed bootstrap methods. Our methods are illustrated on the application to nonparametric testing for the pairwise independence of a high-dimensional random vector under weaker assumptions than those appearing in the literature.

We propose a pointwise inference algorithm for high-dimensional linear models with time-varying coefficients. The method is based on a novel combination of the nonparametric kernel smoothing technique and a Lasso bias-corrected ridge regression estimator. Due to the non-stationarity feature of the model, dynamic bias-variance decomposition of the estimator is obtained. With a bias-correction procedure, the local null distribution of the estimator of the time-varying coefficient vector is characterized for iid Gaussian and heavy-tailed errors. The limiting null distribution is also established for Gaussian process errors, and we show that the asymptotic properties differ between short-range and long-range dependent errors. Here, p-values are adjusted by a Bonferroni-type correction procedure to control the familywise error rate (FWER) in the asymptotic sense at each time point. The finite sample size performance of the proposed inference algorithm is illustrated with synthetic data and an application to learn brain connectivity by using the resting-state fMRI data for Parkinson’s disease.

This paper studies the Gaussian and bootstrap approximations for the probabilities of a nondegenerate U-statistic belonging to the hyperrectangles in \mathbbR^d when the dimension d is large. A two-step Gaussian approximation procedure that does not impose structural assumptions on the data distribution is proposed. Subject to mild moment conditions on the kernel, we establish the explicit rate of convergence uniformly in the class of all hyperrectangles in \mathbbR^d that decays polynomially in sample size for a high-dimensional scaling limit, where the dimension can be much larger than the sample size. We also provide computable approximation methods for the quantiles of the maxima of centered U-statistics. Specifically, we provide a unified perspective for the empirical bootstrap, the randomly reweighted bootstrap and the Gaussian multiplier bootstrap with the jackknife estimator of covariance matrix as randomly reweighted quadratic forms and we establish their validity. We show that all three methods are inferentially first-order equivalent for high-dimensional U-statistics in the sense that they achieve the same uniform rate of convergence over all d-dimensional hyperrectangles. In particular, they are asymptotically valid when the dimension d can be as large as O(e^n^c) for some constant c∈(0,1/7). The bootstrap methods are applied to statistical applications for high-dimensional non-Gaussian data including: (i) principled and data-dependent tuning parameter selection for regularized estimation of the covariance matrix and its related functionals; (ii) simultaneous inference for the covariance and rank correlation matrices. In particular, for the thresholded covariance matrix estimator with the bootstrap selected tuning parameter, we show that for a class of sub-Gaussian data, error bounds of the bootstrapped thresholded covariance matrix estimator can be much tighter than those of the minimax estimator with a universal threshold. In addition, we also show that the Gaussian-like convergence rates can be achieved for heavy-tailed data, which are less conservative than those obtained by the Bonferroni technique that ignores the dependency in the underlying data distribution.

We consider the estimation of the transition matrix in the high-dimensional time-varying vector autoregression (TV-VAR) models. Our model builds on a general class of locally stationary VAR processes that evolve smoothly in time. We propose a hybridized kernel smoothing and \ell^1-regularized method to directly estimate the sequence of time-varying transition matrices. Under the sparsity assumption on the transition matrix, we establish the rate of convergence of the proposed estimator and show that the convergence rate depends on the smoothness of the locally stationary VAR processes only through the smoothness of the transition matrix function. In addition, for our estimator followed by thresholding, we prove that the false positive rate (type I error) and false negative rate (type II error) in the pattern recovery can asymptotically vanish in the presence of weak signals without assuming the minimum nonzero signal strength condition. Favorable finite sample performances over the \ell^2-penalized least-squares estimator and the unstructured maximum likelihood estimator are shown on simulated data. We also provide two real examples on estimating the dependence structures on financial stock prices and economic exchange rates datasets.

This paper studies a Dantzig-selector type regularized estimator for linear functionals of high-dimensional linear processes. Explicit rates of convergence of the proposed estimator are obtained and they cover the broad regime from independent identically distributed samples to long-range dependent time series and from sub-Gaussian innovations to those with mild polynomial moments. It is shown that the convergence rates depend on the degree of temporal dependence and the moment conditions of the underlying linear processes. The Dantzig-selector estimator is applied to the sparse Markowitz portfolio allocation and the optimal linear prediction for time series, in which the ratio consistency when compared with an oracle estimator is established. The effect of dependence and innovation moment conditions is further illustrated in the simulation study. Finally, the regularized estimator is applied to classify the cognitive states on a real functional magnetic resonance imaging dataset and to portfolio optimization on a financial dataset.

While neuroimaging data can provide valuable phenotypic information to inform genetic studies, the opposite is also true: known genotypes can be used to inform brain connectivity patterns from fMRI data. Here, we propose a framework for genetically informed group brain connectivity modeling. Subjects are first stratified according to their genotypes, and then a group regularized regression model is employed for brain connectivity modeling utilizing the time courses from a priori specified regions of interest (ROIs). With such an approach, each ROI time course is in turn predicted from all other ROI time courses at zero lag using a group regression framework which also incorporates a penalty based on genotypic similarity. Simulations supported such an approach when, as previously studies have indicated to be the case, genetic influences impart connectivity differences across subjects. The proposed method was applied to resting state fMRI data from Schizophrenia and normal control subjects. Genotypes were based on D-amino acid oxidase activator (DAOA) single-nucleotide polymorphisms (SNPs) information. With DAOA SNPs information integrated, the proposed approach was able to more accurately model the diversity in connectivity patterns. Specifically, connectivity with the left putamen, right posterior cingulate, and left middle frontal gyri were found to be jointly modulated by DAOA genotypes and the presence of Schizophrenia. We conclude that the proposed framework represents a multimodal analysis approach for incorporating genotypic variability into brain connectivity analysis directly.

Moment inequality for quadratic forms of random vectors is of particular interest in covariance matrix testing and estimation problems. In this paper, we prove a Rosenthal-type inequality, which exhibits new features and certain improvement beyond the unstructured Rosenthal inequality of quadratic forms when dimension of the vectors increases without bound. Applications to test the block diagonal structures and detect the sparsity in the high-dimensional covariance matrix are presented.

We consider estimation of covariance matrices and their inverses (a.k.a. precision matrices) for high-dimensional stationary and locally stationary time series. In the latter case the covariance matrices evolve smoothly in time, thus forming a covariance matrix function. Using the functional dependence measure of Wu [Proc. Natl. Acad. Sci. USA 102 (2005) 14150–14154 (electronic)], we obtain the rate of convergence for the thresholded estimate and illustrate how the dependence affects the rate of convergence. Asymptotic properties are also obtained for the precision matrix estimate which is based on the graphical Lasso principle. Our theory substantially generalizes earlier ones by allowing dependence, by allowing nonstationarity and by relaxing the associated moment conditions.

In this paper, we propose a shrinkage-to-tapering oracle (STO) estimator for estimation of large covariance matrix when the number of samples is substantially fewer than the number of variables, by combining the strength from both Steinian-type shrinkage and tapering estimators. Our contributions include: (i) Deriving the Frobenius risk and a lower bound for the spectral risk of an MMSE shrinkage estimator; (ii) Deriving a closed-form expression for the optimal coefficient of the proposed STO estimator. Simulations on auto-regression (e.g. a sparse case) and fraction Brownian motion (e.g. a non-sparse case) covariance structures are used to demonstrate the superiority of the proposed estimator.

In this paper, we introduce a shrinkage-to-tapering approach for estimating large covariance matrices when the number of samples is substantially fewer than the number of variables (i.e., n,p→∞ and p/n→∞). The proposed estimator improves upon both shrinkage and tapering estimators by shrinking the sample covariance matrix to its tapered version. We first show that, under both normalized Frobenius and spectral risks, the minimum mean-squared error (MMSE) shrinkage-to-identity estimator is inconsistent and outperformed by a minimax tapering estimator for a class of high-dimensional and diagonally dominant covariance matrices. Motivated by this observation, we propose a shrinkage-to-tapering oracle (STO) estimator for efficient estimation of general, large covariance matrices. A closed-form formula of the optimal coefficient ρ of the proposed STO estimator is derived under the minimum Frobenius risk. Since the true covariance matrix is to be estimated, we further propose a STO approximating (STOA) algorithm with a data-driven bandwidth selection procedure to iteratively estimate the coefficient ρ and the covariance matrix. We study the finite sample performances of different estimators and our simulation results clearly show the improved performances of the proposed STO estimators. Finally, the proposed STOA method is applied to a real breast cancer gene expression data set.

Estimation of the covariance matrix and its inverse, the precision matrix, in high-dimensional situations is of great interest in many applications. In this paper, we focus on the estimation of a class of sparse precision matrices which are assumed to be approximately inversely closed for the case that the dimensionality p can be much larger than the sample size n, which is fundamentally different from the classical case that p <; n. Different in nature from state-of-the-art methods that are based on penalized likelihood maximization or constrained error minimization, based on the truncated Neumann series representation, we propose a computationally efficient precision matrix estimator that has a computational complexity of O (p3). We prove that the proposed estimator is consistent in probability and in L2 under the spectral norm. Moreover, its convergence is shown to be rate-optimal in the sense of minimax risk. We further prove that the proposed estimator is model selection consistent by establishing a convergence result under the entry-wise ∞-norm. Simulations demonstrate the encouraging finite sample size performance and computational advantage of the proposed estimator. The proposed estimator is also applied to a real breast cancer data and shown to outperform existing precision matrix estimators.

Learning a small number of genetic variants associated with multiple complex genetic traits is of practical importance and remains challenging due to the high dimensional nature of data. In this paper, we proposed a two-graph guided multi-task Lasso to address this issue with an emphasis on estimating subnetwork-to-subnetwork associations in expression quantitative trait loci (eQTL) mapping. The proposed model can learn such subnetwork-to-subnetwork associations and therefore can be seen as a generalization of several state-of-the-art multi-task feature selection methods. Additionally, this model has a nice property of allowing flexible structured sparsity on both feature and label domains. Simulation study shows the improved performance of our model and a human eQTL data set is analyzed to further demonstrate the applications of the model.

Variable selection is a topic of great importance in high-dimensional statistical modeling and has a wide range of real-world applications. Many variable selection techniques have been proposed in the context of linear regression, and the Lasso model is probably one of the most popular penalized regression techniques. In this paper, we propose a new, fully hierarchical, Bayesian version of the Lasso model by employing flexible sparsity promoting priors. To obtain the Bayesian Lasso estimate, a reversible-jump MCMC algorithm is developed for joint posterior inference over both discrete and continuous parameter spaces. Simulations demonstrate that the proposed RJ-MCMC-based Bayesian Lasso yields smaller estimation errors and more accurate sparsity pattern detection when compared with state-of-the-art optimization-based Lasso-type methods, a standard Gibbs sampler-based Bayesian Lasso and the Binomial–Gaussian prior model. To demonstrate the applicability and estimation stability of the proposed Bayesian Lasso, we examine a benchmark diabetes data set and real functional Magnetic Resonance Imaging data. As an extension of the proposed RJ-MCMC framework, we also develop an MCMC-based algorithm for the Binomial–Gaussian prior model and illustrate its improved performance over the non-Bayesian estimate via simulations.

The Huberized LASSO model, a robust version of the popular LASSO, yields robust model selection in sparse linear regression. Though its superior performance was empirically demonstrated for large variance noise, currently no theoretical asymptotic analysis has been derived for the Huberized LASSO estimator. Here we prove that the Huberized LASSO estimator is consistent and asymptotically normal distributed under a proper shrinkage rate. Our derivation shows that, unlike the LASSO estimator, its asymptotic variance is stabilized in the presence of noise with large variance. We also propose the adaptive Huberized LASSO estimator by allowing unequal penalty weights for the regression coefficients, and prove its model selection consistency. Simulations confirm our theoretical results.

Inferring effective brain connectivity from neuroimaging data such as functional Magnetic Resonance Imaging (fMRI) has been attracting increasing interest due to its critical role in understanding brain functioning. Incorporating sparsity into connectivity modeling to make models more biologically realistic and performing group analysis to deal with inter-subject variability are still challenges associated with fMRI brain connectivity modeling. To address the above two crucial challenges, the attractive computational and theoretical properties of the least absolute shrinkage and selection operator (LASSO) in sparse linear regression provide a suitable starting point. We propose a group robust LASSO (grpRLASSO) model by combining advantages of the popular group-LASSO and our recently developed robust-LASSO. Here group analysis is formulated as a grouped variable selection procedure. Superior performance of the proposed grpRLASSO in terms of group selection and robustness is demonstrated by simulations with large noise variance. The grpRLASSO is also applied to a real fMRI data set for brain connectivity study in Parkinson’s disease, resulting in biologically plausible networks.

In the context of linear regression, the least absolute shrinkage and selection operator (LASSO) is probably the most popular supervised-learning technique proposed to recover sparse signals from high-dimensional measurements. Prior literature has mainly concerned itself with independent, identically distributed noise with moderate variance. In many real applications, however, the measurement errors may have heavy-tailed distributions or suffer from severe outliers, making the LASSO poorly estimate the coefficients due to its sensitivity to large error variance. To address this concern, a robust version of the LASSO is proposed, and the limiting distribution of its estimator is derived. Model selection consistency is established for the proposed robust LASSO under an adaptation procedure of the penalty weight. A parallel asymptotic analysis is derived for the Huberized LASSO, a previously proposed robust LASSO, and it is shown that the Huberized LASSO estimator preserves similar asymptotics even with a Cauchy error distribution. We show that asymptotic variances of the two robust LASSO estimators are stabilized in the presence of large variance noise, compared with the unbounded asymptotic variance of the ordinary LASSO estimator. The asymptotic analysis from the nonstochastic design is extended to the case of random design. Simulations further confirm our theoretical results.

Sparsity is crucial for high-dimensional statistical modeling. On one hand, dimensionality reduction can reduce the variability of estimation and thus provide reliable predictive power. On the other hand, the selected sub-model can discover and emphasize the underlying dependencies, which is useful for objective interpretation. Many variable selection methods have been proposed in literatures. For a prominent example, Least Absolute Shrinkage and Selection Operator (lasso) in linear regression context has been extensively explored. This paper discusses a class of scaled mixture of Gaussian models from both a penalized likelihood and a Bayesian regression point of view. We propose an Majorize-Minimize (MM) algorithm to find the Maximum A Posteriori (MAP) estimator, where the EM algorithm can be stuck at local optimum for some members in this class. Simulation studies show the outperformance of proposed algorithm in nonstochastic design variable selection scenario. The proposed algorithm is applied to a real large-scale E.coli data set with known bona fide interactions for constructing sparse gene regulatory networks. We show that our regression networks with a properly chosen prior can perform comparably to state-of-the-art regulatory network construction algorithms.

Summary:BNArray is a systemized tool developed in R. It facilitates the construction of gene regulatory networks from DNA microarray data by using Bayesian network. Significant sub-modules of regulatory networks with high confidence are reconstructed by using our extended sub-network mining algorithm of directed graphs. BNArray can handle microarray datasets with missing data. To evaluate the statistical features of generated Bayesian networks, re-sampling procedures are utilized to yield collections of candidate 1st-order network sets for mining dense coherent sub-networks.Availability: The R package and the supplementary documentation are available at .Contact:mchen@zju.edu.cn