se

Atlas of Lasers and Lights in Dermatology

Cannarozzo, Giovanni. author.
9783030312329




se

African edible insects as alternative source of food, oil, protein and bioactive components

9783030329525 (electronic bk.)




se

Advances in virus research.

9780123850348 (electronic bk.)




se

A treatise on topical corticosteroids in dermatology : use, misuse and abuse

9789811046094




se

100 cases in clinical pharmacology, therapeutics and prescribing

Layne, Kerry, author.
9780429624537 electronic book




se

InBios receives Emergency Use Authorization for its Smart Detect...

InBios International, Inc. announces the U.S. Food and Drug Administration (FDA) issued an emergency use authorization (EUA) for its diagnostic test that can be used immediately by CLIA...

(PRWeb April 08, 2020)

Read the full story at https://www.prweb.com/releases/inbios_receives_emergency_use_authorization_for_its_smart_detect_sars_cov_2_rrt_pcr_kit_for_detection_of_the_virus_causing_covid_19/prweb17036897.htm







se

Asymptotic genealogies of interacting particle systems with an application to sequential Monte Carlo

Jere Koskela, Paul A. Jenkins, Adam M. Johansen, Dario Spanò.

Source: The Annals of Statistics, Volume 48, Number 1, 560--583.

Abstract:
We study weighted particle systems in which new generations are resampled from current particles with probabilities proportional to their weights. This covers a broad class of sequential Monte Carlo (SMC) methods, widely-used in applied statistics and cognate disciplines. We consider the genealogical tree embedded into such particle systems, and identify conditions, as well as an appropriate time-scaling, under which they converge to the Kingman $n$-coalescent in the infinite system size limit in the sense of finite-dimensional distributions. Thus, the tractable $n$-coalescent can be used to predict the shape and size of SMC genealogies, as we illustrate by characterising the limiting mean and variance of the tree height. SMC genealogies are known to be connected to algorithm performance, so that our results are likely to have applications in the design of new methods as well. Our conditions for convergence are strong, but we show by simulation that they do not appear to be necessary.




se

Uniformly valid confidence intervals post-model-selection

François Bachoc, David Preinerstorfer, Lukas Steinberger.

Source: The Annals of Statistics, Volume 48, Number 1, 440--463.

Abstract:
We suggest general methods to construct asymptotically uniformly valid confidence intervals post-model-selection. The constructions are based on principles recently proposed by Berk et al. ( Ann. Statist. 41 (2013) 802–837). In particular, the candidate models used can be misspecified, the target of inference is model-specific, and coverage is guaranteed for any data-driven model selection procedure. After developing a general theory, we apply our methods to practically important situations where the candidate set of models, from which a working model is selected, consists of fixed design homoskedastic or heteroskedastic linear models, or of binary regression models with general link functions. In an extensive simulation study, we find that the proposed confidence intervals perform remarkably well, even when compared to existing methods that are tailored only for specific model selection procedures.




se

Consistent selection of the number of change-points via sample-splitting

Changliang Zou, Guanghui Wang, Runze Li.

Source: The Annals of Statistics, Volume 48, Number 1, 413--439.

Abstract:
In multiple change-point analysis, one of the major challenges is to estimate the number of change-points. Most existing approaches attempt to minimize a Schwarz information criterion which balances a term quantifying model fit with a penalization term accounting for model complexity that increases with the number of change-points and limits overfitting. However, different penalization terms are required to adapt to different contexts of multiple change-point problems and the optimal penalization magnitude usually varies from the model and error distribution. We propose a data-driven selection criterion that is applicable to most kinds of popular change-point detection methods, including binary segmentation and optimal partitioning algorithms. The key idea is to select the number of change-points that minimizes the squared prediction error, which measures the fit of a specified model for a new sample. We develop a cross-validation estimation scheme based on an order-preserved sample-splitting strategy, and establish its asymptotic selection consistency under some mild conditions. Effectiveness of the proposed selection criterion is demonstrated on a variety of numerical experiments and real-data examples.




se

Sparse high-dimensional regression: Exact scalable algorithms and phase transitions

Dimitris Bertsimas, Bart Van Parys.

Source: The Annals of Statistics, Volume 48, Number 1, 300--323.

Abstract:
We present a novel binary convex reformulation of the sparse regression problem that constitutes a new duality perspective. We devise a new cutting plane method and provide evidence that it can solve to provable optimality the sparse regression problem for sample sizes $n$ and number of regressors $p$ in the 100,000s, that is, two orders of magnitude better than the current state of the art, in seconds. The ability to solve the problem for very high dimensions allows us to observe new phase transition phenomena. Contrary to traditional complexity theory which suggests that the difficulty of a problem increases with problem size, the sparse regression problem has the property that as the number of samples $n$ increases the problem becomes easier in that the solution recovers 100% of the true signal, and our approach solves the problem extremely fast (in fact faster than Lasso), while for small number of samples $n$, our approach takes a larger amount of time to solve the problem, but importantly the optimal solution provides a statistically more relevant regressor. We argue that our exact sparse regression approach presents a superior alternative over heuristic methods available at present.




se

Bootstrap confidence regions based on M-estimators under nonstandard conditions

Stephen M. S. Lee, Puyudi Yang.

Source: The Annals of Statistics, Volume 48, Number 1, 274--299.

Abstract:
Suppose that a confidence region is desired for a subvector $ heta $ of a multidimensional parameter $xi =( heta ,psi )$, based on an M-estimator $hat{xi }_{n}=(hat{ heta }_{n},hat{psi }_{n})$ calculated from a random sample of size $n$. Under nonstandard conditions $hat{xi }_{n}$ often converges at a nonregular rate $r_{n}$, in which case consistent estimation of the distribution of $r_{n}(hat{ heta }_{n}- heta )$, a pivot commonly chosen for confidence region construction, is most conveniently effected by the $m$ out of $n$ bootstrap. The above choice of pivot has three drawbacks: (i) the shape of the region is either subjectively prescribed or controlled by a computationally intensive depth function; (ii) the region is not transformation equivariant; (iii) $hat{xi }_{n}$ may not be uniquely defined. To resolve the above difficulties, we propose a one-dimensional pivot derived from the criterion function, and prove that its distribution can be consistently estimated by the $m$ out of $n$ bootstrap, or by a modified version of the perturbation bootstrap. This leads to a new method for constructing confidence regions which are transformation equivariant and have shapes driven solely by the criterion function. A subsampling procedure is proposed for selecting $m$ in practice. Empirical performance of the new method is illustrated with examples drawn from different nonstandard M-estimation settings. Extension of our theory to row-wise independent triangular arrays is also explored.




se

Envelope-based sparse partial least squares

Guangyu Zhu, Zhihua Su.

Source: The Annals of Statistics, Volume 48, Number 1, 161--182.

Abstract:
Sparse partial least squares (SPLS) is widely used in applied sciences as a method that performs dimension reduction and variable selection simultaneously in linear regression. Several implementations of SPLS have been derived, among which the SPLS proposed in Chun and Keleş ( J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 (2010) 3–25) is very popular and highly cited. However, for all of these implementations, the theoretical properties of SPLS are largely unknown. In this paper, we propose a new version of SPLS, called the envelope-based SPLS, using a connection between envelope models and partial least squares (PLS). We establish the consistency, oracle property and asymptotic normality of the envelope-based SPLS estimator. The large-sample scenario and high-dimensional scenario are both considered. We also develop the envelope-based SPLS estimators under the context of generalized linear models, and discuss its theoretical properties including consistency, oracle property and asymptotic distribution. Numerical experiments and examples show that the envelope-based SPLS estimator has better variable selection and prediction performance over the SPLS estimator ( J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 (2010) 3–25).




se

New $G$-formula for the sequential causal effect and blip effect of treatment in sequential causal inference

Xiaoqin Wang, Li Yin.

Source: The Annals of Statistics, Volume 48, Number 1, 138--160.

Abstract:
In sequential causal inference, two types of causal effects are of practical interest, namely, the causal effect of the treatment regime (called the sequential causal effect) and the blip effect of treatment on the potential outcome after the last treatment. The well-known $G$-formula expresses these causal effects in terms of the standard parameters. In this article, we obtain a new $G$-formula that expresses these causal effects in terms of the point observable effects of treatments similar to treatment in the framework of single-point causal inference. Based on the new $G$-formula, we estimate these causal effects by maximum likelihood via point observable effects with methods extended from single-point causal inference. We are able to increase precision of the estimation without introducing biases by an unsaturated model imposing constraints on the point observable effects. We are also able to reduce the number of point observable effects in the estimation by treatment assignment conditions.




se

Robust sparse covariance estimation by thresholding Tyler’s M-estimator

John Goes, Gilad Lerman, Boaz Nadler.

Source: The Annals of Statistics, Volume 48, Number 1, 86--110.

Abstract:
Estimating a high-dimensional sparse covariance matrix from a limited number of samples is a fundamental task in contemporary data analysis. Most proposals to date, however, are not robust to outliers or heavy tails. Toward bridging this gap, in this work we consider estimating a sparse shape matrix from $n$ samples following a possibly heavy-tailed elliptical distribution. We propose estimators based on thresholding either Tyler’s M-estimator or its regularized variant. We prove that in the joint limit as the dimension $p$ and the sample size $n$ tend to infinity with $p/n ogamma>0$, our estimators are minimax rate optimal. Results on simulated data support our theoretical analysis.




se

Sparse SIR: Optimal rates and adaptive estimation

Kai Tan, Lei Shi, Zhou Yu.

Source: The Annals of Statistics, Volume 48, Number 1, 64--85.

Abstract:
Sliced inverse regression (SIR) is an innovative and effective method for sufficient dimension reduction and data visualization. Recently, an impressive range of penalized SIR methods has been proposed to estimate the central subspace in a sparse fashion. Nonetheless, few of them considered the sparse sufficient dimension reduction from a decision-theoretic point of view. To address this issue, we in this paper establish the minimax rates of convergence for estimating the sparse SIR directions under various commonly used loss functions in the literature of sufficient dimension reduction. We also discover the possible trade-off between statistical guarantee and computational performance for sparse SIR. We finally propose an adaptive estimation scheme for sparse SIR which is computationally tractable and rate optimal. Numerical studies are carried out to confirm the theoretical properties of our proposed methods.




se

The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression

Emmanuel J. Candès, Pragya Sur.

Source: The Annals of Statistics, Volume 48, Number 1, 27--42.

Abstract:
This paper rigorously establishes that the existence of the maximum likelihood estimate (MLE) in high-dimensional logistic regression models with Gaussian covariates undergoes a sharp “phase transition.” We introduce an explicit boundary curve $h_{mathrm{MLE}}$, parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients, with the following property: in the limit of large sample sizes $n$ and number of features $p$ proportioned in such a way that $p/n ightarrow kappa $, we show that if the problem is sufficiently high dimensional in the sense that $kappa >h_{mathrm{MLE}}$, then the MLE does not exist with probability one. Conversely, if $kappa <h_{mathrm{MLE}}$, the MLE asymptotically exists with probability one.




se

Two-step semiparametric empirical likelihood inference

Francesco Bravo, Juan Carlos Escanciano, Ingrid Van Keilegom.

Source: The Annals of Statistics, Volume 48, Number 1, 1--26.

Abstract:
In both parametric and certain nonparametric statistical models, the empirical likelihood ratio satisfies a nonparametric version of Wilks’ theorem. For many semiparametric models, however, the commonly used two-step (plug-in) empirical likelihood ratio is not asymptotically distribution-free, that is, its asymptotic distribution contains unknown quantities, and hence Wilks’ theorem breaks down. This article suggests a general approach to restore Wilks’ phenomenon in two-step semiparametric empirical likelihood inferences. The main insight consists in using as the moment function in the estimating equation the influence function of the plug-in sample moment. The proposed method is general; it leads to a chi-squared limiting distribution with known degrees of freedom; it is efficient; it does not require undersmoothing; and it is less sensitive to the first-step than alternative methods, which is particularly appealing for high-dimensional settings. Several examples and simulation studies illustrate the general applicability of the procedure and its excellent finite sample performance relative to competing methods.




se

Detecting relevant changes in the mean of nonstationary processes—A mass excess approach

Holger Dette, Weichi Wu.

Source: The Annals of Statistics, Volume 47, Number 6, 3578--3608.

Abstract:
This paper considers the problem of testing if a sequence of means $(mu_{t})_{t=1,ldots ,n}$ of a nonstationary time series $(X_{t})_{t=1,ldots ,n}$ is stable in the sense that the difference of the means $mu_{1}$ and $mu_{t}$ between the initial time $t=1$ and any other time is smaller than a given threshold, that is $|mu_{1}-mu_{t}|leq c$ for all $t=1,ldots ,n$. A test for hypotheses of this type is developed using a bias corrected monotone rearranged local linear estimator and asymptotic normality of the corresponding test statistic is established. As the asymptotic variance depends on the location of the roots of the equation $|mu_{1}-mu_{t}|=c$ a new bootstrap procedure is proposed to obtain critical values and its consistency is established. As a consequence we are able to quantitatively describe relevant deviations of a nonstationary sequence from its initial value. The results are illustrated by means of a simulation study and by analyzing data examples.




se

Minimax posterior convergence rates and model selection consistency in high-dimensional DAG models based on sparse Cholesky factors

Kyoungjae Lee, Jaeyong Lee, Lizhen Lin.

Source: The Annals of Statistics, Volume 47, Number 6, 3413--3437.

Abstract:
In this paper we study the high-dimensional sparse directed acyclic graph (DAG) models under the empirical sparse Cholesky prior. Among our results, strong model selection consistency or graph selection consistency is obtained under more general conditions than those in the existing literature. Compared to Cao, Khare and Ghosh [ Ann. Statist. (2019) 47 319–348], the required conditions are weakened in terms of the dimensionality, sparsity and lower bound of the nonzero elements in the Cholesky factor. Furthermore, our result does not require the irrepresentable condition, which is necessary for Lasso-type methods. We also derive the posterior convergence rates for precision matrices and Cholesky factors with respect to various matrix norms. The obtained posterior convergence rates are the fastest among those of the existing Bayesian approaches. In particular, we prove that our posterior convergence rates for Cholesky factors are the minimax or at least nearly minimax depending on the relative size of true sparseness for the entire dimension. The simulation study confirms that the proposed method outperforms the competing methods.




se

On testing for high-dimensional white noise

Zeng Li, Clifford Lam, Jianfeng Yao, Qiwei Yao.

Source: The Annals of Statistics, Volume 47, Number 6, 3382--3412.

Abstract:
Testing for white noise is a classical yet important problem in statistics, especially for diagnostic checks in time series modeling and linear regression. For high-dimensional time series in the sense that the dimension $p$ is large in relation to the sample size $T$, the popular omnibus tests including the multivariate Hosking and Li–McLeod tests are extremely conservative, leading to substantial power loss. To develop more relevant tests for high-dimensional cases, we propose a portmanteau-type test statistic which is the sum of squared singular values of the first $q$ lagged sample autocovariance matrices. It, therefore, encapsulates all the serial correlations (up to the time lag $q$) within and across all component series. Using the tools from random matrix theory and assuming both $p$ and $T$ diverge to infinity, we derive the asymptotic normality of the test statistic under both the null and a specific VMA(1) alternative hypothesis. As the actual implementation of the test requires the knowledge of three characteristic constants of the population cross-sectional covariance matrix and the value of the fourth moment of the standardized innovations, nontrivial estimations are proposed for these parameters and their integration leads to a practically usable test. Extensive simulation confirms the excellent finite-sample performance of the new test with accurate size and satisfactory power for a large range of finite $(p,T)$ combinations, therefore, ensuring wide applicability in practice. In particular, the new tests are consistently superior to the traditional Hosking and Li–McLeod tests.




se

Sampling and estimation for (sparse) exchangeable graphs

Victor Veitch, Daniel M. Roy.

Source: The Annals of Statistics, Volume 47, Number 6, 3274--3299.

Abstract:
Sparse exchangeable graphs on $mathbb{R}_{+}$, and the associated graphex framework for sparse graphs, generalize exchangeable graphs on $mathbb{N}$, and the associated graphon framework for dense graphs. We develop the graphex framework as a tool for statistical network analysis by identifying the sampling scheme that is naturally associated with the models of the framework, formalizing two natural notions of consistent estimation of the parameter (the graphex) underlying these models, and identifying general consistent estimators in each case. The sampling scheme is a modification of independent vertex sampling that throws away vertices that are isolated in the sampled subgraph. The estimators are variants of the empirical graphon estimator, which is known to be a consistent estimator for the distribution of dense exchangeable graphs; both can be understood as graph analogues to the empirical distribution in the i.i.d. sequence setting. Our results may be viewed as a generalization of consistent estimation via the empirical graphon from the dense graph regime to also include sparse graphs.




se

On partial-sum processes of ARMAX residuals

Steffen Grønneberg, Benjamin Holcblat.

Source: The Annals of Statistics, Volume 47, Number 6, 3216--3243.

Abstract:
We establish general and versatile results regarding the limit behavior of the partial-sum process of ARMAX residuals. Illustrations include ARMA with seasonal dummies, misspecified ARMAX models with autocorrelated errors, nonlinear ARMAX models, ARMA with a structural break, a wide range of ARMAX models with infinite-variance errors, weak GARCH models and the consistency of kernel estimation of the density of ARMAX errors. Our results identify the limit distributions, and provide a general algorithm to obtain pivot statistics for CUSUM tests.




se

Adaptive estimation of the rank of the coefficient matrix in high-dimensional multivariate response regression models

Xin Bing, Marten H. Wegkamp.

Source: The Annals of Statistics, Volume 47, Number 6, 3157--3184.

Abstract:
We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion differs notably from the one proposed in Bunea, She and Wegkamp ( Ann. Statist. 39 (2011) 1282–1309) in that it does not require estimation of the unknown variance of the noise, nor does it depend on a delicate choice of a tuning parameter. We develop an iterative, fully data-driven procedure, that adapts to the optimal signal-to-noise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. We perform an extensive simulation study that confirms our theoretical findings. The new method performs better and is more stable than the procedure of Bunea, She and Wegkamp ( Ann. Statist. 39 (2011) 1282–1309) in both low- and high-dimensional settings.




se

Active ranking from pairwise comparisons and when parametric assumptions do not help

Reinhard Heckel, Nihar B. Shah, Kannan Ramchandran, Martin J. Wainwright.

Source: The Annals of Statistics, Volume 47, Number 6, 3099--3126.

Abstract:
We consider sequential or active ranking of a set of $n$ items based on noisy pairwise comparisons. Items are ranked according to the probability that a given item beats a randomly chosen item, and ranking refers to partitioning the items into sets of prespecified sizes according to their scores. This notion of ranking includes as special cases the identification of the top-$k$ items and the total ordering of the items. We first analyze a sequential ranking algorithm that counts the number of comparisons won, and uses these counts to decide whether to stop, or to compare another pair of items, chosen based on confidence intervals specified by the data collected up to that point. We prove that this algorithm succeeds in recovering the ranking using a number of comparisons that is optimal up to logarithmic factors. This guarantee does depend on whether or not the underlying pairwise probability matrix, satisfies a particular structural property, unlike a significant body of past work on pairwise ranking based on parametric models such as the Thurstone or Bradley–Terry–Luce models. It has been a long-standing open question as to whether or not imposing these parametric assumptions allows for improved ranking algorithms. For stochastic comparison models, in which the pairwise probabilities are bounded away from zero, our second contribution is to resolve this issue by proving a lower bound for parametric models. This shows, perhaps surprisingly, that these popular parametric modeling choices offer at most logarithmic gains for stochastic comparisons.




se

Phase transition in the spiked random tensor with Rademacher prior

Wei-Kuo Chen.

Source: The Annals of Statistics, Volume 47, Number 5, 2734--2756.

Abstract:
We consider the problem of detecting a deformation from a symmetric Gaussian random $p$-tensor $(pgeq3)$ with a rank-one spike sampled from the Rademacher prior. Recently, in Lesieur et al. (Barbier, Krzakala, Macris, Miolane and Zdeborová (2017)), it was proved that there exists a critical threshold $eta_{p}$ so that when the signal-to-noise ratio exceeds $eta_{p}$, one can distinguish the spiked and unspiked tensors and weakly recover the prior via the minimal mean-square-error method. On the other side, Perry, Wein and Bandeira (Perry, Wein and Bandeira (2017)) proved that there exists a $eta_{p}'<eta_{p}$ such that any statistical hypothesis test cannot distinguish these two tensors, in the sense that their total variation distance asymptotically vanishes, when the signa-to-noise ratio is less than $eta_{p}'$. In this work, we show that $eta_{p}$ is indeed the critical threshold that strictly separates the distinguishability and indistinguishability between the two tensors under the total variation distance. Our approach is based on a subtle analysis of the high temperature behavior of the pure $p$-spin model with Ising spin, arising initially from the field of spin glasses. In particular, we identify the signal-to-noise criticality $eta_{p}$ as the critical temperature, distinguishing the high and low temperature behavior, of the Ising pure $p$-spin mean-field spin glass model.




se

Semiparametrically point-optimal hybrid rank tests for unit roots

Bo Zhou, Ramon van den Akker, Bas J. M. Werker.

Source: The Annals of Statistics, Volume 47, Number 5, 2601--2638.

Abstract:
We propose a new class of unit root tests that exploits invariance properties in the Locally Asymptotically Brownian Functional limit experiment associated to the unit root model. The invariance structures naturally suggest tests that are based on the ranks of the increments of the observations, their average and an assumed reference density for the innovations. The tests are semiparametric in the sense that they are valid, that is, have the correct (asymptotic) size, irrespective of the true innovation density. For a correctly specified reference density, our test is point-optimal and nearly efficient. For arbitrary reference densities, we establish a Chernoff–Savage-type result, that is, our test performs as well as commonly used tests under Gaussian innovations but has improved power under other, for example, fat-tailed or skewed, innovation distributions. To avoid nonparametric estimation, we propose a simplified version of our test that exhibits the same asymptotic properties, except for the Chernoff–Savage result that we are only able to demonstrate by means of simulations.




se

Semi-supervised inference: General theory and estimation of means

Anru Zhang, Lawrence D. Brown, T. Tony Cai.

Source: The Annals of Statistics, Volume 47, Number 5, 2538--2566.

Abstract:
We propose a general semi-supervised inference framework focused on the estimation of the population mean. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses (“labels”). Otherwise, the formulation is “assumption-lean” in that no major conditions are imposed on the statistical or functional form of the data. We consider both the ideal semi-supervised setting where infinitely many unlabeled samples are available, as well as the ordinary semi-supervised setting in which only a finite number of unlabeled samples is available. Estimators are proposed along with corresponding confidence intervals for the population mean. Theoretical analysis on both the asymptotic distribution and $ell_{2}$-risk for the proposed procedures are given. Surprisingly, the proposed estimators, based on a simple form of the least squares method, outperform the ordinary sample mean. The simple, transparent form of the estimator lends confidence to the perception that its asymptotic improvement over the ordinary sample mean also nearly holds even for moderate size samples. The method is further extended to a nonparametric setting, in which the oracle rate can be achieved asymptotically. The proposed estimators are further illustrated by simulation studies and a real data example involving estimation of the homeless population.




se

A knockoff filter for high-dimensional selective inference

Rina Foygel Barber, Emmanuel J. Candès.

Source: The Annals of Statistics, Volume 47, Number 5, 2504--2537.

Abstract:
This paper develops a framework for testing for associations in a possibly high-dimensional linear model where the number of features/variables may far exceed the number of observational units. In this framework, the observations are split into two groups, where the first group is used to screen for a set of potentially relevant variables, whereas the second is used for inference over this reduced set of variables; we also develop strategies for leveraging information from the first part of the data at the inference step for greater power. In our work, the inferential step is carried out by applying the recently introduced knockoff filter, which creates a knockoff copy—a fake variable serving as a control—for each screened variable. We prove that this procedure controls the directional false discovery rate (FDR) in the reduced model controlling for all screened variables; this says that our high-dimensional knockoff procedure “discovers” important variables as well as the directions (signs) of their effects, in such a way that the expected proportion of wrongly chosen signs is below the user-specified level (thereby controlling a notion of Type S error averaged over the selected set). This result is nonasymptotic, and holds for any distribution of the original features and any values of the unknown regression coefficients, so that inference is not calibrated under hypothesized values of the effect sizes. We demonstrate the performance of our general and flexible approach through numerical studies, showing more power than existing alternatives. Finally, we apply our method to a genome-wide association study to find locations on the genome that are possibly associated with a continuous phenotype.




se

Cross validation for locally stationary processes

Stefan Richter, Rainer Dahlhaus.

Source: The Annals of Statistics, Volume 47, Number 4, 2145--2173.

Abstract:
We propose an adaptive bandwidth selector via cross validation for local M-estimators in locally stationary processes. We prove asymptotic optimality of the procedure under mild conditions on the underlying parameter curves. The results are applicable to a wide range of locally stationary processes such linear and nonlinear processes. A simulation study shows that the method works fairly well also in misspecified situations.




se

On deep learning as a remedy for the curse of dimensionality in nonparametric regression

Benedikt Bauer, Michael Kohler.

Source: The Annals of Statistics, Volume 47, Number 4, 2261--2285.

Abstract:
Assuming that a smoothness condition and a suitable restriction on the structure of the regression function hold, it is shown that least squares estimates based on multilayer feedforward neural networks are able to circumvent the curse of dimensionality in nonparametric regression. The proof is based on new approximation results concerning multilayer feedforward neural networks with bounded weights and a bounded number of hidden neurons. The estimates are compared with various other approaches by using simulated data.




se

Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem

James G. Scott, James O. Berger

Source: Ann. Statist., Volume 38, Number 5, 2587--2619.

Abstract:
This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham’s-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains.




se

data warehouse

A large store of data for analysis. Organizations use data warehouses (and smaller 'data marts') to help them analyze historic transaction data to detect useful patterns and trends. First of all the data is transferred into the data warehouse using a process called extracting, transforming and loading (ETL). Then it is organized and stored in the data warehouse in ways that optimize it for high-performance analysis. The transfer to a separate data warehouse system, which is usually performed as a regular batch job every night or at some other interval, insulates the live transaction systems from any side-effects of the analysis, but at the cost of not having the very latest data included in the analysis.




se

semantics

Intended meaning. In computing, semantics is the assumed or explicit set of understandings used in a system to give meaning to data. One of the biggest challenges when integrating separate computer systems and applications is to correctly match up the intended meanings within each system. Simple metadata classifications such as 'price' or 'location' may have wildly different meanings in each system, while apparently different terms, such as 'client' and 'patient' may turn out to be effectively equivalent.




se

Correction: Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects

Trang Quynh Nguyen, Elizabeth A. Stuart.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 518--520.




se

Modeling wildfire ignition origins in southern California using linear network point processes

Medha Uppala, Mark S. Handcock.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 339--356.

Abstract:
This paper focuses on spatial and temporal modeling of point processes on linear networks. Point processes on linear networks can simply be defined as point events occurring on or near line segment network structures embedded in a certain space. A separable modeling framework is introduced that posits separate formation and dissolution models of point processes on linear networks over time. While the model was inspired by spider web building activity in brick mortar lines, the focus is on modeling wildfire ignition origins near road networks over a span of 14 years. As most wildfires in California have human-related origins, modeling the origin locations with respect to the road network provides insight into how human, vehicular and structural densities affect ignition occurrence. Model results show that roads that traverse different types of regions such as residential, interface and wildland regions have higher ignition intensities compared to roads that only exist in each of the mentioned region types.




se

Optimal asset allocation with multivariate Bayesian dynamic linear models

Jared D. Fisher, Davide Pettenuzzo, Carlos M. Carvalho.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 299--338.

Abstract:
We introduce a fast, closed-form, simulation-free method to model and forecast multiple asset returns and employ it to investigate the optimal ensemble of features to include when jointly predicting monthly stock and bond excess returns. Our approach builds on the Bayesian dynamic linear models of West and Harrison ( Bayesian Forecasting and Dynamic Models (1997) Springer), and it can objectively determine, through a fully automated procedure, both the optimal set of regressors to include in the predictive system and the degree to which the model coefficients, volatilities and covariances should vary over time. When applied to a portfolio of five stock and bond returns, we find that our method leads to large forecast gains, both in statistical and economic terms. In particular, we find that relative to a standard no-predictability benchmark, the optimal combination of predictors, stochastic volatility and time-varying covariances increases the annualized certainty equivalent returns of a leverage-constrained power utility investor by more than 500 basis points.




se

Feature selection for generalized varying coefficient mixed-effect models with application to obesity GWAS

Wanghuan Chu, Runze Li, Jingyuan Liu, Matthew Reimherr.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 276--298.

Abstract:
Motivated by an empirical analysis of data from a genome-wide association study on obesity, measured by the body mass index (BMI), we propose a two-step gene-detection procedure for generalized varying coefficient mixed-effects models with ultrahigh dimensional covariates. The proposed procedure selects significant single nucleotide polymorphisms (SNPs) impacting the mean BMI trend, some of which have already been biologically proven to be “fat genes.” The method also discovers SNPs that significantly influence the age-dependent variability of BMI. The proposed procedure takes into account individual variations of genetic effects and can also be directly applied to longitudinal data with continuous, binary or count responses. We employ Monte Carlo simulation studies to assess the performance of the proposed method and further carry out causal inference for the selected SNPs.




se

Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors

Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, Brent Coull.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 257--275.

Abstract:
Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures, we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We utilize our approach to study the impact of exposure to metals on child neurodevelopment in Bangladesh and find a nonlinear, interactive relationship between arsenic and manganese.




se

Bayesian factor models for probabilistic cause of death assessment with verbal autopsies

Tsuyoshi Kunihama, Zehang Richard Li, Samuel J. Clark, Tyler H. McCormick.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 241--256.

Abstract:
The distribution of deaths by cause provides crucial information for public health planning, response and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data




se

Assessing wage status transition and stagnation using quantile transition regression

Chih-Yuan Hsu, Yi-Hau Chen, Ruoh-Rong Yu, Tsung-Wei Hung.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 160--177.

Abstract:
Workers in Taiwan overall have been suffering from long-lasting wage stagnation since the mid-1990s. In particular, there seems to be little mobility for the wages of Taiwanese workers to transit across wage quantile groups. It is of interest to see if certain groups of workers, such as female, lower educated and younger generation workers, suffer from the problem more seriously than the others. This work tries to apply a systematic statistical approach to study this issue, based on the longitudinal data from the Panel Study of Family Dynamics (PSFD) survey conducted in Taiwan since 1999. We propose the quantile transition regression model, generalizing recent methodology for quantile association, to assess the wage status transition with respect to the marginal wage quantiles over time as well as the effects of certain demographic and job factors on the wage status transition. Estimation of the model can be based on the composite likelihoods utilizing the binary, or ordinal-data information regarding the quantile transition, with the associated asymptotic theory established. A goodness-of-fit procedure for the proposed model is developed. The performances of the estimation and the goodness-of-fit procedures for the quantile transition model are illustrated through simulations. The application of the proposed methodology to the PSFD survey data suggests that female, private-sector workers with higher age and education below postgraduate level suffer from more severe wage status stagnation than the others.




se

Bayesian indicator variable selection to incorporate hierarchical overlapping group structure in multi-omics applications

Li Zhu, Zhiguang Huo, Tianzhou Ma, Steffi Oesterreich, George C. Tseng.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2611--2636.

Abstract:
Variable selection is a pervasive problem in modern high-dimensional data analysis where the number of features often exceeds the sample size (a.k.a. small-n-large-p problem). Incorporation of group structure knowledge to improve variable selection has been widely studied. Here, we consider prior knowledge of a hierarchical overlapping group structure to improve variable selection in regression setting. In genomics applications, for instance, a biological pathway contains tens to hundreds of genes and a gene can be mapped to multiple experimentally measured features (such as its mRNA expression, copy number variation and methylation levels of possibly multiple sites). In addition to the hierarchical structure, the groups at the same level may overlap (e.g., two pathways can share common genes). Incorporating such hierarchical overlapping groups in traditional penalized regression setting remains a difficult optimization problem. Alternatively, we propose a Bayesian indicator model that can elegantly serve the purpose. We evaluate the model in simulations and two breast cancer examples, and demonstrate its superior performance over existing models. The result not only enhances prediction accuracy but also improves variable selection and model interpretation that lead to deeper biological insight of the disease.




se

Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: A winning solution to the NIJ “Real-Time Crime Forecasting Challenge”

Seth Flaxman, Michael Chirico, Pau Pereira, Charles Loeffler.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2564--2585.

Abstract:
We propose a generic spatiotemporal event forecasting method which we developed for the National Institute of Justice’s (NIJ) Real-Time Crime Forecasting Challenge (National Institute of Justice (2017)). Our method is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. While the smoothing kernels capture the two main approaches in current use in the field of crime forecasting, kernel density estimation (KDE) and self-exciting point process (SEPP) models, the RKHS component of the model can be understood as an approximation to the popular log-Gaussian Cox Process model. For inference, we discretize the spatiotemporal point pattern and learn a log-intensity function using the Poisson likelihood and highly efficient gradient-based optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags and bandwidths for smoothing kernels as well as cell shape, size and rotation, were learned using cross validation. Resulting predictions significantly exceeded baseline KDE estimates and SEPP models for sparse events.




se

A hierarchical curve-based approach to the analysis of manifold data

Liberty Vittert, Adrian W. Bowman, Stanislav Katina.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2539--2563.

Abstract:
One of the data structures generated by medical imaging technology is high resolution point clouds representing anatomical surfaces. Stereophotogrammetry and laser scanning are two widely available sources of this kind of data. A standardised surface representation is required to provide a meaningful correspondence across different images as a basis for statistical analysis. Point locations with anatomical definitions, referred to as landmarks, have been the traditional approach. Landmarks can also be taken as the starting point for more general surface representations, often using templates which are warped on to an observed surface by matching landmark positions and subsequent local adjustment of the surface. The aim of the present paper is to provide a new approach which places anatomical curves at the heart of the surface representation and its analysis. Curves provide intermediate structures which capture the principal features of the manifold (surface) of interest through its ridges and valleys. As landmarks are often available these are used as anchoring points, but surface curvature information is the principal guide in estimating the curve locations. The surface patches between these curves are relatively flat and can be represented in a standardised manner by appropriate surface transects to give a complete surface model. This new approach does not require the use of a template, reference sample or any external information to guide the method and, when compared with a surface based approach, the estimation of curves is shown to have improved performance. In addition, examples involving applications to mussel shells and human faces show that the analysis of curve information can deliver more targeted and effective insight than the use of full surface information.




se

Empirical Bayes analysis of RNA sequencing experiments with auxiliary information

Kun Liang.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2452--2482.

Abstract:
Finding differentially expressed genes is a common task in high-throughput transcriptome studies. While traditional statistical methods rank the genes by their test statistics alone, we analyze an RNA sequencing dataset using the auxiliary information of gene length and the test statistics from a related microarray study. Given the auxiliary information, we propose a novel nonparametric empirical Bayes procedure to estimate the posterior probability of differential expression for each gene. We demonstrate the advantage of our procedure in extensive simulation studies and a psoriasis RNA sequencing study. The companion R package calm is available at Bioconductor.




se

Outline analyses of the called strike zone in Major League Baseball

Dale L. Zimmerman, Jun Tang, Rui Huang.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2416--2451.

Abstract:
We extend statistical shape analytic methods known as outline analysis for application to the strike zone, a central feature of the game of baseball. Although the strike zone is rigorously defined by Major League Baseball’s official rules, umpires make mistakes in calling pitches as strikes (and balls) and may even adhere to a strike zone somewhat different than that prescribed by the rule book. Our methods yield inference on geometric attributes (centroid, dimensions, orientation and shape) of this “called strike zone” (CSZ) and on the effects that years, umpires, player attributes, game situation factors and their interactions have on those attributes. The methodology consists of first using kernel discriminant analysis to determine a noisy outline representing the CSZ corresponding to each factor combination, then fitting existing elliptic Fourier and new generalized superelliptic models for closed curves to that outline and finally analyzing the fitted model coefficients using standard methods of regression analysis, factorial analysis of variance and variance component estimation. We apply these methods to PITCHf/x data comprising more than three million called pitches from the 2008–2016 Major League Baseball seasons to address numerous questions about the CSZ. We find that all geometric attributes of the CSZ, except its size, became significantly more like those of the rule-book strike zone from 2008–2016 and that several player attribute/game situation factors had statistically and practically significant effects on many of them. We also establish that the variation in the horizontal center, width and area of an individual umpire’s CSZ from pitch to pitch is smaller than their variation among CSZs from different umpires.




se

Predicting paleoclimate from compositional data using multivariate Gaussian process inverse prediction

John R. Tipton, Mevin B. Hooten, Connor Nolan, Robert K. Booth, Jason McLachlan.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2363--2388.

Abstract:
Multivariate compositional count data arise in many applications including ecology, microbiology, genetics and paleoclimate. A frequent question in the analysis of multivariate compositional count data is what underlying values of a covariate(s) give rise to the observed composition. Learning the relationship between covariates and the compositional count allows for inverse prediction of unobserved covariates given compositional count observations. Gaussian processes provide a flexible framework for modeling functional responses with respect to a covariate without assuming a functional form. Many scientific disciplines use Gaussian process approximations to improve prediction and make inference on latent processes and parameters. When prediction is desired on unobserved covariates given realizations of the response variable, this is called inverse prediction. Because inverse prediction is often mathematically and computationally challenging, predicting unobserved covariates often requires fitting models that are different from the hypothesized generative model. We present a novel computational framework that allows for efficient inverse prediction using a Gaussian process approximation to generative models. Our framework enables scientific learning about how the latent processes co-vary with respect to covariates while simultaneously providing predictions of missing covariates. The proposed framework is capable of efficiently exploring the high dimensional, multi-modal latent spaces that arise in the inverse problem. To demonstrate flexibility, we apply our method in a generalized linear model framework to predict latent climate states given multivariate count data. Based on cross-validation, our model has predictive skill competitive with current methods while simultaneously providing formal, statistical inference on the underlying community dynamics of the biological system previously not available.




se

A latent discrete Markov random field approach to identifying and classifying historical forest communities based on spatial multivariate tree species counts

Stephen Berg, Jun Zhu, Murray K. Clayton, Monika E. Shea, David J. Mladenoff.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2312--2340.

Abstract:
The Wisconsin Public Land Survey database describes historical forest composition at high spatial resolution and is of interest in ecological studies of forest composition in Wisconsin just prior to significant Euro-American settlement. For such studies it is useful to identify recurring subpopulations of tree species known as communities, but standard clustering approaches for subpopulation identification do not account for dependence between spatially nearby observations. Here, we develop and fit a latent discrete Markov random field model for the purpose of identifying and classifying historical forest communities based on spatially referenced multivariate tree species counts across Wisconsin. We show empirically for the actual dataset and through simulation that our latent Markov random field modeling approach improves prediction and parameter estimation performance. For model fitting we introduce a new stochastic approximation algorithm which enables computationally efficient estimation and classification of large amounts of spatial multivariate count data.