l Maximum Independent Component Analysis with Application to EEG Data By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Ruosi Guo, Chunming Zhang, Zhengjun Zhang. Source: Statistical Science, Volume 35, Number 1, 145--157.Abstract: In many scientific disciplines, finding hidden influential factors behind observational data is essential but challenging. The majority of existing approaches, such as the independent component analysis (${mathrm{ICA}}$), rely on linear transformation, that is, true signals are linear combinations of hidden components. Motivated from analyzing nonlinear temporal signals in neuroscience, genetics, and finance, this paper proposes the “maximum independent component analysis” (${mathrm{MaxICA}}$), based on max-linear combinations of components. In contrast to existing methods, ${mathrm{MaxICA}}$ benefits from focusing on significant major components while filtering out ignorable components. A major tool for parameter learning of ${mathrm{MaxICA}}$ is an augmented genetic algorithm, consisting of three schemes for the elite weighted sum selection, randomly combined crossover, and dynamic mutation. Extensive empirical evaluations demonstrate the effectiveness of ${mathrm{MaxICA}}$ in either extracting max-linearly combined essential sources in many applications or supplying a better approximation for nonlinearly combined source signals, such as $mathrm{EEG}$ recordings analyzed in this paper. Full Article
l Statistical Inference for the Evolutionary History of Cancer Genomes By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Khanh N. Dinh, Roman Jaksik, Marek Kimmel, Amaury Lambert, Simon Tavaré. Source: Statistical Science, Volume 35, Number 1, 129--144.Abstract: Recent years have seen considerable work on inference about cancer evolution from mutations identified in cancer samples. Much of the modeling work has been based on classical models of population genetics, generalized to accommodate time-varying cell population size. Reverse-time, genealogical views of such models, commonly known as coalescents, have been used to infer aspects of the past of growing populations. Another approach is to use branching processes, the simplest scenario being the classical linear birth-death process. Inference from evolutionary models of DNA often exploits summary statistics of the sequence data, a common one being the so-called Site Frequency Spectrum (SFS). In a bulk tumor sequencing experiment, we can estimate for each site at which a novel somatic point mutation has arisen, the proportion of cells that carry that mutation. These numbers are then grouped into collections of sites which have similar mutant fractions. We examine how the SFS based on birth-death processes differs from those based on the coalescent model. This may stem from the different sampling mechanisms in the two approaches. However, we also show that despite this, they are quantitatively comparable for the range of parameters typical for tumor cell populations. We also present a model of tumor evolution with selective sweeps, and demonstrate how it may help in understanding the history of a tumor as well as the influence of data pre-processing. We illustrate the theory with applications to several examples from The Cancer Genome Atlas tumors. Full Article
l Data Denoising and Post-Denoising Corrections in Single Cell RNA Sequencing By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Divyansh Agarwal, Jingshu Wang, Nancy R. Zhang. Source: Statistical Science, Volume 35, Number 1, 112--128.Abstract: Single cell sequencing technologies are transforming biomedical research. However, due to the inherent nature of the data, single cell RNA sequencing analysis poses new computational and statistical challenges. We begin with a survey of a selection of topics in this field, with a gentle introduction to the biology and a more detailed exploration of the technical noise. We consider in detail the problem of single cell data denoising, sometimes referred to as “imputation” in the relevant literature. We discuss why this is not a typical statistical imputation problem, and review current approaches to this problem. We then explore why the use of denoised values in downstream analyses invites novel statistical insights, and how denoising uncertainty should be accounted for to yield valid statistical inference. The utilization of denoised or imputed matrices in statistical inference is not unique to single cell genomics, and arises in many other fields. We describe the challenges in this type of analysis, discuss some preliminary solutions, and highlight unresolved issues. Full Article
l Statistical Molecule Counting in Super-Resolution Fluorescence Microscopy: Towards Quantitative Nanoscopy By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Thomas Staudt, Timo Aspelmeier, Oskar Laitenberger, Claudia Geisler, Alexander Egner, Axel Munk. Source: Statistical Science, Volume 35, Number 1, 92--111.Abstract: Super-resolution microscopy is rapidly gaining importance as an analytical tool in the life sciences. A compelling feature is the ability to label biological units of interest with fluorescent markers in (living) cells and to observe them with considerably higher resolution than conventional microscopy permits. The images obtained this way, however, lack an absolute intensity scale in terms of numbers of fluorophores observed. In this article, we discuss state of the art methods to count such fluorophores and statistical challenges that come along with it. In particular, we suggest a modeling scheme for time series generated by single-marker-switching (SMS) microscopy that makes it possible to quantify the number of markers in a statistically meaningful manner from the raw data. To this end, we model the entire process of photon generation in the fluorophore, their passage through the microscope, detection and photoelectron amplification in the camera, and extraction of time series from the microscopic images. At the heart of these modeling steps is a careful description of the fluorophore dynamics by a novel hidden Markov model that operates on two timescales (HTMM). Besides the fluorophore number, information about the kinetic transition rates of the fluorophore’s internal states is also inferred during estimation. We comment on computational issues that arise when applying our model to simulated or measured fluorescence traces and illustrate our methodology on simulated data. Full Article
l Statistical Methodology in Single-Molecule Experiments By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Chao Du, S. C. Kou. Source: Statistical Science, Volume 35, Number 1, 75--91.Abstract: Toward the last quarter of the 20th century, the emergence of single-molecule experiments enabled scientists to track and study individual molecules’ dynamic properties in real time. Unlike macroscopic systems’ dynamics, those of single molecules can only be properly described by stochastic models even in the absence of external noise. Consequently, statistical methods have played a key role in extracting hidden information about molecular dynamics from data obtained through single-molecule experiments. In this article, we survey the major statistical methodologies used to analyze single-molecule experimental data. Our discussion is organized according to the types of stochastic models used to describe single-molecule systems as well as major experimental data collection techniques. We also highlight challenges and future directions in the application of statistical methodologies to single-molecule experiments. Full Article
l Quantum Science and Quantum Technology By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Yazhen Wang, Xinyu Song. Source: Statistical Science, Volume 35, Number 1, 51--74.Abstract: Quantum science and quantum technology are of great current interest in multiple frontiers of many scientific fields ranging from computer science to physics and chemistry, and from engineering to mathematics and statistics. Their developments will likely lead to a new wave of scientific revolutions and technological innovations in a wide range of scientific studies and applications. This paper provides a brief review on quantum communication, quantum information, quantum computation, quantum simulation, and quantum metrology. We present essential quantum properties, illustrate relevant concepts of quantum science and quantum technology, and discuss their scientific developments. We point out the need for statistical analysis in their developments, as well as their potential applications to and impacts on statistics and data science. Full Article
l A Tale of Two Parasites: Statistical Modelling to Support Disease Control Programmes in Africa By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Peter J. Diggle, Emanuele Giorgi, Julienne Atsame, Sylvie Ntsame Ella, Kisito Ogoussan, Katherine Gass. Source: Statistical Science, Volume 35, Number 1, 42--50.Abstract: Vector-borne diseases have long presented major challenges to the health of rural communities in the wet tropical regions of the world, but especially in sub-Saharan Africa. In this paper, we describe the contribution that statistical modelling has made to the global elimination programme for one vector-borne disease, onchocerciasis. We explain why information on the spatial distribution of a second vector-borne disease, Loa loa, is needed before communities at high risk of onchocerciasis can be treated safely with mass distribution of ivermectin, an antifiarial medication. We show how a model-based geostatistical analysis of Loa loa prevalence survey data can be used to map the predictive probability that each location in the region of interest meets a WHO policy guideline for safe mass distribution of ivermectin and describe two applications: one is to data from Cameroon that assesses prevalence using traditional blood-smear microscopy; the other is to Africa-wide data that uses a low-cost questionnaire-based method. We describe how a recent technological development in image-based microscopy has resulted in a change of emphasis from prevalence alone to the bivariate spatial distribution of prevalence and the intensity of infection among infected individuals. We discuss how statistical modelling of the kind described here can contribute to health policy guidelines and decision-making in two ways. One is to ensure that, in a resource-limited setting, prevalence surveys are designed, and the resulting data analysed, as efficiently as possible. The other is to provide an honest quantification of the uncertainty attached to any binary decision by reporting predictive probabilities that a policy-defined condition for action is or is not met. Full Article
l Some Statistical Issues in Climate Science By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Michael L. Stein. Source: Statistical Science, Volume 35, Number 1, 31--41.Abstract: Climate science is a field that is arguably both data-rich and data-poor. Data rich in that huge and quickly increasing amounts of data about the state of the climate are collected every day. Data poor in that important aspects of the climate are still undersampled, such as the deep oceans and some characteristics of the upper atmosphere. Data rich in that modern climate models can produce climatological quantities over long time periods with global coverage, including quantities that are difficult to measure and under conditions for which there is no data presently. Data poor in that the correspondence between climate model output to the actual climate, especially for future climate change due to human activities, is difficult to assess. The scope for fruitful interactions between climate scientists and statisticians is great, but requires serious commitments from researchers in both disciplines to understand the scientific and statistical nuances arising from the complex relationships between the data and the real-world problems. This paper describes a small fraction of some of the intellectual challenges that occur at the interface between climate science and statistics, including inferences for extremes for processes with seasonality and long-term trends, the use of climate model ensembles for studying extremes, the scope for using new data sources for studying space-time characteristics of environmental processes and a discussion of non-Gaussian space-time process models for climate variables. The paper concludes with a call to the statistical community to become more engaged in one of the great scientific and policy issues of our time, anthropogenic climate change and its impacts. Full Article
l Risk Models for Breast Cancer and Their Validation By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Adam R. Brentnall, Jack Cuzick. Source: Statistical Science, Volume 35, Number 1, 14--30.Abstract: Strategies to prevent cancer and diagnose it early when it is most treatable are needed to reduce the public health burden from rising disease incidence. Risk assessment is playing an increasingly important role in targeting individuals in need of such interventions. For breast cancer many individual risk factors have been well understood for a long time, but the development of a fully comprehensive risk model has not been straightforward, in part because there have been limited data where joint effects of an extensive set of risk factors may be estimated with precision. In this article we first review the approach taken to develop the IBIS (Tyrer–Cuzick) model, and describe recent updates. We then review and develop methods to assess calibration of models such as this one, where the risk of disease allowing for competing mortality over a long follow-up time or lifetime is estimated. The breast cancer risk model model and calibration assessment methods are demonstrated using a cohort of 132,139 women attending mammography screening in the State of Washington, USA. Full Article
l Model-Based Approach to the Joint Analysis of Single-Cell Data on Chromatin Accessibility and Gene Expression By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Zhixiang Lin, Mahdi Zamanighomi, Timothy Daley, Shining Ma, Wing Hung Wong. Source: Statistical Science, Volume 35, Number 1, 2--13.Abstract: Unsupervised methods, including clustering methods, are essential to the analysis of single-cell genomic data. Model-based clustering methods are under-explored in the area of single-cell genomics, and have the advantage of quantifying the uncertainty of the clustering result. Here we develop a model-based approach for the integrative analysis of single-cell chromatin accessibility and gene expression data. We show that combining these two types of data, we can achieve a better separation of the underlying cell types. An efficient Markov chain Monte Carlo algorithm is also developed. Full Article
l Introduction to the Special Issue By projecteuclid.org Published On :: Tue, 03 Mar 2020 04:00 EST Source: Statistical Science, Volume 35, Number 1, 1--1. Full Article
l Statistical Theory Powering Data Science By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Junhui Cai, Avishai Mandelbaum, Chaitra H. Nagaraja, Haipeng Shen, Linda Zhao. Source: Statistical Science, Volume 34, Number 4, 669--691.Abstract: Statisticians are finding their place in the emerging field of data science. However, many issues considered “new” in data science have long histories in statistics. Examples of using statistical thinking are illustrated, which range from exploratory data analysis to measuring uncertainty to accommodating nonrandom samples. These examples are then applied to service networks, baseball predictions and official statistics. Full Article
l Larry Brown’s Work on Admissibility By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Iain M. Johnstone. Source: Statistical Science, Volume 34, Number 4, 657--668.Abstract: Many papers in the early part of Brown’s career focused on the admissibility or otherwise of estimators of a vector parameter. He established that inadmissibility of invariant estimators in three and higher dimensions is a general phenomenon, and found deep and beautiful connections between admissibility and other areas of mathematics. This review touches on several of his major contributions, with a focus on his celebrated 1971 paper connecting admissibility, recurrence and elliptic partial differential equations. Full Article
l Gaussianization Machines for Non-Gaussian Function Estimation Models By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST T. Tony Cai. Source: Statistical Science, Volume 34, Number 4, 635--656.Abstract: A wide range of nonparametric function estimation models have been studied individually in the literature. Among them the homoscedastic nonparametric Gaussian regression is arguably the best known and understood. Inspired by the asymptotic equivalence theory, Brown, Cai and Zhou ( Ann. Statist. 36 (2008) 2055–2084; Ann. Statist. 38 (2010) 2005–2046) and Brown et al. ( Probab. Theory Related Fields 146 (2010) 401–433) developed a unified approach to turn a collection of non-Gaussian function estimation models into a standard Gaussian regression and any good Gaussian nonparametric regression method can then be used. These Gaussianization Machines have two key components, binning and transformation. When combined with BlockJS, a wavelet thresholding procedure for Gaussian regression, the procedures are computationally efficient with strong theoretical guarantees. Technical analysis given in Brown, Cai and Zhou ( Ann. Statist. 36 (2008) 2055–2084; Ann. Statist. 38 (2010) 2005–2046) and Brown et al. ( Probab. Theory Related Fields 146 (2010) 401–433) shows that the estimators attain the optimal rate of convergence adaptively over a large set of Besov spaces and across a collection of non-Gaussian function estimation models, including robust nonparametric regression, density estimation, and nonparametric regression in exponential families. The estimators are also spatially adaptive. The Gaussianization Machines significantly extend the flexibility and scope of the theories and methodologies originally developed for the conventional nonparametric Gaussian regression. This article aims to provide a concise account of the Gaussianization Machines developed in Brown, Cai and Zhou ( Ann. Statist. 36 (2008) 2055–2084; Ann. Statist. 38 (2010) 2005–2046), Brown et al. ( Probab. Theory Related Fields 146 (2010) 401–433). Full Article
l Larry Brown’s Contributions to Parametric Inference, Decision Theory and Foundations: A Survey By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST James O. Berger, Anirban DasGupta. Source: Statistical Science, Volume 34, Number 4, 621--634.Abstract: This article gives a panoramic survey of the general area of parametric statistical inference, decision theory and foundations of statistics for the period 1965–2010 through the lens of Larry Brown’s contributions to varied aspects of this massive area. The article goes over sufficiency, shrinkage estimation, admissibility, minimaxity, complete class theorems, estimated confidence, conditional confidence procedures, Edgeworth and higher order asymptotic expansions, variational Bayes, Stein’s SURE, differential inequalities, geometrization of convergence rates, asymptotic equivalence, aspects of empirical process theory, inference after model selection, unified frequentist and Bayesian testing, and Wald’s sequential theory. A reasonably comprehensive bibliography is provided. Full Article
l Models as Approximations—Rejoinder By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Andreas Buja, Arun Kumar Kuchibhotla, Richard Berk, Edward George, Eric Tchetgen Tchetgen, Linda Zhao. Source: Statistical Science, Volume 34, Number 4, 606--620.Abstract: We respond to the discussants of our articles emphasizing the importance of inference under misspecification in the context of the reproducibility/replicability crisis. Along the way, we discuss the roles of diagnostics and model building in regression as well as connections between our well-specification framework and semiparametric theory. Full Article
l Discussion: Models as Approximations By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Dalia Ghanem, Todd A. Kuffner. Source: Statistical Science, Volume 34, Number 4, 604--605. Full Article
l Comment: Statistical Inference from a Predictive Perspective By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Alessandro Rinaldo, Ryan J. Tibshirani, Larry Wasserman. Source: Statistical Science, Volume 34, Number 4, 599--603.Abstract: What is the meaning of a regression parameter? Why is this the de facto standard object of interest for statistical inference? These are delicate issues, especially when the model is misspecified. We argue that focusing on predictive quantities may be a desirable alternative. Full Article
l Comment: Models as (Deliberate) Approximations By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST David Whitney, Ali Shojaie, Marco Carone. Source: Statistical Science, Volume 34, Number 4, 591--598. Full Article
l Comment: Models Are Approximations! By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Anthony C. Davison, Erwan Koch, Jonathan Koh. Source: Statistical Science, Volume 34, Number 4, 584--590.Abstract: This discussion focuses on areas of disagreement with the papers, particularly the target of inference and the case for using the robust ‘sandwich’ variance estimator in the presence of moderate mis-specification. We also suggest that existing procedures may be appreciably more powerful for detecting mis-specification than the authors’ RAV statistic, and comment on the use of the pairs bootstrap in balanced situations. Full Article
l Comment: “Models as Approximations I: Consequences Illustrated with Linear Regression” by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, L. Zhan and K. Zhang By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Roderick J. Little. Source: Statistical Science, Volume 34, Number 4, 580--583. Full Article
l Discussion of Models as Approximations I & II By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Dag Tjøstheim. Source: Statistical Science, Volume 34, Number 4, 575--579. Full Article
l Comment: Models as Approximations By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Nikki L. B. Freeman, Xiaotong Jiang, Owen E. Leete, Daniel J. Luckett, Teeranan Pokaprakarn, Michael R. Kosorok. Source: Statistical Science, Volume 34, Number 4, 572--574. Full Article
l Comment on Models as Approximations, Parts I and II, by Buja et al. By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Jerald F. Lawless. Source: Statistical Science, Volume 34, Number 4, 569--571.Abstract: I comment on the papers Models as Approximations I and II, by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, L. Zhao and K. Zhang. Full Article
l Discussion of Models as Approximations I & II By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Sara van de Geer. Source: Statistical Science, Volume 34, Number 4, 566--568.Abstract: We discuss the papers “Models as Approximations” I & II, by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, L. Zao and K. Zhang (Part I) and A. Buja, L. Brown, A. K. Kuchibhota, R. Berk, E. George and L. Zhao (Part II). We present a summary with some details for the generalized linear model. Full Article
l Models as Approximations II: A Model-Free Theory of Parametric Regression By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, Linda Zhao. Source: Statistical Science, Volume 34, Number 4, 545--565.Abstract: We develop a model-free theory of general types of parametric regression for i.i.d. observations. The theory replaces the parameters of parametric models with statistical functionals, to be called “regression functionals,” defined on large nonparametric classes of joint ${x extrm{-}y}$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary ${x extrm{-}y}$ distributions, without assuming a linear model (see Part I). More generally, regression functionals can be defined by minimizing objective functions, solving estimating equations, or with ad hoc constructions. In this framework, it is possible to achieve the following: (1) define a notion of “well-specification” for regression functionals that replaces the notion of correct specification of models, (2) propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3) decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution and another due to the regressor distribution interacting with misspecification, both of order $N^{-1/2}$, (4) exhibit plug-in/sandwich estimators of standard error as limit cases of ${x extrm{-}y}$ bootstrap estimators, and (5) provide theoretical heuristics to indicate that ${x extrm{-}y}$ bootstrap standard errors may generally be preferred over sandwich estimators. Full Article
l Models as Approximations I: Consequences Illustrated with Linear Regression By projecteuclid.org Published On :: Wed, 08 Jan 2020 04:00 EST Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, Linda Zhao. Source: Statistical Science, Volume 34, Number 4, 523--544.Abstract: In the early 1980s, Halbert White inaugurated a “model-robust” form of statistical inference based on the “sandwich estimator” of standard error. This estimator is known to be “heteroskedasticity-consistent,” but it is less well known to be “nonlinearity-consistent” as well. Nonlinearity, however, raises fundamental issues because in its presence regressors are not ancillary, hence cannot be treated as fixed. The consequences are deep: (1) population slopes need to be reinterpreted as statistical functionals obtained from OLS fits to largely arbitrary joint ${x extrm{-}y}$ distributions; (2) the meaning of slope parameters needs to be rethought; (3) the regressor distribution affects the slope parameters; (4) randomness of the regressors becomes a source of sampling variability in slope estimates of order $1/sqrt{N}$; (5) inference needs to be based on model-robust standard errors, including sandwich estimators or the ${x extrm{-}y}$ bootstrap. In theory, model-robust and model-trusting standard errors can deviate by arbitrary magnitudes either way. In practice, significant deviations between them can be detected with a diagnostic test. Full Article
l A Conversation with Peter Diggle By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Peter M. Atkinson, Jorge Mateu. Source: Statistical Science, Volume 34, Number 3, 504--521.Abstract: Peter John Diggle was born on February 24, 1950, in Lancashire, England. Peter went to school in Scotland, and it was at the end of his school years that he found that he was good at maths and actually enjoyed it. Peter went to Edinburgh to do a maths degree, but transferred halfway through to Liverpool where he completed his degree. Peter studied for a year at Oxford and was then appointed in 1974 as a lecturer in statistics at the University of Newcastle-upon-Tyne where he gained his PhD, and was promoted to Reader in 1983. A sabbatical at the Swedish Royal College of Forestry gave him his first exposure to real scientific data and problems, prompting a move to CSIRO, Australia. After five years with CSIRO where he was Senior, then Principal, then Chief Research Scientist and Chief of the Division of Mathematics and Statistics, he returned to the UK in 1988, to a Chair at Lancaster University. Since 2011 Peter has held appointments at Lancaster and Liverpool, together with honorary appointments at Johns Hopkins, Columbia and Yale. At Lancaster, Peter was the founder and Director of the Medical Statistics Unit (1995–2001), University Dean for Research (1998–2001), EPSRC Senior Fellow (2004–2008), Associate Dean for Research at the School of Health and Medicine (2007–2011), Distinguished University Professor, and leader of the CHICAS Research Group (2007–2017). A Fellow of the Royal Statistical Society since 1974, he was a Member of Council (1983–1985), Joint Editor of JRSSB (1984–1987), Honorary Secretary (1990–1996), awarded the Guy Medal in Silver (1997) and the Barnett Award (2018), Associate Editor of Applied Statistics (1998–2000), Chair of the Research Section Committee (1998–2000), and President (2014–2016). Away from work, Peter enjoys music, playing folk-blues guitar and tenor recorder, and listening to jazz. His running days are behind him, but he can just about hold his own in mixed-doubles badminton with his family. His boyhoood hero was Stirling Moss, and he retains an enthusiasm for classic cars, not least his 1988 Porsche 924S. His favorite authors are George Orwell, Primo Levi and Nigel Slater. This interview was done prior to the fourth Spatial Statistics conference held in Lancaster, July 2017 where a session was dedicated to Peter celebrating his contributions to statistics. Full Article
l Assessing the Causal Effect of Binary Interventions from Observational Panel Data with Few Treated Units By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Pantelis Samartsidis, Shaun R. Seaman, Anne M. Presanis, Matthew Hickman, Daniela De Angelis. Source: Statistical Science, Volume 34, Number 3, 486--503.Abstract: Researchers are often challenged with assessing the impact of an intervention on an outcome of interest in situations where the intervention is nonrandomised, the intervention is only applied to one or few units, the intervention is binary, and outcome measurements are available at multiple time points. In this paper, we review existing methods for causal inference in these situations. We detail the assumptions underlying each method, emphasize connections between the different approaches and provide guidelines regarding their practical implementation. Several open problems are identified thus highlighting the need for future research. Full Article
l Conditionally Conjugate Mean-Field Variational Bayes for Logistic Models By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Daniele Durante, Tommaso Rigon. Source: Statistical Science, Volume 34, Number 3, 472--485.Abstract: Variational Bayes (VB) is a common strategy for approximate Bayesian inference, but simple methods are only available for specific classes of models including, in particular, representations having conditionally conjugate constructions within an exponential family. Models with logit components are an apparently notable exception to this class, due to the absence of conjugacy among the logistic likelihood and the Gaussian priors for the coefficients in the linear predictor. To facilitate approximate inference within this widely used class of models, Jaakkola and Jordan ( Stat. Comput. 10 (2000) 25–37) proposed a simple variational approach which relies on a family of tangent quadratic lower bounds of the logistic log-likelihood, thus restoring conjugacy between these approximate bounds and the Gaussian priors. This strategy is still implemented successfully, but few attempts have been made to formally understand the reasons underlying its excellent performance. Following a review on VB for logistic models, we cover this gap by providing a formal connection between the above bound and a recent Pólya-gamma data augmentation for logistic regression. Such a result places the computational methods associated with the aforementioned bounds within the framework of variational inference for conditionally conjugate exponential family models, thereby allowing recent advances for this class to be inherited also by the methods relying on Jaakkola and Jordan ( Stat. Comput. 10 (2000) 25–37). Full Article
l User-Friendly Covariance Estimation for Heavy-Tailed Distributions By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Yuan Ke, Stanislav Minsker, Zhao Ren, Qiang Sun, Wen-Xin Zhou. Source: Statistical Science, Volume 34, Number 3, 454--471.Abstract: We provide a survey of recent results on covariance estimation for heavy-tailed distributions. By unifying ideas scattered in the literature, we propose user-friendly methods that facilitate practical implementation. Specifically, we introduce elementwise and spectrumwise truncation operators, as well as their $M$-estimator counterparts, to robustify the sample covariance matrix. Different from the classical notion of robustness that is characterized by the breakdown property, we focus on the tail robustness which is evidenced by the connection between nonasymptotic deviation and confidence level. The key insight is that estimators should adapt to the sample size, dimensionality and noise level to achieve optimal tradeoff between bias and robustness. Furthermore, to facilitate practical implementation, we propose data-driven procedures that automatically calibrate the tuning parameters. We demonstrate their applications to a series of structured models in high dimensions, including the bandable and low-rank covariance matrices and sparse precision matrices. Numerical studies lend strong support to the proposed methods. Full Article
l The Geometry of Continuous Latent Space Models for Network Data By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Anna L. Smith, Dena M. Asta, Catherine A. Calder. Source: Statistical Science, Volume 34, Number 3, 428--453.Abstract: We review the class of continuous latent space (statistical) models for network data, paying particular attention to the role of the geometry of the latent space. In these models, the presence/absence of network dyadic ties are assumed to be conditionally independent given the dyads’ unobserved positions in a latent space. In this way, these models provide a probabilistic framework for embedding network nodes in a continuous space equipped with a geometry that facilitates the description of dependence between random dyadic ties. Specifically, these models naturally capture homophilous tendencies and triadic clustering, among other common properties of observed networks. In addition to reviewing the literature on continuous latent space models from a geometric perspective, we highlight the important role the geometry of the latent space plays on properties of networks arising from these models via intuition and simulation. Finally, we discuss results from spectral graph theory that allow us to explore the role of the geometry of the latent space, independent of network size. We conclude with conjectures about how these results might be used to infer the appropriate latent space geometry from observed networks. Full Article
l Lasso Meets Horseshoe: A Survey By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Anindya Bhadra, Jyotishka Datta, Nicholas G. Polson, Brandon Willard. Source: Statistical Science, Volume 34, Number 3, 405--427.Abstract: The goal of this paper is to contrast and survey the major advances in two of the most commonly used high-dimensional techniques, namely, the Lasso and horseshoe regularization. Lasso is a gold standard for predictor selection while horseshoe is a state-of-the-art Bayesian estimator for sparse signals. Lasso is fast and scalable and uses convex optimization whilst the horseshoe is nonconvex. Our novel perspective focuses on three aspects: (i) theoretical optimality in high-dimensional inference for the Gaussian sparse model and beyond, (ii) efficiency and scalability of computation and (iii) methodological development and performance. Full Article
l An Overview of Semiparametric Extensions of Finite Mixture Models By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Sijia Xiang, Weixin Yao, Guangren Yang. Source: Statistical Science, Volume 34, Number 3, 391--404.Abstract: Finite mixture models have offered a very important tool for exploring complex data structures in many scientific areas, such as economics, epidemiology and finance. Semiparametric mixture models, which were introduced into traditional finite mixture models in the past decade, have brought forth exciting developments in their methodologies, theories, and applications. In this article, we not only provide a selective overview of the newly-developed semiparametric mixture models, but also discuss their estimation methodologies, theoretical properties if applicable, and some open questions. Recent developments are also discussed. Full Article
l ROS Regression: Integrating Regularization with Optimal Scaling Regression By projecteuclid.org Published On :: Fri, 11 Oct 2019 04:03 EDT Jacqueline J. Meulman, Anita J. van der Kooij, Kevin L. W. Duisters. Source: Statistical Science, Volume 34, Number 3, 361--390.Abstract: We present a methodology for multiple regression analysis that deals with categorical variables (possibly mixed with continuous ones), in combination with regularization, variable selection and high-dimensional data ($Pgg N$). Regularization and optimal scaling (OS) are two important extensions of ordinary least squares regression (OLS) that will be combined in this paper. There are two data analytic situations for which optimal scaling was developed. One is the analysis of categorical data, and the other the need for transformations because of nonlinear relationships between predictors and outcome. Optimal scaling of categorical data finds quantifications for the categories, both for the predictors and for the outcome variables, that are optimal for the regression model in the sense that they maximize the multiple correlation. When nonlinear relationships exist, nonlinear transformation of predictors and outcome maximize the multiple correlation in the same way. We will consider a variety of transformation types; typically we use step functions for categorical variables, and smooth (spline) functions for continuous variables. Both types of functions can be restricted to be monotonic, preserving the ordinal information in the data. In combination with optimal scaling, three popular regularization methods will be considered: Ridge regression, the Lasso and the Elastic Net. The resulting method will be called ROS Regression (Regularized Optimal Scaling Regression). The OS algorithm provides straightforward and efficient estimation of the regularized regression coefficients, automatically gives the Group Lasso and Blockwise Sparse Regression, and extends them by the possibility to maintain ordinal properties in the data. Extended examples are provided. Full Article
l A Conversation with Noel Cressie By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Christopher K. Wikle, Jay M. Ver Hoef. Source: Statistical Science, Volume 34, Number 2, 349--359.Abstract: Noel Cressie, FAA is Director of the Centre for Environmental Informatics in the National Institute for Applied Statistics Research Australia (NIASRA) and Distinguished Professor in the School of Mathematics and Applied Statistics at the University of Wollongong, Australia. He is also Adjunct Professor at the University of Missouri (USA), Affiliate of Org 398, Science Data Understanding, at NASA’s Jet Propulsion Laboratory (USA), and a member of the Science Team for NASA’s Orbiting Carbon Observatory-2 (OCO-2) satellite. Cressie was awarded a B.Sc. with First Class Honours in Mathematics in 1972 from the University of Western Australia, and an M.A. and Ph.D. in Statistics in 1973 and 1975, respectively, from Princeton University (USA). Two brief postdoctoral periods followed, at the Centre de Morphologie Mathématique, ENSMP, in Fontainebleau (France) from April 1975–September 1975, and at Imperial College, London (UK) from September 1975–January 1976. His past appointments have been at The Flinders University of South Australia from 1976–1983, at Iowa State University (USA) from 1983–1998, and at The Ohio State University (USA) from 1998–2012. He has authored or co-authored four books and more than 280 papers in peer-reviewed outlets, covering areas that include spatial and spatio-temporal statistics, environmental statistics, empirical-Bayesian and Bayesian methods including sequential design, goodness-of-fit, and remote sensing of the environment. Many of his papers also address important questions in the sciences. Cressie is a Fellow of the Australian Academy of Science, the American Statistical Association, the Institute of Mathematical Statistics, and the Spatial Econometrics Association, and he is an Elected Member of the International Statistical Institute. Noel Cressie’s refereed, unrefereed, and other publications are available at: https://niasra.uow.edu.au/cei/people/UOW232444.html. Full Article
l Two-Sample Instrumental Variable Analyses Using Heterogeneous Samples By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Qingyuan Zhao, Jingshu Wang, Wes Spiller, Jack Bowden, Dylan S. Small. Source: Statistical Science, Volume 34, Number 2, 317--333.Abstract: Instrumental variable analysis is a widely used method to estimate causal effects in the presence of unmeasured confounding. When the instruments, exposure and outcome are not measured in the same sample, Angrist and Krueger ( J. Amer. Statist. Assoc. 87 (1992) 328–336) suggested to use two-sample instrumental variable (TSIV) estimators that use sample moments from an instrument-exposure sample and an instrument-outcome sample. However, this method is biased if the two samples are from heterogeneous populations so that the distributions of the instruments are different. In linear structural equation models, we derive a new class of TSIV estimators that are robust to heterogeneous samples under the key assumption that the structural relations in the two samples are the same. The widely used two-sample two-stage least squares estimator belongs to this class. It is generally not asymptotically efficient, although we find that it performs similarly to the optimal TSIV estimator in most practical situations. We then attempt to relax the linearity assumption. We find that, unlike one-sample analyses, the TSIV estimator is not robust to misspecified exposure model. Additionally, to nonparametrically identify the magnitude of the causal effect, the noise in the exposure must have the same distributions in the two samples. However, this assumption is in general untestable because the exposure is not observed in one sample. Nonetheless, we may still identify the sign of the causal effect in the absence of homogeneity of the noise. Full Article
l Producing Official County-Level Agricultural Estimates in the United States: Needs and Challenges By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Nathan B. Cruze, Andreea L. Erciulescu, Balgobin Nandram, Wendy J. Barboza, Linda J. Young. Source: Statistical Science, Volume 34, Number 2, 301--316.Abstract: In the United States, county-level estimates of crop yield, production, and acreage published by the United States Department of Agriculture’s National Agricultural Statistics Service (USDA NASS) play an important role in determining the value of payments allotted to farmers and ranchers enrolled in several federal programs. Given the importance of these official county-level crop estimates, NASS continually strives to improve its crops county estimates program in terms of accuracy, reliability and coverage. In 2015, NASS engaged a panel of experts convened under the auspices of the National Academies of Sciences, Engineering, and Medicine Committee on National Statistics (CNSTAT) for guidance on implementing models that may synthesize multiple sources of information into a single estimate, provide defensible measures of uncertainty, and potentially increase the number of publishable county estimates. The final report titled Improving Crop Estimates by Integrating Multiple Data Sources was released in 2017. This paper discusses several needs and requirements for NASS county-level crop estimates that were illuminated during the activities of the CNSTAT panel. A motivating example of planted acreage estimation in Illinois illustrates several challenges that NASS faces as it considers adopting any explicit model for official crops county estimates. Full Article
l The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015 By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Laura Anderlucci, Angela Montanari, Cinzia Viroli. Source: Statistical Science, Volume 34, Number 2, 280--300.Abstract: In this paper, we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: The Annals of Statistics , Biometrika , Journal of the American Statistical Association , Journal of the Royal Statistical Society, Series B and Statistical Science . The aim is to construct a kind of “taxonomy” of the statistical papers by organizing and clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data. Full Article
l Statistical Analysis of Zero-Inflated Nonnegative Continuous Data: A Review By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Lei Liu, Ya-Chen Tina Shih, Robert L. Strawderman, Daowen Zhang, Bankole A. Johnson, Haitao Chai. Source: Statistical Science, Volume 34, Number 2, 253--279.Abstract: Zero-inflated nonnegative continuous (or semicontinuous) data arise frequently in biomedical, economical, and ecological studies. Examples include substance abuse, medical costs, medical care utilization, biomarkers (e.g., CD4 cell counts, coronary artery calcium scores), single cell gene expression rates, and (relative) abundance of microbiome. Such data are often characterized by the presence of a large portion of zero values and positive continuous values that are skewed to the right and heteroscedastic. Both of these features suggest that no simple parametric distribution may be suitable for modeling such type of outcomes. In this paper, we review statistical methods for analyzing zero-inflated nonnegative outcome data. We will start with the cross-sectional setting, discussing ways to separate zero and positive values and introducing flexible models to characterize right skewness and heteroscedasticity in the positive values. We will then present models of correlated zero-inflated nonnegative continuous data, using random effects to tackle the correlation on repeated measures from the same subject and that across different parts of the model. We will also discuss expansion to related topics, for example, zero-inflated count and survival data, nonlinear covariate effects, and joint models of longitudinal zero-inflated nonnegative continuous data and survival. Finally, we will present applications to three real datasets (i.e., microbiome, medical costs, and alcohol drinking) to illustrate these methods. Example code will be provided to facilitate applications of these methods. Full Article
l A Kernel Regression Procedure in the 3D Shape Space with an Application to Online Sales of Children’s Wear By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Gregorio Quintana-Ortí, Amelia Simó. Source: Statistical Science, Volume 34, Number 2, 236--252.Abstract: This paper is focused on kernel regression when the response variable is the shape of a 3D object represented by a configuration matrix of landmarks. Regression methods on this shape space are not trivial because this space has a complex finite-dimensional Riemannian manifold structure (non-Euclidean). Papers about it are scarce in the literature, the majority of them are restricted to the case of a single explanatory variable, and many of them are based on the approximated tangent space. In this paper, there are several methodological innovations. The first one is the adaptation of the general method for kernel regression analysis in manifold-valued data to the three-dimensional case of Kendall’s shape space. The second one is its generalization to the multivariate case and the addressing of the curse-of-dimensionality problem. Finally, we propose bootstrap confidence intervals for prediction. A simulation study is carried out to check the goodness of the procedure, and a comparison with a current approach is performed. Then, it is applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear. Full Article
l Rejoinder: Bayes, Oracle Bayes, and Empirical Bayes By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Bradley Efron. Source: Statistical Science, Volume 34, Number 2, 234--235. Full Article
l Comment: Variational Autoencoders as Empirical Bayes By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Yixin Wang, Andrew C. Miller, David M. Blei. Source: Statistical Science, Volume 34, Number 2, 229--233. Full Article
l Comment: Empirical Bayes, Compound Decisions and Exchangeability By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Eitan Greenshtein, Ya’acov Ritov. Source: Statistical Science, Volume 34, Number 2, 224--228.Abstract: We present some personal reflections on empirical Bayes/ compound decision (EB/CD) theory following Efron (2019). In particular, we consider the role of exchangeability in the EB/CD theory and how it can be achieved when there are covariates. We also discuss the interpretation of EB/CD confidence interval, the theoretical efficiency of the CD procedure, and the impact of sparsity assumptions. Full Article
l Comment: Empirical Bayes Interval Estimation By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Wenhua Jiang. Source: Statistical Science, Volume 34, Number 2, 219--223.Abstract: This is a contribution to the discussion of the enlightening paper by Professor Efron. We focus on empirical Bayes interval estimation. We discuss the oracle interval estimation rules, the empirical Bayes estimation of the oracle rule and the computation. Some numerical results are reported. Full Article
l Comment: Bayes, Oracle Bayes and Empirical Bayes By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Aad van der Vaart. Source: Statistical Science, Volume 34, Number 2, 214--218. Full Article
l Comment: Minimalist $g$-Modeling By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Roger Koenker, Jiaying Gu. Source: Statistical Science, Volume 34, Number 2, 209--213.Abstract: Efron’s elegant approach to $g$-modeling for empirical Bayes problems is contrasted with an implementation of the Kiefer–Wolfowitz nonparametric maximum likelihood estimator for mixture models for several examples. The latter approach has the advantage that it is free of tuning parameters and consequently provides a relatively simple complementary method. Full Article
l Comment: Bayes, Oracle Bayes, and Empirical Bayes By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Nan Laird. Source: Statistical Science, Volume 34, Number 2, 206--208. Full Article
l Comment: Bayes, Oracle Bayes, and Empirical Bayes By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Thomas A. Louis. Source: Statistical Science, Volume 34, Number 2, 202--205. Full Article
l Bayes, Oracle Bayes and Empirical Bayes By projecteuclid.org Published On :: Thu, 18 Jul 2019 22:01 EDT Bradley Efron. Source: Statistical Science, Volume 34, Number 2, 177--201.Abstract: This article concerns the Bayes and frequentist aspects of empirical Bayes inference. Some of the ideas explored go back to Robbins in the 1950s, while others are current. Several examples are discussed, real and artificial, illustrating the two faces of empirical Bayes methodology: “oracle Bayes” shows empirical Bayes in its most frequentist mode, while “finite Bayes inference” is a fundamentally Bayesian application. In either case, modern theory and computation allow us to present a sharp finite-sample picture of what is at stake in an empirical Bayes analysis. Full Article