y Modifying the Chi-square and the CMH test for population genetic inference: Adapting to overdispersion By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Kerstin Spitzer, Marta Pelizzola, Andreas Futschik. Source: The Annals of Applied Statistics, Volume 14, Number 1, 202--220.Abstract: Evolve and resequence studies provide a popular approach to simulate evolution in the lab and explore its genetic basis. In this context, Pearson’s chi-square test, Fisher’s exact test as well as the Cochran–Mantel–Haenszel test are commonly used to infer genomic positions affected by selection from temporal changes in allele frequency. However, the null model associated with these tests does not match the null hypothesis of actual interest. Indeed, due to genetic drift and possibly other additional noise components such as pool sequencing, the null variance in the data can be substantially larger than accounted for by these common test statistics. This leads to $p$-values that are systematically too small and, therefore, a huge number of false positive results. Even, if the ranking rather than the actual $p$-values is of interest, a naive application of the mentioned tests will give misleading results, as the amount of overdispersion varies from locus to locus. We therefore propose adjusted statistics that take the overdispersion into account while keeping the formulas simple. This is particularly useful in genome-wide applications, where millions of SNPs can be handled with little computational effort. We then apply the adapted test statistics to real data from Drosophila and investigate how information from intermediate generations can be included when available. We also discuss further applications such as genome-wide association studies based on pool sequencing data and tests for local adaptation. Full Article
y Surface temperature monitoring in liver procurement via functional variance change-point analysis By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Zhenguo Gao, Pang Du, Ran Jin, John L. Robertson. Source: The Annals of Applied Statistics, Volume 14, Number 1, 143--159.Abstract: Liver procurement experiments with surface-temperature monitoring motivated Gao et al. ( J. Amer. Statist. Assoc. 114 (2019) 773–781) to develop a variance change-point detection method under a smoothly-changing mean trend. However, the spotwise change points yielded from their method do not offer immediate information to surgeons since an organ is often transplanted as a whole or in part. We develop a new practical method that can analyze a defined portion of the organ surface at a time. It also provides a novel addition to the developing field of functional data monitoring. Furthermore, numerical challenge emerges for simultaneously modeling the variance functions of 2D locations and the mean function of location and time. The respective sample sizes in the scales of 10,000 and 1,000,000 for modeling these functions make standard spline estimation too costly to be useful. We introduce a multistage subsampling strategy with steps educated by quickly-computable preliminary statistical measures. Extensive simulations show that the new method can efficiently reduce the computational cost and provide reasonable parameter estimates. Application of the new method to our liver surface temperature monitoring data shows its effectiveness in providing accurate status change information for a selected portion of the organ in the experiment. Full Article
y A statistical analysis of noisy crowdsourced weather data By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Arnab Chakraborty, Soumendra Nath Lahiri, Alyson Wilson. Source: The Annals of Applied Statistics, Volume 14, Number 1, 116--142.Abstract: Spatial prediction of weather elements like temperature, precipitation, and barometric pressure are generally based on satellite imagery or data collected at ground stations. None of these data provide information at a more granular or “hyperlocal” resolution. On the other hand, crowdsourced weather data, which are captured by sensors installed on mobile devices and gathered by weather-related mobile apps like WeatherSignal and AccuWeather, can serve as potential data sources for analyzing environmental processes at a hyperlocal resolution. However, due to the low quality of the sensors and the nonlaboratory environment, the quality of the observations in crowdsourced data is compromised. This paper describes methods to improve hyperlocal spatial prediction using this varying-quality, noisy crowdsourced information. We introduce a reliability metric, namely Veracity Score (VS), to assess the quality of the crowdsourced observations using a coarser, but high-quality, reference data. A VS-based methodology to analyze noisy spatial data is proposed and evaluated through extensive simulations. The merits of the proposed approach are illustrated through case studies analyzing crowdsourced daily average ambient temperature readings for one day in the contiguous United States. Full Article
y Modeling microbial abundances and dysbiosis with beta-binomial regression By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Bryan D. Martin, Daniela Witten, Amy D. Willis. Source: The Annals of Applied Statistics, Volume 14, Number 1, 94--115.Abstract: Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance . In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis , the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data. Full Article
y Integrative survival analysis with uncertain event times in application to a suicide risk study By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Wenjie Wang, Robert Aseltine, Kun Chen, Jun Yan. Source: The Annals of Applied Statistics, Volume 14, Number 1, 51--73.Abstract: The concept of integrating data from disparate sources to accelerate scientific discovery has generated tremendous excitement in many fields. The potential benefits from data integration, however, may be compromised by the uncertainty due to incomplete/imperfect record linkage. Motivated by a suicide risk study, we propose an approach for analyzing survival data with uncertain event times arising from data integration. Specifically, in our problem deaths identified from the hospital discharge records together with reported suicidal deaths determined by the Office of Medical Examiner may still not include all the death events of patients, and the missing deaths can be recovered from a complete database of death records. Since the hospital discharge data can only be linked to the death record data by matching basic patient characteristics, a patient with a censored death time from the first dataset could be linked to multiple potential event records in the second dataset. We develop an integrative Cox proportional hazards regression in which the uncertainty in the matched event times is modeled probabilistically. The estimation procedure combines the ideas of profile likelihood and the expectation conditional maximization algorithm (ECM). Simulation studies demonstrate that under realistic settings of imperfect data linkage the proposed method outperforms several competing approaches including multiple imputation. A marginal screening analysis using the proposed integrative Cox model is performed to identify risk factors associated with death following suicide-related hospitalization in Connecticut. The identified diagnostics codes are consistent with existing literature and provide several new insights on suicide risk, prediction and prevention. Full Article
y BART with targeted smoothing: An analysis of patient-specific stillbirth risk By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Jennifer E. Starling, Jared S. Murray, Carlos M. Carvalho, Radek K. Bukowski, James G. Scott. Source: The Annals of Applied Statistics, Volume 14, Number 1, 28--50.Abstract: This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate $t$ while not necessarily requiring smoothness over other covariates $x$. tsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. tsBART extends BART by parameterizing each tree’s terminal nodes with smooth functions of $t$ rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps to regularize the estimate. After introducing and benchmarking the tsBART model, we apply it to our motivating example—pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age $(t)$ and based on maternal and fetal risk factors $(x)$. Obstetricians expect stillbirth risk to vary smoothly over gestational age but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of fetal mortality. All methods described here are implemented in the R package tsbart . Full Article
y A general theory for preferential sampling in environmental networks By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Joe Watson, James V. Zidek, Gavin Shaddick. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2662--2700.Abstract: This paper presents a general model framework for detecting the preferential sampling of environmental monitors recording an environmental process across space and/or time. This is achieved by considering the joint distribution of an environmental process with a site-selection process that considers where and when sites are placed to measure the process. The environmental process may be spatial, temporal or spatio-temporal in nature. By sharing random effects between the two processes, the joint model is able to establish whether site placement was stochastically dependent of the environmental process under study. Furthermore, if stochastic dependence is identified between the two processes, then inferences about the probability distribution of the spatio-temporal process will change, as will predictions made of the process across space and time. The embedding into a spatio-temporal framework also allows for the modelling of the dynamic site-selection process itself. Real-world factors affecting both the size and location of the network can be easily modelled and quantified. Depending upon the choice of the population of locations considered for selection across space and time under the site-selection process, different insights about the precise nature of preferential sampling can be obtained. The general framework developed in the paper is designed to be easily and quickly fit using the R-INLA package. We apply this framework to a case study involving particulate air pollution over the UK where a major reduction in the size of a monitoring network through time occurred. It is demonstrated that a significant response-biased reduction in the air quality monitoring network occurred, namely the relocation of monitoring sites to locations with the highest pollution levels, and the routine removal of sites at locations with the lowest. We also show that the network was consistently unrepresenting levels of particulate matter seen across much of GB throughout the operating life of the network. Finally we show that this may have led to a severe overreporting of the population-average exposure levels experienced across GB. This could have great impacts on estimates of the health effects of black smoke levels. Full Article
y Bayesian indicator variable selection to incorporate hierarchical overlapping group structure in multi-omics applications By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Li Zhu, Zhiguang Huo, Tianzhou Ma, Steffi Oesterreich, George C. Tseng. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2611--2636.Abstract: Variable selection is a pervasive problem in modern high-dimensional data analysis where the number of features often exceeds the sample size (a.k.a. small-n-large-p problem). Incorporation of group structure knowledge to improve variable selection has been widely studied. Here, we consider prior knowledge of a hierarchical overlapping group structure to improve variable selection in regression setting. In genomics applications, for instance, a biological pathway contains tens to hundreds of genes and a gene can be mapped to multiple experimentally measured features (such as its mRNA expression, copy number variation and methylation levels of possibly multiple sites). In addition to the hierarchical structure, the groups at the same level may overlap (e.g., two pathways can share common genes). Incorporating such hierarchical overlapping groups in traditional penalized regression setting remains a difficult optimization problem. Alternatively, we propose a Bayesian indicator model that can elegantly serve the purpose. We evaluate the model in simulations and two breast cancer examples, and demonstrate its superior performance over existing models. The result not only enhances prediction accuracy but also improves variable selection and model interpretation that lead to deeper biological insight of the disease. Full Article
y On Bayesian new edge prediction and anomaly detection in computer networks By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Silvia Metelli, Nicholas Heard. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2586--2610.Abstract: Monitoring computer network traffic for anomalous behaviour presents an important security challenge. Arrivals of new edges in a network graph represent connections between a client and server pair not previously observed, and in rare cases these might suggest the presence of intruders or malicious implants. We propose a Bayesian model and anomaly detection method for simultaneously characterising existing network structure and modelling likely new edge formation. The method is demonstrated on real computer network authentication data and successfully identifies some machines which are known to be compromised. Full Article
y A hierarchical curve-based approach to the analysis of manifold data By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Liberty Vittert, Adrian W. Bowman, Stanislav Katina. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2539--2563.Abstract: One of the data structures generated by medical imaging technology is high resolution point clouds representing anatomical surfaces. Stereophotogrammetry and laser scanning are two widely available sources of this kind of data. A standardised surface representation is required to provide a meaningful correspondence across different images as a basis for statistical analysis. Point locations with anatomical definitions, referred to as landmarks, have been the traditional approach. Landmarks can also be taken as the starting point for more general surface representations, often using templates which are warped on to an observed surface by matching landmark positions and subsequent local adjustment of the surface. The aim of the present paper is to provide a new approach which places anatomical curves at the heart of the surface representation and its analysis. Curves provide intermediate structures which capture the principal features of the manifold (surface) of interest through its ridges and valleys. As landmarks are often available these are used as anchoring points, but surface curvature information is the principal guide in estimating the curve locations. The surface patches between these curves are relatively flat and can be represented in a standardised manner by appropriate surface transects to give a complete surface model. This new approach does not require the use of a template, reference sample or any external information to guide the method and, when compared with a surface based approach, the estimation of curves is shown to have improved performance. In addition, examples involving applications to mussel shells and human faces show that the analysis of curve information can deliver more targeted and effective insight than the use of full surface information. Full Article
y A simple, consistent estimator of SNP heritability from genome-wide association studies By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Armin Schwartzman, Andrew J. Schork, Rong Zablocki, Wesley K. Thompson. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2509--2538.Abstract: Analysis of genome-wide association studies (GWAS) is characterized by a large number of univariate regressions where a quantitative trait is regressed on hundreds of thousands to millions of single-nucleotide polymorphism (SNP) allele counts, one at a time. This article proposes an estimator of the SNP heritability of the trait, defined here as the fraction of the variance of the trait explained by the SNPs in the study. The proposed GWAS heritability (GWASH) estimator is easy to compute, highly interpretable and is consistent as the number of SNPs and the sample size increase. More importantly, it can be computed from summary statistics typically reported in GWAS, not requiring access to the original data. The estimator takes full account of the linkage disequilibrium (LD) or correlation between the SNPs in the study through moments of the LD matrix, estimable from auxiliary datasets. Unlike other proposed estimators in the literature, we establish the theoretical properties of the GWASH estimator and obtain analytical estimates of the precision, allowing for power and sample size calculations for SNP heritability estimates and forming a firm foundation for future methodological development. Full Article
y New formulation of the logistic-Gaussian process to analyze trajectory tracking data By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Gianluca Mastrantonio, Clara Grazian, Sara Mancinelli, Enrico Bibbona. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2483--2508.Abstract: Improved communication systems, shrinking battery sizes and the price drop of tracking devices have led to an increasing availability of trajectory tracking data. These data are often analyzed to understand animal behavior. In this work, we propose a new model for interpreting the animal movent as a mixture of characteristic patterns, that we interpret as different behaviors. The probability that the animal is behaving according to a specific pattern, at each time instant, is nonparametrically estimated using the Logistic-Gaussian process. Owing to a new formalization and the way we specify the coregionalization matrix of the associated multivariate Gaussian process, our model is invariant with respect to the choice of the reference element and of the ordering of the probability vector components. We fit the model under a Bayesian framework, and show that the Markov chain Monte Carlo algorithm we propose is straightforward to implement. We perform a simulation study with the aim of showing the ability of the estimation procedure to retrieve the model parameters. We also test the performance of the information criterion we used to select the number of behaviors. The model is then applied to a real dataset where a wolf has been observed before and after procreation. The results are easy to interpret, and clear differences emerge in the two phases. Full Article
y Empirical Bayes analysis of RNA sequencing experiments with auxiliary information By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Kun Liang. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2452--2482.Abstract: Finding differentially expressed genes is a common task in high-throughput transcriptome studies. While traditional statistical methods rank the genes by their test statistics alone, we analyze an RNA sequencing dataset using the auxiliary information of gene length and the test statistics from a related microarray study. Given the auxiliary information, we propose a novel nonparametric empirical Bayes procedure to estimate the posterior probability of differential expression for each gene. We demonstrate the advantage of our procedure in extensive simulation studies and a psoriasis RNA sequencing study. The companion R package calm is available at Bioconductor. Full Article
y Outline analyses of the called strike zone in Major League Baseball By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Dale L. Zimmerman, Jun Tang, Rui Huang. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2416--2451.Abstract: We extend statistical shape analytic methods known as outline analysis for application to the strike zone, a central feature of the game of baseball. Although the strike zone is rigorously defined by Major League Baseball’s official rules, umpires make mistakes in calling pitches as strikes (and balls) and may even adhere to a strike zone somewhat different than that prescribed by the rule book. Our methods yield inference on geometric attributes (centroid, dimensions, orientation and shape) of this “called strike zone” (CSZ) and on the effects that years, umpires, player attributes, game situation factors and their interactions have on those attributes. The methodology consists of first using kernel discriminant analysis to determine a noisy outline representing the CSZ corresponding to each factor combination, then fitting existing elliptic Fourier and new generalized superelliptic models for closed curves to that outline and finally analyzing the fitted model coefficients using standard methods of regression analysis, factorial analysis of variance and variance component estimation. We apply these methods to PITCHf/x data comprising more than three million called pitches from the 2008–2016 Major League Baseball seasons to address numerous questions about the CSZ. We find that all geometric attributes of the CSZ, except its size, became significantly more like those of the rule-book strike zone from 2008–2016 and that several player attribute/game situation factors had statistically and practically significant effects on many of them. We also establish that the variation in the horizontal center, width and area of an individual umpire’s CSZ from pitch to pitch is smaller than their variation among CSZs from different umpires. Full Article
y Propensity score weighting for causal inference with multiple treatments By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Fan Li, Fan Li. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2389--2415.Abstract: Causal or unconfounded descriptive comparisons between multiple groups are common in observational studies. Motivated from a racial disparity study in health services research, we propose a unified propensity score weighting framework, the balancing weights, for estimating causal effects with multiple treatments. These weights incorporate the generalized propensity scores to balance the weighted covariate distribution of each treatment group, all weighted toward a common prespecified target population. The class of balancing weights include several existing approaches such as the inverse probability weights and trimming weights as special cases. Within this framework, we propose a set of target estimands based on linear contrasts. We further develop the generalized overlap weights, constructed as the product of the inverse probability weights and the harmonic mean of the generalized propensity scores. The generalized overlap weighting scheme corresponds to the target population with the most overlap in covariates across the multiple treatments. These weights are bounded and thus bypass the problem of extreme propensities. We show that the generalized overlap weights minimize the total asymptotic variance of the moment weighting estimators for the pairwise contrasts within the class of balancing weights. We consider two balance check criteria and propose a new sandwich variance estimator for estimating the causal effects with generalized overlap weights. We apply these methods to study the racial disparities in medical expenditure between several racial groups using the 2009 Medical Expenditure Panel Survey (MEPS) data. Simulations were carried out to compare with existing methods. Full Article
y A nonparametric spatial test to identify factors that shape a microbiome By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Susheela P. Singh, Ana-Maria Staicu, Robert R. Dunn, Noah Fierer, Brian J. Reich. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2341--2362.Abstract: The advent of high-throughput sequencing technologies has made data from DNA material readily available, leading to a surge of microbiome-related research establishing links between markers of microbiome health and specific outcomes. However, to harness the power of microbial communities we must understand not only how they affect us, but also how they can be influenced to improve outcomes. This area has been dominated by methods that reduce community composition to summary metrics, which can fail to fully exploit the complexity of community data. Recently, methods have been developed to model the abundance of taxa in a community, but they can be computationally intensive and do not account for spatial effects underlying microbial settlement. These spatial effects are particularly relevant in the microbiome setting because we expect communities that are close together to be more similar than those that are far apart. In this paper, we propose a flexible Bayesian spike-and-slab variable selection model for presence-absence indicators that accounts for spatial dependence and cross-dependence between taxa while reducing dimensionality in both directions. We show by simulation that in the presence of spatial dependence, popular distance-based hypothesis testing methods fail to preserve their advertised size, and the proposed method improves variable selection. Finally, we present an application of our method to an indoor fungal community found within homes across the contiguous United States. Full Article
y A latent discrete Markov random field approach to identifying and classifying historical forest communities based on spatial multivariate tree species counts By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Stephen Berg, Jun Zhu, Murray K. Clayton, Monika E. Shea, David J. Mladenoff. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2312--2340.Abstract: The Wisconsin Public Land Survey database describes historical forest composition at high spatial resolution and is of interest in ecological studies of forest composition in Wisconsin just prior to significant Euro-American settlement. For such studies it is useful to identify recurring subpopulations of tree species known as communities, but standard clustering approaches for subpopulation identification do not account for dependence between spatially nearby observations. Here, we develop and fit a latent discrete Markov random field model for the purpose of identifying and classifying historical forest communities based on spatially referenced multivariate tree species counts across Wisconsin. We show empirically for the actual dataset and through simulation that our latent Markov random field modeling approach improves prediction and parameter estimation performance. For model fitting we introduce a new stochastic approximation algorithm which enables computationally efficient estimation and classification of large amounts of spatial multivariate count data. Full Article
y Objective Bayes model selection of Gaussian interventional essential graphs for the identification of signaling pathways By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Federico Castelletti, Guido Consonni. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2289--2311.Abstract: A signalling pathway is a sequence of chemical reactions initiated by a stimulus which in turn affects a receptor, and then through some intermediate steps cascades down to the final cell response. Based on the technique of flow cytometry, samples of cell-by-cell measurements are collected under each experimental condition, resulting in a collection of interventional data (assuming no latent variables are involved). Usually several external interventions are applied at different points of the pathway, the ultimate aim being the structural recovery of the underlying signalling network which we model as a causal Directed Acyclic Graph (DAG) using intervention calculus. The advantage of using interventional data, rather than purely observational one, is that identifiability of the true data generating DAG is enhanced. More technically a Markov equivalence class of DAGs, whose members are statistically indistinguishable based on observational data alone, can be further decomposed, using additional interventional data, into smaller distinct Interventional Markov equivalence classes. We present a Bayesian methodology for structural learning of Interventional Markov equivalence classes based on observational and interventional samples of multivariate Gaussian observations. Our approach is objective, meaning that it is based on default parameter priors requiring no personal elicitation; some flexibility is however allowed through a tuning parameter which regulates sparsity in the prior on model space. Based on an analytical expression for the marginal likelihood of a given Interventional Essential Graph, and a suitable MCMC scheme, our analysis produces an approximate posterior distribution on the space of Interventional Markov equivalence classes, which can be used to provide uncertainty quantification for features of substantive scientific interest, such as the posterior probability of inclusion of selected edges, or paths. Full Article
y Fitting a deeply nested hierarchical model to a large book review dataset using a moment-based estimator By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Ningshan Zhang, Kyle Schmaus, Patrick O. Perry. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2260--2288.Abstract: We consider a particular instance of a common problem in recommender systems, using a database of book reviews to inform user-targeted recommendations. In our dataset, books are categorized into genres and subgenres. To exploit this nested taxonomy, we use a hierarchical model that enables information pooling across across similar items at many levels within the genre hierarchy. The main challenge in deploying this model is computational. The data sizes are large and fitting the model at scale using off-the-shelf maximum likelihood procedures is prohibitive. To get around this computational bottleneck, we extend a moment-based fitting procedure proposed for fitting single-level hierarchical models to the general case of arbitrarily deep hierarchies. This extension is an order of magnitude faster than standard maximum likelihood procedures. The fitting method can be deployed beyond recommender systems to general contexts with deeply nested hierarchical generalized linear mixed models. Full Article
y Principal nested shape space analysis of molecular dynamics data By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Ian L. Dryden, Kwang-Rae Kim, Charles A. Laughton, Huiling Le. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2213--2234.Abstract: Molecular dynamics simulations produce huge datasets of temporal sequences of molecules. It is of interest to summarize the shape evolution of the molecules in a succinct, low-dimensional representation. However, Euclidean techniques such as principal components analysis (PCA) can be problematic as the data may lie far from in a flat manifold. Principal nested spheres gives a fundamentally different decomposition of data from the usual Euclidean subspace based PCA [ Biometrika 99 (2012) 551–568]. Subspaces of successively lower dimension are fitted to the data in a backwards manner with the aim of retaining signal and dispensing with noise at each stage. We adapt the methodology to 3D subshape spaces and provide some practical fitting algorithms. The methodology is applied to cluster analysis of peptides, where different states of the molecules can be identified. Also, the temporal transitions between cluster states are explored. Full Article
y Microsimulation model calibration using incremental mixture approximate Bayesian computation By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Carolyn M. Rutter, Jonathan Ozik, Maria DeYoreo, Nicholson Collier. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2189--2212.Abstract: Microsimulation models (MSMs) are used to inform policy by predicting population-level outcomes under different scenarios. MSMs simulate individual-level event histories that mark the disease process (such as the development of cancer) and the effect of policy actions (such as screening) on these events. MSMs often have many unknown parameters; calibration is the process of searching the parameter space to select parameters that result in accurate MSM prediction of a wide range of targets. We develop Incremental Mixture Approximate Bayesian Computation (IMABC) for MSM calibration which results in a simulated sample from the posterior distribution of model parameters given calibration targets. IMABC begins with a rejection-based ABC step, drawing a sample of points from the prior distribution of model parameters and accepting points that result in simulated targets that are near observed targets. Next, the sample is iteratively updated by drawing additional points from a mixture of multivariate normal distributions and accepting points that result in accurate predictions. Posterior estimates are obtained by weighting the final set of accepted points to account for the adaptive sampling scheme. We demonstrate IMABC by calibrating CRC-SPIN 2.0, an updated version of a MSM for colorectal cancer (CRC) that has been used to inform national CRC screening guidelines. Full Article
y Fire seasonality identification with multimodality tests By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Jose Ameijeiras-Alonso, Akli Benali, Rosa M. Crujeiras, Alberto Rodríguez-Casal, José M. C. Pereira. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2120--2139.Abstract: Understanding the role of vegetation fires in the Earth system is an important environmental problem. Although fire occurrence is influenced by natural factors, human activity related to land use and management has altered the temporal patterns of fire in several regions of the world. Hence, for a better insight into fires regimes it is of special interest to analyze where human activity has altered fire seasonality. For doing so, multimodality tests are a useful tool for determining the number of annual fire peaks. The periodicity of fires and their complex distributional features motivate the use of nonparametric circular statistics. The unsatisfactory performance of previous circular nonparametric proposals for testing multimodality justifies the introduction of a new approach, considering an adapted version of the excess mass statistic, jointly with a bootstrap calibration algorithm. A systematic application of the test on the Russia–Kazakhstan area is presented in order to determine how many fire peaks can be identified in this region. A False Discovery Rate correction, accounting for the spatial dependence of the data, is also required. Full Article
y Statistical inference for partially observed branching processes with application to cell lineage tracking of in vivo hematopoiesis By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Jason Xu, Samson Koelle, Peter Guttorp, Chuanfeng Wu, Cynthia Dunbar, Janis L. Abkowitz, Vladimir N. Minin. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2091--2119.Abstract: Single-cell lineage tracking strategies enabled by recent experimental technologies have produced significant insights into cell fate decisions, but lack the quantitative framework necessary for rigorous statistical analysis of mechanistic models describing cell division and differentiation. In this paper, we develop such a framework with corresponding moment-based parameter estimation techniques for continuous-time, multi-type branching processes. Such processes provide a probabilistic model of how cells divide and differentiate, and we apply our method to study hematopoiesis , the mechanism of blood cell production. We derive closed-form expressions for higher moments in a general class of such models. These analytical results allow us to efficiently estimate parameters of much richer statistical models of hematopoiesis than those used in previous statistical studies. To our knowledge, the method provides the first rate inference procedure for fitting such models to time series data generated from cellular barcoding experiments. After validating the methodology in simulation studies, we apply our estimator to hematopoietic lineage tracking data from rhesus macaques. Our analysis provides a more complete understanding of cell fate decisions during hematopoiesis in nonhuman primates, which may be more relevant to human biology and clinical strategies than previous findings from murine studies. For example, in addition to previously estimated hematopoietic stem cell self-renewal rate, we are able to estimate fate decision probabilities and to compare structurally distinct models of hematopoiesis using cross validation. These estimates of fate decision probabilities and our model selection results should help biologists compare competing hypotheses about how progenitor cells differentiate. The methodology is transferrable to a large class of stochastic compartmental and multi-type branching models, commonly used in studies of cancer progression, epidemiology and many other fields. Full Article
y Estimating the rate constant from biosensor data via an adaptive variational Bayesian approach By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Ye Zhang, Zhigang Yao, Patrik Forssén, Torgny Fornstedt. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2011--2042.Abstract: The means to obtain the rate constants of a chemical reaction is a fundamental open problem in both science and the industry. Traditional techniques for finding rate constants require either chemical modifications of the reactants or indirect measurements. The rate constant map method is a modern technique to study binding equilibrium and kinetics in chemical reactions. Finding a rate constant map from biosensor data is an ill-posed inverse problem that is usually solved by regularization. In this work, rather than finding a deterministic regularized rate constant map that does not provide uncertainty quantification of the solution, we develop an adaptive variational Bayesian approach to estimate the distribution of the rate constant map, from which some intrinsic properties of a chemical reaction can be explored, including information about rate constants. Our new approach is more realistic than the existing approaches used for biosensors and allows us to estimate the dynamics of the interactions, which are usually hidden in a deterministic approximate solution. We verify the performance of the new proposed method by numerical simulations, and compare it with the Markov chain Monte Carlo algorithm. The results illustrate that the variational method can reliably capture the posterior distribution in a computationally efficient way. Finally, the developed method is also tested on the real biosensor data (parathyroid hormone), where we provide two novel analysis tools—the thresholding contour map and the high order moment map—to estimate the number of interactions as well as their rate constants. Full Article
y A semiparametric modeling approach using Bayesian Additive Regression Trees with an application to evaluate heterogeneous treatment effects By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Bret Zeldow, Vincent Lo Re III, Jason Roy. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1989--2010.Abstract: Bayesian Additive Regression Trees (BART) is a flexible machine learning algorithm capable of capturing nonlinearities between an outcome and covariates and interactions among covariates. We extend BART to a semiparametric regression framework in which the conditional expectation of an outcome is a function of treatment, its effect modifiers, and confounders. The confounders are allowed to have unspecified functional form, while treatment and effect modifiers that are directly related to the research question are given a linear form. The result is a Bayesian semiparametric linear regression model where the posterior distribution of the parameters of the linear part can be interpreted as in parametric Bayesian regression. This is useful in situations where a subset of the variables are of substantive interest and the others are nuisance variables that we would like to control for. An example of this occurs in causal modeling with the structural mean model (SMM). Under certain causal assumptions, our method can be used as a Bayesian SMM. Our methods are demonstrated with simulation studies and an application to dataset involving adults with HIV/Hepatitis C coinfection who newly initiate antiretroviral therapy. The methods are available in an R package called semibart. Full Article
y Radio-iBAG: Radiomics-based integrative Bayesian analysis of multiplatform genomic data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Youyi Zhang, Jeffrey S. Morris, Shivali Narang Aerry, Arvind U. K. Rao, Veerabhadran Baladandayuthapani. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1957--1988.Abstract: Technological innovations have produced large multi-modal datasets that include imaging and multi-platform genomics data. Integrative analyses of such data have the potential to reveal important biological and clinical insights into complex diseases like cancer. In this paper, we present Bayesian approaches for integrative analysis of radiological imaging and multi-platform genomic data, where-in our goals are to simultaneously identify genomic and radiomic, that is, radiology-based imaging markers, along with the latent associations between these two modalities, and to detect the overall prognostic relevance of the combined markers. For this task, we propose Radio-iBAG: Radiomics-based Integrative Bayesian Analysis of Multiplatform Genomic Data , a multi-scale Bayesian hierarchical model that involves several innovative strategies: it incorporates integrative analysis of multi-platform genomic data sets to capture fundamental biological relationships; explores the associations between radiomic markers accompanying genomic information with clinical outcomes; and detects genomic and radiomic markers associated with clinical prognosis. We also introduce the use of sparse Principal Component Analysis (sPCA) to extract a sparse set of approximately orthogonal meta-features each containing information from a set of related individual radiomic features, reducing dimensionality and combining like features. Our methods are motivated by and applied to The Cancer Genome Atlas glioblastoma multiforme data set, where-in we integrate magnetic resonance imaging-based biomarkers along with genomic, epigenomic and transcriptomic data. Our model identifies important magnetic resonance imaging features and the associated genomic platforms that are related with patient survival times. Full Article
y Bayesian methods for multiple mediators: Relating principal stratification and causal mediation in the analysis of power plant emission controls By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Chanmin Kim, Michael J. Daniels, Joseph W. Hogan, Christine Choirat, Corwin M. Zigler. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1927--1956.Abstract: Emission control technologies installed on power plants are a key feature of many air pollution regulations in the US. While such regulations are predicated on the presumed relationships between emissions, ambient air pollution and human health, many of these relationships have never been empirically verified. The goal of this paper is to develop new statistical methods to quantify these relationships. We frame this problem as one of mediation analysis to evaluate the extent to which the effect of a particular control technology on ambient pollution is mediated through causal effects on power plant emissions. Since power plants emit various compounds that contribute to ambient pollution, we develop new methods for multiple intermediate variables that are measured contemporaneously, may interact with one another, and may exhibit joint mediating effects. Specifically, we propose new methods leveraging two related frameworks for causal inference in the presence of mediating variables: principal stratification and causal mediation analysis. We define principal effects based on multiple mediators, and also introduce a new decomposition of the total effect of an intervention on ambient pollution into the natural direct effect and natural indirect effects for all combinations of mediators. Both approaches are anchored to the same observed-data models, which we specify with Bayesian nonparametric techniques. We provide assumptions for estimating principal causal effects, then augment these with an additional assumption required for causal mediation analysis. The two analyses, interpreted in tandem, provide the first empirical investigation of the presumed causal pathways that motivate important air quality regulatory policies. Full Article
y Wavelet spectral testing: Application to nonstationary circadian rhythms By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Jessica K. Hargreaves, Marina I. Knight, Jon W. Pitchford, Rachael J. Oakenfull, Sangeeta Chawla, Jack Munns, Seth J. Davis. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1817--1846.Abstract: Rhythmic data are ubiquitous in the life sciences. Biologists need reliable statistical tests to identify whether a particular experimental treatment has caused a significant change in a rhythmic signal. When these signals display nonstationary behaviour, as is common in many biological systems, the established methodologies may be misleading. Therefore, there is a real need for new methodology that enables the formal comparison of nonstationary processes. As circadian behaviour is best understood in the spectral domain, here we develop novel hypothesis testing procedures in the (wavelet) spectral domain, embedding replicate information when available. The data are modelled as realisations of locally stationary wavelet processes, allowing us to define and rigorously estimate their evolutionary wavelet spectra. Motivated by three complementary applications in circadian biology, our new methodology allows the identification of three specific types of spectral difference. We demonstrate the advantages of our methodology over alternative approaches, by means of a comprehensive simulation study and real data applications, using both published and newly generated circadian datasets. In contrast to the current standard methodologies, our method successfully identifies differences within the motivating circadian datasets, and facilitates wider ranging analyses of rhythmic biological data in general. Full Article
y Bayesian modeling of the structural connectome for studying Alzheimer’s disease By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Arkaprava Roy, Subhashis Ghosal, Jeffrey Prescott, Kingshuk Roy Choudhury. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1791--1816.Abstract: We study possible relations between Alzheimer’s disease progression and the structure of the connectome which is white matter connecting different regions of the brain. Regression models in covariates including age, gender and disease status for the extent of white matter connecting each pair of regions of the brain are proposed. Subject inhomogeneity is also incorporated in the model through random effects with an unknown distribution. As there is a large number of pairs of regions, we also adopt a dimension reduction technique through graphon ( J. Combin. Theory Ser. B 96 (2006) 933–957) functions which reduces the functions of pairs of regions to functions of regions. The connecting graphon functions are considered unknown but the assumed smoothness allows putting priors of low complexity on these functions. We pursue a nonparametric Bayesian approach by assigning a Dirichlet process scale mixture of zero to mean normal prior on the distributions of the random effects and finite random series of tensor products of B-splines priors on the underlying graphon functions. We develop efficient Markov chain Monte Carlo techniques for drawing samples for the posterior distributions using Hamiltonian Monte Carlo (HMC). The proposed Bayesian method overwhelmingly outperforms a competing method based on ANCOVA models in the simulation setup. The proposed Bayesian approach is applied on a dataset of 100 subjects and 83 brain regions and key regions implicated in the changing connectome are identified. Full Article
y A hierarchical Bayesian model for single-cell clustering using RNA-sequencing data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Yiyi Liu, Joshua L. Warren, Hongyu Zhao. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1733--1752.Abstract: Understanding the heterogeneity of cells is an important biological question. The development of single-cell RNA-sequencing (scRNA-seq) technology provides high resolution data for such inquiry. A key challenge in scRNA-seq analysis is the high variability of measured RNA expression levels and frequent dropouts (missing values) due to limited input RNA compared to bulk RNA-seq measurement. Existing clustering methods do not perform well for these noisy and zero-inflated scRNA-seq data. In this manuscript we propose a Bayesian hierarchical model, called BasClu, to appropriately characterize important features of scRNA-seq data in order to more accurately cluster cells. We demonstrate the effectiveness of our method with extensive simulation studies and applications to three real scRNA-seq datasets. Full Article
y A Bayesian mark interaction model for analysis of tumor pathology images By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Qiwei Li, Xinlei Wang, Faming Liang, Guanghua Xiao. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1708--1732.Abstract: With the advance of imaging technology, digital pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high resolution. Recent developments in deep-learning methods have enabled us to identify and classify individual cells from digital pathology images at large scale. Reliable statistical approaches to model the spatial pattern of cells can provide new insight into tumor progression and shed light on the biological mechanisms of cancer. We consider the problem of modeling spatial correlations among three commonly seen cells observed in tumor pathology images. A novel geostatistical marking model with interpretable underlying parameters is proposed in a Bayesian framework. We use auxiliary variable MCMC algorithms to sample from the posterior distribution with an intractable normalizing constant. We demonstrate how this model-based analysis can lead to sharper inferences than ordinary exploratory analyses, by means of application to three benchmark datasets and a case study on the pathology images of $188$ lung cancer patients. The case study shows that the spatial correlation between tumor and stromal cells predicts patient prognosis. This statistical methodology not only presents a new model for characterizing spatial correlations in a multitype spatial point pattern conditioning on the locations of the points, but also provides a new perspective for understanding the role of cell–cell interactions in cancer progression. Full Article
y Sequential decision model for inference and prediction on nonuniform hypergraphs with application to knot matching from computational forestry By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Seong-Hwan Jun, Samuel W. K. Wong, James V. Zidek, Alexandre Bouchard-Côté. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1678--1707.Abstract: In this paper, we consider the knot-matching problem arising in computational forestry. The knot-matching problem is an important problem that needs to be solved to advance the state of the art in automatic strength prediction of lumber. We show that this problem can be formulated as a quadripartite matching problem and develop a sequential decision model that admits efficient parameter estimation along with a sequential Monte Carlo sampler on graph matching that can be utilized for rapid sampling of graph matching. We demonstrate the effectiveness of our methods on 30 manually annotated boards and present findings from various simulation studies to provide further evidence supporting the efficacy of our methods. Full Article
y RCRnorm: An integrated system of random-coefficient hierarchical regression models for normalizing NanoString nCounter data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Gaoxiang Jia, Xinlei Wang, Qiwei Li, Wei Lu, Ximing Tang, Ignacio Wistuba, Yang Xie. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1617--1647.Abstract: Formalin-fixed paraffin-embedded (FFPE) samples have great potential for biomarker discovery, retrospective studies and diagnosis or prognosis of diseases. Their application, however, is hindered by the unsatisfactory performance of traditional gene expression profiling techniques on damaged RNAs. NanoString nCounter platform is well suited for profiling of FFPE samples and measures gene expression with high sensitivity which may greatly facilitate realization of scientific and clinical values of FFPE samples. However, methodological development for normalization, a critical step when analyzing this type of data, is far behind. Existing methods designed for the platform use information from different types of internal controls separately and rely on an overly-simplified assumption that expression of housekeeping genes is constant across samples for global scaling. Thus, these methods are not optimized for the nCounter system, not mentioning that they were not developed for FFPE samples. We construct an integrated system of random-coefficient hierarchical regression models to capture main patterns and characteristics observed from NanoString data of FFPE samples and develop a Bayesian approach to estimate parameters and normalize gene expression across samples. Our method, labeled RCRnorm, incorporates information from all aspects of the experimental design and simultaneously removes biases from various sources. It eliminates the unrealistic assumption on housekeeping genes and offers great interpretability. Furthermore, it is applicable to freshly frozen or like samples that can be generally viewed as a reduced case of FFPE samples. Simulation and applications showed the superior performance of RCRnorm. Full Article
y Modeling seasonality and serial dependence of electricity price curves with warping functional autoregressive dynamics By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Ying Chen, J. S. Marron, Jiejie Zhang. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1590--1616.Abstract: Electricity prices are high dimensional, serially dependent and have seasonal variations. We propose a Warping Functional AutoRegressive (WFAR) model that simultaneously accounts for the cross time-dependence and seasonal variations of the large dimensional data. In particular, electricity price curves are obtained by smoothing over the $24$ discrete hourly prices on each day. In the functional domain, seasonal phase variations are separated from level amplitude changes in a warping process with the Fisher–Rao distance metric, and the aligned (season-adjusted) electricity price curves are modeled in the functional autoregression framework. In a real application, the WFAR model provides superior out-of-sample forecast accuracy in both a normal functioning market, Nord Pool, and an extreme situation, the California market. The forecast performance as well as the relative accuracy improvement are stable for different markets and different time periods. Full Article
y Fast dynamic nonparametric distribution tracking in electron microscopic data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Yanjun Qian, Jianhua Z. Huang, Chiwoo Park, Yu Ding. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1537--1563.Abstract: In situ transmission electron microscope (TEM) adds a promising instrument to the exploration of the nanoscale world, allowing motion pictures to be taken while nano objects are initiating, crystalizing and morphing into different sizes and shapes. To enable in-process control of nanocrystal production, this technology innovation hinges upon a solution addressing a statistical problem, which is the capability of online tracking a dynamic, time-varying probability distribution reflecting the nanocrystal growth. Because no known parametric density functions can adequately describe the evolving distribution, a nonparametric approach is inevitable. Towards this objective, we propose to incorporate the dynamic evolution of the normalized particle size distribution into a state space model, in which the density function is represented by a linear combination of B-splines and the spline coefficients are treated as states. The closed-form algorithm runs online updates faster than the frame rate of the in situ TEM video, making it suitable for in-process control purpose. Imposing the constraints of curve smoothness and temporal continuity improves the accuracy and robustness while tracking the probability distribution. We test our method on three published TEM videos. For all of them, the proposed method is able to outperform several alternative approaches. Full Article
y Identifying multiple changes for a functional data sequence with application to freeway traffic segmentation By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Jeng-Min Chiou, Yu-Ting Chen, Tailen Hsing. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1430--1463.Abstract: Motivated by the study of road segmentation partitioned by shifts in traffic conditions along a freeway, we introduce a two-stage procedure, Dynamic Segmentation and Backward Elimination (DSBE), for identifying multiple changes in the mean functions for a sequence of functional data. The Dynamic Segmentation procedure searches for all possible changepoints using the derived global optimality criterion coupled with the local strategy of at-most-one-changepoint by dividing the entire sequence into individual subsequences that are recursively adjusted until convergence. Then, the Backward Elimination procedure verifies these changepoints by iteratively testing the unlikely changes to ensure their significance until no more changepoints can be removed. By combining the local strategy with the global optimal changepoint criterion, the DSBE algorithm is conceptually simple and easy to implement and performs better than the binary segmentation-based approach at detecting small multiple changes. The consistency property of the changepoint estimators and the convergence of the algorithm are proved. We apply DSBE to detect changes in traffic streams through real freeway traffic data. The practical performance of DSBE is also investigated through intensive simulation studies for various scenarios. Full Article
y Introduction to papers on the modeling and analysis of network data—II By projecteuclid.org Published On :: Thu, 05 Aug 2010 15:41 EDT Stephen E. FienbergSource: Ann. Appl. Stat., Volume 4, Number 2, 533--534. Full Article
y Stratonovich type integration with respect to fractional Brownian motion with Hurst parameter less than $1/2$ By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Jorge A. León. Source: Bernoulli, Volume 26, Number 3, 2436--2462.Abstract: Let $B^{H}$ be a fractional Brownian motion with Hurst parameter $Hin (0,1/2)$ and $p:mathbb{R} ightarrow mathbb{R}$ a polynomial function. The main purpose of this paper is to introduce a Stratonovich type stochastic integral with respect to $B^{H}$, whose domain includes the process $p(B^{H})$. That is, an integral that allows us to integrate $p(B^{H})$ with respect to $B^{H}$, which does not happen with the symmetric integral given by Russo and Vallois ( Probab. Theory Related Fields 97 (1993) 403–421) in general. Towards this end, we combine the approaches utilized by León and Nualart ( Stochastic Process. Appl. 115 (2005) 481–492), and Russo and Vallois ( Probab. Theory Related Fields 97 (1993) 403–421), whose aims are to extend the domain of the divergence operator for Gaussian processes and to define some stochastic integrals, respectively. Then, we study the relation between this Stratonovich integral and the extension of the divergence operator (see León and Nualart ( Stochastic Process. Appl. 115 (2005) 481–492)), an Itô formula and the existence of a unique solution of some Stratonovich stochastic differential equations. These last results have been analyzed by Alòs, León and Nualart ( Taiwanese J. Math. 5 (2001) 609–632), where the Hurst paramert $H$ belongs to the interval $(1/4,1/2)$. Full Article
y Local law and Tracy–Widom limit for sparse stochastic block models By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Jong Yun Hwang, Ji Oon Lee, Wooseok Yang. Source: Bernoulli, Volume 26, Number 3, 2400--2435.Abstract: We consider the spectral properties of sparse stochastic block models, where $N$ vertices are partitioned into $K$ balanced communities. Under an assumption that the intra-community probability and inter-community probability are of similar order, we prove a local semicircle law up to the spectral edges, with an explicit formula on the deterministic shift of the spectral edge. We also prove that the fluctuation of the extremal eigenvalues is given by the GOE Tracy–Widom law after rescaling and centering the entries of sparse stochastic block models. Applying the result to sparse stochastic block models, we rigorously prove that there is a large gap between the outliers and the spectral edge without centering. Full Article
y Frequency domain theory for functional time series: Variance decomposition and an invariance principle By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Piotr Kokoszka, Neda Mohammadi Jouzdani. Source: Bernoulli, Volume 26, Number 3, 2383--2399.Abstract: This paper is concerned with frequency domain theory for functional time series, which are temporally dependent sequences of functions in a Hilbert space. We consider a variance decomposition, which is more suitable for such a data structure than the variance decomposition based on the Karhunen–Loéve expansion. The decomposition we study uses eigenvalues of spectral density operators, which are functional analogs of the spectral density of a stationary scalar time series. We propose estimators of the variance components and derive convergence rates for their mean square error as well as their asymptotic normality. The latter is derived from a frequency domain invariance principle for the estimators of the spectral density operators. This principle is established for a broad class of linear time series models. It is a main contribution of the paper. Full Article
y Bayesian linear regression for multivariate responses under group sparsity By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Bo Ning, Seonghyun Jeong, Subhashis Ghosal. Source: Bernoulli, Volume 26, Number 3, 2353--2382.Abstract: We study frequentist properties of a Bayesian high-dimensional multivariate linear regression model with correlated responses. The predictors are separated into many groups and the group structure is pre-determined. Two features of the model are unique: (i) group sparsity is imposed on the predictors; (ii) the covariance matrix is unknown and its dimensions can also be high. We choose a product of independent spike-and-slab priors on the regression coefficients and a new prior on the covariance matrix based on its eigendecomposition. Each spike-and-slab prior is a mixture of a point mass at zero and a multivariate density involving the $ell_{2,1}$-norm. We first obtain the posterior contraction rate, the bounds on the effective dimension of the model with high posterior probabilities. We then show that the multivariate regression coefficients can be recovered under certain compatibility conditions. Finally, we quantify the uncertainty for the regression coefficients with frequentist validity through a Bernstein–von Mises type theorem. The result leads to selection consistency for the Bayesian method. We derive the posterior contraction rate using the general theory by constructing a suitable test from the first principle using moment bounds for certain likelihood ratios. This leads to posterior concentration around the truth with respect to the average Rényi divergence of order $1/2$. This technique of obtaining the required tests for posterior contraction rate could be useful in many other problems. Full Article
y A refined Cramér-type moderate deviation for sums of local statistics By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Xiao Fang, Li Luo, Qi-Man Shao. Source: Bernoulli, Volume 26, Number 3, 2319--2352.Abstract: We prove a refined Cramér-type moderate deviation result by taking into account of the skewness in normal approximation for sums of local statistics of independent random variables. We apply the main result to $k$-runs, U-statistics and subgraph counts in the Erdős–Rényi random graph. To prove our main result, we develop exponential concentration inequalities and higher-order tail probability expansions via Stein’s method. Full Article
y Weighted Lépingle inequality By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Pavel Zorin-Kranich. Source: Bernoulli, Volume 26, Number 3, 2311--2318.Abstract: We prove an estimate for weighted $p$th moments of the pathwise $r$-variation of a martingale in terms of the $A_{p}$ characteristic of the weight. The novelty of the proof is that we avoid real interpolation techniques. Full Article
y Concentration of the spectral norm of Erdős–Rényi random graphs By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Gábor Lugosi, Shahar Mendelson, Nikita Zhivotovskiy. Source: Bernoulli, Volume 26, Number 3, 2253--2274.Abstract: We present results on the concentration properties of the spectral norm $|A_{p}|$ of the adjacency matrix $A_{p}$ of an Erdős–Rényi random graph $G(n,p)$. First, we consider the Erdős–Rényi random graph process and prove that $|A_{p}|$ is uniformly concentrated over the range $pin[Clog n/n,1]$. The analysis is based on delocalization arguments, uniform laws of large numbers, together with the entropy method to prove concentration inequalities. As an application of our techniques, we prove sharp sub-Gaussian moment inequalities for $|A_{p}|$ for all $pin[clog^{3}n/n,1]$ that improve the general bounds of Alon, Krivelevich, and Vu ( Israel J. Math. 131 (2002) 259–267) and some of the more recent results of Erdős et al. ( Ann. Probab. 41 (2013) 2279–2375). Both results are consistent with the asymptotic result of Füredi and Komlós ( Combinatorica 1 (1981) 233–241) that holds for fixed $p$ as $n oinfty$. Full Article
y On Sobolev tests of uniformity on the circle with an extension to the sphere By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Sreenivasa Rao Jammalamadaka, Simos Meintanis, Thomas Verdebout. Source: Bernoulli, Volume 26, Number 3, 2226--2252.Abstract: Circular and spherical data arise in many applications, especially in biology, Earth sciences and astronomy. In dealing with such data, one of the preliminary steps before any further inference, is to test if such data is isotropic, that is, uniformly distributed around the circle or the sphere. In view of its importance, there is a considerable literature on the topic. In the present work, we provide new tests of uniformity on the circle based on original asymptotic results. Our tests are motivated by the shape of locally and asymptotically maximin tests of uniformity against generalized von Mises distributions. We show that they are uniformly consistent. Empirical power comparisons with several competing procedures are presented via simulations. The new tests detect particularly well multimodal alternatives such as mixtures of von Mises distributions. A practically-oriented combination of the new tests with already existing Sobolev tests is proposed. An extension to testing uniformity on the sphere, along with some simulations, is included. The procedures are illustrated on a real dataset. Full Article
y Exponential integrability and exit times of diffusions on sub-Riemannian and metric measure spaces By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Anton Thalmaier, James Thompson. Source: Bernoulli, Volume 26, Number 3, 2202--2225.Abstract: In this article, we derive moment estimates, exponential integrability, concentration inequalities and exit times estimates for canonical diffusions firstly on sub-Riemannian limits of Riemannian foliations and secondly in the nonsmooth setting of $operatorname{RCD}^{*}(K,N)$ spaces. In each case, the necessary ingredients are Itô’s formula and a comparison theorem for the Laplacian, for which we refer to the recent literature. As an application, we derive pointwise Carmona-type estimates on eigenfunctions of Schrödinger operators. Full Article
y Directional differentiability for supremum-type functionals: Statistical applications By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Javier Cárcamo, Antonio Cuevas, Luis-Alberto Rodríguez. Source: Bernoulli, Volume 26, Number 3, 2143--2175.Abstract: We show that various functionals related to the supremum of a real function defined on an arbitrary set or a measure space are Hadamard directionally differentiable. We specifically consider the supremum norm, the supremum, the infimum, and the amplitude of a function. The (usually non-linear) derivatives of these maps adopt simple expressions under suitable assumptions on the underlying space. As an application, we improve and extend to the multidimensional case the results in Raghavachari ( Ann. Statist. 1 (1973) 67–73) regarding the limiting distributions of Kolmogorov–Smirnov type statistics under the alternative hypothesis. Similar results are obtained for analogous statistics associated with copulas. We additionally solve an open problem about the Berk–Jones statistic proposed by Jager and Wellner (In A Festschrift for Herman Rubin (2004) 319–331 IMS). Finally, the asymptotic distribution of maximum mean discrepancies over Donsker classes of functions is derived. Full Article
y Noncommutative Lebesgue decomposition and contiguity with applications in quantum statistics By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Akio Fujiwara, Koichi Yamagata. Source: Bernoulli, Volume 26, Number 3, 2105--2142.Abstract: We herein develop a theory of contiguity in the quantum domain based upon a novel quantum analogue of the Lebesgue decomposition. The theory thus formulated is pertinent to the weak quantum local asymptotic normality introduced in the previous paper [Yamagata, Fujiwara, and Gill, Ann. Statist. 41 (2013) 2197–2217], yielding substantial enlargement of the scope of quantum statistics. Full Article
y On sampling from a log-concave density using kinetic Langevin diffusions By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Arnak S. Dalalyan, Lionel Riou-Durand. Source: Bernoulli, Volume 26, Number 3, 1956--1988.Abstract: Langevin diffusion processes and their discretizations are often used for sampling from a target density. The most convenient framework for assessing the quality of such a sampling scheme corresponds to smooth and strongly log-concave densities defined on $mathbb{R}^{p}$. The present work focuses on this framework and studies the behavior of the Monte Carlo algorithm based on discretizations of the kinetic Langevin diffusion. We first prove the geometric mixing property of the kinetic Langevin diffusion with a mixing rate that is optimal in terms of its dependence on the condition number. We then use this result for obtaining improved guarantees of sampling using the kinetic Langevin Monte Carlo method, when the quality of sampling is measured by the Wasserstein distance. We also consider the situation where the Hessian of the log-density of the target distribution is Lipschitz-continuous. In this case, we introduce a new discretization of the kinetic Langevin diffusion and prove that this leads to a substantial improvement of the upper bound on the sampling error measured in Wasserstein distance. Full Article
y Busemann functions and semi-infinite O’Connell–Yor polymers By projecteuclid.org Published On :: Mon, 27 Apr 2020 04:02 EDT Tom Alberts, Firas Rassoul-Agha, Mackenzie Simper. Source: Bernoulli, Volume 26, Number 3, 1927--1955.Abstract: We prove that given any fixed asymptotic velocity, the finite length O’Connell–Yor polymer has an infinite length limit satisfying the law of large numbers with this velocity. By a Markovian property of the quenched polymer this reduces to showing the existence of Busemann functions : almost sure limits of ratios of random point-to-point partition functions. The key ingredients are the Burke property of the O’Connell–Yor polymer and a comparison lemma for the ratios of partition functions. We also show the existence of infinite length limits in the Brownian last passage percolation model. Full Article