me

Oriented first passage percolation in the mean field limit

Nicola Kistler, Adrien Schertzer, Marius A. Schmidt.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 414--425.

Abstract:
The Poisson clumping heuristic has lead Aldous to conjecture the value of the oriented first passage percolation on the hypercube in the limit of large dimensions. Aldous’ conjecture has been rigorously confirmed by Fill and Pemantle ( Ann. Appl. Probab. 3 (1993) 593–629) by means of a variance reduction trick. We present here a streamlined and, we believe, more natural proof based on ideas emerged in the study of Derrida’s random energy models.




me

Measuring symmetry and asymmetry of multiplicative distortion measurement errors data

Jun Zhang, Yujie Gai, Xia Cui, Gaorong Li.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 370--393.

Abstract:
This paper studies the measure of symmetry or asymmetry of a continuous variable under the multiplicative distortion measurement errors setting. The unobservable variable is distorted in a multiplicative fashion by an observed confounding variable. First, two direct plug-in estimation procedures are proposed, and the empirical likelihood based confidence intervals are constructed to measure the symmetry or asymmetry of the unobserved variable. Next, we propose four test statistics for testing whether the unobserved variable is symmetric or not. The asymptotic properties of the proposed estimators and test statistics are examined. We conduct Monte Carlo simulation experiments to examine the performance of the proposed estimators and test statistics. These methods are applied to analyze a real dataset for an illustration.




me

Bayesian modeling and prior sensitivity analysis for zero–one augmented beta regression models with an application to psychometric data

Danilo Covaes Nogarotto, Caio Lucidius Naberezny Azevedo, Jorge Luis Bazán.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 304--322.

Abstract:
The interest on the analysis of the zero–one augmented beta regression (ZOABR) model has been increasing over the last few years. In this work, we developed a Bayesian inference for the ZOABR model, providing some contributions, namely: we explored the use of Jeffreys-rule and independence Jeffreys prior for some of the parameters, performing a sensitivity study of prior choice, comparing the Bayesian estimates with the maximum likelihood ones and measuring the accuracy of the estimates under several scenarios of interest. The results indicate, in a general way, that: the Bayesian approach, under the Jeffreys-rule prior, was as accurate as the ML one. Also, different from other approaches, we use the predictive distribution of the response to implement Bayesian residuals. To further illustrate the advantages of our approach, we conduct an analysis of a real psychometric data set including a Bayesian residual analysis, where it is shown that misleading inference can be obtained when the data is transformed. That is, when the zeros and ones are transformed to suitable values and the usual beta regression model is considered, instead of the ZOABR model. Finally, future developments are discussed.




me

Adaptive two-treatment three-period crossover design for normal responses

Uttam Bandyopadhyay, Shirsendu Mukherjee, Atanu Biswas.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 291--303.

Abstract:
In adaptive crossover design, our goal is to allocate more patients to a promising treatment sequence. The present work contains a very simple three period crossover design for two competing treatments where the allocation in period 3 is done on the basis of the data obtained from the first two periods. Assuming normality of response variables we use a reliability functional for the choice between two treatments. We calculate the allocation proportions and their standard errors corresponding to the possible treatment combinations. We also derive some asymptotic results and provide solutions on related inferential problems. Moreover, the proposed procedure is compared with a possible competitor. Finally, we use a data set to illustrate the applicability of the proposed design.




me

Symmetrical and asymmetrical mixture autoregressive processes

Mohsen Maleki, Arezo Hajrajabi, Reinaldo B. Arellano-Valle.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 273--290.

Abstract:
In this paper, we study the finite mixtures of autoregressive processes assuming that the distribution of innovations (errors) belongs to the class of scale mixture of skew-normal (SMSN) distributions. The SMSN distributions allow a simultaneous modeling of the existence of outliers, heavy tails and asymmetries in the distribution of innovations. Therefore, a statistical methodology based on the SMSN family allows us to use a robust modeling on some non-linear time series with great flexibility, to accommodate skewness, heavy tails and heterogeneity simultaneously. The existence of convenient hierarchical representations of the SMSN distributions facilitates also the implementation of an ECME-type of algorithm to perform the likelihood inference in the considered model. Simulation studies and the application to a real data set are finally presented to illustrate the usefulness of the proposed model.




me

Random environment binomial thinning integer-valued autoregressive process with Poisson or geometric marginal

Zhengwei Liu, Qi Li, Fukang Zhu.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 251--272.

Abstract:
To predict time series of counts with small values and remarkable fluctuations, an available model is the $r$ states random environment process based on the negative binomial thinning operator and the geometric marginal. However, we argue that the aforementioned model may suffer from the following two drawbacks. First, under the condition of no prior information, the overdispersed property of the geometric distribution may cause the predictions fluctuate greatly. Second, because of the constraints on the model parameters, some estimated parameters are close to zero in real-data examples, which may not objectively reveal the correlation relationship. For the first drawback, an $r$ states random environment process based on the binomial thinning operator and the Poisson marginal is introduced. For the second drawback, we propose a generalized $r$ states random environment integer-valued autoregressive model based on the binomial thinning operator to model fluctuations of data. Yule–Walker and conditional maximum likelihood estimates are considered and their performances are assessed via simulation studies. Two real-data sets are conducted to illustrate the better performances of the proposed models compared with some existing models.




me

Recent developments in complex and spatially correlated functional data

Israel Martínez-Hernández, Marc G. Genton.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 204--229.

Abstract:
As high-dimensional and high-frequency data are being collected on a large scale, the development of new statistical models is being pushed forward. Functional data analysis provides the required statistical methods to deal with large-scale and complex data by assuming that data are continuous functions, for example, realizations of a continuous process (curves) or continuous random field (surfaces), and that each curve or surface is considered as a single observation. Here, we provide an overview of functional data analysis when data are complex and spatially correlated. We provide definitions and estimators of the first and second moments of the corresponding functional random variable. We present two main approaches: The first assumes that data are realizations of a functional random field, that is, each observation is a curve with a spatial component. We call them spatial functional data . The second approach assumes that data are continuous deterministic fields observed over time. In this case, one observation is a surface or manifold, and we call them surface time series . For these two approaches, we describe software available for the statistical analysis. We also present a data illustration, using a high-resolution wind speed simulated dataset, as an example of the two approaches. The functional data approach offers a new paradigm of data analysis, where the continuous processes or random fields are considered as a single entity. We consider this approach to be very valuable in the context of big data.




me

A message from the editorial board

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 203--203.




me

On estimating the location parameter of the selected exponential population under the LINEX loss function

Mohd Arshad, Omer Abdalghani.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 167--182.

Abstract:
Suppose that $pi_{1},pi_{2},ldots ,pi_{k}$ be $k(geq2)$ independent exponential populations having unknown location parameters $mu_{1},mu_{2},ldots,mu_{k}$ and known scale parameters $sigma_{1},ldots,sigma_{k}$. Let $mu_{[k]}=max {mu_{1},ldots,mu_{k}}$. For selecting the population associated with $mu_{[k]}$, a class of selection rules (proposed by Arshad and Misra [ Statistical Papers 57 (2016) 605–621]) is considered. We consider the problem of estimating the location parameter $mu_{S}$ of the selected population under the criterion of the LINEX loss function. We consider three natural estimators $delta_{N,1},delta_{N,2}$ and $delta_{N,3}$ of $mu_{S}$, based on the maximum likelihood estimators, uniformly minimum variance unbiased estimator (UMVUE) and minimum risk equivariant estimator (MREE) of $mu_{i}$’s, respectively. The uniformly minimum risk unbiased estimator (UMRUE) and the generalized Bayes estimator of $mu_{S}$ are derived. Under the LINEX loss function, a general result for improving a location-equivariant estimator of $mu_{S}$ is derived. Using this result, estimator better than the natural estimator $delta_{N,1}$ is obtained. We also shown that the estimator $delta_{N,1}$ is dominated by the natural estimator $delta_{N,3}$. Finally, we perform a simulation study to evaluate and compare risk functions among various competing estimators of $mu_{S}$.




me

Multivariate normal approximation of the maximum likelihood estimator via the delta method

Andreas Anastasiou, Robert E. Gaunt.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 136--149.

Abstract:
We use the delta method and Stein’s method to derive, under regularity conditions, explicit upper bounds for the distributional distance between the distribution of the maximum likelihood estimator (MLE) of a $d$-dimensional parameter and its asymptotic multivariate normal distribution. Our bounds apply in situations in which the MLE can be written as a function of a sum of i.i.d. $t$-dimensional random vectors. We apply our general bound to establish a bound for the multivariate normal approximation of the MLE of the normal distribution with unknown mean and variance.




me

A primer on the characterization of the exchangeable Marshall–Olkin copula via monotone sequences

Natalia Shenkman.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 127--135.

Abstract:
While derivations of the characterization of the $d$-variate exchangeable Marshall–Olkin copula via $d$-monotone sequences relying on basic knowledge in probability theory exist in the literature, they contain a myriad of unnecessary relatively complicated computations. We revisit this issue and provide proofs where all undesired artefacts are removed, thereby exposing the simplicity of the characterization. In particular, we give an insightful analytical derivation of the monotonicity conditions based on the monotonicity properties of the survival probabilities.




me

Nonparametric discrimination of areal functional data

Ahmad Younso.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 112--126.

Abstract:
We consider a new nonparametric rule of classification, inspired from the classical moving window rule, that allows for the classification of spatially dependent functional data containing some completely missing curves. We investigate the consistency of this classifier under mild conditions. The practical use of the classifier will be illustrated through simulation studies.




me

Effects of gene–environment and gene–gene interactions in case-control studies: A novel Bayesian semiparametric approach

Durba Bhattacharya, Sourabh Bhattacharya.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 71--89.

Abstract:
Present day bio-medical research is pointing towards the fact that cognizance of gene–environment interactions along with genetic interactions may help prevent or detain the onset of many complex diseases like cardiovascular disease, cancer, type2 diabetes, autism or asthma by adjustments to lifestyle. In this regard, we propose a Bayesian semiparametric model to detect not only the roles of genes and their interactions, but also the possible influence of environmental variables on the genes in case-control studies. Our model also accounts for the unknown number of genetic sub-populations via finite mixtures composed of Dirichlet processes. An effective parallel computing methodology, developed by us harnesses the power of parallel processing technology to increase the efficiencies of our conditionally independent Gibbs sampling and Transformation based MCMC (TMCMC) methods. Applications of our model and methods to simulation studies with biologically realistic genotype datasets and a real, case-control based genotype dataset on early onset of myocardial infarction (MI) have yielded quite interesting results beside providing some insights into the differential effect of gender on MI.




me

A joint mean-correlation modeling approach for longitudinal zero-inflated count data

Weiping Zhang, Jiangli Wang, Fang Qian, Yu Chen.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 35--50.

Abstract:
Longitudinal zero-inflated count data are widely encountered in many fields, while modeling the correlation between measurements for the same subject is more challenge due to the lack of suitable multivariate joint distributions. This paper studies a novel mean-correlation modeling approach for longitudinal zero-inflated regression model, solving both problems of specifying joint distribution and parsimoniously modeling correlations with no constraint. The joint distribution of zero-inflated discrete longitudinal responses is modeled by a copula model whose correlation parameters are innovatively represented in hyper-spherical coordinates. To overcome the computational intractability in maximizing the full likelihood function of the model, we further propose a computationally efficient pairwise likelihood approach. We then propose separated mean and correlation regression models to model these key quantities, such modeling approach can also handle irregularly and possibly subject-specific times points. The resulting estimators are shown to be consistent and asymptotically normal. Data example and simulations support the effectiveness of the proposed approach.




me

A message from the editorial board

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 1--1.




me

Time series of count data: A review, empirical comparisons and data analysis

Glaura C. Franco, Helio S. Migon, Marcos O. Prates.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 4, 756--781.

Abstract:
Observation and parameter driven models are commonly used in the literature to analyse time series of counts. In this paper, we study the characteristics of a variety of models and point out the main differences and similarities among these procedures, concerning parameter estimation, model fitting and forecasting. Alternatively to the literature, all inference was performed under the Bayesian paradigm. The models are fitted with a latent AR($p$) process in the mean, which accounts for autocorrelation in the data. An extensive simulation study shows that the estimates for the covariate parameters are remarkably similar across the different models. However, estimates for autoregressive coefficients and forecasts of future values depend heavily on the underlying process which generates the data. A real data set of bankruptcy in the United States is also analysed.




me

Estimation of parameters in the $operatorname{DDRCINAR}(p)$ model

Xiufang Liu, Dehui Wang.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 3, 638--673.

Abstract:
This paper discusses a $p$th-order dependence-driven random coefficient integer-valued autoregressive time series model ($operatorname{DDRCINAR}(p)$). Stationarity and ergodicity properties are proved. Conditional least squares, weighted least squares and maximum quasi-likelihood are used to estimate the model parameters. Asymptotic properties of the estimators are presented. The performances of these estimators are investigated and compared via simulations. In certain regions of the parameter space, simulative analysis shows that maximum quasi-likelihood estimators perform better than the estimators of conditional least squares and weighted least squares in terms of the proportion of within-$Omega$ estimates. At last, the model is applied to two real data sets.




me

A Jackson network under general regime

Yair Y. Shaki.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 3, 532--548.

Abstract:
We consider a Jackson network in a general heavy traffic diffusion regime with the $alpha$-parametrization . We also assume that each customer may abandon the system while waiting. We show that in this regime the queue-length process converges to a multi-dimensional regulated Ornstein–Uhlenbeck process.




me

Influence measures for the Waring regression model

Luisa Rivas, Manuel Galea.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 2, 402--424.

Abstract:
In this paper, we present a regression model where the response variable is a count data that follows a Waring distribution. The Waring regression model allows for analysis of phenomena where the Geometric regression model is inadequate, because the probability of success on each trial, $p$, is different for each individual and $p$ has an associated distribution. Estimation is performed by maximum likelihood, through the maximization of the $Q$-function using EM algorithm. Diagnostic measures are calculated for this model. To illustrate the results, an application to real data is presented. Some specific details are given in the Appendix of the paper.




me

A temporal perspective on the rate of convergence in first-passage percolation under a moment condition

Daniel Ahlberg.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 2, 397--401.

Abstract:
We study the rate of convergence in the celebrated Shape Theorem in first-passage percolation, obtaining the precise asymptotic rate of decay for the probability of linear order deviations under a moment condition. Our results are presented from a temporal perspective and complement previous work by the same author, in which the rate of convergence was studied from the standard spatial perspective.




me

Hierarchical modelling of power law processes for the analysis of repairable systems with different truncation times: An empirical Bayes approach

Rodrigo Citton P. dos Reis, Enrico A. Colosimo, Gustavo L. Gilardoni.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 2, 374--396.

Abstract:
In the data analysis from multiple repairable systems, it is usual to observe both different truncation times and heterogeneity among the systems. Among other reasons, the latter is caused by different manufacturing lines and maintenance teams of the systems. In this paper, a hierarchical model is proposed for the statistical analysis of multiple repairable systems under different truncation times. A reparameterization of the power law process is proposed in order to obtain a quasi-conjugate bayesian analysis. An empirical Bayes approach is used to estimate model hyperparameters. The uncertainty in the estimate of these quantities are corrected by using a parametric bootstrap approach. The results are illustrated in a real data set of failure times of power transformers from an electric company in Brazil.




me

Necessary and sufficient conditions for the convergence of the consistent maximal displacement of the branching random walk

Bastien Mallein.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 2, 356--373.

Abstract:
Consider a supercritical branching random walk on the real line. The consistent maximal displacement is the smallest of the distances between the trajectories followed by individuals at the $n$th generation and the boundary of the process. Fang and Zeitouni, and Faraud, Hu and Shi proved that under some integrability conditions, the consistent maximal displacement grows almost surely at rate $lambda^{*}n^{1/3}$ for some explicit constant $lambda^{*}$. We obtain here a necessary and sufficient condition for this asymptotic behaviour to hold.




me

An estimation method for latent traits and population parameters in Nominal Response Model

Caio L. N. Azevedo, Dalton F. Andrade

Source: Braz. J. Probab. Stat., Volume 24, Number 3, 415--433.

Abstract:
The nominal response model (NRM) was proposed by Bock [ Psychometrika 37 (1972) 29–51] in order to improve the latent trait (ability) estimation in multiple choice tests with nominal items. When the item parameters are known, expectation a posteriori or maximum a posteriori methods are commonly employed to estimate the latent traits, considering a standard symmetric normal distribution as the latent traits prior density. However, when this item set is presented to a new group of examinees, it is not only necessary to estimate their latent traits but also the population parameters of this group. This article has two main purposes: first, to develop a Monte Carlo Markov Chain algorithm to estimate both latent traits and population parameters concurrently. This algorithm comprises the Metropolis–Hastings within Gibbs sampling algorithm (MHWGS) proposed by Patz and Junker [ Journal of Educational and Behavioral Statistics 24 (1999b) 346–366]. Second, to compare, in the latent trait recovering, the performance of this method with three other methods: maximum likelihood, expectation a posteriori and maximum a posteriori. The comparisons were performed by varying the total number of items (NI), the number of categories and the values of the mean and the variance of the latent trait distribution. The results showed that MHWGS outperforms the other methods concerning the latent traits estimation as well as it recoveries properly the population parameters. Furthermore, we found that NI accounts for the highest percentage of the variability in the accuracy of latent trait estimation.




me

NDN coping mechanisms : notes from the field

Belcourt, Billy-Ray, author.
9781487005771 (softcover)




me

Heavy metalloid music : the story of Simply Saucer

Locke, Jesse, 1983- author.
9781771613682 (Paper)




me

Documenting rebellions : a study of four lesbian and gay archives in queer times

Sheffield, Rebecka Taves, author.
9781634000918 paperback




me

Figuring racism in medieval Christianity

Kaplan, M. Lindsay, author.
9780190678241 hardcover alkaline paper




me

Can $p$-values be meaningfully interpreted without random sampling?

Norbert Hirschauer, Sven Grüner, Oliver Mußhoff, Claudia Becker, Antje Jantsch.

Source: Statistics Surveys, Volume 14, 71--91.

Abstract:
Besides the inferential errors that abound in the interpretation of $p$-values, the probabilistic pre-conditions (i.e. random sampling or equivalent) for using them at all are not often met by observational studies in the social sciences. This paper systematizes different sampling designs and discusses the restrictive requirements of data collection that are the indispensable prerequisite for using $p$-values.




me

Flexible, boundary adapted, nonparametric methods for the estimation of univariate piecewise-smooth functions

Umberto Amato, Anestis Antoniadis, Italia De Feis.

Source: Statistics Surveys, Volume 14, 32--70.

Abstract:
We present and compare some nonparametric estimation methods (wavelet and/or spline-based) designed to recover a one-dimensional piecewise-smooth regression function in both a fixed equidistant or not equidistant design regression model and a random design model. Wavelet methods are known to be very competitive in terms of denoising and compression, due to the simultaneous localization property of a function in time and frequency. However, boundary assumptions, such as periodicity or symmetry, generate bias and artificial wiggles which degrade overall accuracy. Simple methods have been proposed in the literature for reducing the bias at the boundaries. We introduce new ones based on adaptive combinations of two estimators. The underlying idea is to combine a highly accurate method for non-regular functions, e.g., wavelets, with one well behaved at boundaries, e.g., Splines or Local Polynomial. We provide some asymptotic optimal results supporting our approach. All the methods can handle data with a random design. We also sketch some generalization to the multidimensional setting. To study the performance of the proposed approaches we have conducted an extensive set of simulations on synthetic data. An interesting regression analysis of two real data applications using these procedures unambiguously demonstrates their effectiveness.




me

Scalar-on-function regression for predicting distal outcomes from intensively gathered longitudinal data: Interpretability for applied scientists

John J. Dziak, Donna L. Coffman, Matthew Reimherr, Justin Petrovich, Runze Li, Saul Shiffman, Mariya P. Shiyko.

Source: Statistics Surveys, Volume 13, 150--180.

Abstract:
Researchers are sometimes interested in predicting a distal or external outcome (such as smoking cessation at follow-up) from the trajectory of an intensively recorded longitudinal variable (such as urge to smoke). This can be done in a semiparametric way via scalar-on-function regression. However, the resulting fitted coefficient regression function requires special care for correct interpretation, as it represents the joint relationship of time points to the outcome, rather than a marginal or cross-sectional relationship. We provide practical guidelines, based on experience with scientific applications, for helping practitioners interpret their results and illustrate these ideas using data from a smoking cessation study.




me

Additive monotone regression in high and lower dimensions

Solveig Engebretsen, Ingrid K. Glad.

Source: Statistics Surveys, Volume 13, 1--51.

Abstract:
In numerous problems where the aim is to estimate the effect of a predictor variable on a response, one can assume a monotone relationship. For example, dose-effect models in medicine are of this type. In a multiple regression setting, additive monotone regression models assume that each predictor has a monotone effect on the response. In this paper, we present an overview and comparison of very recent frequentist methods for fitting additive monotone regression models. Three of the methods we present can be used both in the high dimensional setting, where the number of parameters $p$ exceeds the number of observations $n$, and in the classical multiple setting where $1<pleq n$. However, many of the most recent methods only apply to the classical setting. The methods are compared through simulation experiments in terms of efficiency, prediction error and variable selection properties in both settings, and they are applied to the Boston housing data. We conclude with some recommendations on when the various methods perform best.




me

Pitfalls of significance testing and &#36;p&#36;-value variability: An econometrics perspective

Norbert Hirschauer, Sven Grüner, Oliver Mußhoff, Claudia Becker.

Source: Statistics Surveys, Volume 12, 136--172.

Abstract:
Data on how many scientific findings are reproducible are generally bleak and a wealth of papers have warned against misuses of the $p$-value and resulting false findings in recent years. This paper discusses the question of what we can(not) learn from the $p$-value, which is still widely considered as the gold standard of statistical validity. We aim to provide a non-technical and easily accessible resource for statistical practitioners who wish to spot and avoid misinterpretations and misuses of statistical significance tests. For this purpose, we first classify and describe the most widely discussed (“classical”) pitfalls of significance testing, and review published work on these misuses with a focus on regression-based “confirmatory” study. This includes a description of the single-study bias and a simulation-based illustration of how proper meta-analysis compares to misleading significance counts (“vote counting”). Going beyond the classical pitfalls, we also use simulation to provide intuition that relying on the statistical estimate “$p$-value” as a measure of evidence without considering its sample-to-sample variability falls short of the mark even within an otherwise appropriate interpretation. We conclude with a discussion of the exigencies of informed approaches to statistical inference and corresponding institutional reforms.




me

An approximate likelihood perspective on ABC methods

George Karabatsos, Fabrizio Leisen.

Source: Statistics Surveys, Volume 12, 66--104.

Abstract:
We are living in the big data era, as current technologies and networks allow for the easy and routine collection of data sets in different disciplines. Bayesian Statistics offers a flexible modeling approach which is attractive for describing the complexity of these datasets. These models often exhibit a likelihood function which is intractable due to the large sample size, high number of parameters, or functional complexity. Approximate Bayesian Computational (ABC) methods provides likelihood-free methods for performing statistical inferences with Bayesian models defined by intractable likelihood functions. The vastity of the literature on ABC methods created a need to review and relate all ABC approaches so that scientists can more readily understand and apply them for their own work. This article provides a unifying review, general representation, and classification of all ABC methods from the view of approximate likelihood theory. This clarifies how ABC methods can be characterized, related, combined, improved, and applied for future research. Possible future research in ABC is then outlined.




me

Variable selection methods for model-based clustering

Michael Fop, Thomas Brendan Murphy.

Source: Statistics Surveys, Volume 12, 18--65.

Abstract:
Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.




me

Measuring multivariate association and beyond

Julie Josse, Susan Holmes.

Source: Statistics Surveys, Volume 10, 132--167.

Abstract:
Simple correlation coefficients between two variables have been generalized to measure association between two matrices in many ways. Coefficients such as the RV coefficient, the distance covariance (dCov) coefficient and kernel based coefficients are being used by different research communities. Scientists use these coefficients to test whether two random vectors are linked. Once it has been ascertained that there is such association through testing, then a next step, often ignored, is to explore and uncover the association’s underlying patterns. This article provides a survey of various measures of dependence between random vectors and tests of independence and emphasizes the connections and differences between the various approaches. After providing definitions of the coefficients and associated tests, we present the recent improvements that enhance their statistical properties and ease of interpretation. We summarize multi-table approaches and provide scenarii where the indices can provide useful summaries of heterogeneous multi-block data. We illustrate these different strategies on several examples of real data and suggest directions for future research.




me

Fundamentals of cone regression

Mariella Dimiccoli.

Source: Statistics Surveys, Volume 10, 53--99.

Abstract:
Cone regression is a particular case of quadratic programming that minimizes a weighted sum of squared residuals under a set of linear inequality constraints. Several important statistical problems such as isotonic, concave regression or ANOVA under partial orderings, just to name a few, can be considered as particular instances of the cone regression problem. Given its relevance in Statistics, this paper aims to address the fundamentals of cone regression from a theoretical and practical point of view. Several formulations of the cone regression problem are considered and, focusing on the particular case of concave regression as an example, several algorithms are analyzed and compared both qualitatively and quantitatively through numerical simulations. Several improvements to enhance numerical stability and bound the computational cost are proposed. For each analyzed algorithm, the pseudo-code and its corresponding code in Matlab are provided. The results from this study demonstrate that the choice of the optimization approach strongly impacts the numerical performances. It is also shown that methods are not currently available to solve efficiently cone regression problems with large dimension (more than many thousands of points). We suggest further research to fill this gap by exploiting and adapting classical multi-scale strategy to compute an approximate solution.




me

A survey of bootstrap methods in finite population sampling

Zeinab Mashreghi, David Haziza, Christian Léger.

Source: Statistics Surveys, Volume 10, 1--52.

Abstract:
We review bootstrap methods in the context of survey data where the effect of the sampling design on the variability of estimators has to be taken into account. We present the methods in a unified way by classifying them in three classes: pseudo-population, direct, and survey weights methods. We cover variance estimation and the construction of confidence intervals for stratified simple random sampling as well as some unequal probability sampling designs. We also address the problem of variance estimation in presence of imputation to compensate for item non-response.




me

A unified treatment for non-asymptotic and asymptotic approaches to minimax signal detection

Clément Marteau, Theofanis Sapatinas.

Source: Statistics Surveys, Volume 9, 253--297.

Abstract:
We are concerned with minimax signal detection. In this setting, we discuss non-asymptotic and asymptotic approaches through a unified treatment. In particular, we consider a Gaussian sequence model that contains classical models as special cases, such as, direct, well-posed inverse and ill-posed inverse problems. Working with certain ellipsoids in the space of squared-summable sequences of real numbers, with a ball of positive radius removed, we compare the construction of lower and upper bounds for the minimax separation radius (non-asymptotic approach) and the minimax separation rate (asymptotic approach) that have been proposed in the literature. Some additional contributions, bringing to light links between non-asymptotic and asymptotic approaches to minimax signal, are also presented. An example of a mildly ill-posed inverse problem is used for illustrative purposes. In particular, it is shown that tools used to derive ‘asymptotic’ results can be exploited to draw ‘non-asymptotic’ conclusions, and vice-versa. In order to enhance our understanding of these two minimax signal detection paradigms, we bring into light hitherto unknown similarities and links between non-asymptotic and asymptotic approaches.




me

Some models and methods for the analysis of observational data

José A. Ferreira.

Source: Statistics Surveys, Volume 9, 106--208.

Abstract:
This article provides a concise and essentially self-contained exposition of some of the most important models and non-parametric methods for the analysis of observational data, and a substantial number of illustrations of their application. Although for the most part our presentation follows P. Rosenbaum’s book, “Observational Studies”, and naturally draws on related literature, it contains original elements and simplifies and generalizes some basic results. The illustrations, based on simulated data, show the methods at work in some detail, highlighting pitfalls and emphasizing certain subjective aspects of the statistical analyses.




me

Semi-parametric estimation for conditional independence multivariate finite mixture models

Didier Chauveau, David R. Hunter, Michael Levine.

Source: Statistics Surveys, Volume 9, 1--31.

Abstract:
The conditional independence assumption for nonparametric multivariate finite mixture models, a weaker form of the well-known conditional independence assumption for random effects models for longitudinal data, is the subject of an increasing number of theoretical and algorithmic developments in the statistical literature. After presenting a survey of this literature, including an in-depth discussion of the all-important identifiability results, this article describes and extends an algorithm for estimation of the parameters in these models. The algorithm works for any number of components in three or more dimensions. It possesses a descent property and can be easily adapted to situations where the data are grouped in blocks of conditionally independent variables. We discuss how to adapt this algorithm to various location-scale models that link component densities, and we even adapt it to a particular class of univariate mixture problems in which the components are assumed symmetric. We give a bandwidth selection procedure for our algorithm. Finally, we demonstrate the effectiveness of our algorithm using a simulation study and two psychometric datasets.




me

Errata: A survey of Bayesian predictive methods for model assessment, selection and comparison

Aki Vehtari, Janne Ojanen.

Source: Statistics Surveys, Volume 8, , 1--1.

Abstract:
Errata for “A survey of Bayesian predictive methods for model assessment, selection and comparison” by A. Vehtari and J. Ojanen, Statistics Surveys , 6 (2012), 142–228. doi:10.1214/12-SS102.




me

A survey of Bayesian predictive methods for model assessment, selection and comparison

Aki Vehtari, Janne Ojanen

Source: Statist. Surv., Volume 6, 142--228.

Abstract:
To date, several methods exist in the statistical literature for model assessment, which purport themselves specifically as Bayesian predictive methods. The decision theoretic assumptions on which these methods are based are not always clearly stated in the original articles, however. The aim of this survey is to provide a unified review of Bayesian predictive model assessment and selection methods, and of methods closely related to them. We review the various assumptions that are made in this context and discuss the connections between different approaches, with an emphasis on how each method approximates the expected utility of using a Bayesian model for the purpose of predicting future data.




me

The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy

Nancy Heckman

Source: Statist. Surv., Volume 6, 113--141.

Abstract:
The popular cubic smoothing spline estimate of a regression function arises as the minimizer of the penalized sum of squares $sum_{j}(Y_{j}-mu(t_{j}))^{2}+lambda int_{a}^{b}[mu''(t)]^{2},dt$, where the data are $t_{j},Y_{j}$, $j=1,ldots,n$. The minimization is taken over an infinite-dimensional function space, the space of all functions with square integrable second derivatives. But the calculations can be carried out in a finite-dimensional space. The reduction from minimizing over an infinite dimensional space to minimizing over a finite dimensional space occurs for more general objective functions: the data may be related to the function $mu$ in another way, the sum of squares may be replaced by a more suitable expression, or the penalty, $int_{a}^{b}[mu''(t)]^{2},dt$, might take a different form. This paper reviews the Reproducing Kernel Hilbert Space structure that provides a finite-dimensional solution for a general minimization problem. Particular attention is paid to the construction and study of the Reproducing Kernel Hilbert Space corresponding to a penalty based on a linear differential operator. In this case, one can often calculate the minimizer explicitly, using Green’s functions.




me

Curse of dimensionality and related issues in nonparametric functional regression

Gery Geenens

Source: Statist. Surv., Volume 5, 30--43.

Abstract:
Recently, some nonparametric regression ideas have been extended to the case of functional regression. Within that framework, the main concern arises from the infinite dimensional nature of the explanatory objects. Specifically, in the classical multivariate regression context, it is well-known that any nonparametric method is affected by the so-called “curse of dimensionality”, caused by the sparsity of data in high-dimensional spaces, resulting in a decrease in fastest achievable rates of convergence of regression function estimators toward their target curve as the dimension of the regressor vector increases. Therefore, it is not surprising to find dramatically bad theoretical properties for the nonparametric functional regression estimators, leading many authors to condemn the methodology. Nevertheless, a closer look at the meaning of the functional data under study and on the conclusions that the statistician would like to draw from it allows to consider the problem from another point-of-view, and to justify the use of slightly modified estimators. In most cases, it can be entirely legitimate to measure the proximity between two elements of the infinite dimensional functional space via a semi-metric, which could prevent those estimators suffering from what we will call the “curse of infinite dimensionality”.

References:
[1] Ait-Saïdi, A., Ferraty, F., Kassa, K. and Vieu, P. (2008). Cross-validated estimations in the single-functional index model, Statistics, 42, 475–494.

[2] Aneiros-Perez, G. and Vieu, P. (2008). Nonparametric time series prediction: A semi-functional partial linear modeling, J. Multivariate Anal., 99, 834–857.

[3] Baillo, A. and Grané, A. (2009). Local linear regression for functional predictor and scalar response, J. Multivariate Anal., 100, 102–111.

[4] Burba, F., Ferraty, F. and Vieu, P. (2009). k-Nearest Neighbour method in functional nonparametric regression, J. Nonparam. Stat., 21, 453–469.

[5] Cardot, H., Ferraty, F. and Sarda, P. (1999). Functional linear model, Stat. Probabil. Lett., 45, 11–22.

[6] Crambes, C., Kneip, A. and Sarda, P. (2009). Smoothing splines estimators for functional linear regression, Ann. Statist., 37, 35–72.

[7] Delsol, L. (2009). Advances on asymptotic normality in nonparametric functional time series analysis, Statistics, 43, 13–33.

[8] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications, Chapman and Hall, London.

[9] Fan, J. and Zhang, J.-T. (2000). Two-step estimation of functional linear models with application to longitudinal data, J. Roy. Stat. Soc. B, 62, 303–322.

[10] Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis, Springer-Verlag, New York.

[11] Ferraty, F., Laksaci, A. and Vieu, P. (2006). Estimating Some Characteristics of the Conditional Distribution in Nonparametric Functional Models, Statist. Inf. Stoch. Proc., 9, 47–76.

[12] Ferraty, F., Mas, A. and Vieu, P. (2007). Nonparametric regression on functional data: inference and practical aspects, Aust. NZ. J. Stat., 49, 267–286.

[13] Ferraty, F., Van Keilegom, I. and Vieu, P. (2010). On the validity of the bootstrap in nonparametric functional regression, Scand. J. Stat., 37, 286–306.

[14] Ferraty, F., Laksaci, A., Tadj, A. and Vieu, P. (2010). Rate of uniform consistency for nonparametric estimates with functional variables, J. Stat. Plan. Inf., 140, 335–352.

[15] Ferraty, F. and Romain, Y. (2011). Oxford handbook on functional data analysis (Eds), Oxford University Press.

[16] Gasser, T., Hall, P. and Presnell, B. (1998). Nonparametric estimation of the mode of a distribution of random curves, J. Roy. Stat. Soc. B, 60, 681–691.

[17] Geenens, G. (2011). A nonparametric functional method for signature recognition, Manuscript.

[18] Härdle, W., Müller, M., Sperlich, S. and Werwatz, A. (2004). Nonparametric and semiparametric models, Springer-Verlag, Berlin.

[19] James, G.M. (2002). Generalized linear models with functional predictors, J. Roy. Stat. Soc. B, 64, 411–432.

[20] Masry, E. (2005). Nonparametric regression estimation for dependent functional data: asymptotic normality, Stochastic Process. Appl., 115, 155–177.

[21] Nadaraya, E.A. (1964). On estimating regression, Theory Probab. Applic., 9, 141–142.

[22] Quintela-Del-Rio, A. (2008). Hazard function given a functional variable: nonparametric estimation under strong mixing conditions, J. Nonparam. Stat., 20, 413–430.

[23] Rachdi, M. and Vieu, P. (2007). Nonparametric regression for functional data: automatic smoothing parameter selection, J. Stat. Plan. Inf., 137, 2784–2801.

[24] Ramsay, J. and Silverman, B.W. (1997). Functional Data Analysis, Springer-Verlag, New York.

[25] Ramsay, J. and Silverman, B.W. (2002). Applied functional data analysis; methods and case study, Springer-Verlag, New York.

[26] Ramsay, J. and Silverman, B.W. (2005). Functional Data Analysis, 2nd Edition, Springer-Verlag, New York.

[27] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression, Ann. Stat., 10, 1040–1053.

[28] Watson, G.S. (1964). Smooth regression analysis, Sankhya A, 26, 359–372.

[29] Yeung, D.T., Chang, H., Xiong, Y., George, S., Kashi, R., Matsumoto, T. and Rigoll, G. (2004). SVC2004: First International Signature Verification Competition, Proceedings of the International Conference on Biometric Authentication (ICBA), Hong Kong, July 2004.




me

Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy

Gregory J. Matthews, Ofer Harel

Source: Statist. Surv., Volume 5, 1--29.

Abstract:
There is an ever increasing demand from researchers for access to useful microdata files. However, there are also growing concerns regarding the privacy of the individuals contained in the microdata. Ideally, microdata could be released in such a way that a balance between usefulness of the data and privacy is struck. This paper presents a review of proposed methods of statistical disclosure control and techniques for assessing the privacy of such methods under different definitions of disclosure.

References:
Abowd, J., Woodcock, S., 2001. Disclosure limitation in longitudinal linked data. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, 215–277.

Adam, N.R., Worthmann, J.C., 1989. Security-control methods for statistical databases: a comparative study. ACM Comput. Surv. 21 (4), 515–556.

Armstrong, M., Rushton, G., Zimmerman, D.L., 1999. Geographically masking health data to preserve confidentiality. Statistics in Medicine 18 (5), 497–525.

Bethlehem, J.G., Keller, W., Pannekoek, J., 1990. Disclosure control of microdata. Jorunal of the American Statistical Association 85, 38–45.

Blum, A., Dwork, C., McSherry, F., Nissam, K., 2005. Practical privacy: The sulq framework. In: Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. pp. 128–138.

Bowden, R.J., Sim, A.B., 1992. The privacy bootstrap. Journal of Business and Economic Statistics 10 (3), 337–345.

Carlson, M., Salabasis, M., 2002. A data-swapping technique for generating synthetic samples; a method for disclosure control. Res. Official Statist. (5), 35–64.

Cox, L.H., 1980. Suppression methodology and statistical disclosure control. Journal of the American Statistical Association 75, 377–385.

Cox, L.H., 1984. Disclosure control methods for frequency count data. Tech. rep., U.S. Bureau of the Census.

Cox, L.H., 1987. A constructive procedure for unbiased controlled rounding. Journal of the American Statistical Association 82, 520–524.

Cox, L.H., 1994. Matrix masking methods for disclosure limitation in microdata. Survey Methodology 6, 165–169.

Cox, L.H., Fagan, J.T., Greenberg, B., Hemmig, R., 1987. Disclosure avoidance techniques for tabular data. Tech. rep., U.S. Bureau of the Census.

Dalenius, T., 1977. Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–444.

Dalenius, T., 1986. Finding a needle in a haystack - or identifying anonymous census record. Journal of Official Statistics 2 (3), 329–336.

Dalenius, T., Denning, D., 1982. A hybrid scheme for release of statistics. Statistisk Tidskrift.

Dalenius, T., Reiss, S.P., 1982. Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference 6, 73–85.

De Waal, A., Hundepool, A., Willenborg, L., 1995. Argus: Software for statistical disclosure control of microdata. U.S. Census Bureau.

DeGroot, M.H., 1962. Uncertainty, information, and sequential experiments. Annals of Mathematical Statistics 33, 404–419.

DeGroot, M.H., 1970. Optimal Statistical Decisions. Mansell, London.

Dinur, I., Nissam, K., 2003. Revealing information while preserving privacy. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principlesof Database Systems. pp. 202–210.

Domingo-Ferrer, J., Torra, V., 2001a. A Quantitative Comparison of Disclosure Control Methods for Microdata. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (Eds.), Confidentiality, Disclosure and Data Access - Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam, Ch. 6, pp. 113–135.

Domingo-Ferrer, J., Torra, V., 2001b. Disclosure control methods and information loss for microdata. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (Eds.), Confidentiality, Disclosure and Data Access - Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam, Ch. 5, pp. 93–112.

Duncan, G., Lambert, D., 1986. Disclosure-limited data dissemination. Journal of the American Statistical Association 81, 10–28.

Duncan, G., Lambert, D., 1989. The risk of disclosure for microdata. Journal of Business & Economic Statistics 7, 207–217.

Duncan, G., Pearson, R., 1991. Enhancing access to microdata while protecting confidentiality: prospects for the future (with discussion). Statistical Science 6, 219–232.

Dwork, C., 2006. Differential privacy. In: ICALP. Springer, pp. 1–12.

Dwork, C., 2008. An ad omnia approach to defining and achieving private data analysis. In: Lecture Notes in Computer Science. Springer, p. 10.

Dwork, C., Lei, J., 2009. Differential privacy and robust statistics. In: Proceedings of the 41th Annual ACM Symposium on Theory of Computing (STOC). pp. 371–380.

Dwork, C., Mcsherry, F., Nissim, K., Smith, A., 2006. Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd Theory of Cryptography Conference. Springer, pp. 265–284.

Dwork, C., Nissam, K., 2004. Privacy-preserving datamining on vertically partitioned databases. In: Advances in Cryptology: Proceedings of Crypto. pp. 528–544.

Elliot, M., 2000. DIS: a new approach to the measurement of statistical disclosure risk. International Journal of Risk Assessment and Management 2, 39–48.

Federal Committee on Statistical Methodology (FCSM), 2005. Statistical policy working group 22 - report on statistical disclosure limitation methodology. U.S. Census Bureau.

Fellegi, I.P., 1972. On the question of statistical confidentiality. Journal of the American Statistical Association 67 (337), 7–18.

Fienberg, S.E., McIntyre, J., 2004. Data swapping: Variations on a theme by Dalenius and Reiss. In: Domingo-Ferrer, J., Torra, V. (Eds.), Privacy in Statistical Databases. Vol. 3050 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg, pp. 519, http://dx.doi.org/10.1007/ 978-3-540-25955-8_2

Fuller, W., 1993. Masking procedurse for microdata disclosure limitation. Journal of Official Statistics 9, 383–406.

General Assembly of the United Nations, 1948. Universal declaration of human rights.

Gouweleeuw, J., P. Kooiman, L.W., de Wolf, P.-P., 1998. Post randomisation for statistical disclosure control: Theory and implementation. Journal of Official Statistics 14 (4), 463–478.

Greenberg, B., 1987. Rank swapping for masking ordinal microdata. Tech. rep., U.S. Bureau of the Census (unpublished manuscript), Suitland, Maryland, USA.

Greenberg, B.G., Abul-Ela, A.-L.A., Simmons, W.R., Horvitz, D.G., 1969. The unrelated question randomized response model: Theoretical framework. Journal of the American Statistical Association 64 (326), 520–539.

Harel, O., Zhou, X.-H., 2007. Multiple imputation: Review and theory, implementation and software. Statistics in Medicine 26, 3057–3077.

Hundepool, A., Domingo-ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Nordholt, E.S., Seri, G., paul De Wolf, P., 2006. A CENtre of EXcellence for Statistical Disclosure Control Handbook on Statistical Disclosure Control Version 1.01.

Hundepool, A., Wetering, A. v.d., Ramaswamy, R., Wolf, P.d., Giessing, S., Fischetti, M., Salazar, J., Castro, J., Lowthian, P., Feb. 2005. τ-argus 3.1 user manual. Statistics Netherlands, Voorburg NL.

Hundepool, A., Willenborg, L., 1996. μ- and τ-argus: Software for statistical disclosure control. Third International Seminar on Statistical Confidentiality, Bled.

Karr, A., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P., 2006. A framework for evaluating the utility of data altered to protect confidentiality. American Statistician 60 (3), 224–232.

Kaufman, S., Seastrom, M., Roey, S., 2005. Do disclosure controls to protect confidentiality degrade the quality of the data? In: American Statistical Association, Proceedings of the Section on Survey Research.

Kennickell, A.B., 1997. Multiple imputation and disclosure protection: the case of the 1995 survey of consumer finances. Record Linkage Techniques, 248–267.

Kim, J., 1986. Limiting disclosure in microdata based on random noise and transformation. Bureau of the Census.

Krumm, J., 2007. Inference attacks on location tracks. Proceedings of Fifth International Conference on Pervasive Computingy, 127–143.

Li, N., Li, T., Venkatasubramanian, S., 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on. pp. 106–115.

Liew, C.K., Choi, U.J., Liew, C.J., 1985. A data distortion by probability distribution. ACM Trans. Database Syst. 10 (3), 395–411.

Little, R.J.A., 1993. Statistical analysis of masked data. Journal of Official Statistics 9, 407–426.

Little, R.J.A., Rubin, D.B., 1987. Statistical Analysis with Missing Data. John Wiley & Sons.

Liu, F., Little, R.J.A., 2002. Selective multiple mputation of keys for statistical disclosure control in microdata. In: Proceedings Joint Statistical Meet. pp. 2133–2138.

Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L., April 2008. Privacy: Theory meets practice on the map. In: International Conference on Data Engineering. Cornell University Comuputer Science Department, Cornell, USA, p. 10.

Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M., 2007. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1 (1), 3.

Manning, A.M., Haglin, D.J., Keane, J.A., 2008. A recursive search algorithm for statistical disclosure assessment. Data Min. Knowl. Discov. 16 (2), 165–196.

Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D., Walford, N., 1991. The case for samples of anonymized records from the 1991 census. Journal of the Royal Statistical Society 154 (2), 305–340.

Matthews, G.J., Harel, O., Aseltine, R.H., 2010a. Assessing database privacy using the area under the receiver-operator characteristic curve. Health Services and Outcomes Research Methodology 10 (1), 1–15.

Matthews, G.J., Harel, O., Aseltine, R.H., 2010b. Examining the robustness of fully synthetic data techniques for data with binary variables. Journal of Statistical Computation and Simulation 80 (6), 609–624.

Moore, Jr., R., 1996. Controlled data-swapping techniques for masking public use microdata. Census Tech Report.

Mugge, R., 1983. Issues in protecting confidentiality in national health statistics. Proceedings of the Section on Survey Research Methods.

Nissim, K., Raskhodnikova, S., Smith, A., 2007. Smooth sensitivity and sampling in private data analysis. In: STOC ’07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. pp. 75–84.

Paass, G., 1988. Disclosure risk and disclosure avoidance for microdata. Journal of Business and Economic Statistics 6 (4), 487–500.

Palley, M., Simonoff, J., 1987. The use of regression methodology for the compromise of confidential information in statistical databases. ACM Trans. Database Systems 12 (4), 593–608.

Raghunathan, T.E., Reiter, J.P., Rubin, D.B., 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19 (1), 1–16.

Rajasekaran, S., Harel, O., Zuba, M., Matthews, G.J., Aseltine, Jr., R., 2009. Responsible data releases. In: Proceedings 9th Industrial Conference on Data Mining (ICDM). Springer LNCS, pp. 388–400.

Reiss, S.P., 1984. Practical data-swapping: The first steps. CM Transactions on Database Systems 9, 20–37.

Reiter, J.P., 2002. Satisfying disclosure restriction with synthetic data sets. Journal of Official Statistics 18 (4), 531–543.

Reiter, J.P., 2003. Inference for partially synthetic, public use microdata sets. Survey Methodology 29 (2), 181–188.

Reiter, J.P., 2004a. New approaches to data dissemination: A glimpse into the future (?). Chance 17 (3), 11–15.

Reiter, J.P., 2004b. Simultaneous use of multiple imputation for missing data and disclosure limitation. Survey Methodology 30 (2), 235–242.

Reiter, J.P., 2005a. Estimating risks of identification disclosure in microdata. Journal of the American Statistical Association 100, 1103–1112.

Reiter, J.P., 2005b. Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A: Statistics in Society 168 (1), 185–205.

Reiter, J.P., 2005c. Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21 (3), 441–462.

Rubin, D.B., 1987. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.

Rubin, D.B., 1993. Comment on “Statistical disclosure limitation”. Journal of Official Statistics 9, 461–468.

Rubner, Y., Tomasi, C., Guibas, L.J., 1998. A metric for distributions with applications to image databases. Computer Vision, IEEE International Conference on 0, 59.

Sarathy, R., Muralidhar, K., 2002a. The security of confidential numerical data in databases. Information Systems Research 13 (4), 389–403.

Sarathy, R., Muralidhar, K., 2002b. The security of confidential numerical data in databases. Info. Sys. Research 13 (4), 389–403.

Schafer, J.L., Graham, J.W., 2002. Missing data: Our view of state of the art. Psychological Methods 7 (2), 147–177.

Singh, A., Yu, F., Dunteman, G., 2003. MASSC: A new data mask for limiting statistical information loss and disclosure. In: Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality. pp. 373–394.

Skinner, C., 2009. Statistical disclosure control for survey data. In: Pfeffermann, D and Rao, C.R. eds. Handbook of Statistics Vol. 29A: Sample Surveys: Design, Methods and Applications. pp. 381–396.

Skinner, C., Marsh, C., Openshaw, S., Wymer, C., 1994. Disclosure control for census microdata. Journal of Official Statistics 10, 31–51.

Skinner, C., Shlomo, N., 2008. Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association 103, 989–1001.

Skinner, C.J., Elliot, M.J., 2002. A measure of disclosure risk for microdata. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64 (4), 855–867.

Smith, A., 2008. Efficient, dfferentially private point estimators. arXiv:0809.4794v1 [cs.CR].

Spruill, N.L., 1982. Measures of confidentiality. Statistics of Income and Related Administrative Record Research, 131–136.

Spruill, N.L., 1983. The confidentiality and analytic usefulness of masked business microdata. In: Proceedings of the Section on Survey Reserach Microdata. American Statistical Association, pp. 602–607.

Sweeney, L., 1996. Replacing personally-identifying information in medical records, the scrub system. In: American Medical Informatics Association. Hanley and Belfus, Inc., pp. 333–337.

Sweeney, L., 1997. Guaranteeing anonymity when sharing medical data, the datafly system. Journal of the American Medical Informatics Association 4, 51–55.

Sweeney, L., 2002a. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems 10 (5), 571–588.

Sweeney, L., 2002b. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems 10 (5), 557–570.

Tendick, P., 1991. Optimal noise addition for preserving confidentiality in multivariate data. Journal of Statistical Planning and Inference 27 (2), 341–353.

United Nations Economic Comission for Europe (UNECE), 2007. Manging statistical cinfidentiality and microdata access: Principles and guidlinesof good practice.

Warner, S.L., 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60 (309), 63–69.

Wasserman, L., Zhou, S., 2010. A statistical framework for differential privacy. Journal of the American Statistical Association 105 (489), 375–389.

Willenborg, L., de Waal, T., 2001. Elements of Statistical Disclosure Control. Springer-Verlag.

Woodward, B., 1995. The computer-based patient record and confidentiality. The New England Journal of Medicine, 1419–1422.




me

Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview

A. Philip Dawid, Vanessa Didelez

Source: Statist. Surv., Volume 4, 184--231.

Abstract:
We consider the problem of learning about and comparing the consequences of dynamic treatment strategies on the basis of observational data. We formulate this within a probabilistic decision-theoretic framework. Our approach is compared with related work by Robins and others: in particular, we show how Robins’s ‘ G -computation’ algorithm arises naturally from this decision-theoretic perspective. Careful attention is paid to the mathematical and substantive conditions required to justify the use of this formula. These conditions revolve around a property we term stability , which relates the probabilistic behaviours of observational and interventional regimes. We show how an assumption of ‘sequential randomization’ (or ‘no unmeasured confounders’), or an alternative assumption of ‘sequential irrelevance’, can be used to infer stability. Probabilistic influence diagrams are used to simplify manipulations, and their power and limitations are discussed. We compare our approach with alternative formulations based on causal DAGs or potential response models. We aim to show that formulating the problem of assessing dynamic treatment strategies as a problem of decision analysis brings clarity, simplicity and generality.

References:
Arjas, E. and Parner, J. (2004). Causal reasoning from longitudinal data. Scandinavian Journal of Statistics 31 171–187.

Arjas, E. and Saarela, O. (2010). Optimal dynamic regimes: Presenting a case for predictive inference. The International Journal of Biostatistics 6. http://tinyurl.com/33dfssf

Cowell, R. G., Dawid, A. P., Lauritzen, S. L. and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer, New York.

Dawid, A. P. (1979). Conditional independence in statistical theory (with Discussion). Journal of the Royal Statistical Society, Series B 41 1–31.

Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert systems. Statistics and Computing 2 25–36.

Dawid, A. P. (1998). Conditional independence. In Encyclopedia of Statistical Science ({U}pdate Volume 2) ( S. Kotz, C. B. Read and D. L. Banks, eds.) 146–155. Wiley-Interscience, New York.

Dawid, A. P. (2000). Causal inference without counterfactuals (with Discussion). Journal of the American Statistical Association 95 407–448.

Dawid, A. P. (2001). Separoids: A mathematical framework for conditional independence and irrelevance. Annals of Mathematics and Artificial Intelligence 32 335–372.

Dawid, A. P. (2002). Influence diagrams for causal modelling and inference. International Statistical Review 70 161–189. Corrigenda, ibid ., 437.

Dawid, A. P. (2003). Causal inference using influence diagrams: The problem of partial compliance (with Discussion). In Highly Structured Stochastic Systems ( P. J. Green, N. L. Hjort and S. Richardson, eds.) 45–81. Oxford University Press.

Dawid, A. P. (2010). Beware of the DAG! In Proceedings of the NIPS 2008 Workshop on Causality. Journal of Machine Learning Research Workshop and Conference Proceedings ( D. Janzing, I. Guyon and B. Schölkopf, eds.) 6 59–86. http://tinyurl.com/33va7tm

Dawid, A. P. and Didelez, V. (2008). Identifying optimal sequential decisions. In Proceedings of the Twenty-Fourth Annual Conference on Uncertainty in Artificial Intelligence (UAI-08) ( D. McAllester and A. Nicholson, eds.). 113-120. AUAI Press, Corvallis, Oregon. http://tinyurl.com/3899qpp

Dechter, R. (2003). Constraint Processing. Morgan Kaufmann Publishers.

Didelez, V., Dawid, A. P. and Geneletti, S. G. (2006). Direct and indirect effects of sequential treatments. In Proceedings of the Twenty-Second Annual Conference on Uncertainty in Artificial Intelligence (UAI-06) ( R. Dechter and T. Richardson, eds.). 138-146. AUAI Press, Arlington, Virginia. http://tinyurl.com/32w3f4e

Didelez, V., Kreiner, S. and Keiding, N. (2010). Graphical models for inference under outcome dependent sampling. Statistical Science (to appear).

Didelez, V. and Sheehan, N. S. (2007). Mendelian randomisation: Why epidemiology needs a formal language for causality. In Causality and Probability in the Sciences, ( F. Russo and J. Williamson, eds.). Texts in Philosophy Series 5 263–292. College Publications, London.

Eichler, M. and Didelez, V. (2010). Granger-causality and the effect of interventions in time series. Lifetime Data Analysis 16 3–32.

Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York, London.

Geneletti, S. G. (2007). Identifying direct and indirect effects in a non–counterfactual framework. Journal of the Royal Statistical Society: Series B 69 199–215.

Geneletti, S. G. and Dawid, A. P. (2010). Defining and identifying the effect of treatment on the treated. In Causality in the Sciences ( P. M. Illari, F. Russo and J. Williamson, eds.) Oxford University Press (to appear).

Gill, R. D. and Robins, J. M. (2001). Causal inference for complex longitudinal data: The continuous case. Annals of Statistics 29 1785–1811.

Guo, H. and Dawid, A. P. (2010). Sufficient covariates and linear propensity analysis. In Proceedings of the Thirteenth International Workshop on Artificial Intelligence and Statistics, (AISTATS) 2010, Chia Laguna, Sardinia, Italy, May 13-15, 2010. Journal of Machine Learning Research Workshop and Conference Proceedings ( Y. W. Teh and D. M. Titterington, eds.) 9 281–288. http://tinyurl.com/33lmuj7

Henderson, R., Ansel, P. and Alshibani, D. (2010). Regret-regression for optimal dynamic treatment regimes. Biometrics (to appear). doi:10.1111/j.1541-0420.2009.01368.x

Hernán, M. A. and Taubman, S. L. (2008). Does obesity shorten life? The importance of well defined interventions to answer causal questions. International Journal of Obesity 32 S8–S14.

Holland, P. W. (1986). Statistics and causal inference (with Discussion). Journal of the American Statistical Association 81 945–970.

Huang, Y. and Valtorta, M. (2006). Identifiability in causal Bayesian networks: A sound and complete algorithm. In AAAI’06: Proceedings of the 21st National Conference on Artificial Intelligence 1149–1154. AAAI Press.

Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22 523–539.

Lauritzen, S. L., Dawid, A. P., Larsen, B. N. and Leimer, H. G. (1990). Independence properties of directed Markov fields. Networks 20 491–505.

Lok, J., Gill, R., van der Vaart, A. and Robins, J. (2004). Estimating the causal effect of a time-varying treatment on time-to-event using structural nested failure time models. Statistica Neerlandica 58 271–295.

Moodie, E. M., Richardson, T. S. and Stephens, D. A. (2007). Demystifying optimal dynamic treatment regimes. Biometrics 63 447–455.

Murphy, S. A. (2003). Optimal dynamic treatment regimes (with Discussion). Journal of the Royal Statistical Society, Series B 65 331-366.

Oliver, R. M. and Smith, J. Q., eds. (1990). Influence Diagrams, Belief Nets and Decision Analysis. John Wiley and Sons, Chichester, United Kingdom.

Pearl, J. (1995). Causal diagrams for empirical research (with Discussion). Biometrika 82 669-710.

Pearl, J. (2009). Causality: Models, Reasoning and Inference, Second ed. Cambridge University Press, Cambridge.

Pearl, J. and Paz, A. (1987). Graphoids: A graph-based logic for reasoning about relevance relations. In Advances in Artificial Intelligence ( D. Hogg and L. Steels, eds.) II 357–363. North-Holland, Amsterdam.

Pearl, J. and Robins, J. (1995). Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence ( P. Besnard and S. Hanks, eds.) 444–453. Morgan Kaufmann Publishers, San Francisco.

Raiffa, H. (1968). Decision Analysis. Addison-Wesley, Reading, Massachusetts.

Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods—Application to control of the healthy worker survivor effect. Mathematical Modelling 7 1393–1512.

Robins, J. M. (1987). Addendum to “A new approach to causal inference in mortality studies with sustained exposure periods—Application to control of the healthy worker survivor effect”. Computers & Mathematics with Applications 14 923–945.

Robins, J. M. (1989). The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In Health Service Research Methodology: A Focus on AIDS ( L. Sechrest, H. Freeman and A. Mulley, eds.) 113–159. NCSHR, U.S. Public Health Service.

Robins, J. M. (1992). Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika 79 321–324.

Robins, J. M. (1997). Causal inference from complex longitudinal data. In Latent Variable Modeling and Applications to Causality, ( M. Berkane, ed.). Lecture Notes in Statistics 120 69–117. Springer-Verlag, New York.

Robins, J. M. (1998). Structural nested failure time models. In Survival Analysis, ( P. K. Andersen and N. Keiding, eds.). Encyclopedia of Biostatistics 6 4372–4389. John Wiley and Sons, Chichester, UK.

Robins, J. M. (2000). Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association Section on Bayesian Statistical Science 1999 6–10.

Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium on Biostatistics ( D. Y. Lin and P. Heagerty, eds.) 189–326. Springer, New York.

Robins, J. M., Greenland, S. and Hu, F. C. (1999). Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome. Journal of the American Statistical Association 94 687–700.

Robins, J. M., Hernán, M. A. and Brumback, B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11 550–560.

Robins, J. M. and Wasserman, L. A. (1997). Estimation of effects of sequential treatments by reparameterizing directed acyclic graphs. In Proceedings of the 13th Annual Conference on Uncertainty in Artificial Intelligence ( D. Geiger and P. Shenoy, eds.) 409-420. Morgan Kaufmann Publishers, San Francisco. http://tinyurl.com/33ghsas

Rosthøj, S., Fullwood, C., Henderson, R. and Stewart, S. (2006). Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statistics in Medicine 25 4197–4215.

Shpitser, I. and Pearl, J. (2006a). Identification of conditional interventional distributions. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06) ( R. Dechter and T. Richardson, eds.). 437–444. AUAI Press, Corvallis, Oregon. http://tinyurl.com/2um8w47

Shpitser, I. and Pearl, J. (2006b). Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence 1219–1226. AAAI Press, Menlo Park, California.

Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction and Search, Second ed. Springer-Verlag, New York.

Sterne, J. A. C., May, M., Costagliola, D., de Wolf, F., Phillips, A. N., Harris, R., Funk, M. J., Geskus, R. B., Gill, J., Dabis, F., Miro, J. M., Justice, A. C., Ledergerber, B., Fatkenheuer, G., Hogg, R. S., D’Arminio-Monforte, A., Saag, M., Smith, C., Staszewski, S., Egger, M., Cole, S. R. and When To Start Consortium (2009). Timing of initiation of antiretroviral therapy in AIDS-Free HIV-1-infected patients: A collaborative analysis of 18 HIV cohort studies. Lancet 373 1352–1363.

Taubman, S. L., Robins, J. M., Mittleman, M. A. and Hernán, M. A. (2009). Intervening on risk factors for coronary heart disease: An application of the parametric g-formula. International Journal of Epidemiology 38 1599–1611.

Tian, J. (2008). Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Annual Conference on Uncertainty in Artificial Intelligence (UAI-08) ( D. McAllester and A. Nicholson, eds.). 554–561. AUAI Press, Corvallis, Oregon. http://tinyurl.com/36ufx2h

Verma, T. and Pearl, J. (1990). Causal networks: Semantics and expressiveness. In Uncertainty in Artificial Intelligence 4 ( R. D. Shachter, T. S. Levitt, L. N. Kanal and J. F. Lemmer, eds.) 69–76. North-Holland, Amsterdam.




me

Statistical errors in Monte Carlo-based inference for random elements. (arXiv:2005.02532v2 [math.ST] UPDATED)

Monte Carlo simulation is useful to compute or estimate expected functionals of random elements if those random samples are possible to be generated from the true distribution. However, when the distribution has some unknown parameters, the samples must be generated from an estimated distribution with the parameters replaced by some estimators, which causes a statistical error in Monte Carlo estimation. This paper considers such a statistical error and investigates the asymptotic distributions of Monte Carlo-based estimators when the random elements are not only the real valued, but also functional valued random variables. We also investigate expected functionals for semimartingales in details. The consideration indicates that the Monte Carlo estimation can get worse when a semimartingale has a jump part with unremovable unknown parameters.




me

Generating Thermal Image Data Samples using 3D Facial Modelling Techniques and Deep Learning Methodologies. (arXiv:2005.01923v2 [cs.CV] UPDATED)

Methods for generating synthetic data have become of increasing importance to build large datasets required for Convolution Neural Networks (CNN) based deep learning techniques for a wide range of computer vision applications. In this work, we extend existing methodologies to show how 2D thermal facial data can be mapped to provide 3D facial models. For the proposed research work we have used tufts datasets for generating 3D varying face poses by using a single frontal face pose. The system works by refining the existing image quality by performing fusion based image preprocessing operations. The refined outputs have better contrast adjustments, decreased noise level and higher exposedness of the dark regions. It makes the facial landmarks and temperature patterns on the human face more discernible and visible when compared to original raw data. Different image quality metrics are used to compare the refined version of images with original images. In the next phase of the proposed study, the refined version of images is used to create 3D facial geometry structures by using Convolution Neural Networks (CNN). The generated outputs are then imported in blender software to finally extract the 3D thermal facial outputs of both males and females. The same technique is also used on our thermal face data acquired using prototype thermal camera (developed under Heliaus EU project) in an indoor lab environment which is then used for generating synthetic 3D face data along with varying yaw face angles and lastly facial depth map is generated.




me

Data-Space Inversion Using a Recurrent Autoencoder for Time-Series Parameterization. (arXiv:2005.00061v2 [stat.ML] UPDATED)

Data-space inversion (DSI) and related procedures represent a family of methods applicable for data assimilation in subsurface flow settings. These methods differ from model-based techniques in that they provide only posterior predictions for quantities (time series) of interest, not posterior models with calibrated parameters. DSI methods require a large number of flow simulations to first be performed on prior geological realizations. Given observed data, posterior predictions can then be generated directly. DSI operates in a Bayesian setting and provides posterior samples of the data vector. In this work we develop and evaluate a new approach for data parameterization in DSI. Parameterization reduces the number of variables to determine in the inversion, and it maintains the physical character of the data variables. The new parameterization uses a recurrent autoencoder (RAE) for dimension reduction, and a long-short-term memory (LSTM) network to represent flow-rate time series. The RAE-based parameterization is combined with an ensemble smoother with multiple data assimilation (ESMDA) for posterior generation. Results are presented for two- and three-phase flow in a 2D channelized system and a 3D multi-Gaussian model. The RAE procedure, along with existing DSI treatments, are assessed through comparison to reference rejection sampling (RS) results. The new DSI methodology is shown to consistently outperform existing approaches, in terms of statistical agreement with RS results. The method is also shown to accurately capture derived quantities, which are computed from variables considered directly in DSI. This requires correlation and covariance between variables to be properly captured, and accuracy in these relationships is demonstrated. The RAE-based parameterization developed here is clearly useful in DSI, and it may also find application in other subsurface flow problems.




me

A bimodal gamma distribution: Properties, regression model and applications. (arXiv:2004.12491v2 [stat.ME] UPDATED)

In this paper we propose a bimodal gamma distribution using a quadratic transformation based on the alpha-skew-normal model. We discuss several properties of this distribution such as mean, variance, moments, hazard rate and entropy measures. Further, we propose a new regression model with censored data based on the bimodal gamma distribution. This regression model can be very useful to the analysis of real data and could give more realistic fits than other special regression models. Monte Carlo simulations were performed to check the bias in the maximum likelihood estimation. The proposed models are applied to two real data sets found in literature.