el

Modal clustering asymptotics with applications to bandwidth selection

Alessandro Casa, José E. Chacón, Giovanna Menardi.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 835--856.

Abstract:
Density-based clustering relies on the idea of linking groups to some specific features of the probability distribution underlying the data. The reference to a true, yet unknown, population structure allows framing the clustering problem in a standard inferential setting, where the concept of ideal population clustering is defined as the partition induced by the true density function. The nonparametric formulation of this approach, known as modal clustering, draws a correspondence between the groups and the domains of attraction of the density modes. Operationally, a nonparametric density estimate is required and a proper selection of the amount of smoothing, governing the shape of the density and hence possibly the modal structure, is crucial to identify the final partition. In this work, we address the issue of density estimation for modal clustering from an asymptotic perspective. A natural and easy to interpret metric to measure the distance between density-based partitions is discussed, its asymptotic approximation explored, and employed to study the problem of bandwidth selection for nonparametric modal clustering.




el

Estimation of a semiparametric transformation model: A novel approach based on least squares minimization

Benjamin Colling, Ingrid Van Keilegom.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 769--800.

Abstract:
Consider the following semiparametric transformation model $Lambda_{ heta }(Y)=m(X)+varepsilon $, where $X$ is a $d$-dimensional covariate, $Y$ is a univariate response variable and $varepsilon $ is an error term with zero mean and independent of $X$. We assume that $m$ is an unknown regression function and that ${Lambda _{ heta }: heta inTheta }$ is a parametric family of strictly increasing functions. Our goal is to develop two new estimators of the transformation parameter $ heta $. The main idea of these two estimators is to minimize, with respect to $ heta $, the $L_{2}$-distance between the transformation $Lambda _{ heta }$ and one of its fully nonparametric estimators. We consider in particular the nonparametric estimator based on the least-absolute deviation loss constructed in Colling and Van Keilegom (2019). We establish the consistency and the asymptotic normality of the two proposed estimators of $ heta $. We also carry out a simulation study to illustrate and compare the performance of our new parametric estimators to that of the profile likelihood estimator constructed in Linton et al. (2008).




el

Profile likelihood biclustering

Cheryl Flynn, Patrick Perry.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 731--768.

Abstract:
Biclustering, the process of simultaneously clustering the rows and columns of a data matrix, is a popular and effective tool for finding structure in a high-dimensional dataset. Many biclustering procedures appear to work well in practice, but most do not have associated consistency guarantees. To address this shortcoming, we propose a new biclustering procedure based on profile likelihood. The procedure applies to a broad range of data modalities, including binary, count, and continuous observations. We prove that the procedure recovers the true row and column classes when the dimensions of the data matrix tend to infinity, even if the functional form of the data distribution is misspecified. The procedure requires computing a combinatorial search, which can be expensive in practice. Rather than performing this search directly, we propose a new heuristic optimization procedure based on the Kernighan-Lin heuristic, which has nice computational properties and performs well in simulations. We demonstrate our procedure with applications to congressional voting records, and microarray analysis.




el

A Model of Fake Data in Data-driven Analysis

Data-driven analysis has been increasingly used in various decision making processes. With more sources, including reviews, news, and pictures, can now be used for data analysis, the authenticity of data sources is in doubt. While previous literature attempted to detect fake data piece by piece, in the current work, we try to capture the fake data sender's strategic behavior to detect the fake data source. Specifically, we model the tension between a data receiver who makes data-driven decisions and a fake data sender who benefits from misleading the receiver. We propose a potentially infinite horizon continuous time game-theoretic model with asymmetric information to capture the fact that the receiver does not initially know the existence of fake data and learns about it during the course of the game. We use point processes to model the data traffic, where each piece of data can occur at any discrete moment in a continuous time flow. We fully solve the model and employ numerical examples to illustrate the players' strategies and payoffs for insights. Specifically, our results show that maintaining some suspicion about the data sources and understanding that the sender can be strategic are very helpful to the data receiver. In addition, based on our model, we propose a methodology of detecting fake data that is complementary to the previous studies on this topic, which suggested various approaches on analyzing the data piece by piece. We show that after analyzing each piece of data, understanding a source by looking at the its whole history of pushing data can be helpful.




el

Universal Latent Space Model Fitting for Large Networks with Edge Covariates

Latent space models are effective tools for statistical modeling and visualization of network data. Due to their close connection to generalized linear models, it is also natural to incorporate covariate information in them. The current paper presents two universal fitting algorithms for networks with edge covariates: one based on nuclear norm penalization and the other based on projected gradient descent. Both algorithms are motivated by maximizing the likelihood function for an existing class of inner-product models, and we establish their statistical rates of convergence for these models. In addition, the theory informs us that both methods work simultaneously for a wide range of different latent space models that allow latent positions to affect edge formation in flexible ways, such as distance models. Furthermore, the effectiveness of the methods is demonstrated on a number of real world network data sets for different statistical tasks, including community detection with and without edge covariates, and network assisted learning.




el

Lower Bounds for Parallel and Randomized Convex Optimization

We study the question of whether parallelization in the exploration of the feasible set can be used to speed up convex optimization, in the local oracle model of computation and in the high-dimensional regime. We show that the answer is negative for both deterministic and randomized algorithms applied to essentially any of the interesting geometries and nonsmooth, weakly-smooth, or smooth objective functions. In particular, we show that it is not possible to obtain a polylogarithmic (in the sequential complexity of the problem) number of parallel rounds with a polynomial (in the dimension) number of queries per round. In the majority of these settings and when the dimension of the space is polynomial in the inverse target accuracy, our lower bounds match the oracle complexity of sequential convex optimization, up to at most a logarithmic factor in the dimension, which makes them (nearly) tight. Another conceptual contribution of our work is in providing a general and streamlined framework for proving lower bounds in the setting of parallel convex optimization. Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean ($ell_2$) and $ell_infty$ spaces.




el

DESlib: A Dynamic ensemble selection library in Python

DESlib is an open-source python library providing the implementation of several dynamic selection techniques. The library is divided into three modules: (i) dcs, containing the implementation of dynamic classifier selection methods (DCS); (ii) des, containing the implementation of dynamic ensemble selection methods (DES); (iii) static, with the implementation of static ensemble techniques. The library is fully documented (documentation available online on Read the Docs), has a high test coverage (codecov.io) and is part of the scikit-learn-contrib supported projects. Documentation, code and examples can be found on its GitHub page: https://github.com/scikit-learn-contrib/DESlib.




el

Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.




el

Generalized probabilistic principal component analysis of correlated data

Principal component analysis (PCA) is a well-established tool in machine learning and data processing. The principal axes in PCA were shown to be equivalent to the maximum marginal likelihood estimator of the factor loading matrix in a latent factor model for the observed data, assuming that the latent factors are independently distributed as standard normal distributions. However, the independence assumption may be unrealistic for many scenarios such as modeling multiple time series, spatial processes, and functional data, where the outcomes are correlated. In this paper, we introduce the generalized probabilistic principal component analysis (GPPCA) to study the latent factor model for multiple correlated outcomes, where each factor is modeled by a Gaussian process. Our method generalizes the previous probabilistic formulation of PCA (PPCA) by providing the closed-form maximum marginal likelihood estimator of the factor loadings and other parameters. Based on the explicit expression of the precision matrix in the marginal likelihood that we derived, the number of the computational operations is linear to the number of output variables. Furthermore, we also provide the closed-form expression of the marginal likelihood when other covariates are included in the mean structure. We highlight the advantage of GPPCA in terms of the practical relevance, estimation accuracy and computational convenience. Numerical studies of simulated and real data confirm the excellent finite-sample performance of the proposed approach.




el

On lp-Support Vector Machines and Multidimensional Kernels

In this paper, we extend the methodology developed for Support Vector Machines (SVM) using the $ell_2$-norm ($ell_2$-SVM) to the more general case of $ell_p$-norms with $p>1$ ($ell_p$-SVM). We derive second order cone formulations for the resulting dual and primal problems. The concept of kernel function, widely applied in $ell_2$-SVM, is extended to the more general case of $ell_p$-norms with $p>1$ by defining a new operator called multidimensional kernel. This object gives rise to reformulations of dual problems, in a transformed space of the original data, where the dependence on the original data always appear as homogeneous polynomials. We adapt known solution algorithms to efficiently solve the primal and dual resulting problems and some computational experiments on real-world datasets are presented showing rather good behavior in terms of the accuracy of $ell_p$-SVM with $p>1$.




el

Connecting Spectral Clustering to Maximum Margins and Level Sets

We study the connections between spectral clustering and the problems of maximum margin clustering, and estimation of the components of level sets of a density function. Specifically, we obtain bounds on the eigenvectors of graph Laplacian matrices in terms of the between cluster separation, and within cluster connectivity. These bounds ensure that the spectral clustering solution converges to the maximum margin clustering solution as the scaling parameter is reduced towards zero. The sensitivity of maximum margin clustering solutions to outlying points is well known, but can be mitigated by first removing such outliers, and applying maximum margin clustering to the remaining points. If outliers are identified using an estimate of the underlying probability density, then the remaining points may be seen as an estimate of a level set of this density function. We show that such an approach can be used to consistently estimate the components of the level sets of a density function under very mild assumptions.




el

Lower Bounds for Testing Graphical Models: Colorings and Antiferromagnetic Ising Models

We study the identity testing problem in the context of spin systems or undirected graphical models, where it takes the following form: given the parameter specification of the model $M$ and a sampling oracle for the distribution $mu_{M^*}$ of an unknown model $M^*$, can we efficiently determine if the two models $M$ and $M^*$ are the same? We consider identity testing for both soft-constraint and hard-constraint systems. In particular, we prove hardness results in two prototypical cases, the Ising model and proper colorings, and explore whether identity testing is any easier than structure learning. For the ferromagnetic (attractive) Ising model, Daskalakis et al. (2018) presented a polynomial-time algorithm for identity testing. We prove hardness results in the antiferromagnetic (repulsive) setting in the same regime of parameters where structure learning is known to require a super-polynomial number of samples. Specifically, for $n$-vertex graphs of maximum degree $d$, we prove that if $|eta| d = omega(log{n})$ (where $eta$ is the inverse temperature parameter), then there is no polynomial running time identity testing algorithm unless $RP=NP$. In the hard-constraint setting, we present hardness results for identity testing for proper colorings. Our results are based on the presumed hardness of #BIS, the problem of (approximately) counting independent sets in bipartite graphs.




el

A New Class of Time Dependent Latent Factor Models with Applications

In many applications, observed data are influenced by some combination of latent causes. For example, suppose sensors are placed inside a building to record responses such as temperature, humidity, power consumption and noise levels. These random, observed responses are typically affected by many unobserved, latent factors (or features) within the building such as the number of individuals, the turning on and off of electrical devices, power surges, etc. These latent factors are usually present for a contiguous period of time before disappearing; further, multiple factors could be present at a time. This paper develops new probabilistic methodology and inference methods for random object generation influenced by latent features exhibiting temporal persistence. Every datum is associated with subsets of a potentially infinite number of hidden, persistent features that account for temporal dynamics in an observation. The ensuing class of dynamic models constructed by adapting the Indian Buffet Process — a probability measure on the space of random, unbounded binary matrices — finds use in a variety of applications arising in operations, signal processing, biomedicine, marketing, image analysis, etc. Illustrations using synthetic and real data are provided.




el

On the Complexity Analysis of the Primal Solutions for the Accelerated Randomized Dual Coordinate Ascent

Dual first-order methods are essential techniques for large-scale constrained convex optimization. However, when recovering the primal solutions, we need $T(epsilon^{-2})$ iterations to achieve an $epsilon$-optimal primal solution when we apply an algorithm to the non-strongly convex dual problem with $T(epsilon^{-1})$ iterations to achieve an $epsilon$-optimal dual solution, where $T(x)$ can be $x$ or $sqrt{x}$. In this paper, we prove that the iteration complexity of the primal solutions and dual solutions have the same $Oleft(frac{1}{sqrt{epsilon}} ight)$ order of magnitude for the accelerated randomized dual coordinate ascent. When the dual function further satisfies the quadratic functional growth condition, by restarting the algorithm at any period, we establish the linear iteration complexity for both the primal solutions and dual solutions even if the condition number is unknown. When applied to the regularized empirical risk minimization problem, we prove the iteration complexity of $Oleft(nlog n+sqrt{frac{n}{epsilon}} ight)$ in both primal space and dual space, where $n$ is the number of samples. Our result takes out the $left(log frac{1}{epsilon} ight)$ factor compared with the methods based on smoothing/regularization or Catalyst reduction. As far as we know, this is the first time that the optimal $Oleft(sqrt{frac{n}{epsilon}} ight)$ iteration complexity in the primal space is established for the dual coordinate ascent based stochastic algorithms. We also establish the accelerated linear complexity for some problems with nonsmooth loss, e.g., the least absolute deviation and SVM.




el

Learning with Fenchel-Young losses

Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction. Understanding the core principles and theoretical properties underpinning these losses is key to choose the right loss for the right problem, as well as to create new losses which combine their strengths. In this paper, we introduce Fenchel-Young losses, a generic way to construct a convex loss function for a regularized prediction function. We provide an in-depth study of their properties in a very broad setting, covering all the aforementioned supervised learning tasks, and revealing new connections between sparsity, generalized entropies, and separation margins. We show that Fenchel-Young losses unify many well-known loss functions and allow to create useful new ones easily. Finally, we derive efficient predictive and training algorithms, making Fenchel-Young losses appealing both in theory and practice.




el

Causal Discovery Toolbox: Uncovering causal relationships in Python

This paper presents a new open source Python framework for causal discovery from observational data and domain background knowledge, aimed at causal graph and causal mechanism modeling. The cdt package implements an end-to-end approach, recovering the direct dependencies (the skeleton of the causal graph) and the causal relationships between variables. It includes algorithms from the `Bnlearn' and `Pcalg' packages, together with algorithms for pairwise causal discovery such as ANM.




el

Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification

High dimensional data often contain multiple facets, and several clustering patterns can co-exist under different variable subspaces, also known as the views. While multi-view clustering algorithms were proposed, the uncertainty quantification remains difficult --- a particular challenge is in the high complexity of estimating the cluster assignment probability under each view, and sharing information among views. In this article, we propose an approximate Bayes approach --- treating the similarity matrices generated over the views as rough first-stage estimates for the co-assignment probabilities; in its Kullback-Leibler neighborhood, we obtain a refined low-rank matrix, formed by the pairwise product of simplex coordinates. Interestingly, each simplex coordinate directly encodes the cluster assignment uncertainty. For multi-view clustering, we let each view draw a parameterization from a few candidates, leading to dimension reduction. With high model flexibility, the estimation can be efficiently carried out as a continuous optimization problem, hence enjoys gradient-based computation. The theory establishes the connection of this model to a random partition distribution under multiple views. Compared to single-view clustering approaches, substantially more interpretable results are obtained when clustering brains from a human traumatic brain injury study, using high-dimensional gene expression data.




el

Learning Linear Non-Gaussian Causal Models in the Presence of Latent Variables

We consider the problem of learning causal models from observational data generated by linear non-Gaussian acyclic causal models with latent variables. Without considering the effect of latent variables, the inferred causal relationships among the observed variables are often wrong. Under faithfulness assumption, we propose a method to check whether there exists a causal path between any two observed variables. From this information, we can obtain the causal order among the observed variables. The next question is whether the causal effects can be uniquely identified as well. We show that causal effects among observed variables cannot be identified uniquely under mere assumptions of faithfulness and non-Gaussianity of exogenous noises. However, we are able to propose an efficient method that identifies the set of all possible causal effects that are compatible with the observational data. We present additional structural conditions on the causal graph under which causal effects among observed variables can be determined uniquely. Furthermore, we provide necessary and sufficient graphical conditions for unique identification of the number of variables in the system. Experiments on synthetic data and real-world data show the effectiveness of our proposed algorithm for learning causal models.




el

Switching Regression Models and Causal Inference in the Presence of Discrete Latent Variables

Given a response $Y$ and a vector $X = (X^1, dots, X^d)$ of $d$ predictors, we investigate the problem of inferring direct causes of $Y$ among the vector $X$. Models for $Y$ that use all of its causal covariates as predictors enjoy the property of being invariant across different environments or interventional settings. Given data from such environments, this property has been exploited for causal discovery. Here, we extend this inference principle to situations in which some (discrete-valued) direct causes of $ Y $ are unobserved. Such cases naturally give rise to switching regression models. We provide sufficient conditions for the existence, consistency and asymptotic normality of the MLE in linear switching regression models with Gaussian noise, and construct a test for the equality of such models. These results allow us to prove that the proposed causal discovery method obtains asymptotic false discovery control under mild conditions. We provide an algorithm, make available code, and test our method on simulated data. It is robust against model violations and outperforms state-of-the-art approaches. We further apply our method to a real data set, where we show that it does not only output causal predictors, but also a process-based clustering of data points, which could be of additional interest to practitioners.




el

Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data

We present a probabilistic framework for studying adversarial attacks on discrete data. Based on this framework, we derive a perturbation-based method, Greedy Attack, and a scalable learning-based method, Gumbel Attack, that illustrate various tradeoffs in the design of attacks. We demonstrate the effectiveness of these methods using both quantitative metrics and human evaluation on various state-of-the-art models for text classification, including a word-based CNN, a character-based CNN and an LSTM. As an example of our results, we show that the accuracy of character-based convolutional networks drops to the level of random selection by modifying only five characters through Greedy Attack.




el

A Convex Parametrization of a New Class of Universal Kernel Functions

The accuracy and complexity of kernel learning algorithms is determined by the set of kernels over which it is able to optimize. An ideal set of kernels should: admit a linear parameterization (tractability); be dense in the set of all kernels (accuracy); and every member should be universal so that the hypothesis space is infinite-dimensional (scalability). Currently, there is no class of kernel that meets all three criteria - e.g. Gaussians are not tractable or accurate; polynomials are not scalable. We propose a new class that meet all three criteria - the Tessellated Kernel (TK) class. Specifically, the TK class: admits a linear parameterization using positive matrices; is dense in all kernels; and every element in the class is universal. This implies that the use of TK kernels for learning the kernel can obviate the need for selecting candidate kernels in algorithms such as SimpleMKL and parameters such as the bandwidth. Numerical testing on soft margin Support Vector Machine (SVM) problems show that algorithms using TK kernels outperform other kernel learning algorithms and neural networks. Furthermore, our results show that when the ratio of the number of training data to features is high, the improvement of TK over MKL increases significantly.




el

Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement

We develop ancestral Gumbel-Top-$k$ sampling: a generic and efficient method for sampling without replacement from discrete-valued Bayesian networks, which includes multivariate discrete distributions, Markov chains and sequence models. The method uses an extension of the Gumbel-Max trick to sample without replacement by finding the top $k$ of perturbed log-probabilities among all possible configurations of a Bayesian network. Despite the exponentially large domain, the algorithm has a complexity linear in the number of variables and sample size $k$. Our algorithm allows to set the number of parallel processors $m$, to trade off the number of iterations versus the total cost (iterations times $m$) of running the algorithm. For $m = 1$ the algorithm has minimum total cost, whereas for $m = k$ the number of iterations is minimized, and the resulting algorithm is known as Stochastic Beam Search. We provide extensions of the algorithm and discuss a number of related algorithms. We analyze the properties of ancestral Gumbel-Top-$k$ sampling and compare against alternatives on randomly generated Bayesian networks with different levels of connectivity. In the context of (deep) sequence models, we show its use as a method to generate diverse but high-quality translations and statistical estimates of translation quality and entropy.




el

Ensemble Learning for Relational Data

We present a theoretical analysis framework for relational ensemble models. We show that ensembles of collective classifiers can improve predictions for graph data by reducing errors due to variance in both learning and inference. In addition, we propose a relational ensemble framework that combines a relational ensemble learning approach with a relational ensemble inference approach for collective classification. The proposed ensemble techniques are applicable for both single and multiple graph settings. Experiments on both synthetic and real-world data demonstrate the effectiveness of the proposed framework. Finally, our experimental results support the theoretical analysis and confirm that ensemble algorithms that explicitly focus on both learning and inference processes and aim at reducing errors associated with both, are the best performers.




el

High-Dimensional Inference for Cluster-Based Graphical Models

Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models, in which variable clustering is applied as an initial step for reducing the dimension of the feature space. We employ model assisted clustering, in which the clusters contain features that are similar to the same unobserved latent variable. Two different cluster-based Gaussian graphical models are considered: the latent variable graph, corresponding to the graphical model associated with the unobserved latent variables, and the cluster-average graph, corresponding to the vector of features averaged over clusters. Our study reveals that likelihood based inference for the latent graph, not analyzed previously, is analytically intractable. Our main contribution is the development and analysis of alternative estimation and inference strategies, for the precision matrix of an unobservable latent vector Z. We replace the likelihood of the data by an appropriate class of empirical risk functions, that can be specialized to the latent graphical model and to the simpler, but under-analyzed, cluster-average graphical model. The estimators thus derived can be used for inference on the graph structure, for instance on edge strength or pattern recovery. Inference is based on the asymptotic limits of the entry-wise estimates of the precision matrices associated with the conditional independence graphs under consideration. While taking the uncertainty induced by the clustering step into account, we establish Berry-Esseen central limit theorems for the proposed estimators. It is noteworthy that, although the clusters are estimated adaptively from the data, the central limit theorems regarding the entries of the estimated graphs are proved under the same conditions one would use if the clusters were known in advance. As an illustration of the usage of these newly developed inferential tools, we show that they can be reliably used for recovery of the sparsity pattern of the graphs we study, under FDR control, which is verified via simulation studies and an fMRI data analysis. These experimental results confirm the theoretically established difference between the two graph structures. Furthermore, the data analysis suggests that the latent variable graph, corresponding to the unobserved cluster centers, can help provide more insight into the understanding of the brain connectivity networks relative to the simpler, average-based, graph.




el

GraKeL: A Graph Kernel Library in Python

The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Graph kernels have recently emerged as a promising approach to this problem. There are now many kernels, each focusing on different structural aspects of graphs. Here, we present GraKeL, a library that unifies several graph kernels into a common framework. The library is written in Python and adheres to the scikit-learn interface. It is simple to use and can be naturally combined with scikit-learn's modules to build a complete machine learning pipeline for tasks such as graph classification and clustering. The code is BSD licensed and is available at: https://github.com/ysig/GraKeL.




el

Conjugate Gradients for Kernel Machines

Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a low-rank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of least-squares regression as encountered in machine learning.




el

Self-paced Multi-view Co-training

Co-training is a well-known semi-supervised learning approach which trains classifiers on two or more different views and exchanges pseudo labels of unlabeled instances in an iterative way. During the co-training process, pseudo labels of unlabeled instances are very likely to be false especially in the initial training, while the standard co-training algorithm adopts a 'draw without replacement' strategy and does not remove these wrongly labeled instances from training stages. Besides, most of the traditional co-training approaches are implemented for two-view cases, and their extensions in multi-view scenarios are not intuitive. These issues not only degenerate their performance as well as available application range but also hamper their fundamental theory. Moreover, there is no optimization model to explain the objective a co-training process manages to optimize. To address these issues, in this study we design a unified self-paced multi-view co-training (SPamCo) framework which draws unlabeled instances with replacement. Two specified co-regularization terms are formulated to develop different strategies for selecting pseudo-labeled instances during training. Both forms share the same optimization strategy which is consistent with the iteration process in co-training and can be naturally extended to multi-view scenarios. A distributed optimization strategy is also introduced to train the classifier of each view in parallel to further improve the efficiency of the algorithm. Furthermore, the SPamCo algorithm is proved to be PAC learnable, supporting its theoretical soundness. Experiments conducted on synthetic, text categorization, person re-identification, image recognition and object detection data sets substantiate the superiority of the proposed method.




el

The weight function in the subtree kernel is decisive

Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small data sets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.




el

Estimation of a Low-rank Topic-Based Model for Information Cascades

We consider the problem of estimating the latent structure of a social network based on the observed information diffusion events, or cascades, where the observations for a given cascade consist of only the timestamps of infection for infected nodes but not the source of the infection. Most of the existing work on this problem has focused on estimating a diffusion matrix without any structural assumptions on it. In this paper, we propose a novel model based on the intuition that an information is more likely to propagate among two nodes if they are interested in similar topics which are also prominent in the information content. In particular, our model endows each node with an influence vector (which measures how authoritative the node is on each topic) and a receptivity vector (which measures how susceptible the node is for each topic). We show how this node-topic structure can be estimated from the observed cascades, and prove the consistency of the estimator. Experiments on synthetic and real data demonstrate the improved performance and better interpretability of our model compared to existing state-of-the-art methods.




el

High-dimensional Gaussian graphical models on network-linked data

Graphical models are commonly used to represent conditional dependence relationships between variables. There are multiple methods available for exploring them from high-dimensional data, but almost all of them rely on the assumption that the observations are independent and identically distributed. At the same time, observations connected by a network are becoming increasingly common, and tend to violate these assumptions. Here we develop a Gaussian graphical model for observations connected by a network with potentially different mean vectors, varying smoothly over the network. We propose an efficient estimation algorithm and demonstrate its effectiveness on both simulated and real data, obtaining meaningful and interpretable results on a statistics coauthorship network. We also prove that our method estimates both the inverse covariance matrix and the corresponding graph structure correctly under the assumption of network “cohesion”, which refers to the empirically observed phenomenon of network neighbors sharing similar traits.




el

Identifiability of Additive Noise Models Using Conditional Variances

This paper considers a new identifiability condition for additive noise models (ANMs) in which each variable is determined by an arbitrary Borel measurable function of its parents plus an independent error. It has been shown that ANMs are fully recoverable under some identifiability conditions, such as when all error variances are equal. However, this identifiable condition could be restrictive, and hence, this paper focuses on a relaxed identifiability condition that involves not only error variances, but also the influence of parents. This new class of identifiable ANMs does not put any constraints on the form of dependencies, or distributions of errors, and allows different error variances. It further provides a statistically consistent and computationally feasible structure learning algorithm for the identifiable ANMs based on the new identifiability condition. The proposed algorithm assumes that all relevant variables are observed, while it does not assume faithfulness or a sparse graph. Demonstrated through extensive simulated and real multivariate data is that the proposed algorithm successfully recovers directed acyclic graphs.




el

TIGER: using artificial intelligence to discover our collections

The State Library of NSW has almost 4 million digital files in its collection.




el

Oriented first passage percolation in the mean field limit

Nicola Kistler, Adrien Schertzer, Marius A. Schmidt.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 414--425.

Abstract:
The Poisson clumping heuristic has lead Aldous to conjecture the value of the oriented first passage percolation on the hypercube in the limit of large dimensions. Aldous’ conjecture has been rigorously confirmed by Fill and Pemantle ( Ann. Appl. Probab. 3 (1993) 593–629) by means of a variance reduction trick. We present here a streamlined and, we believe, more natural proof based on ideas emerged in the study of Derrida’s random energy models.




el

Reliability estimation in a multicomponent stress-strength model for Burr XII distribution under progressive censoring

Raj Kamal Maurya, Yogesh Mani Tripathi.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 345--369.

Abstract:
We consider estimation of the multicomponent stress-strength reliability under progressive Type II censoring under the assumption that stress and strength variables follow Burr XII distributions with a common shape parameter. Maximum likelihood estimates of the reliability are obtained along with asymptotic intervals when common shape parameter may be known or unknown. Bayes estimates are also derived under the squared error loss function using different approximation methods. Further, we obtain exact Bayes and uniformly minimum variance unbiased estimates of the reliability for the case common shape parameter is known. The highest posterior density intervals are also obtained. We perform Monte Carlo simulations to compare the performance of proposed estimates and present a discussion based on this study. Finally, two real data sets are analyzed for illustration purposes.




el

A Bayesian sparse finite mixture model for clustering data from a heterogeneous population

Erlandson F. Saraiva, Adriano K. Suzuki, Luís A. Milan.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 323--344.

Abstract:
In this paper, we introduce a Bayesian approach for clustering data using a sparse finite mixture model (SFMM). The SFMM is a finite mixture model with a large number of components $k$ previously fixed where many components can be empty. In this model, the number of components $k$ can be interpreted as the maximum number of distinct mixture components. Then, we explore the use of a prior distribution for the weights of the mixture model that take into account the possibility that the number of clusters $k_{mathbf{c}}$ (e.g., nonempty components) can be random and smaller than the number of components $k$ of the finite mixture model. In order to determine clusters we develop a MCMC algorithm denominated Split-Merge allocation sampler. In this algorithm, the split-merge strategy is data-driven and was inserted within the algorithm in order to increase the mixing of the Markov chain in relation to the number of clusters. The performance of the method is verified using simulated datasets and three real datasets. The first real data set is the benchmark galaxy data, while second and third are the publicly available data set on Enzyme and Acidity, respectively.




el

Bayesian modeling and prior sensitivity analysis for zero–one augmented beta regression models with an application to psychometric data

Danilo Covaes Nogarotto, Caio Lucidius Naberezny Azevedo, Jorge Luis Bazán.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 304--322.

Abstract:
The interest on the analysis of the zero–one augmented beta regression (ZOABR) model has been increasing over the last few years. In this work, we developed a Bayesian inference for the ZOABR model, providing some contributions, namely: we explored the use of Jeffreys-rule and independence Jeffreys prior for some of the parameters, performing a sensitivity study of prior choice, comparing the Bayesian estimates with the maximum likelihood ones and measuring the accuracy of the estimates under several scenarios of interest. The results indicate, in a general way, that: the Bayesian approach, under the Jeffreys-rule prior, was as accurate as the ML one. Also, different from other approaches, we use the predictive distribution of the response to implement Bayesian residuals. To further illustrate the advantages of our approach, we conduct an analysis of a real psychometric data set including a Bayesian residual analysis, where it is shown that misleading inference can be obtained when the data is transformed. That is, when the zeros and ones are transformed to suitable values and the usual beta regression model is considered, instead of the ZOABR model. Finally, future developments are discussed.




el

Recent developments in complex and spatially correlated functional data

Israel Martínez-Hernández, Marc G. Genton.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 204--229.

Abstract:
As high-dimensional and high-frequency data are being collected on a large scale, the development of new statistical models is being pushed forward. Functional data analysis provides the required statistical methods to deal with large-scale and complex data by assuming that data are continuous functions, for example, realizations of a continuous process (curves) or continuous random field (surfaces), and that each curve or surface is considered as a single observation. Here, we provide an overview of functional data analysis when data are complex and spatially correlated. We provide definitions and estimators of the first and second moments of the corresponding functional random variable. We present two main approaches: The first assumes that data are realizations of a functional random field, that is, each observation is a curve with a spatial component. We call them spatial functional data . The second approach assumes that data are continuous deterministic fields observed over time. In this case, one observation is a surface or manifold, and we call them surface time series . For these two approaches, we describe software available for the statistical analysis. We also present a data illustration, using a high-resolution wind speed simulated dataset, as an example of the two approaches. The functional data approach offers a new paradigm of data analysis, where the continuous processes or random fields are considered as a single entity. We consider this approach to be very valuable in the context of big data.




el

A note on the “L-logistic regression models: Prior sensitivity analysis, robustness to outliers and applications”

Saralees Nadarajah, Yuancheng Si.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 183--187.

Abstract:
Da Paz, Balakrishnan and Bazan [Braz. J. Probab. Stat. 33 (2019), 455–479] introduced the L-logistic distribution, studied its properties including estimation issues and illustrated a data application. This note derives a closed form expression for moment properties of the distribution. Some computational issues are discussed.




el

On estimating the location parameter of the selected exponential population under the LINEX loss function

Mohd Arshad, Omer Abdalghani.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 167--182.

Abstract:
Suppose that $pi_{1},pi_{2},ldots ,pi_{k}$ be $k(geq2)$ independent exponential populations having unknown location parameters $mu_{1},mu_{2},ldots,mu_{k}$ and known scale parameters $sigma_{1},ldots,sigma_{k}$. Let $mu_{[k]}=max {mu_{1},ldots,mu_{k}}$. For selecting the population associated with $mu_{[k]}$, a class of selection rules (proposed by Arshad and Misra [ Statistical Papers 57 (2016) 605–621]) is considered. We consider the problem of estimating the location parameter $mu_{S}$ of the selected population under the criterion of the LINEX loss function. We consider three natural estimators $delta_{N,1},delta_{N,2}$ and $delta_{N,3}$ of $mu_{S}$, based on the maximum likelihood estimators, uniformly minimum variance unbiased estimator (UMVUE) and minimum risk equivariant estimator (MREE) of $mu_{i}$’s, respectively. The uniformly minimum risk unbiased estimator (UMRUE) and the generalized Bayes estimator of $mu_{S}$ are derived. Under the LINEX loss function, a general result for improving a location-equivariant estimator of $mu_{S}$ is derived. Using this result, estimator better than the natural estimator $delta_{N,1}$ is obtained. We also shown that the estimator $delta_{N,1}$ is dominated by the natural estimator $delta_{N,3}$. Finally, we perform a simulation study to evaluate and compare risk functions among various competing estimators of $mu_{S}$.




el

Application of weighted and unordered majorization orders in comparisons of parallel systems with exponentiated generalized gamma components

Abedin Haidari, Amir T. Payandeh Najafabadi, Narayanaswamy Balakrishnan.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 150--166.

Abstract:
Consider two parallel systems, say $A$ and $B$, with respective lifetimes $T_{1}$ and $T_{2}$ wherein independent component lifetimes of each system follow exponentiated generalized gamma distribution with possibly different exponential shape and scale parameters. We show here that $T_{2}$ is smaller than $T_{1}$ with respect to the usual stochastic order (reversed hazard rate order) if the vector of logarithm (the main vector) of scale parameters of System $B$ is weakly weighted majorized by that of System $A$, and if the vector of exponential shape parameters of System $A$ is unordered mojorized by that of System $B$. By means of some examples, we show that the above results can not be extended to the hazard rate and likelihood ratio orders. However, when the scale parameters of each system divide into two homogeneous groups, we verify that the usual stochastic and reversed hazard rate orders can be extended, respectively, to the hazard rate and likelihood ratio orders. The established results complete and strengthen some of the known results in the literature.




el

Multivariate normal approximation of the maximum likelihood estimator via the delta method

Andreas Anastasiou, Robert E. Gaunt.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 136--149.

Abstract:
We use the delta method and Stein’s method to derive, under regularity conditions, explicit upper bounds for the distributional distance between the distribution of the maximum likelihood estimator (MLE) of a $d$-dimensional parameter and its asymptotic multivariate normal distribution. Our bounds apply in situations in which the MLE can be written as a function of a sum of i.i.d. $t$-dimensional random vectors. We apply our general bound to establish a bound for the multivariate normal approximation of the MLE of the normal distribution with unknown mean and variance.




el

On the Nielsen distribution

Fredy Castellares, Artur J. Lemonte, Marcos A. C. Santos.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 90--111.

Abstract:
We introduce a two-parameter discrete distribution that may have a zero vertex and can be useful for modeling overdispersion. The discrete Nielsen distribution generalizes the Fisher logarithmic (i.e., logarithmic series) and Stirling type I distributions in the sense that both can be considered displacements of the Nielsen distribution. We provide a comprehensive account of the structural properties of the new discrete distribution. We also show that the Nielsen distribution is infinitely divisible. We discuss maximum likelihood estimation of the model parameters and provide a simple method to find them numerically. The usefulness of the proposed distribution is illustrated by means of three real data sets to prove its versatility in practical applications.




el

Effects of gene–environment and gene–gene interactions in case-control studies: A novel Bayesian semiparametric approach

Durba Bhattacharya, Sourabh Bhattacharya.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 71--89.

Abstract:
Present day bio-medical research is pointing towards the fact that cognizance of gene–environment interactions along with genetic interactions may help prevent or detain the onset of many complex diseases like cardiovascular disease, cancer, type2 diabetes, autism or asthma by adjustments to lifestyle. In this regard, we propose a Bayesian semiparametric model to detect not only the roles of genes and their interactions, but also the possible influence of environmental variables on the genes in case-control studies. Our model also accounts for the unknown number of genetic sub-populations via finite mixtures composed of Dirichlet processes. An effective parallel computing methodology, developed by us harnesses the power of parallel processing technology to increase the efficiencies of our conditionally independent Gibbs sampling and Transformation based MCMC (TMCMC) methods. Applications of our model and methods to simulation studies with biologically realistic genotype datasets and a real, case-control based genotype dataset on early onset of myocardial infarction (MI) have yielded quite interesting results beside providing some insights into the differential effect of gender on MI.




el

Robust Bayesian model selection for heavy-tailed linear regression using finite mixtures

Flávio B. Gonçalves, Marcos O. Prates, Victor Hugo Lachos.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 51--70.

Abstract:
In this paper, we present a novel methodology to perform Bayesian model selection in linear models with heavy-tailed distributions. We consider a finite mixture of distributions to model a latent variable where each component of the mixture corresponds to one possible model within the symmetrical class of normal independent distributions. Naturally, the Gaussian model is one of the possibilities. This allows for a simultaneous analysis based on the posterior probability of each model. Inference is performed via Markov chain Monte Carlo—a Gibbs sampler with Metropolis–Hastings steps for a class of parameters. Simulated examples highlight the advantages of this approach compared to a segregated analysis based on arbitrarily chosen model selection criteria. Examples with real data are presented and an extension to censored linear regression is introduced and discussed.




el

A joint mean-correlation modeling approach for longitudinal zero-inflated count data

Weiping Zhang, Jiangli Wang, Fang Qian, Yu Chen.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 35--50.

Abstract:
Longitudinal zero-inflated count data are widely encountered in many fields, while modeling the correlation between measurements for the same subject is more challenge due to the lack of suitable multivariate joint distributions. This paper studies a novel mean-correlation modeling approach for longitudinal zero-inflated regression model, solving both problems of specifying joint distribution and parsimoniously modeling correlations with no constraint. The joint distribution of zero-inflated discrete longitudinal responses is modeled by a copula model whose correlation parameters are innovatively represented in hyper-spherical coordinates. To overcome the computational intractability in maximizing the full likelihood function of the model, we further propose a computationally efficient pairwise likelihood approach. We then propose separated mean and correlation regression models to model these key quantities, such modeling approach can also handle irregularly and possibly subject-specific times points. The resulting estimators are shown to be consistent and asymptotically normal. Data example and simulations support the effectiveness of the proposed approach.




el

Simple step-stress models with a cure fraction

Nandini Kannan, Debasis Kundu.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 2--17.

Abstract:
In this article, we consider models for time-to-event data obtained from experiments in which stress levels are altered at intermediate stages during the observation period. These experiments, known as step-stress tests, belong to the larger class of accelerated tests used extensively in the reliability literature. The analysis of data from step-stress tests largely relies on the popular cumulative exposure model. However, despite its simple form, the utility of the model is limited, as it is assumed that the hazard function of the underlying distribution is discontinuous at the points at which the stress levels are changed, which may not be very reasonable. Due to this deficiency, Kannan et al. ( Journal of Applied Statistics 37 (2010b) 1625–1636) introduced the cumulative risk model, where the hazard function is continuous. In this paper, we propose a class of parametric models based on the cumulative risk model assuming the underlying population contains long-term survivors or ‘cured’ fraction. An EM algorithm to compute the maximum likelihood estimators of the unknown parameters is proposed. This research is motivated by a study on altitude decompression sickness. The performance of different parametric models will be evaluated using data from this study.




el

Bayesian approach for the zero-modified Poisson–Lindley regression model

Wesley Bertoli, Katiane S. Conceição, Marinho G. Andrade, Francisco Louzada.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 4, 826--860.

Abstract:
The primary goal of this paper is to introduce the zero-modified Poisson–Lindley regression model as an alternative to model overdispersed count data exhibiting inflation or deflation of zeros in the presence of covariates. The zero-modification is incorporated by considering that a zero-truncated process produces positive observations and consequently, the proposed model can be fitted without any previous information about the zero-modification present in a given dataset. A fully Bayesian approach based on the g-prior method has been considered for inference concerns. An intensive Monte Carlo simulation study has been conducted to evaluate the performance of the developed methodology and the maximum likelihood estimators. The proposed model was considered for the analysis of a real dataset on the number of bids received by $126$ U.S. firms between 1978–1985, and the impact of choosing different prior distributions for the regression coefficients has been studied. A sensitivity analysis to detect influential points has been performed based on the Kullback–Leibler divergence. A general comparison with some well-known regression models for discrete data has been presented.




el

Option pricing with bivariate risk-neutral density via copula and heteroscedastic model: A Bayesian approach

Lucas Pereira Lopes, Vicente Garibay Cancho, Francisco Louzada.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 4, 801--825.

Abstract:
Multivariate options are adequate tools for multi-asset risk management. The pricing models derived from the pioneer Black and Scholes method under the multivariate case consider that the asset-object prices follow a Brownian geometric motion. However, the construction of such methods imposes some unrealistic constraints on the process of fair option calculation, such as constant volatility over the maturity time and linear correlation between the assets. Therefore, this paper aims to price and analyze the fair price behavior of the call-on-max (bivariate) option considering marginal heteroscedastic models with dependence structure modeled via copulas. Concerning inference, we adopt a Bayesian perspective and computationally intensive methods based on Monte Carlo simulations via Markov Chain (MCMC). A simulation study examines the bias, and the root mean squared errors of the posterior means for the parameters. Real stocks prices of Brazilian banks illustrate the approach. For the proposed method is verified the effects of strike and dependence structure on the fair price of the option. The results show that the prices obtained by our heteroscedastic model approach and copulas differ substantially from the prices obtained by the model derived from Black and Scholes. Empirical results are presented to argue the advantages of our strategy.




el

Bayesian modelling of the abilities in dichotomous IRT models via regression with missing values in the covariates

Flávio B. Gonçalves, Bárbara C. C. Dias.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 4, 782--800.

Abstract:
Educational assessment usually considers a contextual questionnaire to extract relevant information from the applicants. This may include items related to socio-economical profile as well as items to extract other characteristics potentially related to applicant’s performance in the test. A careful analysis of the questionnaires jointly with the test’s results may evidence important relations between profiles and test performance. The most coherent way to perform this task in a statistical context is to use the information from the questionnaire to help explain the variability of the abilities in a joint model-based approach. Nevertheless, the responses to the questionnaire typically present missing values which, in some cases, may be missing not at random. This paper proposes a statistical methodology to model the abilities in dichotomous IRT models using the information of the contextual questionnaires via linear regression. The proposed methodology models the missing data jointly with the all the observed data, which allows for the estimation of the former. The missing data modelling is flexible enough to allow the specification of missing not at random structures. Furthermore, even if those structures are not assumed a priori, they can be estimated from the posterior results when assuming missing (completely) at random structures a priori. Statistical inference is performed under the Bayesian paradigm via an efficient MCMC algorithm. Simulated and real examples are presented to investigate the efficiency and applicability of the proposed methodology.




el

The limiting distribution of the Gibbs sampler for the intrinsic conditional autoregressive model

Marco A. R. Ferreira.

Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 4, 734--744.

Abstract:
We study the limiting behavior of the one-at-a-time Gibbs sampler for the intrinsic conditional autoregressive model with centering on the fly. The intrinsic conditional autoregressive model is widely used as a prior for random effects in hierarchical models for spatial modeling. This model is defined by full conditional distributions that imply an improper joint “density” with a multivariate Gaussian kernel and a singular precision matrix. To guarantee propriety of the posterior distribution, usually at the end of each iteration of the Gibbs sampler the random effects are centered to sum to zero in what is widely known as centering on the fly. While this works well in practice, this informal computational way to recenter the random effects obscures their implied prior distribution and prevents the development of formal Bayesian procedures. Here we show that the implied prior distribution, that is, the limiting distribution of the one-at-a-time Gibbs sampler for the intrinsic conditional autoregressive model with centering on the fly is a singular Gaussian distribution with a covariance matrix that is the Moore–Penrose inverse of the precision matrix. This result has important implications for the development of formal Bayesian procedures such as reference priors and Bayes-factor-based model selection for spatial models.