science and technology

SLA

(Service Level Agreement) Contractual service commitment. An SLA is a document that describes the minimum performance criteria a provider promises to meet while delivering a service. It typically also sets out the remedial action and any penalties that will take effect if performance falls below the promised standard. It is an essential component of the legal contract between a service consumer and the provider.




science and technology

QoS

(Quality of Service) Consistent performance. Certain network services need to be delivered at a certain minimum performance level to be useable -- for example, a video or audio clip will stutter and break up if the bandwidth is inadequate. QoS refers to a network system's ability to sustain a given service at or above its required minimum performance level.




science and technology

Liberty Alliance

Digital identity standards group. Set up at the instigation of Sun Microsystems in 2001, the Liberty Alliance Project is a consortium of technology vendors and consumer-facing enterprises formed "to establish an open standard for federated network identity." It aims to make it easier for consumers to access networked services from multiple suppliers while safeguarding security and privacy. Its specifications have been published in three phases: the Identity Federation Framework (ID-FF) came first; the Identity Web Services Framework (ID-WSF) followed in November 2003; and work is in progress on the Identity Services Interface Specifications (ID-SIS). Liberty Alliance specifications are closely linked to the SAML single sign-on standard, and overlap with elements of WS-Security.




science and technology

CORBA

(Common Object Request Broker Architecture) Pioneering integration architecture. Developed during the 1990s by the Object Management Group (OMG), CORBA was the first major attempt to define a platform-neutral architecture for combining heterogenous software resources across a network. A forerunner of today's service-oriented architectures, CORBA was designed for high-end, transaction-heavy, enterprise deployments, and thus it works best for tight coupling of software resources written in traditional programming languages such as C, C++, Java, Smalltalk and COBOL. Although the addition of IIOP (Internet Inter-ORB Protocol) extended CORBA to run over the Internet, it is less flexible than today's more loosely coupled SOAs, which are based on the exchange of XML documents using web services.




science and technology

COBOL

(COmmon Business Oriented Language) World's favorite mainframe programming language. Despite its venerable roots as one of the earliest high-level compiled languages, COBOL today still underpins some of the world's most important commercial and government operations, as it remains the most widely used programming language on mainframe computers. Created in 1959 by a cross-industry group of computer manufacturers under the auspices of the US Department of Defense, COBOL was designed as a machine-independent, industry-standard programming language for business data processing -- although in practice there were various incompatibilities between individual makers' versions. It has continued to evolve under the management of US and international standards bodies. The latest revision is COBOL 2002, with the next planned for 2008.




science and technology

WSRF

(Web Services Resource Framework) Web services for grid computing. WSRF defines conventions for managing 'state' so that applications can reliably share changing information. In combination with WS-Notification and other WS-* standards, the result is to make grid resources accessible within a web services architecture. Coupled with WS-Notification, the specification is a response to, and supersedes, the grid community's own first effort to converge grid and web services, the Open Grid Service Infrastructure (OGSI), which the Global Grid Forum (GGF) and others released in 2003. Announced by the Globus Alliance and IBM (with contributions from HP, SAP, Akamai, Tibco and Sonic) in January 2004, WSRF is due to be implemented in version 4.0 of the open source Globus Toolkit for grid computing, as well as several commercial packages. It consists of several component specifications, including WS-Resource Properties, WS-ResourceLifetime, WS-ServiceGroup and WS-BaseFaults.




science and technology

grid computing

Pooled computer resources. Grid computing, or simply grid, is the generic term given to techniques and technologies designed to make pools of distributed computer resources available on-demand. Grid computing was originally conceived by research scientists as a way of combining computers across a network to form a distributed supercomputer to tackle complex computations. In the commercial world, grid aims to maximize the utilization of an organization's computing resources by making them shareable across applications (sometimes called virtualization) and, potentially, provide computing on demand to third parties as a utility service. When used with specifications such as WSRF and WS-Notification, grid resources can appear as web services within a service-oriented architecture.




science and technology

componentization

Breaking down into interchangeable pieces. For many years, software innovators have been trying to make software more like computer hardware, which is assembled from cheap, mass-produced components that connect together using standard interfaces. Component-based development (CBD) uses this approach to assemble software from reusable components within frameworks such as CORBA, Sun's Enterprise Java Beans (EJBs) and Microsoft COM. Today's service oriented architectures, based on web services, go a step further by encapsulating components in a standards-based service interface, which allows components to be reused outside their native framework. Componentization is not limited to software; through the use of subcontracting and outsourcing, it can also apply to business organizations and processes.




science and technology

granularity

How small the pieces are. When a system is split into components, it's important to get the right degree of componentization. Small, fine-grained components give much greater flexibility in assembling precisely the right combination of functionality, but they are more difficult to co-ordinate. Much larger, coarse-grained components are easier to manage but may become too unwieldy. Performance and management considerations tend to favor the use of more coarsely grained messages in a service oriented architecture, whereas earlier generations of distributed computing have preferred a much finer level of granularity.




science and technology

endpoint

Where a service connects to the network. In a service oriented architecture, any single network interaction involves two endpoints: one to provide a service, and the other to consume it. In web services, an endpoint is specified by a URI.




science and technology

MOM

(Message Oriented Middleware) Software plumbing. Message-oriented middleware is the term for software that connects separate systems in a network by carrying and distributing messages between them. The messages may contain data, software instructions, or both together. MOM infrastructure is typically built around a queuing system that stores messages pending delivery, and keeps track of whether and when each message has been delivered. Most MOM systems also support autonomous publish-subscribe messaging. MOM products frequently use proprietary messaging technologies -- well-known examples include IBM MQSeries, MSMQ from Microsoft and Tibco Rendezvous -- but emerging standards specifications such as JMS and WS-ReliableMessaging are now enabling standards-based MOM infrastructures.




science and technology

middleware

Integration software. Middleware is the term coined to describe software that connects other software together. In the early days of computing, each software system in an organization was a separate 'stovepipe' or 'silo' that stood alone and was dedicated to automating a specific part of the business or its IT operations. Middleware aims to connect those individual islands of automation, both within an enterprise and out to external systems (for example at customers and suppliers). For a long while, middleware has either been custom coded for individual projects or has come in the form of proprietary products or suites, most notably as enterprise application integration (EAI) software. The emergence of industry-agreed web services specifications is now enabling convergence on standards-based distributed middleware, which in theory should allow all systems to automatically connect together on demand.




science and technology

cache

Short-term storage. A cache is used to speed up certain computer operations by temporarily placing data, or a copy of it, in a location where it can be accessed more rapidly than normal. For example, data from a storage disk may be cached temporarily in high-speed memory so that it can be read and written more quickly than if it had to come directly from the disk itself; or a microprocessor may use an an on-board memory cache to store temporary data for use during operations. 'Cache' is derived from the French word for a hiding place, and so is pronounced like 'cash'.




science and technology

registry

Recognized service directory. A registry stores information about services in an SOA. At a minimum, the registry includes information that other participants can look up to find out the location of the service and what it does (the UDDI specification defines a web services standard for this functionality). A registry may also include information about policies that are applied to the service, such as security requirements, quality of service commitments and billing. Some registries are extended with document repositories, providing more detailed information about the operation and constraints of the service that may be useful to developers, administrators or users.




science and technology

metadata

Data about data. In common usage as a generic term, metadata stores data about the structure, context and meaning of raw data, and computers use it to help organize and interpret data, turning it into meaningful information. The WorldWide Web has driven usage of metadata to new levels, as the tags used in HTML and XML are a form of metadata, although the meaning they convey is often limited because the metadata means different things to different people.




science and technology

object-oriented

(OO) Structured around functional units. Object-oriented programming languages such as C++, SmallTalk and Java are designed to build software made up of objects: discrete bundles of functionality that can act on data only in certain pre-defined ways. This modular building-block approach makes complex software development tasks more flexible and easier to manage within a given programming environment. The emergence of object-oriented programming was a stepping stone to the development of componentization and subsequently of service-oriented architectures.




science and technology

data warehouse

A large store of data for analysis. Organizations use data warehouses (and smaller 'data marts') to help them analyze historic transaction data to detect useful patterns and trends. First of all the data is transferred into the data warehouse using a process called extracting, transforming and loading (ETL). Then it is organized and stored in the data warehouse in ways that optimize it for high-performance analysis. The transfer to a separate data warehouse system, which is usually performed as a regular batch job every night or at some other interval, insulates the live transaction systems from any side-effects of the analysis, but at the cost of not having the very latest data included in the analysis.




science and technology

governance

How an organization controls its actions. Governance describes the mechanisms an organization uses to ensure that its constituents follow its established processes and policies. It is the primary means of maintaining oversight and accountability in a loosely coupled organizational structure. A proper governance strategy implements systems to monitor and record what is going on, takes steps to ensure compliance with agreed policies, and provides for corrective action in cases where the rules have been ignored or misconstrued.




science and technology

EII

(Enterprise Information Integration) Linking information within an enterprise. In most enterprises, information is stored in separate databases, data warehouses and applications. EII products make it possible to combine information from these different data sources on demand. They do this by establishing an intermediate data services layer that makes it possible to access the data in a standardized way, instead of having to interact directly with each separate back-end data source. Although EII is named after EAI, a class of technologies for linking applications, it follows a much more service-oriented model than traditional EAI products.




science and technology

BI

(Business Intelligence) Analysis of business data. BI is the name given to a class of software tools specifically designed to aid analysis of business data. BI tools have traditionally been associated with in-depth analysis of historical transaction data, supplied by either a data warehouse or an online analytical processing (OLAP) server linked to a database system. BI has a wide range of commercial and non-commercial applications, with the most common being the analysis of patterns such as sales and stock trends, pricing and customer behavior to inform business decision-making. For this reason it is sometimes referred to as decision support software.




science and technology

XBRL

(eXtensible Business Reporting Language) Standard format for reporting financial data. XBRL is an internationally agreed, open specification that uses XML to structure financial information for automated electronic processing. It is being adopted by major accounting standards bodies, regulators, tax authorities, banks and credit organizations around the world to streamline the reporting and analysis of statutory financial statements and other business financial information.




science and technology

AON

(Application-Oriented Networking) Using network devices to help with integration. Application-oriented networking has arisen in response to increasing use of XML messaging (combined with related standards such as XSLT, XPath and XQuery) to link miscellaneous applications, data sources and other computing assets. Many of the operations required to mediate between these different participants, or to monitor their exchanges, can be built into network devices that are optimized for the purpose. The rules and policies for performing these operations, also expressed in XML, are specified separately and downloaded as required. Network equipment vendor Cisco has adopted the AON acronym as the name of a family of products that function in this way.




science and technology

AJAX

(Asynchronous Javascript And Xml) Technique for dynamically updating web pages. AJAX is the term coined in February 2005 to describe a collection of technologies used to automatically update and manipulate the information on a web page while it is being viewed in a browser (ie without the user having to manually refresh the page). This allows developers to create more sophisticated web pages and applications without having to add to the native capabilities of the browser. A key component is the use of XMLHttpRequest, a function originally added to browsers by Microsoft, to exchange data in the background with one or more web servers.




science and technology

EJB

(Enterprise JavaBeans) Software components for networked Java applications. Defined by the Enterprise JavaBeans specification, EJBs are the basic building blocks of software applications on the J2EE platform, which has been the preferred choice for many enterprises when building large-scale, web-accessed applications. Recently, however, many developers have been turning away from the complexity of EJBs in favor of simpler alternatives. The new EJB 3.0 specification attempts to answer these criticisms by simplifying EJB development.




science and technology

semantics

Intended meaning. In computing, semantics is the assumed or explicit set of understandings used in a system to give meaning to data. One of the biggest challenges when integrating separate computer systems and applications is to correctly match up the intended meanings within each system. Simple metadata classifications such as 'price' or 'location' may have wildly different meanings in each system, while apparently different terms, such as 'client' and 'patient' may turn out to be effectively equivalent.




science and technology

RIA

(Rich Internet Application) Fully featured software package that runs in a browser. Early generations of Internet-hosted, browser-based applications were notoriously basic compared to equivalent software that ran on a Windows or Mac desktop. This led to the evolution of RIA platforms (also known as rich client platforms), which boost the core functionality of the basic browser by temporarily downloading extra software to the client. This makes it possible to develop applications with the look and feel of a full-fledged Windows or Mac application, making them faster and more convenient to use. RIAs are distinct from 'smart clients', which require extra software pre-installed on the client machine. The leading RIA platforms today are AJAX, based on JavaScript and XML messaging, and Adobe Flex, based on Macromedia's Flash technology.




science and technology

MVC

(Model View Controller) A design pattern used in services architectures. MVC expresses the separation of a software architecture into three distinct elements. The 'Model' is how the underlying data is structured. The 'View' is what is presented to the user or consumer. The 'Controller' is the element that performs the processing. Separating these three elements makes it easier to achieve loose coupling, because it makes it possible for the controller to work with multiple different Model and View components.




science and technology

Correction: Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects

Trang Quynh Nguyen, Elizabeth A. Stuart.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 518--520.




science and technology

Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis

Boyu Ren, Sergio Bacallado, Stefano Favaro, Tommi Vatanen, Curtis Huttenhower, Lorenzo Trippa.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 494--517.

Abstract:
Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal prior on each microbial composition is a Dirichlet process, and dependence across compositions is induced through a linear combination of individual covariates, such as disease biomarkers or the subject’s age, and latent factors. The latent factors capture residual variability and their dimensionality is learned from the data in a fully Bayesian procedure. The proposed model is tested in data analyses and simulation studies with zero-inflated compositions. In these settings and within each sample, a large proportion of counts per microbial species are equal to zero. In our Bayesian model a priori the probability of compositions with absent microbial species is strictly positive. We propose an efficient algorithm to sample from the posterior and visualizations of model parameters which reveal associations between covariates and microbial compositions. We evaluate the proposed method in simulation studies, and then analyze a microbiome dataset for infants with type 1 diabetes which contains a large proportion of zeros in the sample-specific microbial compositions.




science and technology

A hierarchical dependent Dirichlet process prior for modelling bird migration patterns in the UK

Alex Diana, Eleni Matechou, Jim Griffin, Alison Johnston.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 473--493.

Abstract:
Environmental changes in recent years have been linked to phenological shifts which in turn are linked to the survival of species. The work in this paper is motivated by capture-recapture data on blackcaps collected by the British Trust for Ornithology as part of the Constant Effort Sites monitoring scheme. Blackcaps overwinter abroad and migrate to the UK annually for breeding purposes. We propose a novel Bayesian nonparametric approach for expressing the bivariate density of individual arrival and departure times at different sites across a number of years as a mixture model. The new model combines the ideas of the hierarchical and the dependent Dirichlet process, allowing the estimation of site-specific weights and year-specific mixture locations, which are modelled as functions of environmental covariates using a multivariate extension of the Gaussian process. The proposed modelling framework is extremely general and can be used in any context where multivariate density estimation is performed jointly across different groups and in the presence of a continuous covariate.




science and technology

Estimating causal effects in studies of human brain function: New models, methods and estimands

Michael E. Sobel, Martin A. Lindquist.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 452--472.

Abstract:
Neuroscientists often use functional magnetic resonance imaging (fMRI) to infer effects of treatments on neural activity in brain regions. In a typical fMRI experiment, each subject is observed at several hundred time points. At each point, the blood oxygenation level dependent (BOLD) response is measured at 100,000 or more locations (voxels). Typically, these responses are modeled treating each voxel separately, and no rationale for interpreting associations as effects is given. Building on Sobel and Lindquist ( J. Amer. Statist. Assoc. 109 (2014) 967–976), who used potential outcomes to define unit and average effects at each voxel and time point, we define and estimate both “point” and “cumulated” effects for brain regions. Second, we construct a multisubject, multivoxel, multirun whole brain causal model with explicit parameters for regions. We justify estimation using BOLD responses averaged over voxels within regions, making feasible estimation for all regions simultaneously, thereby also facilitating inferences about association between effects in different regions. We apply the model to a study of pain, finding effects in standard pain regions. We also observe more cerebellar activity than observed in previous studies using prevailing methods.




science and technology

A comparison of principal component methods between multiple phenotype regression and multiple SNP regression in genetic association studies

Zhonghua Liu, Ian Barnett, Xihong Lin.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 433--451.

Abstract:
Principal component analysis (PCA) is a popular method for dimension reduction in unsupervised multivariate analysis. However, existing ad hoc uses of PCA in both multivariate regression (multiple outcomes) and multiple regression (multiple predictors) lack theoretical justification. The differences in the statistical properties of PCAs in these two regression settings are not well understood. In this paper we provide theoretical results on the power of PCA in genetic association testings in both multiple phenotype and SNP-set settings. The multiple phenotype setting refers to the case when one is interested in studying the association between a single SNP and multiple phenotypes as outcomes. The SNP-set setting refers to the case when one is interested in studying the association between multiple SNPs in a SNP set and a single phenotype as the outcome. We demonstrate analytically that the properties of the PC-based analysis in these two regression settings are substantially different. We show that the lower order PCs, that is, PCs with large eigenvalues, are generally preferred and lead to a higher power in the SNP-set setting, while the higher-order PCs, that is, PCs with small eigenvalues, are generally preferred in the multiple phenotype setting. We also investigate the power of three other popular statistical methods, the Wald test, the variance component test and the minimum $p$-value test, in both multiple phenotype and SNP-set settings. We use theoretical power, simulation studies, and two real data analyses to validate our findings.




science and technology

Measuring human activity spaces from GPS data with density ranking and summary curves

Yen-Chi Chen, Adrian Dobra.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 409--432.

Abstract:
Activity spaces are fundamental to the assessment of individuals’ dynamic exposure to social and environmental risk factors associated with multiple spatial contexts that are visited during activities of daily living. In this paper we survey existing approaches for measuring the geometry, size and structure of activity spaces, based on GPS data, and explain their limitations. We propose addressing these shortcomings through a nonparametric approach called density ranking and also through three summary curves: the mass-volume curve, the Betti number curve and the persistence curve. We introduce a novel mixture model for human activity spaces and study its asymptotic properties. We prove that the kernel density estimator, which at the present time, is one of the most widespread methods for measuring activity spaces, is not a stable estimator of their structure. We illustrate the practical value of our methods with a simulation study and with a recently collected GPS dataset that comprises the locations visited by 10 individuals over a six months period.




science and technology

Estimating and forecasting the smoking-attributable mortality fraction for both genders jointly in over 60 countries

Yicheng Li, Adrian E. Raftery.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 381--408.

Abstract:
Smoking is one of the leading preventable threats to human health and a major risk factor for lung cancer, upper aerodigestive cancer and chronic obstructive pulmonary disease. Estimating and forecasting the smoking attributable fraction (SAF) of mortality can yield insights into smoking epidemics and also provide a basis for more accurate mortality and life expectancy projection. Peto et al. ( Lancet 339 (1992) 1268–1278) proposed a method to estimate the SAF using the lung cancer mortality rate as an indicator of exposure to smoking in the population of interest. Here, we use the same method to estimate the all-age SAF (ASAF) for both genders for over 60 countries. We document a strong and cross-nationally consistent pattern of the evolution of the SAF over time. We use this as the basis for a new Bayesian hierarchical model to project future male and female ASAF from over 60 countries simultaneously. This gives forecasts as well as predictive distributions that can be used to find uncertainty intervals for any quantity of interest. We assess the model using out-of-sample predictive validation and find that it provides good forecasts and well-calibrated forecast intervals, comparing favorably with other methods.




science and technology

Regression for copula-linked compound distributions with applications in modeling aggregate insurance claims

Peng Shi, Zifeng Zhao.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 357--380.

Abstract:
In actuarial research a task of particular interest and importance is to predict the loss cost for individual risks so that informative decisions are made in various insurance operations such as underwriting, ratemaking and capital management. The loss cost is typically viewed to follow a compound distribution where the summation of the severity variables is stopped by the frequency variable. A challenging issue in modeling such outcomes is to accommodate the potential dependence between the number of claims and the size of each individual claim. In this article we introduce a novel regression framework for compound distributions that uses a copula to accommodate the association between the frequency and the severity variables and, thus, allows for arbitrary dependence between the two components. We further show that the new model is very flexible and is easily modified to account for incomplete data due to censoring or truncation. The flexibility of the proposed model is illustrated using both simulated and real data sets. In the analysis of granular claims data from property insurance, we find substantive negative relationship between the number and the size of insurance claims. In addition, we demonstrate that ignoring the frequency-severity association could lead to biased decision-making in insurance operations.




science and technology

Modeling wildfire ignition origins in southern California using linear network point processes

Medha Uppala, Mark S. Handcock.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 339--356.

Abstract:
This paper focuses on spatial and temporal modeling of point processes on linear networks. Point processes on linear networks can simply be defined as point events occurring on or near line segment network structures embedded in a certain space. A separable modeling framework is introduced that posits separate formation and dissolution models of point processes on linear networks over time. While the model was inspired by spider web building activity in brick mortar lines, the focus is on modeling wildfire ignition origins near road networks over a span of 14 years. As most wildfires in California have human-related origins, modeling the origin locations with respect to the road network provides insight into how human, vehicular and structural densities affect ignition occurrence. Model results show that roads that traverse different types of regions such as residential, interface and wildland regions have higher ignition intensities compared to roads that only exist in each of the mentioned region types.




science and technology

Optimal asset allocation with multivariate Bayesian dynamic linear models

Jared D. Fisher, Davide Pettenuzzo, Carlos M. Carvalho.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 299--338.

Abstract:
We introduce a fast, closed-form, simulation-free method to model and forecast multiple asset returns and employ it to investigate the optimal ensemble of features to include when jointly predicting monthly stock and bond excess returns. Our approach builds on the Bayesian dynamic linear models of West and Harrison ( Bayesian Forecasting and Dynamic Models (1997) Springer), and it can objectively determine, through a fully automated procedure, both the optimal set of regressors to include in the predictive system and the degree to which the model coefficients, volatilities and covariances should vary over time. When applied to a portfolio of five stock and bond returns, we find that our method leads to large forecast gains, both in statistical and economic terms. In particular, we find that relative to a standard no-predictability benchmark, the optimal combination of predictors, stochastic volatility and time-varying covariances increases the annualized certainty equivalent returns of a leverage-constrained power utility investor by more than 500 basis points.




science and technology

Feature selection for generalized varying coefficient mixed-effect models with application to obesity GWAS

Wanghuan Chu, Runze Li, Jingyuan Liu, Matthew Reimherr.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 276--298.

Abstract:
Motivated by an empirical analysis of data from a genome-wide association study on obesity, measured by the body mass index (BMI), we propose a two-step gene-detection procedure for generalized varying coefficient mixed-effects models with ultrahigh dimensional covariates. The proposed procedure selects significant single nucleotide polymorphisms (SNPs) impacting the mean BMI trend, some of which have already been biologically proven to be “fat genes.” The method also discovers SNPs that significantly influence the age-dependent variability of BMI. The proposed procedure takes into account individual variations of genetic effects and can also be directly applied to longitudinal data with continuous, binary or count responses. We employ Monte Carlo simulation studies to assess the performance of the proposed method and further carry out causal inference for the selected SNPs.




science and technology

Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors

Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, Brent Coull.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 257--275.

Abstract:
Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures, we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We utilize our approach to study the impact of exposure to metals on child neurodevelopment in Bangladesh and find a nonlinear, interactive relationship between arsenic and manganese.




science and technology

Bayesian factor models for probabilistic cause of death assessment with verbal autopsies

Tsuyoshi Kunihama, Zehang Richard Li, Samuel J. Clark, Tyler H. McCormick.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 241--256.

Abstract:
The distribution of deaths by cause provides crucial information for public health planning, response and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data




science and technology

A hierarchical Bayesian model for predicting ecological interactions using scaled evolutionary relationships

Mohamad Elmasri, Maxwell J. Farrell, T. Jonathan Davies, David A. Stephens.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 221--240.

Abstract:
Identifying undocumented or potential future interactions among species is a challenge facing modern ecologists. Recent link prediction methods rely on trait data; however, large species interaction databases are typically sparse and covariates are limited to only a fraction of species. On the other hand, evolutionary relationships, encoded as phylogenetic trees, can act as proxies for underlying traits and historical patterns of parasite sharing among hosts. We show that, using a network-based conditional model, phylogenetic information provides strong predictive power in a recently published global database of host-parasite interactions. By scaling the phylogeny using an evolutionary model, our method allows for biological interpretation often missing from latent variable models. To further improve on the phylogeny-only model, we combine a hierarchical Bayesian latent score framework for bipartite graphs that accounts for the number of interactions per species with host dependence informed by phylogeny. Combining the two information sources yields significant improvement in predictive accuracy over each of the submodels alone. As many interaction networks are constructed from presence-only data, we extend the model by integrating a correction mechanism for missing interactions which proves valuable in reducing uncertainty in unobserved interactions.




science and technology

Modifying the Chi-square and the CMH test for population genetic inference: Adapting to overdispersion

Kerstin Spitzer, Marta Pelizzola, Andreas Futschik.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 202--220.

Abstract:
Evolve and resequence studies provide a popular approach to simulate evolution in the lab and explore its genetic basis. In this context, Pearson’s chi-square test, Fisher’s exact test as well as the Cochran–Mantel–Haenszel test are commonly used to infer genomic positions affected by selection from temporal changes in allele frequency. However, the null model associated with these tests does not match the null hypothesis of actual interest. Indeed, due to genetic drift and possibly other additional noise components such as pool sequencing, the null variance in the data can be substantially larger than accounted for by these common test statistics. This leads to $p$-values that are systematically too small and, therefore, a huge number of false positive results. Even, if the ranking rather than the actual $p$-values is of interest, a naive application of the mentioned tests will give misleading results, as the amount of overdispersion varies from locus to locus. We therefore propose adjusted statistics that take the overdispersion into account while keeping the formulas simple. This is particularly useful in genome-wide applications, where millions of SNPs can be handled with little computational effort. We then apply the adapted test statistics to real data from Drosophila and investigate how information from intermediate generations can be included when available. We also discuss further applications such as genome-wide association studies based on pool sequencing data and tests for local adaptation.




science and technology

TFisher: A powerful truncation and weighting procedure for combining $p$-values

Hong Zhang, Tiejun Tong, John Landers, Zheyang Wu.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 178--201.

Abstract:
The $p$-value combination approach is an important statistical strategy for testing global hypotheses with broad applications in signal detection, meta-analysis, data integration, etc. In this paper we extend the classic Fisher’s combination method to a unified family of statistics, called TFisher, which allows a general truncation-and-weighting scheme of input $p$-values. TFisher can significantly improve statistical power over the Fisher and related truncation-only methods for detecting both rare and dense “signals.” To address wide applications, analytical calculations for TFisher’s size and power are deduced under any two continuous distributions in the null and the alternative hypotheses. The corresponding omnibus test (oTFisher) and its size calculation are also provided for data-adaptive analysis. We study the asymptotic optimal parameters of truncation and weighting based on Bahadur efficiency (BE). A new asymptotic measure, called the asymptotic power efficiency (APE), is also proposed for better reflecting the statistics’ performance in real data analysis. Interestingly, under the Gaussian mixture model in the signal detection problem, both BE and APE indicate that the soft-thresholding scheme is the best, the truncation and weighting parameters should be equal. By simulations of various signal patterns, we systematically compare the power of statistics within TFisher family as well as some rare-signal-optimal tests. We illustrate the use of TFisher in an exome-sequencing analysis for detecting novel genes of amyotrophic lateral sclerosis. Relevant computation has been implemented into an R package TFisher published on the Comprehensive R Archive Network to cater for applications.




science and technology

Assessing wage status transition and stagnation using quantile transition regression

Chih-Yuan Hsu, Yi-Hau Chen, Ruoh-Rong Yu, Tsung-Wei Hung.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 160--177.

Abstract:
Workers in Taiwan overall have been suffering from long-lasting wage stagnation since the mid-1990s. In particular, there seems to be little mobility for the wages of Taiwanese workers to transit across wage quantile groups. It is of interest to see if certain groups of workers, such as female, lower educated and younger generation workers, suffer from the problem more seriously than the others. This work tries to apply a systematic statistical approach to study this issue, based on the longitudinal data from the Panel Study of Family Dynamics (PSFD) survey conducted in Taiwan since 1999. We propose the quantile transition regression model, generalizing recent methodology for quantile association, to assess the wage status transition with respect to the marginal wage quantiles over time as well as the effects of certain demographic and job factors on the wage status transition. Estimation of the model can be based on the composite likelihoods utilizing the binary, or ordinal-data information regarding the quantile transition, with the associated asymptotic theory established. A goodness-of-fit procedure for the proposed model is developed. The performances of the estimation and the goodness-of-fit procedures for the quantile transition model are illustrated through simulations. The application of the proposed methodology to the PSFD survey data suggests that female, private-sector workers with higher age and education below postgraduate level suffer from more severe wage status stagnation than the others.




science and technology

Surface temperature monitoring in liver procurement via functional variance change-point analysis

Zhenguo Gao, Pang Du, Ran Jin, John L. Robertson.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 143--159.

Abstract:
Liver procurement experiments with surface-temperature monitoring motivated Gao et al. ( J. Amer. Statist. Assoc. 114 (2019) 773–781) to develop a variance change-point detection method under a smoothly-changing mean trend. However, the spotwise change points yielded from their method do not offer immediate information to surgeons since an organ is often transplanted as a whole or in part. We develop a new practical method that can analyze a defined portion of the organ surface at a time. It also provides a novel addition to the developing field of functional data monitoring. Furthermore, numerical challenge emerges for simultaneously modeling the variance functions of 2D locations and the mean function of location and time. The respective sample sizes in the scales of 10,000 and 1,000,000 for modeling these functions make standard spline estimation too costly to be useful. We introduce a multistage subsampling strategy with steps educated by quickly-computable preliminary statistical measures. Extensive simulations show that the new method can efficiently reduce the computational cost and provide reasonable parameter estimates. Application of the new method to our liver surface temperature monitoring data shows its effectiveness in providing accurate status change information for a selected portion of the organ in the experiment.




science and technology

A statistical analysis of noisy crowdsourced weather data

Arnab Chakraborty, Soumendra Nath Lahiri, Alyson Wilson.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 116--142.

Abstract:
Spatial prediction of weather elements like temperature, precipitation, and barometric pressure are generally based on satellite imagery or data collected at ground stations. None of these data provide information at a more granular or “hyperlocal” resolution. On the other hand, crowdsourced weather data, which are captured by sensors installed on mobile devices and gathered by weather-related mobile apps like WeatherSignal and AccuWeather, can serve as potential data sources for analyzing environmental processes at a hyperlocal resolution. However, due to the low quality of the sensors and the nonlaboratory environment, the quality of the observations in crowdsourced data is compromised. This paper describes methods to improve hyperlocal spatial prediction using this varying-quality, noisy crowdsourced information. We introduce a reliability metric, namely Veracity Score (VS), to assess the quality of the crowdsourced observations using a coarser, but high-quality, reference data. A VS-based methodology to analyze noisy spatial data is proposed and evaluated through extensive simulations. The merits of the proposed approach are illustrated through case studies analyzing crowdsourced daily average ambient temperature readings for one day in the contiguous United States.




science and technology

Modeling microbial abundances and dysbiosis with beta-binomial regression

Bryan D. Martin, Daniela Witten, Amy D. Willis.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 94--115.

Abstract:
Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance . In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis , the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.




science and technology

Efficient real-time monitoring of an emerging influenza pandemic: How feasible?

Paul J. Birrell, Lorenz Wernisch, Brian D. M. Tom, Leonhard Held, Gareth O. Roberts, Richard G. Pebody, Daniela De Angelis.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 74--93.

Abstract:
A prompt public health response to a new epidemic relies on the ability to monitor and predict its evolution in real time as data accumulate. The 2009 A/H1N1 outbreak in the UK revealed pandemic data as noisy, contaminated, potentially biased and originating from multiple sources. This seriously challenges the capacity for real-time monitoring. Here, we assess the feasibility of real-time inference based on such data by constructing an analytic tool combining an age-stratified SEIR transmission model with various observation models describing the data generation mechanisms. As batches of data become available, a sequential Monte Carlo (SMC) algorithm is developed to synthesise multiple imperfect data streams, iterate epidemic inferences and assess model adequacy amidst a rapidly evolving epidemic environment, substantially reducing computation time in comparison to standard MCMC, to ensure timely delivery of real-time epidemic assessments. In application to simulated data designed to mimic the 2009 A/H1N1 epidemic, SMC is shown to have additional benefits in terms of assessing predictive performance and coping with parameter nonidentifiability.




science and technology

Integrative survival analysis with uncertain event times in application to a suicide risk study

Wenjie Wang, Robert Aseltine, Kun Chen, Jun Yan.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 51--73.

Abstract:
The concept of integrating data from disparate sources to accelerate scientific discovery has generated tremendous excitement in many fields. The potential benefits from data integration, however, may be compromised by the uncertainty due to incomplete/imperfect record linkage. Motivated by a suicide risk study, we propose an approach for analyzing survival data with uncertain event times arising from data integration. Specifically, in our problem deaths identified from the hospital discharge records together with reported suicidal deaths determined by the Office of Medical Examiner may still not include all the death events of patients, and the missing deaths can be recovered from a complete database of death records. Since the hospital discharge data can only be linked to the death record data by matching basic patient characteristics, a patient with a censored death time from the first dataset could be linked to multiple potential event records in the second dataset. We develop an integrative Cox proportional hazards regression in which the uncertainty in the matched event times is modeled probabilistically. The estimation procedure combines the ideas of profile likelihood and the expectation conditional maximization algorithm (ECM). Simulation studies demonstrate that under realistic settings of imperfect data linkage the proposed method outperforms several competing approaches including multiple imputation. A marginal screening analysis using the proposed integrative Cox model is performed to identify risk factors associated with death following suicide-related hospitalization in Connecticut. The identified diagnostics codes are consistent with existing literature and provide several new insights on suicide risk, prediction and prevention.




science and technology

BART with targeted smoothing: An analysis of patient-specific stillbirth risk

Jennifer E. Starling, Jared S. Murray, Carlos M. Carvalho, Radek K. Bukowski, James G. Scott.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 28--50.

Abstract:
This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate $t$ while not necessarily requiring smoothness over other covariates $x$. tsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. tsBART extends BART by parameterizing each tree’s terminal nodes with smooth functions of $t$ rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps to regularize the estimate. After introducing and benchmarking the tsBART model, we apply it to our motivating example—pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age $(t)$ and based on maternal and fetal risk factors $(x)$. Obstetricians expect stillbirth risk to vary smoothly over gestational age but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of fetal mortality. All methods described here are implemented in the R package tsbart .