# Statistics

## New submissions

[ total of 13 entries: 1-13 ]
[ showing up to 2000 entries per page: fewer | more ]

### New submissions for Wed, 23 Apr 14

[1]
Title: A randomized trial in a massive online open course shows people don't know what a statistically significant relationship looks like, but they can learn
Comments: 7 pages, including 2 figures and 1 table
Subjects: Applications (stat.AP)

Scatterplots are the most common way for statisticians, scientists, and the public to visually detect relationships between measured variables. At the same time, and despite widely publicized controversy, P-values remain the most commonly used measure to statistically justify relationships identified between variables. Here we measure the ability to detect statistically significant relationships from scatterplots in a randomized trial of 2,039 students in a statistics massive open online course (MOOC). Each subject was shown a random set of scatterplots and asked to visually determine if the underlying relationships were statistically significant at the P < 0.05 level. Subjects correctly classified only 47.4% (95% CI: 45.1%-49.7%) of statistically significant relationships, and 74.6% (95% CI: 72.5%-76.6%) of non-significant relationships. Adding visual aids such as a best fit line or scatterplot smooth increased the probability a relationship was called significant, regardless of whether the relationship was actually significant. Classification of statistically significant relationships improved on repeat attempts of the survey, although classification of non-significant relationships did not. Our results suggest: (1) that evidence-based data analysis can be used to identify weaknesses in theoretical procedures in the hands of average users, (2) data analysts can be trained to improve detection of statistically significant results with practice, but (3) data analysts have incorrect intuition about what statistically significant relationships look like, particularly for small effects. We have built a web tool for people to compare scatterplots with their corresponding p-values which is available here: this http URL

[2]
Title: Approximate Inference for Nonstationary Heteroscedastic Gaussian process Regression
Subjects: Machine Learning (stat.ML)

This paper presents a novel approach for approximate integration over the uncertainty of noise and signal variances in Gaussian process (GP) regression. Our efficient and straightforward approach can also be applied to integration over input dependent noise variance (heteroscedasticity) and input dependent signal variance (nonstationarity) by setting independent GP priors for the noise and signal variances. We use expectation propagation (EP) for inference and compare results to Markov chain Monte Carlo in two simulated data sets and three empirical examples. The results show that EP produces comparable results with less computational burden.

[3]
Title: Best prediction under a nested error model with log transformation
Subjects: Statistics Theory (math.ST)

In regression models involving economic variables such as income, log transformation is typically taken to achieve approximate normality and stabilize the variance. However, often the interest is predicting individual values or means of the variable in the original scale. Back transformation of predicted values introduces a non-negligible bias. Moreover, assessing the uncertainty of the actual predictor is not straightforward. In this paper, a nested error model for the log transformation of the target variable is considered. Nested error models are widely used for estimation of means in subpopulations with small sample sizes (small areas), by linking all the areas through common parameters. These common parameters are estimated using the overall set of sample data, which leads to much more efficient small area estimators. Analytical expressions for the best predictors of individual values of the original variable and of small area means are obtained under the nested error model with log transformation of the target variable. Empirical best predictors are defined by estimating the unknown model parameters in the best predictors. Exact mean squared errors of the best predictors and second order approximations to the mean squared errors of the empirical best predictors are derived. Mean squared error estimators that are second order correct are also obtained. An example with Spanish data on living conditions illustrates the procedures.

[4]
Title: The Degrees of Freedom of Partly Smooth Regularizers
Authors: Samuel Vaiter (CEREMADE), Charles-Alban Deledalle (IMB), Gabriel Peyré (CEREMADE), Jalal M. Fadili (GREYC), Charles Dossal (IMB)
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT)

In this paper, we are concerned with regularized regression problems where the prior penalty is a piecewise regular/partly smooth gauge whose active manifold is linear. This encompasses as special cases the Lasso ($\lun$ regularizer), the group Lasso ($\lun-\ldeux$ regularizer) and the $\linf$-norm regularizer penalties. This also includes so-called analysis-type priors, i.e. composition of the previously mentioned functionals with linear operators, a typical example being the total variation prior. We study the sensitivity of {\textit{any}} regularized minimizer to perturbations of the observations and provide its precise local parameterization. Our main result shows that, when the observations are outside a set of zero Lebesgue measure, the predictor moves locally stably along the same linear space as the observations undergo small perturbations. This local stability is a consequence of the piecewise regularity of the gauge, which in turn plays a pivotal role to get a closed form expression for the variations of the predictor w.r.t. observations which holds almost everywhere. When the perturbation is random (with an appropriate continuous distribution), this allows us to derive an unbiased estimator of the degrees of freedom and of the risk of the estimator prediction. Our results hold true without placing any assumption on the design matrix, should it be full column rank or not. They generalize those already known in the literature such as the Lasso problem, the general Lasso problem (analysis $\lun$-penalty), or the group Lasso where existing results for the latter assume that the design is full column rank.

[5]
Title: Descriptive examples of the limitations of Artificial Neural Networks applied to the analysis of independent stochastic data
Subjects: Applications (stat.AP)

We show with a few descriptive examples the limitations of Artificial Neural Networks when they are applied to the analysis of independent stochastic data.

[6]
Title: Controlling the False Discovery Rate via Knockoffs
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)---the expected fraction of false discoveries among all discoveries---is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap---their construction does not require any new data---and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high. We also apply the knockoff filter to HIV data with the goal of identifying those mutations associated with a form of resistance to treatment plans.

### Cross-lists for Wed, 23 Apr 14

[7]  arXiv:1404.5406 (cross-list from cs.PF) [pdf, ps, other]
Title: Degradation Analysis of Probabilistic Parallel Choice Systems
Journal-ref: International Journal of Reliability, Quality and Safety Engineering, vol. 21(3), June 2014
Subjects: Performance (cs.PF); Statistics Theory (math.ST)

Degradation analysis is used to analyze the useful lifetimes of systems, their failure rates, and various other system parameters like mean time to failure (MTTF), mean time between failures (MTBF), and the system failure rate (SFR). In many systems, certain possible parallel paths of execution that have greater chances of success are preferred over others. Thus we introduce here the concept of probabilistic parallel choice. We use binary and $n$-ary probabilistic choice operators in describing the selections of parallel paths. These binary and $n$-ary probabilistic choice operators are considered so as to represent the complete system (described as a series-parallel system) in terms of the probabilities of selection of parallel paths and their relevant parameters. Our approach allows us to derive new and generalized formulae for system parameters like MTTF, MTBF, and SFR. We use a generalized exponential distribution, allowing distinct installation times for individual components, and use this model to derive expressions for such system parameters.

### Replacements for Wed, 23 Apr 14

[8]  arXiv:0808.3495 (replaced) [pdf, other]
Title: Tail asymptotics for a random sign Lindley recursion
Journal-ref: Journal of Applied Probability, 47(1), 72-83, 2010
Subjects: Probability (math.PR); Statistics Theory (math.ST)
[9]  arXiv:1209.3394 (replaced) [pdf, ps, other]
Title: Distribution of the largest eigenvalue for real Wishart and Gaussian random matrices and a simple approximation for the Tracy-Widom distribution
Authors: Marco Chiani
Comments: Journal of Multivariate Analysis (2014)
Subjects: Information Theory (cs.IT); Statistics Theory (math.ST)
[10]  arXiv:1301.3529 (replaced) [pdf, other]
Title: Discrete Restricted Boltzmann Machines
Subjects: Machine Learning (stat.ML); Algebraic Geometry (math.AG); Probability (math.PR)
[11]  arXiv:1312.7366 (replaced) [pdf, ps, other]
Title: Monte Carlo non local means: Random sampling for large-scale image filtering
Comments: 23 pages, 14 figures; submitted for publication
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
[12]  arXiv:1402.0947 (replaced) [pdf, ps, other]
Title: On Renyi entropy convergence of the max domain of attraction
Authors: Ali Saeb
Subjects: Other Statistics (stat.OT)
[13]  arXiv:1404.5165 (replaced) [pdf, other]
Title: GP-Localize: Persistent Mobile Robot Localization using Online Sparse Gaussian Process Observation Model
Comments: 28th AAAI Conference on Artificial Intelligence (AAAI 2014), Extended version with proofs, 10 pages
Subjects: Robotics (cs.RO); Learning (cs.LG); Machine Learning (stat.ML)
[ total of 13 entries: 1-13 ]
[ showing up to 2000 entries per page: fewer | more ]