A fast algorithm for the minimum covariance determinant estimator

Published on Aug 1, 1999in Technometrics2.089
· DOI :10.2307/1270566
Peter J. Rousseeuw57
Estimated H-index: 57
Katrien Van Driessen5
Estimated H-index: 5
The minimum covariance determinant (MCD) method of Rousseeuw is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance matrix has the lowest determinant. Until now, applications of the MCD were hampered by the computation time of existing algorithms, which were limited to a few hundred objects in a few dimensions. We discuss two important applications of larger size, one about a production process at Philips with n = 677 objects and p = 9 variables, and a dataset from astronomy with n = 137,256 objects and p = 27 variables. To deal with such problems we have developed a new algorithm for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and determinants, and techniques which we call “selective iteration” and “nested extensions.” For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders...
  • References (27)
  • Citations (1359)
📖 Papers frequently viewed together
1 Author (David Donoho)
369 Citations
1,294 Citations
697 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
Abstract High breakdown estimation allows one to get reasonable estimates of the parameters from a sample of data even if that sample is contaminated by large numbers of awkwardly placed outliers. Two particular application areas in which this is of interest are multiple linear regression, and estimation of the location vector and scatter matrix of multivariate data. Standard high breakdown criteria for the regression problem are the least median of squares (LMS) and least trimmed squares (LTS);...
72 CitationsSource
#1Douglas M. Hawkins (UMN: University of Minnesota)H-Index: 54
#2Geoffrey J. McLachlan (UQ: University of Queensland)H-Index: 47
Abstract The classification rules of linear discriminant analysis are defined by the true mean vectors and the common covariance matrix of the populations from which the data come. Because these true parameters are generally unknown, they are commonly estimated by the sample mean vector and covariance matrix of the data in a training sample randomly drawn from each population. However, these sample statistics are notoriously susceptible to contamination by outliers, a problem compounded by the f...
92 CitationsSource
Publisher Summary This chapter discusses positive-breakdown methods. It focusses on the linear regression model. The goal of positive-breakdown methods is to be robust against the possibility of one or several unannounced outliers that may occur anywhere in the data. High-breakdown regression does not throw away 50% of the data, but instead it finds a majority fit, which can then be used to detect the actual outliers. The purpose of positive-breakdown method is not to delete and forget the point...
37 CitationsSource
#1David M. Rocke (UC Davis: University of California, Davis)H-Index: 49
#2David L. Woodruff (UC Davis: University of California, Davis)H-Index: 32
Abstract New insights are given into why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of the data. Significant improvements in methods for detecting outliers are described, and extensive simulation experiments demonstrate that a hybrid method extends the practical boundaries of outlier detection capabilities. Based on simulation results and examples from the literature, the question of what levels of contamination can be dete...
250 CitationsSource
#1José Agulló Candela (University of Alicante)H-Index: 1
In this paper we develop an exact iterative algorithm for the computation of the minimum volume ellipsoid (MVE) estimator that is more efficient than the algorithm of Cook, Hawkins and Weisberg (1993). Our algorithm is based on a branch and bound (BAB) technique and it is computationally feasible for small and moderate-sized samples.
25 CitationsSource
#1David L. Woodruff (UC Davis: University of California, Davis)H-Index: 32
#2David M. Rocke (UC Davis: University of California, Davis)H-Index: 49
Abstract Estimation of multivariate shape and location in a fashion that is robust with respect to outliers and is affine equivariant represents a significant challenge. The use of compound estimators that use a combinatorial estimator such as Rousseeuw's minimum volume ellipsoid (MVE) or minimum covariance determinant (MCD) to find good starting points for high-efficiency robust estimators such as S estimators has been proposed. In this article we indicate why this scheme will fail in high dime...
114 CitationsSource
Abstract Outliers in multivariate data present a particularly intractable problem. There are various criteria for their identification, and for the related problem of high breakdown robust estimation of multivariate location and scale, but to date no reliable algorithm has been proposed for fitting by these criteria. The most widely used multivariate high breakdown method is the minimum volume ellipsoid (MVE) approach which seeks the ellipsoid of minimum volume containing at least half of the da...
65 CitationsSource
#1R.W. ButlerH-Index: 1
#2P.L. DaviesH-Index: 1
Last. Myoungshic JhunH-Index: 10
view all 3 authors...
Consistency is shown for the minimum covariance determinant (MCD) estimators of multivariate location and scale and asymptotic normality is shown for the former. The proofs are made possible by showing a separating ellipsoid property for the MCD subset of observations. An analogous property is shown for the MCD subset computed from the population distribution.
134 CitationsSource
Abstract We consider the multiple linear regression model y i = x′ i β + e i , i = 1, 2, …, n, with random carriers and focus on the estimation of β. This article's main contribution is to present an estimator that is affine, regression, and scale equivariant; has both a high breakdown point and a bounded influence function; and has an asymptotic efficiency greater than .95 versus least squares under Gaussian errors. We give conditions under which the estimator—a one-step general M estimator tha...
130 CitationsSource
#1David L. Woodruff (UC Davis: University of California, Davis)H-Index: 32
#2David M. Rocke (UC Davis: University of California, Davis)H-Index: 49
Abstract A method of robust estimation of multivariate location and shape that has attracted a lot of attention recently is Rousseeuw's minimum volume ellipsoid estimator (MVE). This estimator has a high breakdown point but is difficult to compute successfully. In this article, we apply methods of heuristic search to this problem, including simulated annealing, genetic algorithms, and tabu search, and compare the results to the undirected random search algorithm that is often cited. Heuristic se...
53 CitationsSource
Cited By1359
#1R. Fuentes (University of Sheffield)H-Index: 2
#2Paul Gardner (University of Sheffield)H-Index: 2
Last. Elizabeth Cross (University of Sheffield)H-Index: 10
view all 8 authors...
Abstract The use of robotics is beginning to play a key role in automating the data collection process in Non Destructive Testing (NDT). Increasing the use of automation quickly leads to the gathering of large quantities of data, which makes it inefficient, perhaps even infeasible, for a human to parse the information contained in them. This paper presents a solution to this problem by making the process of NDT data acquisition an autonomous one as opposed to an automatic one. In order to achiev...
#1Jordan W. Bishop (UAF: University of Alaska Fairbanks)
#2David Fee (UAF: University of Alaska Fairbanks)H-Index: 23
Last. Curt A. L. Szuberla (UAF: University of Alaska Fairbanks)H-Index: 10
view all 3 authors...
Abstract This paper presents a generic methodology for fault forecasting or prognosis in industrial equipment. Particularly, this technique regards training some unsupervised machine learning model using high amount of historical process data from such equipment as input and stop data as reference. The goal is to correlate as strongly as possible the anomalies found by the model in the process data with upcoming faults, according to a forecasting horizon. In this way, the outlier detection model...
#1Muhammad Ahsan (UTM: Universiti Teknologi Malaysia)
#2Muhammad Mashuri (ITS: Sepuluh Nopember Institute of Technology)H-Index: 5
Last. Dedi Dwi Prasetyo (ITS: Sepuluh Nopember Institute of Technology)H-Index: 4
view all 5 authors...
Abstract The utilization of conventional multivariate control chart in network intrusion detection will deal with two main problems. First, the high false alarm occurs due to the distribution of network traffic data that is not following the theory. Second, the inability of the control chart to detect outliers caused by the masking effect. To overcome these problems, the multivariate control chart based on the fast minimum covariance determinant (MCD) algorithm and kernel density estimation (KDE...
#1Daniel PérezH-Index: 3
#1Daniel PerezH-Index: 5
Last. Manuel DomínguezH-Index: 14
view all 6 authors...
#1Abhik Ghosh (ISI: Indian Statistical Institute)H-Index: 39
view all 4 authors...
Quadratic discriminant analysis (QDA) is a widely used statistical tool to classify observations from different multivariate Normal populations. The generalized quadratic discriminant analysis (GQDA) classification rule/classifier, which generalizes the QDA and the minimum Mahalanobis distance (MMD) classifiers to discriminate between populations with underlying elliptically symmetric distributions competes quite favorably with the QDA classifier when it is optimal and performs much better when ...
Signal recognition is one of significant and challenging tasks in the signal processing and communications field. It is often a common situation that there's no training data accessible for some signal classes to perform a recognition task. Hence, as widely-used in image processing field, zero-shot learning (ZSL) is also very important for signal recognition. Unfortunately, ZSL regarding this field has hardly been studied due to inexplicable signal semantics. This paper proposes a ZSL framework,...
#1José Valero (University of Murcia)H-Index: 22
#2Pedro Miguel Sánchez Sánchez (University of Murcia)H-Index: 1
Last. Gregorio Martínez Pérez (University of Murcia)H-Index: 18
view all 4 authors...
#1Benyamin GhojoghH-Index: 4
#2Fakhri KarrayH-Index: 30
Last. Mark Crowley (UW: University of Waterloo)H-Index: 6
view all 3 authors...
We propose a novel approach to anomaly detection called Curvature Anomaly Detection (CAD) and Kernel CAD based on the idea of polyhedron curvature. Using the nearest neighbors for a point, we consider every data point as the vertex of a polyhedron where the more anomalous point has more curvature. We also propose inverse CAD (iCAD) and Kernel iCAD for instance ranking and prototype selection by looking at CAD from an opposite perspective. We define the concept of anomaly landscape and anomaly pa...