A Simple and Effective Model-Based Variable Importance Measure.

Published on Jan 1, 2018in arXiv: Machine Learning
Brandon M. Greenwell4
Estimated H-index: 4
Bradley C. Boehmke4
Estimated H-index: 4
Andrew J. McCarthy1
Estimated H-index: 1
In the era of "big data", it is becoming more of a challenge to not only build state-of-the-art predictive models, but also gain an understanding of what's really going on in the data. For example, it is often of interest to know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome. Some modern algorithms---like random forests and gradient boosted decision trees---have a natural way of quantifying the importance or relative influence of each feature. Other algorithms---like naive Bayes classifiers and support vector machines---are not capable of doing so and model-free approaches are generally used to measure each predictor's importance. In this paper, we propose a standardized, model-based approach to measuring predictor importance across the growing spectrum of supervised learning algorithms. Our proposed method is illustrated through both simulated and real data examples. The R code to reproduce all of the figures in this paper is available in the supplementary materials.
  • References (17)
  • Citations (9)
📖 Papers frequently viewed together
94 Citations
10 Citations
20 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
32 CitationsSource
#1Alex GoldsteinH-Index: 2
#2Adam Kapelner (CUNY: City University of New York)H-Index: 11
Last. Emil Pitkin (UPenn: University of Pennsylvania)H-Index: 9
view all 4 authors...
94 CitationsSource
126k Citations
#1Dean R. De Cock (TSU: Truman State University)H-Index: 1
This paper presents a data set describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. I will discuss my previous use of the Boston Housing Data Set and I will suggest methods for incorporating this new data set as a final project in an undergraduate regression course.
14 CitationsSource
#1William N. VenablesH-Index: 24
#2Brian D. RipleyH-Index: 40
A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods. The emphasis is on presenting practical problems and full analyses of real data sets.
7,915 Citations
#1Hadley WickhamH-Index: 25
This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable...
7,925 Citations
General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predictive accuracy comparable to the best methods. However, their principal advantage lies in interpretation. Because of its simple form, each rule is easy to understand, as is its influence on individual p...
352 CitationsSource
#1Pierre Geurts (University of Liège)H-Index: 30
#2Damien Ernst (University of Liège)H-Index: 30
Last. Louis Wehenkel (University of Liège)H-Index: 41
view all 3 authors...
This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the d...
1,625 CitationsSource
#1Julian D. Olden (CSU: Colorado State University)H-Index: 69
#2Michael K. Joy (Massey University)H-Index: 15
Last. Russell G. Death (Massey University)H-Index: 34
view all 3 authors...
Artificial neural networks (ANNs) are receiving greater attention in the ecological sciences as a powerful statistical modeling technique; however, they have also been labeled a “black box” because they are believed to provide little explanatory insight into the contributions of the independent variables in the prediction process. A recent paper published in Ecological Modelling [Review and comparison of methods to study the contribution of variables in artificial neural network models, Ecol. Mo...
415 CitationsSource
Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelih...
5,667 CitationsSource
Cited By9
Last. Raul C. MureşanH-Index: 8
view all 6 authors...
#1Dominik Schüßler (University of Hildesheim)H-Index: 1
#2Jasmin Mantilla-Contreras (University of Hildesheim)H-Index: 8
Last. Ute RadespielH-Index: 25
view all 5 authors...
Madagascar is a global biodiversity hotspot of conservation concern. The decline of natural forest habitats due to shifting cultivation has been one of the major land use changes during the last decades. We analyzed satellite images between 1990 and 2018 from northeastern Madagascar to evaluate the contribution of nine variables (e.g., topographic, demographic, forest protection) to explain past forest loss, predict future deforestation probabilities to define important areas that require furthe...
The emerge of new technologies to synthesize and analyze big data with high-performance computing, has increased our capacity to more accurately predict crop yields. Recent research has shown that Machine learning (ML) can provide reasonable predictions, faster, and with higher flexibility compared to simulation crop modeling. The earlier the prediction during the growing season the better, but this has not been thoroughly investigated as previous studies considered all data available to predict...
1 Citations
#1Yimou Li (State Street Corporation)H-Index: 1
#2David Turkington (State Street Corporation)H-Index: 6
Last. Alireza Yazdani (State Street Corporation)H-Index: 1
view all 3 authors...
The complexity of machine learning models presents a substantial barrier to their adoption for many investors. The algorithms that generate machine learning predictions are sometimes regarded as a black box and demand interpretation. In this article, the authors present a framework for demystifying the behavior of machine learning models. They decompose model predictions into linear, nonlinear, and interaction components and study a model’s predictive efficacy using the same components. Together...
1 CitationsSource
#1Topi Paananen (TKK: Helsinki University of Technology)H-Index: 1
#1Topi Paananen (TKK: Helsinki University of Technology)
Last. Aki Vehtari (TKK: Helsinki University of Technology)H-Index: 30
view all 3 authors...
For complex nonlinear supervised learning models, assessing the relevance of input variables or their interactions is not straightforward due to the lack of a direct measure of relevance, such as the regression coefficients in generalized linear models. One can assess the relevance of input variables locally by using the mean prediction or its derivative, but this disregards the predictive uncertainty. In this work, we present a Bayesian method for identifying relevant input variables with main ...
#1Aditya V. Karhade (Harvard University)H-Index: 9
#2Joseph H. Schwab (Harvard University)H-Index: 30
Last. Hany Bedair (Harvard University)H-Index: 14
view all 3 authors...
Abstract Background Postoperative recovery after total hip arthroplasty (THA) can lead to the development of prolonged opioid use but there are few tools for predicting this adverse outcome. The purpose of this study is to develop machine learning algorithms for preoperative prediction of prolonged opioid prescriptions after THA. Methods A retrospective review of electronic health records was conducted at 2 academic medical centers and 3 community hospitals to identify adult patients who underwe...
#1Akash A. Shah (UCLA: University of California, Los Angeles)H-Index: 4
#2Aditya V. Karhade (Harvard University)H-Index: 9
Last. Joseph H. Schwab (Harvard University)H-Index: 30
view all 6 authors...
Abstract BACKGROUND CONTEXT Data regarding risk of failure of nonoperative management in spinal epidural abscess (SEA) are limited. Given the potential for deterioration with treatment failure, a tool that predicts the probability of failure would be of great clinical utility. PURPOSE We primarily aim to build a machine learning model using independent predictors of nonoperative management failure. Secondarily, we aim to develop an open-access web-based application that provides a patient-specif...
#1Jonathan J. Maynard (ARS: Agricultural Research Service)H-Index: 10
#2Travis W. Nauman (USGS: United States Geological Survey)H-Index: 7
Last. Joel R. Brown (USDA: United States Department of Agriculture)H-Index: 26
view all 7 authors...
#1Shannon Wongvibulsin (Johns Hopkins University)
#2Scott L. Zeger (Johns Hopkins University)H-Index: 77
The rising burden of healthcare costs suggests that the healthcare system could benefit from novel methods that allow for continuous learning to provide more data-driven, individualised care at lower costs and with improved outcomes. Here, we present our synergistic Learning approach for Prediction, Interpretation/Inference and Communication (Learning PIC) framework to address the challenges hindering the successful implementation of learning healthcare systems and to enable the effective delive...
#1Christian A. Scholbeck (LMU: Ludwig Maximilian University of Munich)H-Index: 1
#2Christoph MolnarH-Index: 4
Last. Giuseppe CasalicchioH-Index: 7
view all 5 authors...
Non-linear machine learning models often trade off a great predictive performance for a lack of interpretability. However, model agnostic interpretation techniques now allow us to estimate the effect and importance of features for any predictive model. Different notations and terminology have complicated their understanding and how they are related. A unified view on these methods has been missing. We present the generalized SIPA (Sampling, Intervention, Prediction, Aggregation) framework of wor...
1 Citations