PMLB: a large benchmark suite for machine learning evaluation and comparison

Published on Dec 1, 2017in Biodata Mining
· DOI :10.1186/s13040-017-0154-4
Randal S. Olson15
Estimated H-index: 15
(UPenn: University of Pennsylvania),
William G. La Cava10
Estimated H-index: 10
(UPenn: University of Pennsylvania)
+ 2 AuthorsJason H. Moore69
Estimated H-index: 69
(UPenn: University of Pennsylvania)
Background The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists.
  • References (24)
  • Citations (40)
📖 Papers frequently viewed together
2 Citations
12.3k Citations
230 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
#1Mario A. Muñoz (Monash University)H-Index: 10
#2Laura Villanova (Monash University)H-Index: 5
Last. Kate Smith-Miles (Monash University)H-Index: 22
view all 4 authors...
This paper tackles the issue of objective performance evaluation of machine learning classifiers, and the impact of the choice of test instances. Given that statistical properties or features of a dataset affect the difficulty of an instance for particular classification algorithms, we examine the diversity and quality of the UCI repository of test instances used by most machine learning researchers. We show how an instance space can be visualized, with each classification dataset represented as...
18 CitationsSource
#1William G. La Cava (UMass: University of Massachusetts Amherst)H-Index: 10
#2Kourosh Danai (UMass: University of Massachusetts Amherst)H-Index: 16
Last. Lee Spector (Hampshire College)H-Index: 33
view all 3 authors...
We introduce a method to enhance the inference of meaningful dynamic models from observational data by genetic programming (GP). This method incorporates an inheritable epigenetic layer that specifies active and inactive genes for a more effective local search of the model structure space. We define several GP implementations using different features of epigenetics, such as passive structure, phenotypic plasticity, and inheritable gene regulation. To test these implementations, we use hundreds o...
14 CitationsSource
#1Jing Li (Dartmouth College)H-Index: 4
#2James D. Malley (CIT: Center for Information Technology)H-Index: 26
Last. Jason H. Moore (UPenn: University of Pennsylvania)H-Index: 69
view all 5 authors...
Background Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest class...
23 CitationsSource
#1Ryan J. Urbanowicz (UPenn: University of Pennsylvania)H-Index: 16
#2Jason H. Moore (UPenn: University of Pennsylvania)H-Index: 69
Algorithmic scalability is a major concern for any machine learning strategy in this age of ‘big data’. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExSTraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, no...
37 CitationsSource
#1Joaquin Vanschoren (TU/e: Eindhoven University of Technology)H-Index: 15
#2Jan N. van Rijn (LEI: Leiden University)H-Index: 9
Last. Luís Torgo (University of Porto)H-Index: 23
view all 4 authors...
230 CitationsSource
#1Núria Macií (La Salle University)H-Index: 2
#2Ester Bernadó-Mansilla (La Salle University)H-Index: 16
Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims-such as the superiority of en...
22 CitationsSource
#1David White (Glas.: University of Glasgow)H-Index: 27
#2James McDermott (UCD: University College Dublin)H-Index: 17
Last. Sean Luke (GMU: George Mason University)H-Index: 35
view all 9 authors...
We present the results of a community survey regarding genetic programming benchmark practices. Analysis shows broad consensus that improvement is needed in problem selection and experimental rigor. While views expressed in the survey dissuade us from proposing a large-scale benchmark suite, we find community support for creating a "blacklist" of problems which are in common use but have important flaws, and whose use should therefore be discouraged. We propose a set of possible replacement prob...
115 CitationsSource
#1Ryan J. Urbanowicz (Dartmouth College)H-Index: 16
#2Jeff Kiralis (Dartmouth College)H-Index: 8
Last. Jason H. Moore (Dartmouth College)H-Index: 69
view all 4 authors...
Background Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In...
24 CitationsSource
#1Ryan J. Urbanowicz (Dartmouth College)H-Index: 16
#2Jeff Kiralis (Dartmouth College)H-Index: 8
Last. Jason H. Moore (Dartmouth College)H-Index: 69
view all 6 authors...
Background Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. Ho...
101 CitationsSource
#1Johannes Stallkamp (RUB: Ruhr University Bochum)H-Index: 7
#2Marc Schlipsing (RUB: Ruhr University Bochum)H-Index: 11
Last. Christian Igel (UCPH: University of Copenhagen)H-Index: 36
view all 4 authors...
Traffic signs are characterized by a wide variability in their visual appearance in real-world environments. For example, changes of illumination, varying weather conditions and partial occlusions impact the perception of road signs. In practice, a large number of different sign classes needs to be recognized with very high accuracy. Traffic signs have been designed to be easily readable for humans, who perform very well at this task. For computer systems, however, classifying traffic signs stil...
412 CitationsSource
Cited By40
#1Raaz DwivediH-Index: 4
#2Chandan SinghH-Index: 21
Last. Martin J. WainwrightH-Index: 71
view all 4 authors...
The recent success of high-dimensional models, such as deep neural networks (DNNs), has led many to question the validity of the bias-variance tradeoff principle in high dimensions. We reexamine it with respect to two key choices: the model class and the complexity measure. We argue that failing to suitably specify either one can falsely suggest that the tradeoff does not hold. This observation motivates us to seek a valid complexity measure, defined with respect to a reasonably good class of mo...
Binary classification is widely used in ML production systems. Monitoring classifiers in a constrained event space is well known. However, real world production systems often lack the ground truth these methods require. Privacy concerns may also require that the ground truth needed to evaluate the classifiers cannot be made available. In these autonomous settings, non-parametric estimators of performance are an attractive solution. They do not require theoretical models about how the classifiers...
#2Steven K. KauweH-Index: 2
Last. Taylor D. SparksH-Index: 13
view all 3 authors...
#1Qingyun Wu (UVA: University of Virginia)H-Index: 5
#2Chi WangH-Index: 1
view all 3 authors...
The increasing demand for democratizing machine learning algorithms for general software developers calls for hyperparameter optimization (HPO) solutions at low cost. Many machine learning algorithms have hyperparameters, which can cause a large variation in the training cost. But this effect is largely ignored in existing HPO methods, which are incapable to properly control cost during the optimization process. To address this problem, we develop a cost effective HPO solution. The core of our s...
#1Angela LeeH-Index: 1
Last. Aditya ParameswaranH-Index: 25
view all 4 authors...
It is well-known that the process of developing machine learning (ML) workflows is a dark-art; even experts struggle to find an optimal workflow leading to a high accuracy model. Users currently rely on empirical trial-and-error to obtain their own set of battle-tested guidelines to inform their modeling decisions. In this study, we aim to demystify this dark art by understanding how people iterate on ML workflows in practice. We analyze over 475k user-generated workflows on OpenML, an open-sour...
#1Jonas Fausing Fausing Olesen (University of Southern Denmark)
#2Hamid Reza (University of Southern Denmark)H-Index: 14
Thermal power plants are an important asset in the current energy infrastructure, delivering ancillary services, power, and heat to their respective consumers. Faults on critical components, such as large pumping systems, can lead to material damage and opportunity losses. Pumps plays an essential role in various industries and as such clever maintenance can ensure cost reductions and high availability. Prognostics and Health Management, PHM, is the study utilizing data to estimate the current a...
#1Patryk OrzechowskiH-Index: 6
Last. Jason H. MooreH-Index: 69
view all 3 authors...
Manifold learning, a non-linear approach of dimensionality reduction, assumes that the dimensionality of multiple datasets is artificially high and a reduced number of dimensions is sufficient to maintain the information about the data. In this paper, a large scale comparison of manifold learning techniques is performed for the task of classification. We show the current standing of genetic programming (GP) for the task of classification by comparing the classification results of two GP-based ma...
#1Stephen R. Piccolo (BYU: Brigham Young University)H-Index: 14
#2Terry J Lee (BYU: Brigham Young University)
Last. Kimball Hill (BYU: Brigham Young University)H-Index: 1
view all 4 authors...
BACKGROUND: Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied ...
#1Henry Wilde (Cardiff University)
#2Vincent Anthony Knight (Cardiff University)H-Index: 11
Last. Jonathan William Gillard (Cardiff University)H-Index: 11
view all 3 authors...
In this paper we propose a novel method for learning how algorithms perform. Classically, algorithms are compared on a finite number of existing (or newly simulated) benchmark datasets based on some fixed metrics. The algorithm(s) with the smallest value of this metric are chosen to be the ‘best performing’. We offer a new approach to flip this paradigm. We instead aim to gain a richer picture of the performance of an algorithm by generating artificial data through genetic evolution, the purpose...
2 CitationsSource
#1Leonardo TrujilloH-Index: 19
Last. Antonin Ponsich (UAM: Universidad Autónoma Metropolitana)H-Index: 5
view all 5 authors...
Evolutionary algorithms (EAs) have been with us for several decades and are highly popular given that they have proved competitive in the face of challenging problems’ features such as deceptiveness, multiple local optima, among other characteristics. However, it is necessary to define multiple hyper-parameter values to have a working EA, which is a drawback for many practitioners. In the case of genetic programming (GP), an EA for the evolution of models and programs, hyper-parameter optimizati...