Package mlbench: Machine Learning Benchmark Problems | GERMAN 0270, Study Guides, Projects, Research of German Philology

Material Type: Project; Class: LITERARY THEORY; Subject: German; University: University of California - Los Angeles; Term: Spring 2009;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 08/30/2009

koofers-user-bcw
koofers-user-bcw 🇺🇸

10 documents

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Package ‘mlbench’
April 17, 2009
Version 1.1-6
Title Machine Learning Benchmark Problems
Date 2009-02-05
Author Friedrich Leisch and Evgenia Dimitriadou. Original data sets from various sources.
Maintainer Friedrich Leisch <[email protected]>
Description A collection of artificial and real-world machine learning benchmark problems,
including, e.g., several data sets from the UCI repository.
License file LICENSE
Suggests e1071, scatterplot3d
ZipData No
Repository CRAN
Date/Publication 2009-02-05 14:13:08
Rtopics documented:
as.data.frame.mlbench ................................... 2
bayesclass.......................................... 3
BostonHousing ....................................... 3
BreastCancer ........................................ 5
DNA............................................. 6
Glass ............................................ 8
HouseVotes84........................................ 9
Ionosphere.......................................... 10
LetterRecognition...................................... 11
mlbench.2dnormals..................................... 12
mlbench.cassini....................................... 13
mlbench.circle........................................ 14
mlbench.corners....................................... 15
mlbench.cuboids ...................................... 15
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Package mlbench: Machine Learning Benchmark Problems | GERMAN 0270 and more Study Guides, Projects, Research German Philology in PDF only on Docsity!

Package ‘mlbench’

April 17, 2009

Version 1.1-

Title Machine Learning Benchmark Problems

Date 2009-02-

Author Friedrich Leisch and Evgenia Dimitriadou. Original data sets from various sources.

Maintainer Friedrich Leisch

Description A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.

License file LICENSE

Suggests e1071, scatterplot3d

ZipData No

Repository CRAN

Date/Publication 2009-02-05 14:13:

R topics documented:

as.data.frame.mlbench................................... 2 bayesclass.......................................... 3 BostonHousing....................................... 3 BreastCancer........................................ 5 DNA............................................. 6 Glass............................................ 8 HouseVotes84........................................ 9 Ionosphere.......................................... 10 LetterRecognition...................................... 11 mlbench.2dnormals..................................... 12 mlbench.cassini....................................... 13 mlbench.circle........................................ 14 mlbench.corners....................................... 15 mlbench.cuboids...................................... 15

2 as.data.frame.mlbench

mlbench.friedman1..................................... 16 mlbench.friedman2..................................... 17 mlbench.friedman3..................................... 18 mlbench.peak........................................ 19 mlbench.ringnorm...................................... 19 mlbench.shapes....................................... 20 mlbench.smiley....................................... 21 mlbench.spirals....................................... 21 mlbench.threenorm..................................... 22 mlbench.twonorm...................................... 23 mlbench.waveform..................................... 24 mlbench.xor......................................... 24 Ozone............................................ 25 PimaIndiansDiabetes.................................... 26 plot.mlbench........................................ 27 Satellite........................................... 28 Servo............................................ 30 Shuttle............................................ 31 Sonar............................................ 32 Soybean........................................... 33 Vehicle........................................... 34 Vowel............................................ 36 Zoo............................................. 37

Index 38

as.data.frame.mlbench Convert an mlbench object to a dataframe

Description

Converts x (which is basically a list) to a dataframe.

Usage

S3 method for class 'mlbench':

as.data.frame(x, row.names=NULL, optional=FALSE, ...)

Arguments

x Object of class "mlbench". row.names,optional,... currently ignored.

4 BostonHousing

Format

The original data are 506 observations on 14 variables, medv being the target variable:

BostonHousing 5

crim per capita crime rate by town zn proportion of residential land zoned for lots over 25,000 sq.ft indus proportion of non-retail business acres per town chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox nitric oxides concentration (parts per 10 million) rm average number of rooms per dwelling age proportion of owner-occupied units built prior to 1940 dis weighted distances to five Boston employment centres rad index of accessibility to radial highways tax full-value property-tax rate per USD 10, ptratio pupil-teacher ratio by town b 1000(B − 0 .63)^2 where B is the proportion of blacks by town lstat percentage of lower status of the population medv median value of owner-occupied homes in USD 1000’s

The corrected data set has the following additional columns:

cmedv corrected median value of owner-occupied homes in USD 1000’s town name of town tract census tract lon longitude of census tract lat latitude of census tract

Source

The original data have been taken from the UCI Repository Of Machine Learning Databases at

  • http://www.ics.uci.edu/~mlearn/MLRepository.html,

the corrected data have been taken from Statlib at

  • http://lib.stat.cmu.edu/datasets/

See Statlib and references there for details on the corrections. Both were converted to R format by Friedrich Leisch.

References

Harrison, D. and Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5 , 81–102. Gilley, O.W., and R. Kelley Pace (1996). On the Harrison and Rubinfeld Data. Journal of Environ- mental Economics and Management, 31 , 403–405. [Provided corrections and examined censoring.] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. Pace, R. Kelley, and O.W. Gilley (1997). Using the Spatial Configuration of the Data to Improve Es- timation. Journal of the Real Estate Finance and Economics, 14 , 333–340. [Added georeferencing and spatial estimation.]

DNA 7

References

  1. Wolberg,W.H., & Mangasarian,O.L. (1990). Multisurface method of pattern separation for med- ical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193-9196.
  • Size of data set: only 369 instances (at that point in time)
  • Collected classification results: 1 trial only
  • Two pairs of parallel hyperplanes were found to be consistent with 50% of the data
  • Accuracy on remaining 50% of dataset: 93.5%
  • Three pairs of parallel hyperplanes were found to be consistent with 67% of data
  • Accuracy on remaining 33% of dataset: 95.9%
  1. Zhang,J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Conference (pp. 470-479). Aberdeen, Scotland: Morgan Kaufmann.
  • Size of data set: only 369 instances (at that point in time)
  • Applied 4 instance-based learning algorithms
  • Collected classification results averaged over 10 trials
  • Best accuracy result:
  • 1-nearest neighbor: 93.7%
  • trained on 200 instances, tested on the other 169
  • Also of interest:
  • Using only typical instances: 92.2% (storing only 23.1 instances)
  • trained on 200 instances, tested on the other 169 Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

DNA Primate splice-junction gene sequences (DNA)

Description

It consists of 3,186 data points (splice junctions). The data points are described by 180 indicator binary variables and the problem is to recognize the 3 classes (ei, ie, neither), i.e., the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). The StaLog dna dataset is a processed version of the Irvine database described below. The main difference is that the symbolic variables representing the nucleotides (only A,G,T,C) were replaced by 3 binary indicator variables. Thus the original 60 symbolic attributes were changed into 180 binary attributes. The names of the examples were removed. The examples with ambiguities were removed (there was very few of them, 4). The StatLog version of this dataset was produced by Ross King at Strathclyde University. For original details see the Irvine database documentation. The nucleotides A,C,G,T were given indicator values as follows:

A -> 1 0 0 C -> 0 1 0 G -> 0 0 1 T -> 0 0 0

8 DNA

Hint. Much better performance is generally observed if attributes closest to the junction are used. In the StatLog version, this means using attributes A61 to A120 only.

Usage

data(DNA)

Format

A data frame with 3,186 observations on 180 variables, all nominal and a target class.

Source

  • Source:
    • all examples taken from Genbank 64.1 (ftp site: genbank.bio.net)
    • categories "ei" and "ie" include every "split-gene" for primates in Genbank 64.
    • non-splice examples taken from sequences known not to include a splicing site
  • Donor: G. Towell, M. Noordewier, and J. Shavlik, towell,[email protected], [email protected]

These data have been taken from:

  • ftp.stams.strath.ac.uk/pub/Statlog

and were converted to R format by [email protected].

References

machine learning:

  • M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; "Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences". Advances in Neural Information Processing Systems, volume 3, Morgan Kaufmann.
  • G. G. Towell and J. W. Shavlik and M. W. Craven, 1991; "Constructive Induction in Knowledge- Based Neural Networks", In Proceedings of the Eighth International Machine Learning Workshop, Morgan Kaufmann.
  • G. G. Towell, 1991; "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and Extraction", PhD Thesis, University of Wisconsin - Madison.
  • G. G. Towell and J. W. Shavlik, 1992; "Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules", In Advances in Neural Information Processing Systems, volume 4, Morgan Kaufmann.

10 HouseVotes

HouseVotes84 United States Congressional Voting Records 1984

Description

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

Usage

data(HouseVotes84)

Format

A data frame with 435 observations on 17 variables:

1 Class Name: 2 (democrat, republican) 2 handicapped-infants: 2 (y,n) 3 water-project-cost-sharing: 2 (y,n) 4 adoption-of-the-budget-resolution: 2 (y,n) 5 physician-fee-freeze: 2 (y,n) 6 el-salvador-aid: 2 (y,n) 7 religious-groups-in-schools: 2 (y,n) 8 anti-satellite-test-ban: 2 (y,n) 9 aid-to-nicaraguan-contras: 2 (y,n) 10 mx-missile: 2 (y,n) 11 immigration: 2 (y,n) 12 synfuels-corporation-cutback: 2 (y,n) 13 education-spending: 2 (y,n) 14 superfund-right-to-sue: 2 (y,n) 15 crime: 2 (y,n) 16 duty-free-exports: 2 (y,n) 17 export-administration-act-south-africa: 2 (y,n)

Source

  • Source: Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc., ington, D.C., 1985
  • Donor: Jeff Schlimmer ([email protected]) These data have been taken from the UCI Repository Of Machine Learning Databases at
  • ftp://ftp.ics.uci.edu/pub/machine-learning-databases
  • http://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere 11

and were converted to R format by Friedrich Leisch.

References

Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

Ionosphere Johns Hopkins University Ionosphere database

Description

This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. See the paper for more details. The targets were free electrons in the ionosphere. "good" radar returns are those showing evidence of some type of structure in the ionosphere. "bad" returns are those that do not; their signals pass through the ionosphere. Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal. See cited below for more details.

Usage

data(Ionosphere)

Format

A data frame with 351 observations on 35 independent variables, some numerical and 2 nominal, and one last defining the class.

Source

  • Source: Space Physics Group; Applied Physics Laboratory; Johns Hopkins University; Johns Hopkins Road; Laurel; MD 20723
  • Donor: Vince Sigillito ([email protected]) These data have been taken from the UCI Repository Of Machine Learning Databases at
  • ftp://ftp.ics.uci.edu/pub/machine-learning-databases
  • http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by [email protected].

mlbench.2dnormals 13

[,5] high height of box [,6] onpix total number of on pixels [,7] x.bar mean x of on pixels in box [,8] y.bar mean y of on pixels in box [,9] x2bar mean x variance [,10] y2bar mean y variance [,11] xybar mean x y correlation [,12] x2ybr mean of x^2 y [,13] xy2br mean of xy^2 [,14] x.ege mean edge count left to right [,15] xegvy correlation of x.ege with y [,16] y.ege mean edge count bottom to top [,17] yegvx correlation of y.ege with x

Source

  • Creator: David J. Slate
  • Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201
  • Donor: David J. Slate ([email protected]) (708) 491- These data have been taken from the UCI Repository Of Machine Learning Databases at
  • ftp://ftp.ics.uci.edu/pub/machine-learning-databases
  • http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by Friedrich Leisch.

References

P. W. Frey and D. J. Slate (Machine Learning Vol 6/2 March 91): "Letter Recognition Using Holland-style Adaptive Classifiers". The research for this article investigated the ability of several variations of Holland-style adaptive classifier systems to learn to correctly guess the letter categories associated with vectors of 16 integer attributes extracted from raster scan images of the letters. The best accuracy obtained was a little over 80%. It would be interesting to see how well other methods do with the same data. Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

mlbench.2dnormals 2-dimensional Gaussian Problem

Description

Each of the cl classes consists of a 2-dimensional Gaussian. The centers are equally spaced on a circle around the origin with radius r.

14 mlbench.cassini

Usage

mlbench.2dnormals(n, cl=2, r=sqrt(cl), sd=1)

Arguments

n number of patterns to create cl number of classes r radius at which the centers of the classes are located sd standard deviation of the Gaussians

Value

Returns an object of class "bayes.2dnormals" with components

x input values classes factor vector of length n with target classes

Examples

2 classes

p <- mlbench.2dnormals(500,2) plot(p)

6 classes

p <- mlbench.2dnormals(500,6) plot(p)

mlbench.cassini Cassini: A 2 Dimensional Problem

Description

The inputs of the cassini problem are uniformly distributed on a 2 -dimensional space within 3 structures. The 2 external structures (classes) are banana-shaped structures and in between them, the middle structure (class) is a circle.

Usage

mlbench.cassini(n, relsize=c(2,2,1))

Arguments

n number of patterns to create relsize relative size of the classes (vector of length 3)

16 mlbench.cuboids

mlbench.corners Corners of Hypercube

Description

The created data are d-dimensional spherical Gaussians with standard deviation sd and means at the corners of a d-dimensional hypercube. The number of classes is 2 d.

Usage

mlbench.corners(n=800, d=3, sides=rep(1,d), sd=0.1)

Arguments

n number of patterns to create d dimensionality of hypercube, default is 3 sides lengths of the sides of the hypercube, default is to create a unit hypercube sd standard deviation

Value

Returns an object of class "mlbench.corners" with components

x input values classes factor of length n with target classes

Examples

p<-mlbench.corners() plot(p)

library("scatterplot3d") scatterplot3d(p$x, color=as.numeric(p$classes))

mlbench.cuboids Cuboids: A 3 Dimensional Problem

Description

The inputs of the cuboids problem are uniformly distributed on a 3 -dimensional space within 3 cuboids and a small cube in the middle of them.

Usage

mlbench.cuboids(n, relsize=c(2,2,2,1))

mlbench.friedman1 17

Arguments

n number of patterns to create relsize relative size of the classes (vector of length 4)

Value

Returns an object of class "mlbench.cuboids" with components

x input values classes vector of length n with target classes

Author(s)

Evgenia Dimitriadou, and Andreas Weingessel

Examples

p <- mlbench.cuboids(7000) plot(p)

Not run:

library(Rggobi) g <- ggobi(p$x) g$setColors(p$class) g$setMode("2D Tour")

End(Not run)

mlbench.friedman1 Benchmark Problem Friedman 1

Description

The regression problem Friedman 1 as described in Friedman (1991) and Breiman (1996). Inputs are 10 independent variables uniformly distributed on the interval [0, 1], only 5 out of these 10 are actually used. Outputs are created according to the formula

y = 10 sin(πx 1 x2) + 20(x 3 − 0 .5)^2 + 10x4 + 5x5 + e

where e is N(0,sd).

Usage

mlbench.friedman1(n, sd=1)

Arguments

n number of patterns to create sd Standard deviation of noise

mlbench.friedman3 19

References

Breiman, Leo (1996) Bagging predictors. Machine Learning 24, pages 123-140. Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.

mlbench.friedman3 Benchmark Problem Friedman 3

Description

The regression problem Friedman 3 as described in Friedman (1991) and Breiman (1996). Inputs are 4 independent variables uniformly distrtibuted over the ranges

0 ≤ x 1 ≤ 100

40 π ≤ x 2 ≤ 560 π 0 ≤ x 3 ≤ 1 1 ≤ x 4 ≤ 11

The outputs are created according to the formula

y = atan((x 2 x 3 − (1/(x 2 x4)))/x1) + e

where e is N(0,sd).

Usage

mlbench.friedman3(n, sd=0.1)

Arguments

n number of patterns to create sd Standard deviation of noise. The default value of 0.1 gives a signal to noise ratio (i.e., the ratio of the standard deviations) of 3:1. Thus, the variance of the function itself (without noise) accounts for 90% of the total variance.

Value

Returns a list with components x input values (independent variables) y output values (dependent variable)

References

Breiman, Leo (1996) Bagging predictors. Machine Learning 24, pages 123-140. Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.

20 mlbench.ringnorm

mlbench.peak Peak Benchmark Problem

Description

Let r = 3u where u is uniform on [0,1]. Take x to be uniformly distributed on the d-dimensional sphere of radius r. Let y = 25exp(−. 5 r^2 ). This data set is not a classification problem but a regression problem where y is the dependent variable.

Usage

mlbench.peak(n, d=20)

Arguments

n number of patterns to create d dimension of the problem

Value

Returns a list with components

x input values (independent variables) y output values (dependent variable)

mlbench.ringnorm Ringnorm Benchmark Problem

Description

The inputs of the ringnorm problem are points from two Gaussian distributions. Class 1 is multi- variate normal with mean 0 and covariance 4 times the identity matrix. Class 2 has unit covariance and mean (a, a,... , a), a = d−^0.^5.

Usage

mlbench.ringnorm(n, d=20)

Arguments

n number of patterns to create d dimension of the ringnorm problem