Monday, October 3, 2011

Model Exploration using K-sample Plot in Big Data

Generically, error rate of predicting binary variable by a certain model becomes plateau increasing sample size. When the model fits training data , the error rate gains from 0 to true error. And when the model fits test data, the error rate decreases from 1 to true error. These phenomenons also occur predicting continuous variable, now error rate changes to proportion of explained variance. This figure clearly shows this concept (listed in textbook "Data Mining and Statistics for Decision Making").

When we have to deal with Big Data, most of all machine learning methods may not be able to calculate parameters because these calculation take to huge time. So I propose that checking performance of a certain model using
"K-sample Plot (K's plot)" explained below at times like dealing Big Data.

K's plot is simple algorithm drawing 1 - AUC (or proportion of explained variance) vs. sample size plot. The reason of using 1 - AUC instead of error rate, AUC can considerate sensitivity and specificity simultaneously.

Step 1. Sampling K observations from Big Data.
Step 2. Estimate 1 - AUC of training samples and test samples by Cross-Variation.
Step 3. Changing sampling number K from small to efficient size and calculate Step2.
Step 4. Plot K (x-axis) vs. 1 - AUC (y-axis).

I implement K's plot in R package named "KsPlot". Partial example code of KsPlot is this.

x1   <- rnorm(1000000)
x2   <- rnorm(1000000)
y    <- 2*x1 + x2**2 + rnorm(1000000)

X1      <- data.frame(x1 = x1, x2 = x2)
X2      <- data.frame(x1 = x1, x2 = x2, x3 = x2**2)
y       <- y

KsResult1 <- KsamplePlot(X1, y)
KsResult2 <- KsamplePlot(X1, y, Method="svm")

This example data have continuous outcome y and two continuous explanatory variables x1 and x2. Observation number of this data is 1 million!! Relationships of y and x1 is linear and y and x2 is quadratic. And If someone try to fitting linear model and SVM (support vector machine), calculation time of linear model is a few second but time of SVM is over 1 day!! So it's better that we can check performances of linear and SVM using K's plot.

 Result of Linear Model.

Result of SVM.

Compare two results, we understand SVM is better than linear model in this case. At the present I implement these models (specified "Method=" option): linear model (lm), Support Vector Machine (svm), Newral Network (nn), is Random Forest (rf), Multipe Adaptive Regression Splines (mars), Classification and Regression Tree (cart) and LASSO (lasso) for continuous target variable. And implement lm and svm for binary target.

This method is some part of my doctor paper in Japanese. Now we are translating doctor paper in English and will try to publish to statistical journal.