When we have to deal with Big Data, most of all machine learning methods may not be able to calculate parameters because these calculation take to huge time. So I propose that checking performance of a certain model using

"K-sample Plot (K's plot)" explained below at times like dealing Big Data.

K's plot is simple algorithm drawing 1 - AUC (or proportion of explained variance) vs. sample size plot. The reason of using 1 - AUC instead of error rate, AUC can considerate sensitivity and specificity simultaneously.

Step 1. Sampling K observations from Big Data.

Step 2. Estimate 1 - AUC of training samples and test samples by Cross-Variation.

Step 3. Changing sampling number K from small to efficient size and calculate Step2.

Step 4. Plot K (x-axis) vs. 1 - AUC (y-axis).

I implement K's plot in R package named "KsPlot". Partial example code of KsPlot is this.

library(KsPlot) set.seed(1) x1 <- rnorm(1000000) set.seed(2) x2 <- rnorm(1000000) set.seed(3) y <- 2*x1 + x2**2 + rnorm(1000000) X1 <- data.frame(x1 = x1, x2 = x2) X2 <- data.frame(x1 = x1, x2 = x2, x3 = x2**2) y <- y set.seed(1) KsResult1 <- KsamplePlot(X1, y) set.seed(1) KsResult2 <- KsamplePlot(X1, y, Method="svm")

This example data have continuous outcome y and two continuous explanatory variables x1 and x2. Observation number of this data is 1 million!! Relationships of y and x1 is linear and y and x2 is quadratic. And If someone try to fitting linear model and SVM (support vector machine), calculation time of linear model is a few second but time of SVM is over 1 day!! So it's better that we can check performances of linear and SVM using K's plot.

Result of Linear Model.

Result of SVM.

Compare two results, we understand SVM is better than linear model in this case. At the present I implement these models (specified "Method=" option): linear model (lm), Support Vector Machine (svm), Newral Network (nn), is Random Forest (rf), Multipe Adaptive Regression Splines (mars), Classification and Regression Tree (cart) and LASSO (lasso) for continuous target variable. And implement lm and svm for binary target.

This method is some part of my doctor paper in Japanese. Now we are translating doctor paper in English and will try to publish to statistical journal.

Hi! Thanks for good article.

ReplyDeleteI think, Step2 can be executed as parallel on multicore or multi nodes, it can reduce running time.

Thank you for your comment!

ReplyDeleteIt's true that calculate time become shorten by parallel or distributing process.

When maximum K is 1,000 the total calculate time is about a few (lm) or 10 second (svm). So if we use parallel process, it's time may become less than one second.