Monday, October 3, 2011

Model Exploration using K-sample Plot in Big Data

Generically, error rate of predicting binary variable by a certain model becomes plateau increasing sample size. When the model fits training data , the error rate gains from 0 to true error. And when the model fits test data, the error rate decreases from 1 to true error. These phenomenons also occur predicting continuous variable, now error rate changes to proportion of explained variance. This figure clearly shows this concept (listed in textbook "Data Mining and Statistics for Decision Making").



When we have to deal with Big Data, most of all machine learning methods may not be able to calculate parameters because these calculation take to huge time. So I propose that checking performance of a certain model using
"K-sample Plot (K's plot)" explained below at times like dealing Big Data.

K's plot is simple algorithm drawing 1 - AUC (or proportion of explained variance) vs. sample size plot. The reason of using 1 - AUC instead of error rate, AUC can considerate sensitivity and specificity simultaneously.

Step 1. Sampling K observations from Big Data.
Step 2. Estimate 1 - AUC of training samples and test samples by Cross-Variation.
Step 3. Changing sampling number K from small to efficient size and calculate Step2.
Step 4. Plot K (x-axis) vs. 1 - AUC (y-axis).


I implement K's plot in R package named "KsPlot". Partial example code of KsPlot is this.

library(KsPlot)
set.seed(1)
x1   <- rnorm(1000000)
set.seed(2)
x2   <- rnorm(1000000)
set.seed(3)
y    <- 2*x1 + x2**2 + rnorm(1000000)

X1      <- data.frame(x1 = x1, x2 = x2)
X2      <- data.frame(x1 = x1, x2 = x2, x3 = x2**2)
y       <- y

set.seed(1)
KsResult1 <- KsamplePlot(X1, y)
set.seed(1)
KsResult2 <- KsamplePlot(X1, y, Method="svm")

This example data have continuous outcome y and two continuous explanatory variables x1 and x2. Observation number of this data is 1 million!! Relationships of y and x1 is linear and y and x2 is quadratic. And If someone try to fitting linear model and SVM (support vector machine), calculation time of linear model is a few second but time of SVM is over 1 day!! So it's better that we can check performances of linear and SVM using K's plot.

 Result of Linear Model.


Result of SVM.

Compare two results, we understand SVM is better than linear model in this case. At the present I implement these models (specified "Method=" option): linear model (lm), Support Vector Machine (svm), Newral Network (nn), is Random Forest (rf), Multipe Adaptive Regression Splines (mars), Classification and Regression Tree (cart) and LASSO (lasso) for continuous target variable. And implement lm and svm for binary target.

This method is some part of my doctor paper in Japanese. Now we are translating doctor paper in English and will try to publish to statistical journal.

Thursday, August 11, 2011

Analysis of Japanese Earthquakes Data

Latest version is here.

Greater attention is much more about earthquakes, Touhoku earthquake has occurred on March 11, 2011. Earthquake data are released at Japan Weather Association's Tenki.jp. We can get these variables from the site.


  • Date, Time
  • Area
  • Lat/Lon
  • Depth
  • Magnitude


The information is currently available from August 23, 2008. There ware 7,392 earthquakes in three years until August 5, 2011. Almost all earthquakes happened in Japan. I summarize frequencies of earthquakes at each year.


Year Freq
August 23, 2008- 639
2009 1,400
2010 1,265
-August 5, 2011 4,088

Even if annual frequency is about 1,200~1,400 but at this year already more than 4,000 earthquakes have happened.

Next, I show longitudinal frequencies of earthquakes by magnitude.


Touhoku Area (including Fukushima)




X-axis is day, Y-axis is frequency of earthquake and color shows magnitude. The color becomes more brown, the earthquake is small. "1-11" of x-axis shows January 11.

You can see and follow the features as follows.


  • ~March 8: not much earthquakes
  • March 9, 10: Observed large earthquakes (one M7, five M6s)
  • March 11: Touhoku earthquake
  • March 12~: The number of earthquakes gradually vanishing
  • April 11~: Occurred M7 earthquake
  • end of April to head of June: Occurred about average 20 earthquakes
  • June~head of August: Occurred about average 10~20 earthquakes


Earthquakes have seemed to calm after an Touhoku earthquake happened, many earthquakes have been observed again after April 11. Number of earthquakes seems a little less since June but they does not seem to go back as usual yet.


Kahto Area (including Tokyo)



This shows a similar trend with the results of Tohoku. Occurred about average 5~10 earthquakes since June.


Japan



Frequencies of earthquakes increased in Kanto, Tohoku, Chubu (including Nagoya) area.

Many earthquakes seem to occurred yet.


An example code of R is below.

library(reshape)
library(ggplot2)
eq      <- read.csv("http://ianalysis.jp/eq_en.csv", as.is=T)
eqFreq1 <- table(eq$date2, trunc(eq$M), eq$area2)
eqFreq2 <- melt(eqFreq1)
names(eqFreq2) <- c("date", "M", "area", "freq")
eqFreq2$date   <- as.POSIXct(eqFreq2$date, format="%Y-%m-%d")
eqFreq2$M      <- factor(eqFreq2$M)
ggplot(eqFreq2[eqFreq2$area=="Touhoku",], aes(date, weight=freq, fill=M)) + 
 geom_bar(binwidth=60*60*24) + opts(title="Touhoku Area") + 
 scale_fill_brewer(type="div") + xlab("Date") + ylab("Frequency")