Last active
July 11, 2019 07:27
-
-
Save kkraoj/ca5d24ebe52dc1ac0a6be395b2de641e to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
```{r first separate data into folds before choosing features} | |
# train the model on training set | |
# Leave out fold | |
accuracy.rates <- c() | |
for (itr in seq_len(folds)){ | |
# Note: since input features are already random, no need to shuffle the data | |
# before creating folds. But ideally, examples need to be shuffled before | |
# creating folds to get rid of recording/data collecting bias | |
test_ind <- seq(from <- fold.size*(itr-1)+1, to = fold.size*itr, by = 1) | |
train <- data[-test_ind, ] | |
#After creating the leave out data, select best subset of features | |
selected.data.train <- best.subset(train) | |
# Pick the same features in the test set so that model can predict output | |
test <- data[test_ind, colnames(selected.data.train)] | |
model <- train(y ~ .,data = selected.data.train,method = 'naive_bayes') | |
#note no Cross validation while training. Cross validation is performed | |
#by the outer for loop. | |
test$yhat <- predict(model, newdata = test[,-which(names(test) == "y")]) | |
accuracy <- mean(test$y==test$yhat) | |
accuracy.rates <- c(accuracy.rates, accuracy) | |
} | |
sprintf('Classification accuracy when CV is performed before subset selection = %0.0f %%', | |
100*mean(accuracy.rates)) | |
``` | |
[1] "Classification accuracy when CV is performed before subset selection = 46 %" | |
This error is more like what we would expect, after all our chosen | |
response variable (predicting a coin flip) was indeed a random | |
process completed unrelated to the predictor variables. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment