* Travel in Western Nevada and the Sierra will largely be free of snow impacts this weekend with a few light snow showers in the mountains from Lake Tahoe northward. * WHAT.Southwest winds 15 to 25 mph with gusts up to 45 mph expected. * CHANGES.Extended Lake Wind Advisory into Sunday evening. This indicated that our random forest was overfit to the training set and would not be a strong model for future predictions.LAKE WIND ADVISORY NOW IN EFFECT FROM 11 AM THIS MORNING TO 8 PM PDT SUNDAY FOR LAKE TAHOE. We then tested predicting on the testing set, which resulted in an accuracy of only 0.84 with a similarly significant p-value. In the real world, perfectly accurate models don’t exist. While the p-value is ideal at virtually 0, the fact that our model had near perfect predictions raises concerns. Developing predictions on the training set with our random forest resulted in an accuracy of 0.9991, almost 100%. Improving on the classification tree method, we then built a random forest with 10 k-fold cross-validation and 50 trees. With both predictions showing greater accuracy than the naïve model, the classification tree may be viable for predicting RainTomorrow. We then made predictions on the testing dataset, which resulted in an accuracy of about 0.824. Compared to the naïve model accuracy of 0.78, and considering a p-value of nearly 0, this model appeared to be a good quality model. Predicting RainTomorrow on the training set resulted in about 0.828 accuracy. Predicting RainTomorrow used a pruned classification tree We’ll merge the imputed values with the original data to create a completed dataset. Rain_imp = mice(rain, m=1, method = “pmm”, seed = 12345)Īs we can see below, the imputed data (red) fits well with the observed values (blue). With this code executed, all missing values in the quantitative variables have been replaced with an estimation based on the present data. The next section of code imputes missing values using predictive mean matching, which essentially uses a predicted regression model from known data to estimate a missing value. Rain = rain %>% drop_na(WindGustDir, WindDir9am, WindDir3pm, RainToday, RainTomorrow) Doing so reduces our dataset to 24,351 observations. The following code checks for any missing, or “NA”, values in our categorical variables and eliminates the entire observation. For simplicity, we’ll opt to eliminate any observations with missing categorical values. The dataset contains both categorical and quantitative variables, both of which have missing values. If we were to remove all observations with a missing value, our dataset size would reduce by half to 13,887 observations. Cloud9am and Cloud3pm suffer the most from missing data, with almost 40% of values missing. the total missing can range widely from 64 “MaxTemp” values to over 11,341 missing values on cloud cover. With over 28,000 observations, it’s not too surprising to see most variables are each missing at least a few values. Phase one of this project will be preparing the data for analysis by eliminating or filling those missing values, then identifying potentially strong predictors of the “RainTomorrow” variable. Most variables are missing a number of values, so we’ll need to sanitize the data before proceeding. Variables include information such as wind speed, humidity, temperature, and cloud cover. This dataset includes 28,003 observations with 20 variables describing various factors. The data provided includes dates from November 2007 to late-June 2017.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |