You are working as a Business Data Analyst for the government in the city of Washington, D.C. Currently, Washington, D.C has a bike sharing system. People could rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in the Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.
The data source is from Kaggle at https://www.kaggle.com/c/bike-sharing-demand (Links to an external site.).
This dataset contains the following data fields:
datetime – hourly date + timestamp
season – 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday – whether the day is considered a holiday workingday – whether the day is neither a weekend nor holiday
weather – 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp – temperature in Celsius
atemp – “feels like” temperature in Celsius
humidity – relative humidity
windspeed – wind speed
casual – number of non-registered user rentals initiated
registered – number of registered user rentals initiated
count – number of total rentals
You are asked to perform the following tasks by writing a script in R. Submit both the R codes and a Word document.
- Load the dataset bike.csv into memory. Then split the data into a training set containing 2/3 of the original data (test set containing remaining 1/3 of the original data).
-
Build
a tree model using function tree().
- The response is count and the predictors are season, holiday, workingday, temp, atemp, humidity, windspeed, casual, and registered.
- Perform cross-validation to choose the best tree by calling cv.tree().
- Plot the model results of b) and determine the best size of the optimal tree.
- Prune the tree by calling prune.tree() function with the best size found in c).
- Plot the best tree model.
- Compute the test error using the test data set.
-
Build
a random forest model using function randomForest()
- The response is count and the predictors are season, holiday, workingday, temp, atemp, humidity, windspeed, casual, and registered.
- Compute the test error using the test data set.
- Extract variable importance measure using importance() function.
- Plot the variable importance using function varImpPlot(). Which are the top 2 important predictors in this model?