working as a Business Data Analyst

You are working as a Business Data Analyst for the government in the city of Washington, D.C. Currently, Washington, D.C has a bike sharing system. People could rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in the Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.

The data source is from Kaggle at https://www.kaggle.com/c/bike-sharing-demand (Links to an external site.).

This dataset contains the following data fields:

datetime – hourly date + timestamp

season – 1 = spring, 2 = summer, 3 = fall, 4 = winter

holiday – whether the day is considered a holiday workingday – whether the day is neither a weekend nor holiday

weather – 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp – temperature in Celsius

atemp – “feels like” temperature in Celsius

humidity – relative humidity

windspeed – wind speed

casual – number of non-registered user rentals initiated

registered – number of registered user rentals initiated

count – number of total rentals

You are asked to perform the following tasks by writing a script in R. Submit both the R codes and a Word document.

1. Load the dataset bike.csv into memory. Then split the data into a training set containing 2/3 of the original data (test set containing remaining 1/3 of the original data).
2. Build a tree model using function tree().
1. The response is count and the predictors are season, holiday, workingday, temp, atemp, humidity, windspeed, casual, and registered.
2. Perform cross-validation to choose the best tree by calling cv.tree().
3. Plot the model results of b) and determine the best size of the optimal tree.
4. Prune the tree by calling prune.tree() function with the best size found in c).
5. Plot the best tree model.
6. Compute the test error using the test data set.
3. Build a random forest model using function randomForest()
1. The response is count and the predictors are season, holiday, workingday, temp, atemp, humidity, windspeed, casual, and registered.
2. Compute the test error using the test data set.
3. Extract variable importance measure using importance() function.
4. Plot the variable importance using function varImpPlot(). Which are the top 2 important predictors in this model?