Excel Data Analysis Capstone Project
Overview
This project requires that you use the tools learned throughout this portion of the course to create a model for a real world situation – creating a model to predict the success of NBA teams.
A tragedy has happened and the world has lost all copies of Win-Loss data for the 2018-2019 season for the NBA. All other stats remain, but we can no longer trust the Win and Loss counts anymore. The only solution the world has is to use your advanced analysis skills to predict what those win totals should be; these win predictions will be written into the history books, so we need to be accurate!
Note: You do not need to have an advanced, or even a basic, knowledge of basketball or the NBA. This entire project can be completed without learning the details of what all these statistics mean in relation to an actual game – just like some of the examples (wine quality, insurance) have inputs that we don’t fully understand. DO NOT spend your time trying to learn about basketball, or trying to apply basketball knowledge you do have to the assignment. It will not help, use the techniques we’ve talked about in class.
Part 1: Collect and Prepare Data
Note: This should be something that was mostly completed in the chapter 5 assignment. Start with that sheet and make the edits indicated below.
For this model you will need to prepare two sets of data. The easiest way is to have each on one sheet in the same workbook:
-
Source
Data – 5 (or more) seasons of data (2017-2018) and back. This
will have all statistics from the table (including wins and playoff
status), but with the win proxy stats removed.
- Note: This model must not contain any win-like statistics such as losses, winning percentage, Pythagorean wins/losses, margin of victory, SRS, etc. When you’ve created the sheet you must remove the Win-proxy stats, some of those are L, PW, PL, MOV, and SRS. There may be others if you’ve added optional data to your model. Ensure you double check this as inclusion of any win/loss statistics will ruin the accuracy of your model.
- Subject Data – The most recent season (2018-2019). This will have the same set of statistics as the source data, with the exception that the wins and playoff columns should be blank – this is what we are predicting. The classification will use the data to predict the playoff status and the prediction will predict the number of wins.
Creating a Predictive Model
Once your data is prepared you can begin creating your predictive model. Any and all of the tools we looked at in the course are available to you. You may find that as you proceed in building your model that data needs to be added or removed from your initial worksheet. You may also choose to use other techniques such as normalization and partitioning to create a more accurate model.
As you are going through this process you must take note of what method you are using, what changes you make to the model data, and why you are making those decisions. You will need to present both your model, and the reasoning of why you built it as you did and why it is superior to the alternatives that proved to be less accurate. The process of developing your model is the most important part of this process, so ensure you are making logical improvements and documenting the reasoning and impact.
Note: Use the source data to build a predictive model targeted at predicting number of wins, then use that model to predict the number of wins on the subject data.
Creating a Classification Model
In this step you must create a classification model that uses your source data to predict if teams will be in the playoffs or not. Follow the same process as the prediction model, using classification tools to split the teams into the two groups.
Note: Use the source data to build a classification model targeted on the playoff status, then use that model to predict the playoff status on the subject data.
Comparing the Results
Use the template to insert a copy of your final predictions – the predictions on the number of wins and the playoff status. You must insert the data into the template exactly as shown – this will be used to calculate your accuracy. The teams must be sorted alphabetically, then insert the wins and playoff status in their respective columns.
Grading
Deliverables
Note: If you’re unsure or unclear on what you’re being asked to complete, please ask early. There will be substantial time devoted to this project and it is important that you progress down the correct path to excel.
The deliverable for this project are:
- A paper that contains the results of your modeling exercise, the explanation of what you did, and the reasoning behind why your model is the best.
- Your spreadsheet.
- Your results pasted into the accuracy template.
Rubric
Item | Weighting | Notes |
Spreadsheet Construction | 5% |
|
Predictive Model | 5% |
|
Classification Model | 5% |
|
Report | 60% |
|
Explanation of Process | 20% |
|
Justification of Model Choice | 25% |
|
Presentation of Results | 10% |
|
Possible Improvements | 5% |
|
Accuracy of Model | 10% | Scaled mark based on relative accuracy. |
Quality of Presentation | 15% | Subjective judgement on how well your methods are presented. |