CSCE822 Homework1

CSCE822 Homework1
Attached melb_data.csv file is the Snapshot of Tony Pino’s Melbourne Housing Dataset. Do the
following data preprocessing and apply KNN and RandomForest algorithms to classify the
property prices.
1. Fill the missing values in the dataset using imputation approaches as we talked in class.
You can use the scikit-learn’s module
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
The default imputer use mean values to fill the missing values. You can try other imputation
method as well.
2. Replace the categorical/nominal attributes with one-hot-encoding
You can use Category Encoders package for use with scikit-learn in Python
Read this blog for more approaches for data encoding
3. Install Weka system on your computer
Sort all the property samples by the property prices and divide the samples equally into 5
categories/classes: Top value, High value, medium value, low value, bottom value.
Install Weka software
Apply the KNN algorithm of Weka with K=5 to 10 to classify the property instances into 5
classes. Calculate the accuracy for each K values.
Apply RandomForest algorithm of Weka and report the performance.
You need to split the whole dataset into training (66% samples) and testing datasets (34%
samples). Do the random splitting 10 times to calculate the average accuracy.
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size =
0.2, random_state = 0)

K=5 K=6 K=7 K=8 K=9 K=10
KNN Average

CSCE822 Homework1

RandomForest Average

Write report to discuss the performances of KNN and randomforest. You are encouraged to
compare the performance of different missing value imputation methods or the categorical
encoding methods.
Zip your code and the report and upload to

Leave a Reply

Your email address will not be published. Required fields are marked *