Optimizing KNN Classifier: Finding Best Neighbors & Visualizing Misclassification, Thesis of Information Technology

This python script uses scikit-learn library to perform k-nearest neighbors (knn) classification on a dataset stored in a csv file. The script splits the dataset into training and testing sets, performs 10-fold cross-validation to find the optimal number of neighbors, and plots the misclassification error against the number of neighbors. Finally, it trains the knn classifier with the optimal number of neighbors and evaluates its accuracy.

Typology: Thesis

2019/2020

Uploaded on 02/21/2020

anirut-sriwichian
anirut-sriwichian 🇹🇭

1 document

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
import pandas as pd
# define column names
names = ['ax','ay', 'az', 'gx', 'gy', 'gz', 'anglex', 'angley', 'anglez', 'class']
df = pd.read_csv('C:/Users/AMCS2/Desktop/ML-PAPER/A9-SFU-SFB-SFR-SFL-S-SOC-R-WS-F-MT5-8-9-
11000R.csv', header=None, names=names)
df.head()
import numpy as np
from sklearn.model_selection import train_test_split
# create design matrix X and target vector y
X = np.array(df.iloc[:, 1:9]) # end index is exclusive
y = np.array(df['class']) # another way of indexing a pandas df
# split into train and test
X_train, \
X_test, \
y_train, \
y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# creating odd list of K for KNN
neighbors = list(range(1, 55, 2))
# empty list that will hold cv scores
pf3

Partial preview of the text

Download Optimizing KNN Classifier: Finding Best Neighbors & Visualizing Misclassification and more Thesis Information Technology in PDF only on Docsity!

import pandas as pd

define column names

names = ['ax','ay', 'az', 'gx', 'gy', 'gz', 'anglex', 'angley', 'anglez', 'class'] df = pd.read_csv('C:/Users/AMCS2/Desktop/ML-PAPER/A9-SFU-SFB-SFR-SFL-S-SOC-R-WS-F-MT5-8-9- 11000R.csv', header=None, names=names) df.head() import numpy as np from sklearn.model_selection import train_test_split

create design matrix X and target vector y

X = np.array(df.iloc[:, 1:9]) # end index is exclusive y = np.array(df['class']) # another way of indexing a pandas df

split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score

creating odd list of K for KNN

neighbors = list(range(1, 55, 2))

empty list that will hold cv scores

cv_scores = [] from sklearn.model_selection import cross_val_score

perform 10-fold cross validation

for k in neighbors: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') cv_scores.append(scores.mean())

changing to misclassification error

mse = [1 - x for x in cv_scores]

determining best k

optimal_k = neighbors[mse.index(min(mse))] print("The optimal number of neighbors is {}".format(optimal_k))

plot misclassification error vs k

import matplotlib.pyplot as plt plt.dpi= plt.rcParams["figure.figsize"] = (10,5) plt.plot(neighbors, mse) plt.xlabel("Number of Neighbors K") plt.ylabel("Misclassification Error")

plt.tick_params(colors='black',labelsize=12,bottom=2,width=2,length=2) # ปปปปปปปปปปปปปป

plt.grid(True) plt.show()

instantiate learning model (k = 5)