Skip to main content

Predict diabetes using k-NN from scratch and scikit-learn

·20 mins

Preface #

In the past few months, I’ve been learning machine learning via the IBM Data Science specialization on Coursera. It’s been an exciting experience, having learned about many topics I’ve never dabbled in before. I could glide through most of the courses in the specialization because I am a software engineer with over a decade of experience. This meant that, for me, the real learning started in the last few courses. Learning some rudimentary machine learning has been exciting, so I decided to document what I know and how I understand things.

Therefore, a fair warning: I am an absolute machine learning newbie. I am writing this primarily for myself and someone in a similar boat. There might be some stuff that’s wrong here. If you see any, please let me know, so I can fix my understanding.

One of the first classification algorithms I learned was k-Nearest Neighbors (k-NN). When I discovered it, I found it really cool because it can do some legit classification while being an algorithm that fits within 100 lines of code. As an ML newbie, I find this fascinating. With the newfound excitement from learning k-NN, I decided to look deeper into it, find a dataset, do some feature engineering, apply k-NN, and assess its performance.

Before we dive into the dataset, I want to discuss the theory of the k-Nearest Neighbors (k-NN) algorithm and its inner workings.

k-Nearest Neighbors #

Let’s take an imaginary dataset with cars, with three columns where two are features (top speed and horsepower) and the target (class):

import pandas as pd

data = {
    'Top Speed (mph)': [205, 217, 211, 186, 202, 187, 183, 174, 181, 155, 155, 150, 149, 147, 155, 126, 130, 122, 90, 95, 110, 112, 109, 118, 121, 130],
    'Horsepower (hp)': [600, 710, 650, 460, 500, 493, 420, 365, 414, 227, 255, 240, 335, 290, 330, 240, 260, 210, 150, 170, 200, 180, 190, 170, 220, 190],
    'Class': ['Hypercar', 'Hypercar', 'Hypercar', 'Supercar', 'Supercar', 'Supercar', 'Supercar', 'Supercar', 'Supercar', 'Luxury car', 'Luxury car', 'Luxury car', 'Luxury car', 'Luxury car', 'Luxury car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car', 'Economy car']
}

cars_df = pd.DataFrame(data)
cars_df

Top Speed (mph) Horsepower (hp) Class
0 205 600 Hypercar
1 217 710 Hypercar
2 211 650 Hypercar
3 186 460 Supercar
4 202 500 Supercar
5 187 493 Supercar
6 183 420 Supercar
7 174 365 Supercar
8 181 414 Supercar
9 155 227 Luxury car
10 155 255 Luxury car
11 150 240 Luxury car
12 149 335 Luxury car
13 147 290 Luxury car
14 155 330 Luxury car
15 126 240 Economy car
16 130 260 Economy car
17 122 210 Economy car
18 90 150 Economy car
19 95 170 Economy car
20 110 200 Economy car
21 112 180 Economy car
22 109 190 Economy car
23 118 170 Economy car
24 121 220 Economy car
25 130 190 Economy car

We can observe that the speed and horsepower affect the class of the car. If we create a scatterplot of the data we’ll see how they’re grouped on the plot:

import seaborn as sns

sns.scatterplot(x=cars_df['Top Speed (mph)'], y=cars_df['Horsepower (hp)'], hue=cars_df['Class']);

There’s a correlation between the top speed, the horsepower, and the class of the car. Now, if we get a car whose top speed is 140mph, with 300hp, would that be an economy or luxury car? This is where k-NN can help.

The k-NN algorithm is a voting system where the majority class label determines the class label of a new data point among its nearest ‘k’ neighbors, where k is an integer in the feature space. Using the above example of a car with a top speed of 140mph and 300hp, k-NN would measure the distance of the newly added point to its nearest neighbors. Once the algorithm finds the K closest points, it takes the prevalent category amongst those nearest points and set it to the newly added point. When setting K, a good starting point is usually the square root of the features used from the dataset. Generally, it’s best if the K number is assigned to an odd number, so the algorithm doesn’t run into a tie.

Now, you will see that I refer to k-NN as an algorithm and not an ML model. That’s because k-NN is a non-parametric and lazy learning algorithm. In machine learning, lazy learning is a method in which the generalization of the training data is delayed until a query is made to the system. As we will see soon, there’s no model that we will train with k-NN - it’s merely an algorithm that will run on the training data and apply a local function (distance between points) to determine the target class.

The math behind k-NN #

The distance is the next important aspect: what is actually a distance between two points, or a vector, and other points/vectors? To calculate the distance between points we use what’s called an Euclidean distance. If a and b are points on a two-dimensional plane, their Euclidean distance will be:

d(a,b) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2}

For those that do not like mathematial notation, this basically means the square root of the squares of the differences between the two points on the plane. In Python, this would be:

def euclidean_distance(a,b):
    # Two points, with their x and y coordinates
    a = [1,2]
    b = [3,4]

    # Calculate the squared distance
    squared_distance = (a[0] - b[0])**2 + (a[1] - b[1])**2

    # Return the root of the squared distance
    return math.sqrt(squared_distance)

When I read this the first time, I thought “why are we squaring the differences in the points if we’re applying the square root later?”. The reason is that by applying the square and then the square root them it’s like taking the absolute value of each point - in other words, we gracefully handle negative points, so the distance is always a positive number (a negative distance is not really a thing).

And that’s all the math that k-NN requires. Let’s go back to a real dataset and apply k-NN to it.

Patients’ data & diabetes outcome #

The dataset I found is the Diabetes Dataset from Kaggle. It contains fetures regarding the patient’s health status, with the diabetes outcome as the dependent variable. The goal is to implement our own k-NN classifier and then use it on the dataset to predict diabetes outcomes in patients. Afterwards, we will evaluate the classifier’s performance relative to the one that’s shipped with scikit-learn.

First, we will load the data, explore it, do a bit of feature engineering, and then we’ll move on to building the solution.

df = pd.read_csv('diabetes.csv')

First, we will load the dataset and check its first columns just to get a feel for it:

df.head()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

The dataset has different columns, each one of them with continious numerical values. Let’s see their types and null counts:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

The fact that every column has no null values is great. We can also see that each column contains numerical values, where two are float64 and the rest are int64.

Let’s look at the column statistics:

df.describe()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Cleaning zero values #

There are a few interesting bits in the table above:

  • we see patients whose Glucose measurements are zero, which makes no sense and we need to take care of
  • we can also observe patients whose BloodPressure measurement is zero
  • we can also observe patients whose SkinThickness measurement is zero
  • we can also observe patients whose Insulin measurement is zero
  • we can also observe patients whose BMI is also zero, which is plainly impossible

Let’s clean up these nonsensical values - we can replace the zeroes with the mean of each column:

means = df.iloc[:, 1:6].mean()
means
Glucose          120.894531
BloodPressure     69.105469
SkinThickness     20.536458
Insulin           79.799479
BMI               31.992578
dtype: float64
nonzeros = list(df.columns[1:6])
for column in nonzeros:
    df[column] = df[column].replace(0, means[column])

We can see how the min values now changed:

df.min()
Pregnancies                  0.000
Glucose                     44.000
BloodPressure               24.000
SkinThickness                7.000
Insulin                     14.000
BMI                         18.200
DiabetesPedigreeFunction     0.078
Age                         21.000
Outcome                      0.000
dtype: float64

Now that we cleaned up the zero values, let’s look splitting our data into training and tests sets.

Data splitting #

Splitting the data into test and train data is an essential part of model buildling. Having the training data allows us to train our model, while the test data allows us to check the accuracy of the model. In other words, the test data allows us to see how our model will perform with out-of-sample data.

Out-of-sample data is a fancy way of saying data that is not part of the training dataset.

To split our data into training and test data, we can use scikit-learn’s train_test_split method.

X = df.iloc[:, 0:7].to_numpy()
y = df.iloc[:, 8].to_numpy()
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

The train_test_split method takes three immportant arguments:

  • the independent features, stored in the variable X
  • the target values, stored in the variable y
  • the test size as a float, which tells the method how big the test size should be

Let’s see how train_test_split has split our data here. Remember, the original dataset had 768 rows and 9 columns:

df.shape
(768, 9)

Let’s see the shape of the train and test data:

X_train.shape
(460, 7)
X_test.shape
(308, 7)

We can see that the data has been split to 60% training data and 40% test data, which is exactly what the test_size arguent instructed train_test_split to do (with its value 0.4).

Now that our data is split, let’s look into the problem of data scaling.

Scaling our data #

As we already established in the short part on the backing theory, k-NN is an algorithm that calculates distances between points on a plane. Because it calculates distances, the algorithm will be sensitive to the larger range in which some of the features may have.

Let’s look at the different ranges of each feature in the dataset:

plt.style.use('seaborn-v0_8')

fig, axes = plt.subplots(2, 4, figsize=[14,6])

pd.DataFrame.hist(df.drop('Outcome', axis=1), ax=axes);

For example, let’s take two 2D planes, where the frist one has Pregnancies as one axis and SkinThickness as the other, and the second plane has Insulin and Glucose as axes. Due to the ranges of the axes in the first plane being smaller than the second plane, k-NN will always give the distances of the second plane more significance. In other words, the values of the distances from the second plane will always have a higher absolute value, which will make k-NN bias towards the distances from the second plane.

To remove this kind of bias, we need to scale our features.

Intro to scaling #

There are different ways to scale features, but I’d say exploring them is outside of the scope of this writeup. One has got to draw the line somewhere! To scale our features we will use the preprocessing module from sklearn. From the preprocessing module we will use the StandardScaler class - it’s a utility class, which is a quick and easy way to perform standardization on an array-like dataset.

Let’s try to standardize all of our features and see their before- and after-standardization:

from sklearn.preprocessing import StandardScaler

fig, ax = plt.subplots(2, 4, figsize=[14,6])

scaler = StandardScaler()
features = df.iloc[:, 0:8]
scaled_features = scaler.fit_transform(features)
scaled_features_df = pd.DataFrame(scaled_features, columns=df.columns[0:8])
scaled_features_df.hist(ax=ax);

What’s interesting is that the histograms of the featuers before and after scaling look the same. This is exactly what we want - the scaled individual features are still distributed identically as before scaling, but all values are scaled down between -4 and 4, with some outliers. By standardizing the features here we make sure that some features won’t have a greater effect on the model than others, due to their absolute values being larger than others.

Scaling the features #

To actually scale our training and test sets, we need to take similar steps as above. First, let’s initialize the scaler with the X_train data, and make it fit and transform the X_train data first:

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

Now, with the scaler fitted to the X_train data, we need to transform the X_test data:

X_test = scaler.transform(X_test)

Now that the features are scaled, let’s use them in our k-Nearest Neighbors algorithm. Which we’ll have to first write from scratch before we try it out.

k-NN from scratch #

Now that we have that out of the way, let’s implement our own k-NN. To implement the algorithm, we need two methods:

  • fit, which will set the features (X) and labels (y) to the model’s memory
  • predict, which will take a vector of out-of-sample data and return a prediction

Let’s start with the simpler ones:

class KNN:
  def __init__(self, k=5):
    self.k = k

  def fit(self, X, y):
    self.X_train = X
    self.y_train = y

In this case, fitting the X and y just means setting them as the object’s attributes.

Let’s look at the predict method:

def predict(self, X):
    predictions = [self._predict(x) for x in X]
    return predictions

def _predict(self, x):
    distances = [euclidean_distance(x, x_train) for x_train in self.X_train]

    k_indices = np.argsort(distances)[:self.k]
    k_labels = [self.y_train[i] for i in k_indices]

    most_common = Counter(k_labels).most_common()
    return most_common[0][0]

The predict method essentially loops through each of the observations from the X n-dimensional array and calls _predit for every subarray. Within _predict we calculate the Euclidean distances for every array pair: the argument (X) and each array of the training data (X_train). Once we have the distances, we use np.argsort [docs] to get the indices of the sorted distances. np.argsort will basically perform a sort on the distances, but return the indices of the elements in the sorted array, instead of the array itself.

Using the indices we will take the matching target variables from the y_train - in other words, by having the indices of the closest distances from the X_train set we will use their matching target labels from y_train. This is how the X_train input features become a single target class - by fetching the corresponding y_train label.

Once we have the k_labels, we just find the most common label of the bunch. This is the voting component of the k-NN algorithm. By finding the most common label we basically say: from the K nearest neighbors, the most common is Z, so we return Z as the predicted class.

Here’s the whole k-NN algorithm:

from collections import Counter

def euclidean_distance(a, b):
  return np.sqrt(np.sum((a-b)**2))

class KNN:
  def __init__(self, k=5):
    self.k = k

  def fit(self, X, y):
    self.X_train = X
    self.y_train = y
    
  def predict(self, X):
    predictions = [self._predict(x) for x in X]
    return np.array(predictions)

  def _predict(self, x):
    distances = [euclidean_distance(x, x_train) for x_train in self.X_train]

    k_indices = np.argsort(distances)[:self.k]
    k_labels = [self.y_train[i] for i in k_indices]

    most_common = Counter(k_labels).most_common()
    return most_common[0][0]

Let’s put it into use, using the X_train_scaled and X_test_scaled:

clf = KNN(k=5)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

Now we have our predictions from the k-NN algorithm, but how do we know how if our model is accurate? This is where we need to evaluate our model’s performance.

Accuracy, confusion, precision & recall #

Model evaluation is a really interesting topic - it’s basically measuring how accurate the model is against the test data. We will look at a few different techniques here that I’d like to explore, but the list is not exhaustive as there are other methods of model evaluation as well.

To evaluate the performance of the algorithm we will look at the accuracy metric and the confusion matrix of the algorithm.

Accuracy #

Accuracy is the simplest approach of all. The way it works is by finding the number of correct predictions between the predictions and the targets, divided by the number of all targets. The accuracy measurement essentially finds the ratio of correctly predicted labels against all labels.

In Python, the accuracy method would look like:

def calculate_accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy

If we take our predictions from above and run them against our test labels (y_test), we will get the accuracy of the k-NN algorithm against the test data:

calculate_accuracy(y_test, predictions)
0.7045454545454546

The accuracy is quite a simple measurement, as it doesn’t tell us about the potential misclassifications. In the diabetes example, with accuracy of 0.75 it means that we will end up misdiagnosing a quarter of the patients.

Confusion matrix #

While the accuracy metric speaks loads, for us humans in some cases a visualization goes a longer way instead of a metric. A confusion matrix is a visualization that can give us a better idea of the algorithm’s performance. Let’s see the confusion matrix of our k-NN algorithm:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)
import seaborn as sns

# Change figure size and increase dpi for better resolution
plt.figure(figsize=(6, 4))

ax = sns.heatmap(cm, annot=True, fmt='d')

# set x-axis label and ticks. 
ax.set_xlabel("Predicted Diagnosis", fontsize=14, labelpad=20)
ax.xaxis.set_ticklabels(['Negative', 'Positive'])

# set y-axis label and ticks
ax.set_ylabel("Actual Diagnosis", fontsize=14, labelpad=20)
ax.yaxis.set_ticklabels(['Negative', 'Positive'])

ax.set_title("Confusion Matrix for the Diabetes k-NN", fontsize=14, pad=20);

We can see in the confusion matrix that the algorithm predicts the Negative outcomes pretty well, but there is a significant misclassification happening in the lower left quadrant. In other words, the algorithm classified 56 patients as ‘Negative’ (no diabetes), but the diagnosis was ‘Positive’ (they did get diabetes). These 56 classifications are also called false negatives.

What’s tricky about this stark number of false negatives is that if a doctor followed the algorithm’s prediction, the correct diagnosis would be set much later with these patients, which can have long-term implications for their health. So what can we do about this? Is this a problem with our algorithm or our setup? Or a bit of both?

Let’s look at another few metrics to measure the algorithm’s performance: precision, recall & F1 score.

Precision & recall #

These two are exciting metrics because they measure the positive predictive value and the algorithm’s sensitivity, respectively. Another way to explain them:

  • Precision is the true positive predictions over the sum of the true positive and false positive predictions. It basically returns the ratio of true positive predictions over all positive predictions. Hence the name ‘positive predictive value.’
  • Recall, on the other hand, is the true positive predictions over the sum of the true predictions and the false negative. The true positive and false negative predictions are, in fact, all the correct predictions. So recall returns the ratio of the true predictions against the correct predictions. Hence the name ‘sensitivity’ - it measures how many of the correct predictions the algorithm under evaluation will pick up.

We can implement precision and recall using Python. But, first, we need to implement four helper functions:

  • true_positive
  • true_negative
  • false_positive
  • false_negative

From there, we can easily implement recall and precision. So let’s give it a shot. First, the helpers:

def true_positive(y_val, y_pred):
    tp = 0
    for pred, val in zip(y_pred, y_val):
        if pred == 1 and val == 1:
            tp += 1
    return tp

def true_negative(y_val, y_pred):
    tn = 0
    for pred, val in zip(y_pred, y_val):
        if pred == 0 and val == 0:
            tn += 1
    return tn

def false_positive(y_val, y_pred):
    fp = 0
    for pred, val in zip(y_pred, y_val):
        if pred == 1 and val == 0:
            fp += 1
    return fp

def false_negative(y_val, y_pred):
    fn = 0
    for pred, val in zip(y_pred, y_val):
        if pred == 0 and val == 1:
            fn += 1
    return fn

Next, the methods themselves:

def calculate_precision(y_val, y_pred):
    tp = true_positive(y_val, y_pred)
    fp = false_positive(y_val, y_pred)
    prec = tp / (tp+fp)
    return prec

def calculate_recall(y_val, y_pred):
    tp = true_positive(y_val, y_pred)
    fn = false_negative(y_val, y_pred)
    rec = tp / (tp + fn)
    return rec

Let’s measure the precision and recall of the predictions:

precision_scr = calculate_precision(y_test, predictions)
recall_scr = calculate_recall(y_test, predictions)
print(f"Precision: {precision_scr}")
print(f"Recall: {recall_scr}")
Precision: 0.5980392156862745
Recall: 0.5495495495495496

Let’s compare our implementation of precision and recall against the one from scikit-learn:

from sklearn.metrics import precision_score, recall_score

print(f"scikit-learn's Precision: {precision_score(y_test, predictions)}")
print(f"scikit-learn's Recall: {recall_score(y_test, predictions)}")
scikit-learn's Precision: 0.5980392156862745
scikit-learn's Recall: 0.5495495495495496

Nice, we got the same scores. You can see that our precision and recall scores are pretty low (0 is worst, 1 is best). This means that with our feature setup and the algorithm we chose, k Nearest Neighbors is not the best choice for the problem. We can conclude that because the algorithm’s ability not to label negative samples as positive (precision) and its ability to find all the positive samples (recall) are quite low.

Now that we have the precision and recall, to wrap up our assessment of the algorithm we can also apply the F1 score.

F1 Score #

The F1 score measures the model’s balanced ability to capture positive cases (recall) and be accurate with the cases it does capture (precision). Mathematically explained, it takes the recall and precision metrics and it calculates their harmonic mean:

F1 = 2 * (precision * recall) / (precision + recall)

When compared to the accuracy score, the F1 score is useful because it has both recall and precision captured into it. For contrast, the accuracy score only looks at the recall, ignoring the precision. This bi-modal nature of the F1 score makes it a popular choice of evaluation metric in the ML community.

As you can see above the formula is quite simple, but let’s implement it for ourselves and test it against our precision and recall metrics:

def calculate_f1(precision, recall):
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1
f1_scr = calculate_f1(precision_scr, recall_scr)
f1_scr
0.5727699530516432

Now, 0.59259 doesn’t tell us much as we need a good idea of how to interpret it. From what I was able to find online, the best way to interpret F1 is this table:

F1 score Interpretation
> 0.9 Very good
0.8 - 0.9 Good
0.5 - 0.8 OK
< 0.5 Not good

If we’d follow the above table, then our k-NN algorithm is doing barely OK. Which means we should probably look into other algorithms or models to predict diabetes in patients using the available dataset.

Comparing to scikit-learn #

Out of curiosity, the last thing that I wanted to look at is to compare our implementation of k-NN with the one from scikit-learn. The comparison will only confirm whether our implementation is correct, assuming we get the same metrics. Surely, the scikit-learn implementation is better as it will perform better with larger datasets. Still, for our 700-observations-large dataset both will do the trick.

Back to the comparison:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)
/Users/ie/opt/anaconda3/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

Now that we have the predictions of the KNeighborsClassifier, let’s see the precision, recall and F1 scores:

sk_precision = calculate_precision(y_test, predictions)
sk_recall = calculate_recall(y_test, predictions)
sk_f1_score = calculate_f1(precision, recall)

print(f"Precision: {sk_precision}")
print(f"Recall: {sk_recall}")
print(f"F1: {sk_f1_score}")
Precision: 0.5980392156862745
Recall: 0.5495495495495496
F1: 0.5727699530516432

Let’s compare them to our scores:

print(f"Precision same? {sk_precision == precision_scr}")
print(f"Recall same? {sk_recall == recall_scr}")
print(f"F1 same? {sk_f1_score == f1_scr}")
Precision same? True
Recall same? True
F1 same? True

Voilà! Our k-NN implementation has the same recall and precision metrics as the one from sklearn. That’s pretty cool to see, as it confirms that we have implemented our k-NN well!


Some references I used while writing this post: