Breast Tumor Type Prediction Model (ML in Python)

Project Overview

This project demonstrates the process of building a machine learning model that predicts whether a breast tumor is benign (non-cancerous) or malignant (cancerous). The project goal was to build a model achieving at least 0.95 accuracy. The workflow includes minimal data cleaning, feature selection based on correlations, model training, and evaluation with ROC curve analysis.

Terminology note: The dataset uses the word “cancer”, but the labels actually represent tumor diagnosis: malignant vs benign. In medical terminology, cancer typically refers to malignant neoplasms, while benign tumors are not considered cancer.

Disclaimer (educational use only): This repository is an educational machine learning project based on a public, historical dataset. It is not a medical device and must not be used to diagnose, treat, or make clinical decisions without appropriate clinical validation, regulatory compliance, and oversight by qualified healthcare professionals. Any performance metrics reported here reflect this specific dataset and setup and may not generalize to real-world clinical populations.

Dataset

The dataset is available in this repository (file: Cancer_Data.csv) and on Kaggle: https://www.kaggle.com/datasets/erdemtaha/cancer-data

Project Structure and Results

Import required libraries and objects

from  sklearn.linear_model import LogisticRegression

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, roc_curve, roc_auc_score

Load the dataset

os.chdir('../')
df = pd.read_csv('data/Cancer_Data.csv')

Basic Data Exploration

Preview first rows:

df.head(30)

List all variables (column names), non-null counts, and data types:

df.info()

Check whether any missing values exist per variable (as a quick safety check):

df.isna().max()

Review the distribution of the diagnosis target labels:

df['diagnosis'].value_counts()

Compute basic descriptive statistics for numerical variables:

df.drop(columns=["id"]).describe()

The dataset contains 569 records and the following variables:

id (dtype: int64)
diagnosis (dtype: object) with values:
- M = malignant
- B = benign
30 numeric, continuous variables (dtype: float64): radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst
A likely accidental column Unnamed: 32 containing only missing values (NaN)

Data Cleaning and Transformation

The data quality is reasonably high, so preparation includes only two steps:

Remove the last column (Unnamed: 32):

del df[df.columns[-1]]

Encode the target variable from categorical diagnosis to numeric target. Since the objective is to detect malignant tumors, malignant cases are encoded as 1:

df['target'] = (df['diagnosis'] == 'M').astype(int)

Correlation analysis (feature selection)

Display the full correlation matrix (all variables against all variables), using Spearman correlation:

plt.figure(figsize = (17,15))
sns.heatmap(round(df[df.columns[2:]].corr('spearman').sort_values(by = 'target'), 2), annot = True, linewidths = 0.1)
plt.show()

The variable most strongly correlated with the target is perimeter_worst. It is included in the model first. Next, correlations of the remaining variables are evaluated both against the target and against perimeter_worst:

plt.figure(figsize = (6,8))
sns.heatmap(round(df[df.columns[2:]].corr('spearman').sort_values(by = 'target'), 2)[['target', 'perimeter_worst']],
            annot = True, linewidths = 0.1)
plt.show()

Selection rule used:

include variables with absolute correlation with the target ≥ 0.45, and
ensure correlation with already selected variables is < 0.7 (to limit redundancy / multicollinearity).

Following this rule, the next selected variable is perimeter_se:

plt.figure(figsize = (6,8))
sns.heatmap(round(df[df.columns[2:]].corr('spearman').sort_values(by = 'target'), 2)[['target', 'perimeter_worst', 'perimeter_se']], 
            annot = True, linewidths = 0.1)
plt.show()

Next, compactness_mean and compactness_worst show the same correlations with the target. Since compactness_worst is significantly less correlated with the previously selected feature, it is chosen for the model.

plt.figure(figsize = (6,8))
sns.heatmap(round(df[df.columns[2:]].corr('spearman').sort_values(by = 'target'), 2)[['target', 'perimeter_worst', 'perimeter_se', 'compactness_worst']], annot = True, linewidths = 0.1)
plt.show()

The next variable meeting the criteria is concave points_se, which is then added:

plt.figure(figsize = (6,8))
sns.heatmap(round(df[df.columns[2:]].corr('spearman').sort_values(by = 'target'), 2)[['target', 'perimeter_worst', 'perimeter_se', 'compactness_worst', 'concave points_se']], annot = True, linewidths = 0.1)
plt.show()

The next variable meeting the criteria is texture_worst, which is then added:

plt.figure(figsize = (6,8))
sns.heatmap(round(df[df.columns[2:]].corr('spearman').sort_values(by = 'target'), 2)[['target', 'perimeter_worst', 'perimeter_se', 'compactness_worst', 'concave points_se', 'texture_worst']], annot = True, linewidths = 0.1)
plt.show()

There is no more variables meeting the criteria.

Final feature list used in the model

x_names = ['perimeter_worst', 'perimeter_se', 'compactness_worst', 'concave points_se', 'texture_worst']

Outlier analysis

Outliers are identified using the interquartile range (IQR) method. To keep the modeling pipeline simple and robust, an observation will be excluded if it is flagged as an outlier for any of the selected model features. This is a deliberate trade-off: it reduces the training sample size, but may improve model stability by limiting the influence of extreme values.

Define a helper function that flags outliers:

def find_outliers(x, a = 1.5):
    q1, q3 = np.quantile(x, [0.25, 0.75])
    iqr = q3 - q1
    x_min = q1 - a * iqr
    x_max = q3 + a * iqr
    return (x < x_min) | (x > x_max)

Create additional columns indicating whether each observation is an outlier for each selected feature, and store those column names for later use:

outlier_column_names = []
for i in x_names:
    df[f'{i}_outlier'] = find_outliers(df[i])
    outlier_column_names.append(f'{i}_outlier')

As a result, the following outlier indicator columns are created (and stored in outlier_column_names):

Create a single column indicating whether an observation is an outlier for any of the selected model features:

df['outlier_total'] = df[outlier_column_names].max(axis = 1)

Inspect how many observations are flagged as outliers (these will be excluded from model training and testing):

df['outlier_total'].value_counts()

Train/test split

Import the train/test split function:

from sklearn.model_selection import train_test_split

Define X (input features) and y (target), excluding observations flagged as outliers:

X = df.loc[~(df.outlier_total), x_names]
y = df.loc[~(df.outlier_total), 'target']

Split into training and test sets:

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 123)

Modeling

Create the model object:

model_1 = LogisticRegression()

Fit the model:

model_1.fit(train_x, train_y)

Model evaluation

Generate predictions for both training and test sets:

train_pred = model_1.predict(train_x)
test_pred = model_1.predict(test_x)

Confusion matrices (training and test):

confusion_matrix(train_y, train_pred)

confusion_matrix(test_y, test_pred)

The confusion matrices look strong. Next, compute key metrics (Accuracy Score, TPR, FNR, TNR, and FPR) on training and test sets:

accuracy_score(train_y, train_pred)

Training accuracy is approximately 0.943.

accuracy_score(test_y, test_pred)

Test accuracy is approximately 0.961.

recall_score(test_y, test_pred)

Test TPR (sensitivity/recall) is approximately 0.946.

1 - recall_score(test_y, test_pred)

Test FNR is approximately 0.054.

recall_score(test_y, test_pred, pos_label=0)

Test TNR (specificity) is approximately 0.969.

1 - recall_score(test_y, test_pred, pos_label=0)

Test FPR is approximately 0.031.

Interpretation of the current model performance

The model quality is high and appears stable (there is no sign of overfitting, as test performance does not degrade compared to training). Among malignant cases, approximately 95% would be correctly predicted as malignant. Among benign cases, approximately 97% would be correctly predicted as benign.

A key next step is aligning the model threshold with the “business” (clinical) objective:

prioritize detecting all malignant tumors (minimize false negatives), even if it increases false positives, or
prioritize minimizing false positives, even if it increases false negatives.

ROC curve and decision threshold analysis

Compute predicted probabilities (for both training and test sets) of belonging to class 1 (malignant):

train_pred_p = model_1.predict_proba(train_x)[:,1]
test_pred_p = model_1.predict_proba(test_x)[:,1]

Compute FPR, TPR, and thresholds for the ROC curve:

fpr_train, tpr_train, threshold_train = roc_curve(train_y, train_pred_p)
fpr_test, tpr_test, threshold_test = roc_curve(test_y, test_pred_p)

Count the AUC score:

auc_train = round(roc_auc_score(train_y, train_pred_p), 3)
auc_test = round(roc_auc_score(test_y, test_pred_p), 3)

Plot the ROC curve:

plt.plot(fpr_train, tpr_train, label = "train")
plt.plot(fpr_test, tpr_test, label = "test")
plt.plot(np.arange(0,1,0.01), np.arange(0,1,0.01), '--')
plt.legend()
plt.annotate(f'AUC train: {auc_train}', xy = [0.2, 0.8])
plt.annotate(f'AUC test: {auc_test}', xy = [0.2, 0.75])

plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")

plt.title(f'Krzywa ROC')
plt.show()

The ROC curve shape and AUC values confirm strong model performance.

If the priority is to detect all malignant tumors (i.e., ensure TPR = 1), we must select a threshold that achieves TPR = 1 with the smallest possible FPR.

The following code finds the first index at which tpr_test equals exactly 1.0, and then retrieves the corresponding threshold and FPR:

idxs = np.where(tpr_test == 1.0)[0]

if idxs.size == 0:
    raise ValueError("TPR never achieves 1.0 value at given threshold.")

idx_first_1 = idxs[0]

threshold_at_first_1 = threshold_test[idx_first_1]
fpr_at_first_1 = fpr_test[idx_first_1]

idx_first_1, threshold_at_first_1, fpr_at_first_1

Conclusions and recommendations

A threshold that ensures all malignant tumors are detected (i.e., TPR = 1) is 0.209611944215763. However, at this threshold the FPR is 0.09375, meaning that approximately 9% of benign tumors would be incorrectly classified as malignant. Under the default threshold, only about 3% of benign tumors are incorrectly classified as malignant.

Author: Julita Wawreszuk-Chylińska

LinkedIn: Julita Wawreszuk-Chylińska

Thank you for your interest in this project!

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Cancer_Data.csv		Cancer_Data.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Tumor Type Prediction Model (ML in Python)

Project Overview

Dataset

Project Structure and Results

Import required libraries and objects

Load the dataset

Basic Data Exploration

Data Cleaning and Transformation

Correlation analysis (feature selection)

Final feature list used in the model

Outlier analysis

Train/test split

Modeling

Model evaluation

Interpretation of the current model performance

ROC curve and decision threshold analysis

Conclusions and recommendations

Author: Julita Wawreszuk-Chylińska

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Breast Tumor Type Prediction Model (ML in Python)

Project Overview

Dataset

Project Structure and Results

Import required libraries and objects

Load the dataset

Basic Data Exploration

Data Cleaning and Transformation

Correlation analysis (feature selection)

Final feature list used in the model

Outlier analysis

Train/test split

Modeling

Model evaluation

Interpretation of the current model performance

ROC curve and decision threshold analysis

Conclusions and recommendations

Author: Julita Wawreszuk-Chylińska

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages