Breast Cancer Prediction

Description

We are going to predict most likely women is prone to breast cancer based on the features selected. We are going to use different algorithms with different features and come up with solution by taking features and algorithm which gives better prediction.

Attribute Information

ID number.
Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)\

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Dataset

Kaggle Dataset

Data Pre-Processing

The target was converted into 0 and 1. Skewness is removed for the selected features using log function.

Feature Engineering

We have tried using different functions like selecting the k best features, PCA to see if which of them offer better accuracy.

Model

Different models have been tried like Logistic Regerssion, Random Forest Classifier, Decision Tree Classifier, K Neighbors Classifier, Gaussian Naive Bayes and SVM Classifier.

SVM CLassifier with kernel as linear gives us the best accuracy

Project Timeline

We did initial exploration and used a base model Logistic regression where we were getting an accuracy of 95%.
Later we tried to see if accuracy increases with other models and it did increase to 96% being the highest.
Next we tried Select K best features where we experimented with different number of features and achieved highest accuracy of 97% with SVM linear kernel.
After that we tried PCA, where we tried to find the best value for the number fo components by plotting the explained variance with the number of features and also tried with different explained variance(95%-99%) and we achieved a accurcay of 98%.
We also removed the skewness and tried using the top 10 correlated features and it also got an accuracy of 98% SVM (Kernel Linear).

Contributions

Bharath Jagini (015260232) - Front-End API, Random Forest Classifier and Decision Tree Classifier
Premchand Jayachandran (015326428) - PCA, SVM and Gaussian NB.
Tanay Ganeriwal (015278042) - Data Visualization(Finding correlation bar graph and plotting histograms) and KNN
Tejas Mahajan (015319421) - Initial Data exploration, removing skewness and Logistic regression.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ui		ui
.DS_Store		.DS_Store
.gitignore		.gitignore
Breast Cancer Prediction .pptx		Breast Cancer Prediction .pptx
CancerPrediction.py		CancerPrediction.py
CancerPredictionflask.py		CancerPredictionflask.py
LICENSE		LICENSE
README.md		README.md
Team12_Project.ipynb		Team12_Project.ipynb
data.csv		data.csv
model.pkl		model.pkl
svc.pkl		svc.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Cancer Prediction

Description

Attribute Information

Dataset

Data Pre-Processing

Feature Engineering

Model

Project Timeline

Contributions

About

Releases

Packages

Contributors 2

Languages

License

bharathjagini/BreastCancerPrediction

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Prediction

Description

Attribute Information

Dataset

Data Pre-Processing

Feature Engineering

Model

Project Timeline

Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages