VIF analysis using Python, Exercises of Data Analysis & Statistical Methods

To perform data analysis using python on a qualitative data

Typology: Exercises

2019/2020

Uploaded on 09/02/2020

vikram_jaswal
vikram_jaswal 🇺🇸

1 document

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ASSIGNMENT NO: 10
Submitted To: Dr. RAMPRRASADH GOARTHY
Group 2
PGDM 2019-2021
Name Roll No.
Rahul Patibandla A032
Vikram Jaswal B054
Parag Jamdade C030
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download VIF analysis using Python and more Exercises Data Analysis & Statistical Methods in PDF only on Docsity!

ASSIGNMENT NO: 10

Submitted To: Dr. RAMPRRASADH GOARTHY

Group 2

PGDM 2019-

Name Roll No.

Rahul Patibandla A

Vikram Jaswal B

Parag Jamdade C

Python Code: import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.linear_model import LogisticRegression from sklearn import metrics data= pd.read_csv('HeartDisease.csv') print(data.shape) print(data) print(data.describe()) data_top = data.head() Output:

From the graph below we see diagnosis of heart disease (angiographic disease status) around 140 people have less than 50% diameter narrowing rest have more than 50%. From the graph below Value 0: typical angina, 140 people Value 1: atypical angina, 50 people Value 2: non-anginal pain, 80 people Value 3: asymptomatic, 20 people

From the graph below 150 people have fasting blood sugar > 120 mg/dl From the graph below 150 people have exercise induced angina Code: To check multicollinearity X = data[['age', 'sex', 'cp', 'fbs','chol' ,'target','thalach','exang','thal','restecg','oldpeak','slope','ca']] y = data['trestbps']

Output: Code: Training and Test Sets: Splitting Data | Normalization of the Dataset X = np.asarray(data[['age', 'sex', 'cp', 'fbs', 'trestbps','chol','exang', 'thal','restecg', 'thalach','oldpeak','slope','ca']]) y = np.asarray(data['target']) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 4) print ('Train set:', X_train.shape, y_train.shape) print ('Test set:', X_test.shape, y_test.shape) Output: Code: Modeling of the Dataset | Evaluation and Accuracy from sklearn.linear_model import LogisticRegression logreg = LogisticRegression()

logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) from sklearn.metrics import jaccard_similarity_score print('Accuracy of the model in jaccard similarity score is = ', jaccard_similarity_score(y_test, y_pred)) Output: Code: Details of the confusion matrix along with the plot from sklearn.metrics import confusion_matrix, classification_report cm = confusion_matrix(y_test, y_pred) conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0', 'Predicted:1'], index =['Actual:0', 'Actual:1']) plt.figure(figsize = (8, 5)) sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = "Greens") plt.show() print('The details for confusion matrix is =') print (classification_report(y_test, y_pred)) Output: Below is the confusion matrix of test data (20% of data) Below output are the figures before removing the trestbp variable:

After removing trestbp: True Positive: 49 True Negative: 27 False Positive: 11 False Negative: 4 Code: Calculating Sensitivity and Specificity total= sum(sum(cm)) accuracy=(cm[0,0]+cm[1,1])/total print ('Accuracy : ', accuracy) sensitivity= cm[0,0]/(cm[0,0]+cm[0,1]) print('Sensitivity : ', sensitivity )

specificity1 = cm[1,1]/(cm[1,0]+cm[1,1]) print('Specificity : ', specificity1) Output: After removing trestbp: In this case we will consider the specificity of the data, as this is a screening test. We will not consider the sensitivity as high prevalence will automatically increase the True positives. Specificity will keep a check whether a patient is wrongly diagnosed or not. In the above case the specificity is .924 which means NPV is 92% therefore, 92% of the tests out of the total number of tests are genuinely negative.