






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
To perform data analysis using python on a qualitative data
Typology: Exercises
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Python Code: import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.linear_model import LogisticRegression from sklearn import metrics data= pd.read_csv('HeartDisease.csv') print(data.shape) print(data) print(data.describe()) data_top = data.head() Output:
From the graph below we see diagnosis of heart disease (angiographic disease status) around 140 people have less than 50% diameter narrowing rest have more than 50%. From the graph below Value 0: typical angina, 140 people Value 1: atypical angina, 50 people Value 2: non-anginal pain, 80 people Value 3: asymptomatic, 20 people
From the graph below 150 people have fasting blood sugar > 120 mg/dl From the graph below 150 people have exercise induced angina Code: To check multicollinearity X = data[['age', 'sex', 'cp', 'fbs','chol' ,'target','thalach','exang','thal','restecg','oldpeak','slope','ca']] y = data['trestbps']
Output: Code: Training and Test Sets: Splitting Data | Normalization of the Dataset X = np.asarray(data[['age', 'sex', 'cp', 'fbs', 'trestbps','chol','exang', 'thal','restecg', 'thalach','oldpeak','slope','ca']]) y = np.asarray(data['target']) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 4) print ('Train set:', X_train.shape, y_train.shape) print ('Test set:', X_test.shape, y_test.shape) Output: Code: Modeling of the Dataset | Evaluation and Accuracy from sklearn.linear_model import LogisticRegression logreg = LogisticRegression()
logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) from sklearn.metrics import jaccard_similarity_score print('Accuracy of the model in jaccard similarity score is = ', jaccard_similarity_score(y_test, y_pred)) Output: Code: Details of the confusion matrix along with the plot from sklearn.metrics import confusion_matrix, classification_report cm = confusion_matrix(y_test, y_pred) conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0', 'Predicted:1'], index =['Actual:0', 'Actual:1']) plt.figure(figsize = (8, 5)) sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = "Greens") plt.show() print('The details for confusion matrix is =') print (classification_report(y_test, y_pred)) Output: Below is the confusion matrix of test data (20% of data) Below output are the figures before removing the trestbp variable:
After removing trestbp: True Positive: 49 True Negative: 27 False Positive: 11 False Negative: 4 Code: Calculating Sensitivity and Specificity total= sum(sum(cm)) accuracy=(cm[0,0]+cm[1,1])/total print ('Accuracy : ', accuracy) sensitivity= cm[0,0]/(cm[0,0]+cm[0,1]) print('Sensitivity : ', sensitivity )
specificity1 = cm[1,1]/(cm[1,0]+cm[1,1]) print('Specificity : ', specificity1) Output: After removing trestbp: In this case we will consider the specificity of the data, as this is a screening test. We will not consider the sensitivity as high prevalence will automatically increase the True positives. Specificity will keep a check whether a patient is wrongly diagnosed or not. In the above case the specificity is .924 which means NPV is 92% therefore, 92% of the tests out of the total number of tests are genuinely negative.