[사이킷런]모델의 평가-피마 인디언 당뇨병 예제 데이터셋

28 Feb 2020

Reading time ~6 minutes

이번에는 앞선 (링크) 모델의 평가 포스팅에 이어 캐글에서 제공하는 (링크) 피마 인디언 당뇨병 데이터 셋으로 당뇨병 여부를 예측하는 모델을 만들고 해당 모델의 성능을 평가해보겠습니다.

피마 인디언 당뇨병 데이터 셋은 아래와 같은 피처로 구성되어 있습니다.

Pregnancies : 임신횟수
Glucose : 포도당 부하 검사 수치
BloodPressure : 혈압
SkinThickness : 팔 삼두근 뒤쪽의 피하지방 측정값
Insulin : 혈청 인슐린
BMI : 체질량 지수
DiabetesPedigreeFunction : 당뇨 내력 가중치 값
Age : 나이
Outcome : 당뇨여부(0 또는 1)

라이브러리 불러오기

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.preprocessing import StandardScaler, Binarizer
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

데이터 불러오기

diabetes_data = pd.read_csv('pima indian/diabetes.csv')
print(diabetes_data['Outcome'].value_counts())
diabetes_data.head(3)

# 출력: 
0    500
1    268
Name: Outcome, dtype: int64

전체 768개의 데이터 중 Positive 값(1)이 268개, Negative 값(0)이 500개로 구성되어 있습니다.

이번에는 info( ) 함수를 사용하여 각 피처의 데이터 유형과 Null 값 여부를 살펴보겠습니다.

# diabetes 데이터 갼략히 보기(feature type 및 Null 값 개수 보기)
diabetes_data.info()

# 출력:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

확인해본 결과 Null 값은 존재하지 않으며, 모두 숫자형 데이터입니다.
때문에 Null 값과 문자열 처리를 위한 별도의 작업은 필요하지 않습니다.

로지스틱 회귀를 이용한 예측모델 생성

이번 데이터셋에서는 분류를 위한 알고리즘인 로지스틱 회귀를 이용하여 예측 모델을 만들어 보겠습니다.

# 모델 평가를 위한 함수 설정
def get_clf_eval(y_test, y_pred):
  confusion = confusion_matrix(y_test, y_pred)
  accuracy = accuracy_score(y_test, y_pred)
  precision = precision_score(y_test, y_pred)
  recall = recall_score(y_test, y_pred)
  F1 = F1_score(y_test, y_pred)
  AUC = roc_auc_curve(y_test, y_pred)
  # 평가지표 출력
  print('오차행렬:\n', confusion)
  print('\n정확도: {:.4f}'.format(accuracy))
  print('정밀도: {:.4f}'.format(precision))
  print('재현율: {:.4f}'.format(recall))
  print('F1: {:.4f}'.format(F1))
  print('AUC: {:.4f}'.format(AUC))

# Precision-Recall Curve Plot 그리기
def precision_recall_curve_plot(y_test, pred_proba):
  # threshold의 ndarray와 threshold 값별 정밀도, 재현율에 대한 ndarray 추출
  precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba)
  
  # x축을 threshold, y축을 정밀도, 재현율로 그래프 그리기
  plt.figure(figsize = (8, 6))
  thresholds_boundary = thresholds.shape[0]
  plt.plot(thresholds, precisions[:thresholds_boundary], linestyle = "--", label = "precision")
  plt.plot(thresholds, recalls[:thresholds_boundary], linestyle = ":", label = 'recall')
  
  # thresholds의 값 X축 scale을 0.1 단위로 변경
  start, end = plt.xlim()
  plt.xticks(np.round(np.arange(stard, end, 0.1), 2))
  
  plt.xlim()
  plt.xlabel('thresholds')
  plt.ylabel('precision & recall value')
  plt.legend()
  plt.grid()

# 피처 데이터 세트 X, 레이블 데이터 세트 y 를 추출
X = diabetes_data.iloc[:, :-1]
y = diabetes_data['Outcome']

# 데이터를 훈련과 테스트 데이터로 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    random_state = 156,
                                                    stratify = y)

# 로지스틱 회귀로 학습, 예측 및 평가를 수행
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
get_clf_eval(y_test, pred)

# 출력:
오차행렬:
 [[87 13]
 [22 32]]

정확도: 0.7727
정밀도: 0.7111
재현율: 0.5926
F1: 0.6465
AUC: 0.7313

전체 데이터를 모두 음성(0)으로 판정할 경우 약 65.10%(=500/768)의 정확도를 나타내는데 로지스틱 회귀의 경우, 별도의 데이터나 모델의 변경없이도 77.27%의 성능을 나타냅니다.

# 임계값별로 정밀도-재현율 시각화
pred_proba = lr_clf.predict_proba(X_test)[:, 1]
precision_recall_curve_plot(y_test, pred_proba)

위의 정밀도-재현율 그래프를 보면 임계값을 약 0.42 정도 수준일 때, 정밀도와 재현율이 균형을 이루는 것을 볼 수 있습니다. 이때 정밀도, 재현율은 0.7에 못 미치는 수준으로 그리 높지 않아 보입니다.

보다 모델의 성능을 높이기 위해 데이터를 다시 확인해보겠습니다.

# 데이터의 기초 통계값들
diabetes_data.describe()

위의 결과를 보면 최소값이 0으로 되어 있는 데이터들이 다수 존재하는 것을 볼 수 있습니다.

Glocose(당 수치), BloodPressure(혈압), SkinThickness(피하지망), 인슐린(Insulin), BMI(체질량지수) 같은 값이 실제로 0일 수는 없으므로 더 상세한 데이터 확인이 필요합니다.

f, ax = plt.subplots(2, 3, figsize = (15, 8))
diabetes_data['Glucose'].plot(kind = 'hist', bins = 20, ax = ax[0, 0])
ax[0,0].set_title('Histogram of Glocose')
diabetes_data['BloodPressure'].plot(kind = 'hist', bins = 20, ax = ax[0, 1])
ax[0,1].set_title('Histogram of BloodPressure')
diabetes_data['SkinThickness'].plot(kind = 'hist', bins = 20, ax = ax[0, 2])
ax[0,2].set_title('Histogram of SkinThickness')
diabetes_data['Insulin'].plot(kind = 'hist', bins = 20, ax = ax[1, 0])
ax[1,0].set_title('Histogram of Insulin')
diabetes_data['BMI'].plot(kind = 'hist', bins = 20, ax = ax[1, 1])
ax[1,1].set_title('Histogram of BMI')

plt.show()

# 위 컬럼들에 대한 0 값의 비율 확인
feature_list = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
zero_count = []
zero_percent = []
for col in feature_list:
  zero_num = diabetes_data[diabetes_data[col]==0].shape[0]
  zero_count.append(zero_num)
  zero_percent.append(np.round(zero_num/diabetes_data.shape[0]*100, 2))

zero = pd.DataFrame([zero_count, zero_percent], 
                    columns = feature_list,
                    index = ['count', 'percent'])

위의 값들이 실제로 0일 가능성은 희박해보이므로 0 값을 평균으로 대체해보겠습니다.
SkinThickness와 Insulin의 경우 0 값의 비율이 29.56%, 48.70%로 상당히 높아 변경 영향도가 상대적으로 더 클 것 같습니다.

# 0 값을 우선 np.nan으로 교체
diabetes_data[feature_list] = diabetes_data[feature_list].replace(0, np.nan)

# 위 5개 feature에 대해 0 값을 평균 값으로 대체
mean_features = diabetes_data[feature_list].mean()
diabetes_data[feature_list] = diabetes_data[feature_list].replace(0, mean_features)

그리고 로지스틱회귀의 경우, 숫자데이터에 스케일링을 적용하면 성능이 더 좋아지는 경우가 많으므로 StandardScaler를 적용해보겠습니다.

# 데이터 셋에 StandardScaler를 적용하여 변환하기
X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 데이터를 훈련과 테스트 데이터로 분리
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, 
                                                    test_size = 0.2,
                                                    random_state = 156,
                                                    stratify = y)

# 로지스틱 회귀로 학습, 예측, 평가 수행
lr_clf = LogistricRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
get_clf_eval(y_test, pred)

# 출력:
오차행렬:
 [[89 11]
 [21 33]]

정확도: 0.7922
정밀도: 0.7500
재현율: 0.6111
F1: 0.6735
AUC: 0.7506

0 값을 평균으로 처리한 후 스케일링을 하고 나니 앞선 예측보다는 소폭 개선되었습니다.

이번에는 현재 모델의 임계값 변화에 따른 예측 성능을 확인해보겠습니다.

# 평가지표를 조사하기 위한 새로운 함수 생성
def get_eval_by_threshold(y_test, pred_proba_c1, thresholds):
  # thresholds list 객체 내의 값을 iteration하면서 평가 수행
  for custom_threshold in thresholds:
    binarizer = Binarizer(threshold = custom_threshold).fit(pred_proba_c1)
    custom_predict = binarizer.transform(pred_proba_c1)
    print('임계값: ', custom_thresholds)
    get_clf_eval(y_test, custom_predict)
    print('')

# 임계값 변화에 대한 예측 성능 확인
thresholds = [0.3, 0.33, 0.36, 0.39, 0.42, 0.45, 0.48, 0.50]
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:, 1].reshape(-1, 1), thresholds)

# 출력:
임계값:  0.3
오차행렬:
 [[68 32]
 [ 9 45]]

정확도: 0.7338
정밀도: 0.5844
재현율: 0.8333
F1: 0.6870
AUC: 0.7567

임계값:  0.33
오차행렬:
 [[73 27]
 [11 43]]

정확도: 0.7532
정밀도: 0.6143
재현율: 0.7963
F1: 0.6935
AUC: 0.7631

임계값:  0.36
오차행렬:
 [[75 25]
 [13 41]]

정확도: 0.7532
정밀도: 0.6212
재현율: 0.7593
F1: 0.6833
AUC: 0.7546

임계값:  0.39
오차행렬:
 [[82 18]
 [15 39]]

정확도: 0.7857
정밀도: 0.6842
재현율: 0.7222
F1: 0.7027
AUC: 0.7711

임계값:  0.42
오차행렬:
 [[85 15]
 [18 36]]

정확도: 0.7857
정밀도: 0.7059
재현율: 0.6667
F1: 0.6857
AUC: 0.7583

임계값:  0.45
오차행렬:
 [[86 14]
 [19 35]]

정확도: 0.7857
정밀도: 0.7143
재현율: 0.6481
F1: 0.6796
AUC: 0.7541

임계값:  0.48
오차행렬:
 [[88 12]
 [19 35]]

정확도: 0.7987
정밀도: 0.7447
재현율: 0.6481
F1: 0.6931
AUC: 0.7641

임계값:  0.5
오차행렬:
 [[89 11]
 [21 33]]

정확도: 0.7922
정밀도: 0.7500
재현율: 0.6111
F1: 0.6735
AUC: 0.7506

위와 같이 정확도, 정밀도, 재현율, F1, AUC 등의 평가지표를 보고 적절히 판단하여 임계값을 선택한 후, 예측을 수행할 수 있습니다.

# 임계값을 0.48로 설정하여 예측 수행
binarizer = Binarizer(thresholds = 0.48)

# Binarizer를 이용하여 예측값 반환
pred_th_048 = binarizer.fit_transform(pred_proba[:, 1].reshape(-1, 1)) 

get_clf_eval(y_test, pred_th_048)

# 출력:
오차행렬:
 [[88 12]
 [19 35]]

정확도: 0.7987
정밀도: 0.7447
재현율: 0.6481
F1: 0.6931
AUC: 0.7641

Reference

파이썬 머신러닝 완벽가이드