Intelligence Authentic + Healthcare

Predicting Hospital Readmission Rates for Diabetic Patients - A Deep Learning Approach

 

Challenge

We provide a sample, initial exploration here for clientele in the medical field. The objective is to build a deep learning algorithm with good initial results for predicting hospital readmission rates. Our objective is to determine if a patient will be readmitted at all, readmitted within 30 days of discharge, or readmitted after 30 days of discharge. Using open source datasets, we find that deep learning algorithms outperformed classical machine learning algorithms by over 300% on out of sample F1-score of the target class.

KEY PRELIMINARY facts

Key Preliminary Facts:

  1. The dataset is biased to no readmissions and readmissions within 30 days, which we mitigate via the F1 measure.

  2. Biases in the data are present in that certain patients are included numerous times in the dataset. To normalize, only one sample per patient was examined. More subtly, there are no date values for any admission/readmission, meaning that any values of the sample of a patient who occurs more than once in the dataset may influenced by prior admissions/readmissions.

  3. Many diagnoses and variables did not have sufficient positive examples (99%+ negative) to warrant being a predictor variable. These were removed from the analysis.

KEY PRELIMINARY FINDINGS

Fast ICA and various variable regularizations were needed to help the neural network converge.

  1. Performance out of sample for the Keras Neural Network plateaued around 40% F1-score, while for random forests/classical machine learning, it plateaued around 12% F1-score.

  2. In-sample performance of both algorithms was similar, but regularizing random forests did not help improve out of sample F1-scores, while doing so for Keras Neural Networks did.

  3. Without normalizations, sampling and feature based, both algorithms achieved far worse results in sample and out of sample. In many cases, especially in the case of the neural networks, the algorithms did not converge or learn over numerous iterations.

This short study below highlights the importance of preparation of datasets, statistically and logically, in order to achieve progress in data science engagements. Code is shared below, feel free to contact us at info@intelauthentic.com with any comments, thoughts, or suggestions.

 

In [1]:

%matplotlib inline

import os
import sys
import json
import pandas
import sklearn
import scipy
import numpy
import itertools
import sklearn.ensemble
import sklearn.feature_selection
import sklearn.model_selection
from sklearn.model_selection import StratifiedKFold,train_test_split
import sklearn.preprocessing

In [2]:

data = pandas.read_csv("./diabetic_data.csv")

In [3]:

patient_data = [list(x) for x in data.groupby("patient_nbr")]
patient_data = [(x[0],len(x[1]),x[1]) for x in patient_data]

In [4]:

balanced_admissions = [ (x[0],x[2].iloc[0]) for x in patient_data]

In [5]:

# Plot the Distribution of the Target Variable - Number of readamissions

number_records = pandas.Series([x[1] for x in patient_data],index=[x[0] for x in patient_data])
number_records.hist()
number_records.describe()

Out[5]:

count    71518.000000
mean         1.422942
std          1.090740
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         40.000000
dtype: float64
download (1).png
 

In [6]:

# create the first admission df
index = [x[0] for x in balanced_admissions]
balanced_admission_df = pandas.concat([x[1] for x in balanced_admissions],axis=1).transpose()

In [7]:

print(balanced_admission_df.patient_nbr.value_counts().max())
balanced_admission_df = balanced_admission_df.set_index('patient_nbr')
1

In [8]:

print(balanced_admission_df.shape,number_records.shape)
(71518, 49) (71518,)

In [9]:

balanced_admission_df = balanced_admission_df.drop('encounter_id',axis=1)

stringify_columns = ['admission_type_id','discharge_disposition_id','admission_source_id']
for colname in stringify_columns:
    balanced_admission_df[colname] = balanced_admission_df[colname].map(str)

    
# TODO: Add New Features from paper

cols_to_drop = ['weight','payer_code','medical_specialty',
                'num_lab_procedures','num_procedures','num_medications','number_outpatient','number_emergency',
               'number_inpatient','number_diagnoses',
               'glimepiride','glipizide-metformin']

balanced_admission_df = balanced_admission_df.drop(cols_to_drop,axis=1)
diag_cols = [x for x in balanced_admission_df.columns if x.startswith('diag')]

In [10]:

readmitted = balanced_admission_df['readmitted']
balanced_admission_df = balanced_admission_df.drop('readmitted',axis=1)
balanced_admission_df = balanced_admission_df.replace('?',numpy.NAN)


# Dummy Cols
dummy_transform_cols = ['race', 'gender', 'age', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id',
        'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed']
dummy_transform_cols = [x for x in dummy_transform_cols if x in balanced_admission_df.columns]

In [11]:

inbalanced_cols = []

for colname in dummy_transform_cols:
    max_val = balanced_admission_df[colname].value_counts().max()*1. / len(balanced_admission_df)
    if max_val > 0.99:
        inbalanced_cols.append(colname)

dummy_transform_cols = [x for x in dummy_transform_cols if x not in inbalanced_cols]
#balanced_admission_df = balanced_admission_df.drop(inbalanced_cols,axis=1)
balanced_admission_df_dummies = pandas.get_dummies(balanced_admission_df[dummy_transform_cols])

In [12]:

def create_diag_type(balanced_admission_df):
    diag_cols = [x for x in balanced_admission_df.columns if 'diag' in x]
    diag_data = balanced_admission_df[diag_cols]
    
    values = [list(balanced_admission_df[colname].value_counts().index) for colname in diag_cols]
    values = list(set(list(itertools.chain.from_iterable(values))))
    values = sorted(values)
    new_cols = ['DIAG_' + str(value) for value in values]
    
    dict_val = {values[i]:i for i in range(len(values))}
    
    def apply_func(row):
        diagnoses = row.tolist()
        new_row = [0 for x in range(len(dict_val))]
        for d in diagnoses:
            if pandas.notnull(d):
                new_row[dict_val[d]] = 1
        return pandas.Series(new_row,index=new_cols)
    
    # create_dummies 
    diag_dummies = diag_data.apply(apply_func,axis=1)
    #diag_dummies.columns = new_cols
    return diag_dummies

diag_data = create_diag_type(balanced_admission_df)
    

In [13]:

# remove diagnoses with very few instances
bad_diag_cols = []
for colname in diag_data:
    max_val = diag_data[colname].value_counts().max() * 1. / len(diag_data)
    if max_val>0.99:
        bad_diag_cols.append(colname)
print(len(bad_diag_cols))
filtered_diag_data = diag_data.drop(bad_diag_cols,axis=1)
841

In [14]:

features = balanced_admission_df_dummies.join(filtered_diag_data,how='inner')

In [15]:

readmitted.value_counts()

Out[15]:

NO     42985
>30    22240
<30     6293
Name: readmitted, dtype: int64

In [16]:

# DL Experiment

import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LeakyReLU
import keras.optimizers
import sklearn.decomposition

def f1_metric(y_true,y_pred):
    return sklearn.metrics.f1_score(y_true,y_pred,average='macro')


old_features = features.copy()
# Remove certains DIAG columns
diag_cols = [x for x in features.columns if "DIAG" in x]
cols_to_remove = []
for colname in diag_cols:
    if features[colname].sum()*1. / len(features) < 0.02:
        cols_to_remove.append(colname)
        
features = features.drop(cols_to_remove,axis=1)
# generate interaction terms into the keras model

dummy_targets = pandas.get_dummies(readmitted)
#keras_model = make_model(features,dummy_targets)

x_train,x_test,y_train,y_test = train_test_split(features,readmitted,test_size=0.1)

scaler = sklearn.preprocessing.MinMaxScaler()

poly_feature_maker = sklearn.preprocessing.PolynomialFeatures(degree=2,interaction_only=True)
poly_features = poly_feature_maker.fit_transform(x_train)
poly_feature_names = poly_feature_maker.get_feature_names()

poly_features = pandas.DataFrame(poly_features,columns=poly_feature_names,index=x_train.index)


poly_cols_to_drop = []
for colname in poly_features.columns:
    if sum(poly_features[colname].fillna(0.0) ==0.0) *1. / len(poly_features) > 0.98:
        poly_cols_to_drop.append(colname)
        
poly_train_features = poly_features.drop(poly_cols_to_drop,axis=1)
poly_test_features = poly_feature_maker.transform(x_test)
poly_test_features = pandas.DataFrame(poly_test_features,columns=poly_feature_names,index=x_test.index)
poly_test_features = poly_test_features.drop(poly_cols_to_drop,axis=1)

print("Poly Feature Shape",poly_features.shape)

x_train = pandas.concat([pandas.DataFrame(x_train),pandas.DataFrame(poly_train_features)],axis=1)
x_test = pandas.concat([pandas.DataFrame(x_test),pandas.DataFrame(poly_test_features)],axis=1)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Add in Fast ICA Transform for feature reduction

print("Training Data Shape",x_train.shape)
ica = sklearn.decomposition.FastICA(n_components=500)
x_train = ica.fit_transform(x_train)
x_test = ica.transform(x_test)


# Determine Class Weights

class_weights = float(len(y_train)) / y_train.value_counts() + 3 
class_weights = json.loads(class_weights.to_json())

print(class_weights)

for key in class_weights:
    class_weights[key] = numpy.log(class_weights[key]) + 1

print(class_weights)
cols = sorted(y_train.value_counts().index.tolist())
class_weights = {i:class_weights[cols[i]] for i in range(len(cols))}

y_train,y_test = pandas.get_dummies(y_train),pandas.get_dummies(y_test)


print(y_train.head())
print(x_train.shape)
# make the final model
/Users/andrewgabriel/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Poly Feature Shape (64366, 10154)
Training Data Shape (64366, 1442)
{'NO': 4.663634014, '>30': 6.2183, '<30': 14.3400281889}
{'NO': 2.5397949755575637, '>30': 2.8274965574974784, '<30': 3.6630548009226804}
             <30  >30  NO
patient_nbr              
84218751       0    0   1
34242201       0    0   1
32799276       0    0   1
77140440       0    0   1
98284500       0    0   1
(64366, 500)

In [17]:

def make_model(x,y,use_dropout=True,dropout_rate=0.25):
    use_dropout = use_dropout and isinstance(dropout_rate,float)
    model = Sequential()
    model.add(Dense(128,input_dim = x.shape[1],activation='relu'))
    if use_dropout:
        model.add(Dropout(dropout_rate))
    model.add(Dense(64,activation='relu'))
    model.add(Dense(32))
    model.add(LeakyReLU(alpha=0.5))
    if use_dropout:
        model.add(Dropout(dropout_rate))
    model.add(Dense(16))
    model.add(LeakyReLU(alpha=0.5))
    model.add(Dense(8))
    model.add(LeakyReLU(alpha=0.5))
    model.add(Dense(3))
    model.add(Activation('softmax'))
    
    
    opt = keras.optimizers.Adagrad(lr=0.005, epsilon=None, decay=10**-6)
    
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['categorical_accuracy'])
    return model

keras_model = make_model(x_train,y_train)
keras_model.fit(x_train,y_train,epochs=100,class_weight=class_weights)
WARNING:tensorflow:Variable *= will be deprecated. Use variable.assign_mul if you want assignment to the variable value or 'x = x * y' if you want a new python Tensor object.
Epoch 1/100
64366/64366 [==============================] - 17s 271us/step - loss: 2.5668 - categorical_accuracy: 0.6010
Epoch 2/100
64366/64366 [==============================] - 19s 288us/step - loss: 2.5142 - categorical_accuracy: 0.6018
Epoch 3/100
64366/64366 [==============================] - 16s 256us/step - loss: 2.4836 - categorical_accuracy: 0.6030
Epoch 4/100
64366/64366 [==============================] - 19s 291us/step - loss: 2.4755 - categorical_accuracy: 0.6033
Epoch 5/100
64366/64366 [==============================] - 17s 259us/step - loss: 2.4701 - categorical_accuracy: 0.6050
Epoch 6/100
64366/64366 [==============================] - 18s 286us/step - loss: 2.4651 - categorical_accuracy: 0.6054
Epoch 7/100
64366/64366 [==============================] - 18s 276us/step - loss: 2.4618 - categorical_accuracy: 0.6060
Epoch 8/100
64366/64366 [==============================] - 16s 247us/step - loss: 2.4600 - categorical_accuracy: 0.6065
Epoch 9/100
64366/64366 [==============================] - 16s 246us/step - loss: 2.4574 - categorical_accuracy: 0.6076
Epoch 10/100
64366/64366 [==============================] - 17s 265us/step - loss: 2.4540 - categorical_accuracy: 0.6068
Epoch 11/100
64366/64366 [==============================] - 17s 257us/step - loss: 2.4514 - categorical_accuracy: 0.6088
Epoch 12/100
64366/64366 [==============================] - 16s 247us/step - loss: 2.4478 - categorical_accuracy: 0.6088
Epoch 13/100
64366/64366 [==============================] - 17s 266us/step - loss: 2.4431 - categorical_accuracy: 0.6108
Epoch 14/100
64366/64366 [==============================] - 17s 268us/step - loss: 2.4399 - categorical_accuracy: 0.6105
Epoch 15/100
64366/64366 [==============================] - 17s 264us/step - loss: 2.4349 - categorical_accuracy: 0.6125
Epoch 16/100
64366/64366 [==============================] - 17s 262us/step - loss: 2.4307 - categorical_accuracy: 0.6140
Epoch 17/100
64366/64366 [==============================] - 19s 299us/step - loss: 2.4281 - categorical_accuracy: 0.6148
Epoch 18/100
64366/64366 [==============================] - 18s 276us/step - loss: 2.4216 - categorical_accuracy: 0.6153
Epoch 19/100
64366/64366 [==============================] - 19s 288us/step - loss: 2.4142 - categorical_accuracy: 0.6164
Epoch 20/100
64366/64366 [==============================] - 18s 272us/step - loss: 2.4105 - categorical_accuracy: 0.6186
Epoch 21/100
64366/64366 [==============================] - 17s 265us/step - loss: 2.4023 - categorical_accuracy: 0.6195
Epoch 22/100
64366/64366 [==============================] - 17s 258us/step - loss: 2.3972 - categorical_accuracy: 0.6215
Epoch 23/100
64366/64366 [==============================] - 17s 257us/step - loss: 2.3914 - categorical_accuracy: 0.6217
Epoch 24/100
64366/64366 [==============================] - 17s 257us/step - loss: 2.3850 - categorical_accuracy: 0.6241
Epoch 25/100
64366/64366 [==============================] - 17s 268us/step - loss: 2.3748 - categorical_accuracy: 0.6248
Epoch 26/100
64366/64366 [==============================] - 16s 253us/step - loss: 2.3668 - categorical_accuracy: 0.6255
Epoch 27/100
64366/64366 [==============================] - 16s 250us/step - loss: 2.3566 - categorical_accuracy: 0.6276
Epoch 28/100
64366/64366 [==============================] - 16s 248us/step - loss: 2.3506 - categorical_accuracy: 0.6297
Epoch 29/100
64366/64366 [==============================] - 16s 245us/step - loss: 2.3401 - categorical_accuracy: 0.6332
Epoch 30/100
64366/64366 [==============================] - 16s 254us/step - loss: 2.3342 - categorical_accuracy: 0.6350
Epoch 31/100
64366/64366 [==============================] - 16s 249us/step - loss: 2.3290 - categorical_accuracy: 0.6339
Epoch 32/100
64366/64366 [==============================] - 16s 256us/step - loss: 2.3180 - categorical_accuracy: 0.6365
Epoch 33/100
64366/64366 [==============================] - 16s 249us/step - loss: 2.3121 - categorical_accuracy: 0.6388
Epoch 34/100
64366/64366 [==============================] - 16s 254us/step - loss: 2.3015 - categorical_accuracy: 0.6402
Epoch 35/100
64366/64366 [==============================] - 16s 252us/step - loss: 2.2974 - categorical_accuracy: 0.6412
Epoch 36/100
64366/64366 [==============================] - 18s 282us/step - loss: 2.2924 - categorical_accuracy: 0.6422
Epoch 37/100
64366/64366 [==============================] - 20s 305us/step - loss: 2.2818 - categorical_accuracy: 0.6428
Epoch 38/100
64366/64366 [==============================] - 17s 267us/step - loss: 2.2860 - categorical_accuracy: 0.6445
Epoch 39/100
64366/64366 [==============================] - 17s 260us/step - loss: 2.2747 - categorical_accuracy: 0.6460
Epoch 40/100
64366/64366 [==============================] - 17s 261us/step - loss: 2.2688 - categorical_accuracy: 0.6462
Epoch 41/100
64366/64366 [==============================] - 17s 257us/step - loss: 2.2640 - categorical_accuracy: 0.6478
Epoch 42/100
64366/64366 [==============================] - 20s 308us/step - loss: 2.2541 - categorical_accuracy: 0.6501
Epoch 43/100
64366/64366 [==============================] - 18s 282us/step - loss: 2.2524 - categorical_accuracy: 0.6511
Epoch 44/100
64366/64366 [==============================] - 17s 271us/step - loss: 2.2478 - categorical_accuracy: 0.6505
Epoch 45/100
64366/64366 [==============================] - 18s 287us/step - loss: 2.2397 - categorical_accuracy: 0.6527
Epoch 46/100
64366/64366 [==============================] - 23s 363us/step - loss: 2.2380 - categorical_accuracy: 0.6535
Epoch 47/100
64366/64366 [==============================] - 21s 319us/step - loss: 2.2361 - categorical_accuracy: 0.6532
Epoch 48/100
64366/64366 [==============================] - 18s 287us/step - loss: 2.2263 - categorical_accuracy: 0.6559
Epoch 49/100
64366/64366 [==============================] - 22s 338us/step - loss: 2.2245 - categorical_accuracy: 0.6548
Epoch 50/100
64366/64366 [==============================] - 17s 272us/step - loss: 2.2187 - categorical_accuracy: 0.6579
Epoch 51/100
64366/64366 [==============================] - 16s 252us/step - loss: 2.2098 - categorical_accuracy: 0.6598
Epoch 52/100
64366/64366 [==============================] - 16s 245us/step - loss: 2.2073 - categorical_accuracy: 0.6588
Epoch 53/100
64366/64366 [==============================] - 17s 263us/step - loss: 2.2067 - categorical_accuracy: 0.6586
Epoch 54/100
64366/64366 [==============================] - 15s 238us/step - loss: 2.1961 - categorical_accuracy: 0.6625
Epoch 55/100
64366/64366 [==============================] - 16s 252us/step - loss: 2.1951 - categorical_accuracy: 0.6612
Epoch 56/100
64366/64366 [==============================] - 19s 302us/step - loss: 2.1914 - categorical_accuracy: 0.6635
Epoch 57/100
64366/64366 [==============================] - 19s 298us/step - loss: 2.1862 - categorical_accuracy: 0.6609
Epoch 58/100
64366/64366 [==============================] - 17s 266us/step - loss: 2.1856 - categorical_accuracy: 0.6617
Epoch 59/100
64366/64366 [==============================] - 20s 306us/step - loss: 2.1757 - categorical_accuracy: 0.6632
Epoch 60/100
64366/64366 [==============================] - 18s 275us/step - loss: 2.1741 - categorical_accuracy: 0.6660
Epoch 61/100
64366/64366 [==============================] - 19s 293us/step - loss: 2.1681 - categorical_accuracy: 0.6654
Epoch 62/100
64366/64366 [==============================] - 18s 275us/step - loss: 2.1729 - categorical_accuracy: 0.6648
Epoch 63/100
64366/64366 [==============================] - 16s 243us/step - loss: 2.1688 - categorical_accuracy: 0.6657
Epoch 64/100
64366/64366 [==============================] - 19s 292us/step - loss: 2.1639 - categorical_accuracy: 0.6656
Epoch 65/100
64366/64366 [==============================] - 17s 267us/step - loss: 2.1598 - categorical_accuracy: 0.6661
Epoch 66/100
64366/64366 [==============================] - 16s 241us/step - loss: 2.1549 - categorical_accuracy: 0.6688
Epoch 67/100
64366/64366 [==============================] - 17s 267us/step - loss: 2.1498 - categorical_accuracy: 0.6683
Epoch 68/100
64366/64366 [==============================] - 19s 298us/step - loss: 2.1469 - categorical_accuracy: 0.6706
Epoch 69/100
64366/64366 [==============================] - 18s 282us/step - loss: 2.1455 - categorical_accuracy: 0.6706
Epoch 70/100
64366/64366 [==============================] - 18s 278us/step - loss: 2.1455 - categorical_accuracy: 0.6695
Epoch 71/100
64366/64366 [==============================] - 18s 277us/step - loss: 2.1398 - categorical_accuracy: 0.6717
Epoch 72/100
64366/64366 [==============================] - 21s 329us/step - loss: 2.1305 - categorical_accuracy: 0.6706
Epoch 73/100
64366/64366 [==============================] - 22s 347us/step - loss: 2.1282 - categorical_accuracy: 0.6729
Epoch 74/100
64366/64366 [==============================] - 19s 303us/step - loss: 2.1232 - categorical_accuracy: 0.6729
Epoch 75/100
64366/64366 [==============================] - 19s 292us/step - loss: 2.1231 - categorical_accuracy: 0.6736
Epoch 76/100
64366/64366 [==============================] - 20s 303us/step - loss: 2.1185 - categorical_accuracy: 0.6748
Epoch 77/100
64366/64366 [==============================] - 16s 249us/step - loss: 2.1153 - categorical_accuracy: 0.6751
Epoch 78/100
64366/64366 [==============================] - 15s 233us/step - loss: 2.1137 - categorical_accuracy: 0.6736
Epoch 79/100
64366/64366 [==============================] - 15s 235us/step - loss: 2.1104 - categorical_accuracy: 0.6738
Epoch 80/100
64366/64366 [==============================] - 16s 244us/step - loss: 2.1103 - categorical_accuracy: 0.6764
Epoch 81/100
64366/64366 [==============================] - 16s 241us/step - loss: 2.1030 - categorical_accuracy: 0.6752
Epoch 82/100
64366/64366 [==============================] - 15s 228us/step - loss: 2.1007 - categorical_accuracy: 0.6773
Epoch 83/100
64366/64366 [==============================] - 15s 231us/step - loss: 2.1013 - categorical_accuracy: 0.6753
Epoch 84/100
64366/64366 [==============================] - 17s 264us/step - loss: 2.0949 - categorical_accuracy: 0.6771
Epoch 85/100
64366/64366 [==============================] - 15s 235us/step - loss: 2.0939 - categorical_accuracy: 0.6780
Epoch 86/100
64366/64366 [==============================] - 14s 225us/step - loss: 2.0857 - categorical_accuracy: 0.6781
Epoch 87/100
64366/64366 [==============================] - 15s 234us/step - loss: 2.0857 - categorical_accuracy: 0.6788
Epoch 88/100
64366/64366 [==============================] - 18s 272us/step - loss: 2.0800 - categorical_accuracy: 0.6793
Epoch 89/100
64366/64366 [==============================] - 15s 233us/step - loss: 2.0847 - categorical_accuracy: 0.6794
Epoch 90/100
64366/64366 [==============================] - 15s 231us/step - loss: 2.0721 - categorical_accuracy: 0.6807
Epoch 91/100
64366/64366 [==============================] - 17s 257us/step - loss: 2.0766 - categorical_accuracy: 0.6811
Epoch 92/100
64366/64366 [==============================] - 17s 265us/step - loss: 2.0708 - categorical_accuracy: 0.6822
Epoch 93/100
64366/64366 [==============================] - 17s 259us/step - loss: 2.0692 - categorical_accuracy: 0.6825
Epoch 94/100
64366/64366 [==============================] - 17s 260us/step - loss: 2.0620 - categorical_accuracy: 0.6822
Epoch 95/100
64366/64366 [==============================] - 17s 259us/step - loss: 2.0642 - categorical_accuracy: 0.6839
Epoch 96/100
64366/64366 [==============================] - 17s 261us/step - loss: 2.0678 - categorical_accuracy: 0.6815
Epoch 97/100
64366/64366 [==============================] - 17s 263us/step - loss: 2.0654 - categorical_accuracy: 0.6808
Epoch 98/100
64366/64366 [==============================] - 17s 263us/step - loss: 2.0525 - categorical_accuracy: 0.6850
Epoch 99/100
64366/64366 [==============================] - 17s 264us/step - loss: 2.0567 - categorical_accuracy: 0.6831
Epoch 100/100
64366/64366 [==============================] - 17s 266us/step - loss: 2.0499 - categorical_accuracy: 0.6827

Out[17]:

<keras.callbacks.History at 0x10f8967f0>

In [18]:

is_pred,oos_pred = keras_model.predict(x_train),keras_model.predict(x_test)
is_pred = pandas.DataFrame(is_pred)
oos_pred = pandas.DataFrame(oos_pred)
is_pred.columns = cols
oos_pred.columns = cols

In [19]:

is_pred_single = is_pred.apply(numpy.argmax,axis=1)
oos_pred_single = oos_pred.apply(numpy.argmax,axis=1)

print("Insample F1 DL: ",f1_metric(y_train.idxmax(axis=1),is_pred_single))
print("Out of Sample F1 DL: ",f1_metric(y_test.idxmax(axis=1),oos_pred_single))
/Users/andrewgabriel/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py:52: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)
Insample F1 DL:  0.5754620551510934
Out of Sample F1 DL:  0.38473039735879744

In [20]:

print(is_pred_single.value_counts())
print(y_train.idxmax(axis=1).value_counts())
print(oos_pred_single.value_counts())
print(y_test.idxmax(axis=1).value_counts())
NO     46221
>30    16133
<30     2012
dtype: int64
NO     38690
>30    20000
<30     5676
dtype: int64
NO     5235
>30    1767
<30     150
dtype: int64
NO     4295
>30    2240
<30     617
dtype: int64

In [21]:

# Doesnt Work Too Well
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100,criterion='entropy',class_weight='balanced',
                                                min_samples_leaf=20)
rf.fit(x_train,y_train)    
is_preds,oos_preds = rf.predict(x_train),rf.predict(x_test)
f1_is,f1_oos = sklearn.metrics.f1_score(y_train,is_preds,average='macro'),sklearn.metrics.f1_score(y_test,oos_preds,average='macro')
print(f1_is,f1_oos)
0.6069455214069673 0.11687895906991708