Skip to content

ML using scikit learn

Predicting absenteeism

A large problem within organisations is how to motivate their employees. This is a continuation of the previous blog where we went through various feature engineering methods to come up with a comprehensive dataset. In this blog, we will use this data in order to predict employment absenteeism. The goal is to identify who are likely to be absent in the near future. This blog doesn't go through the machine learning concepts, or business logic, but implementation of machine learning using scikit-learn package in python.

As a first step, let us load and look at the data.

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline 
pd.set_option('display.max_columns', None)

df = pd.read_csv("data_after_feature_engg.csv")
df['date'] = pd.to_datetime(df.date)
df.head()
Unnamed: 0 employee date last_likes last_dislikes feedbackType likes_till_date dislikes_till_date last_2_likes last_2_dislikes days_since_last_comment last_vote timezone stillExists no_of_days_since_first_vote no_of_votes_till_date perc_days_voted avg_vote_till_date avg_vote last_2_votes_avg prev_vote days_since_last_vote employee_joined_after_jun17 countdown_to_last_day reason on_leave no_leaves_till_date previous_day_leave last_2_days_leaves weekday month week
0 23729 17r 2018-05-29 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 1 1.0 1.0 3.0 2.121212 0.0 0.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Tuesday May 1
1 23730 17r 2018-05-30 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 1 1.0 1.0 3.0 2.121212 0.0 0.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Wednesday May 2
2 23731 17r 2018-05-31 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 2 1.0 1.0 3.0 2.121212 0.0 0.0 1 1.0 999 NaN 0.0 0.0 0.0 0.0 Thursday May 3
3 23732 17r 2018-06-01 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 3 2.0 0.5 3.0 2.121212 0.0 3.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Friday Jun 1
4 23733 17r 2018-06-02 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 4 2.0 0.5 3.0 2.121212 0.0 0.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Saturday Jun 2

Base model

This data is at an employee-day level, and we can predict if any employee will take leave on any particular date. As very few employees take leave on any particular day, we will have a highly imbalanced dataset.

1-df.on_leave.mean()
0.9514513766404592

This indicates that only 5% of the dataset contains information about employees taking leaves. This means that if we predicted that all the employees are not taking a leave (class 0), we would be 95% accurate, but such prediction is not useful. This is the base model, and we need to see to it that we have accuracy greater than 95%.

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, roc_curve, roc_auc_score

From the different features that we have created, we are selecting what we think will be relevant features for predicting which employee might be absent.

indep_vars = ['last_likes', 'last_dislikes', 'feedbackType', 
     'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 
     'days_since_last_comment', 'last_vote', 'timezone', 'stillExists',
       'no_of_days_since_first_vote', 'no_of_votes_till_date',
       'perc_days_voted', 'avg_vote_till_date', 'avg_vote', 'last_2_votes_avg',
       'days_since_last_vote', 'employee_joined_after_jun17', 'countdown_to_last_day',
        'no_leaves_till_date', 'weekday', 'month']

The dependent variable is on_leave. We are also creating dummy variables that represent categorical data.

data_targets = df['on_leave'].astype('int')
data_features = pd.get_dummies(df[indep_vars], prefix = "_", drop_first= True)

To prevent overfitting, we are splitting the data into test data and train data. Test data has 30% of the data (selected randomly) while the train data has the remaining 70% on which we train the model. This model is tested against the test data to validate for overfitting.

x_train, x_test, y_train, y_test = train_test_split(data_features, data_targets, test_size=.30, random_state=35, \
                                                    stratify=data_targets)

Before building an ensemble of models, we can build a decision tree model and understand if the features we have selected perform the classification reasonably.

tree_clf = DecisionTreeClassifier(random_state=35, max_depth=3, class_weight="balanced").fit(x_train,y_train)
def plot_feature_importance(model):
    fs, ax = plt.subplots(1, figsize=(20,10))
    number_of_features = x_train.shape[1]
    plt.barh(range(number_of_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(number_of_features),x_train.columns)
    plt.xlabel("Feature Importance")
    plt.ylabel("Feature")
plot_feature_importance(tree_clf)

png

plt.figure(figsize=(20,20))
plot_tree(tree_clf,  fontsize=10, feature_names = x_train.columns)
plt.show()

png

We can observe that the features that are important make reasonable sense. 1. Number of leaves till date: The number of leaves that an employee has taken already will affect the future leaves that a person would take 2. number of days since first vote (proxy to the employee tenure), employee_joined_after_Jun17 (proxy to newer employees) are significant, and we have seen these trends in the visualizations in the feature engineering section 3. As seen in the visualisations, employees took leaves during months like June and October which have come up as significant in the analysis

We can see the model is not overfit with similar accuracy across test and train datasets. The confusion matrix is plotted below:

def plot_confusion_matrix(model, x_test, y_test):
    plt.rcParams["figure.figsize"] = (5, 5)
    disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, model.predict(x_test), labels=model.classes_),
                           display_labels=model.classes_)
    disp.plot()
    plt.show();
print('Accuracy on train data is', round(tree_clf.score(x_train, y_train),2))
print('Accuracy on test data is', round(tree_clf.score(x_test, y_test),2))
plot_confusion_matrix(tree_clf, x_test, y_test)
Accuracy on train data is 0.81
Accuracy on test data is 0.81

png

y_test_pred = tree_clf.predict(x_test)
print(classification_report(y_test, y_test_pred))
              precision    recall  f1-score   support

           0       1.00      0.80      0.89      9549
           1       0.20      0.98      0.34       487

    accuracy                           0.81     10036
   macro avg       0.60      0.89      0.61     10036
weighted avg       0.96      0.81      0.86     10036

We can observe that the precision is just 20% (on classifying when an employee will take a leave). This is due to imbalance of classes. This can be rectified by many ways, one of which includes under-sampling. For other methods, refer the blog on handling imbalanced classes. We are also using GridSearch to find the best tuning parameters and n-fold cross validation.

Downsampling

from imblearn.pipeline import Pipeline, make_pipeline
from imblearn.under_sampling import RandomUnderSampler
param_grid_dt = {
    'classification__criterion': ['gini', 'entropy'],
    'classification__max_depth': [2,5,10],
    'classification__max_features':[0.2, 0.5, 0.8]
}
model_dt_p = Pipeline([
    ('sampling', RandomUnderSampler()), # While training, downsample the majority class randomly
    ('classification', DecisionTreeClassifier()) # Train and predict using rf classification model
    ])
model_dt = GridSearchCV(model_dt_p, param_grid_dt, cv=5)
model_dt.fit(x_train, y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('sampling', RandomUnderSampler()),
                                       ('classification',
                                        DecisionTreeClassifier())]),
             param_grid={'classification__criterion': ['gini', 'entropy'],
                         'classification__max_depth': [2, 5, 10],
                         'classification__max_features': [0.2, 0.5, 0.8]})
y_test_pred = model_dt.predict(x_test)
print(classification_report(y_test, y_test_pred))
print('Accuracy on train data is', round(model_dt.score(x_train, y_train),2))
print('Accuracy on test data is', round(model_dt.score(x_test, y_test),2))
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      9549
           1       0.63      0.97      0.76       487

    accuracy                           0.97     10036
   macro avg       0.81      0.97      0.87     10036
weighted avg       0.98      0.97      0.97     10036

Accuracy on train data is 0.97
Accuracy on test data is 0.97

Here we observe that with a simple decision tree classifier, we already reach an accuracy above 95% (base accuracy) with good precision and recall values. This model is better than our base model. We could further improve it by using ensemble methods which we will look into later in the blog.

def print_stats(model):
    print('Best params for the model', model.best_params_)
    print('Best score on the cross validation data', round(model.best_score_, 2))
    print('Accuracy on train data', round(model.score(x_train, y_train),2))
    print('Accuracy on test data', round(model.score(x_test, y_test),2))
print_stats(model_dt)

plot_confusion_matrix(model_dt, x_test, y_test)
Best params for the model {'classification__criterion': 'entropy', 'classification__max_depth': 10, 'classification__max_features': 0.5}
Best score on the cross validation data 0.96
Accuracy on train data 0.97
Accuracy on test data 0.97

png

y_test_pred = model_dt.predict(x_test)
print(classification_report(y_test, y_test_pred))
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      9549
           1       0.63      0.97      0.76       487

    accuracy                           0.97     10036
   macro avg       0.81      0.97      0.87     10036
weighted avg       0.98      0.97      0.97     10036

Multiple models

After manually applying a few models, we can scale up to use grid search method to search for the best classifier and parameters in a quick and efficient manner. We can use Logistic Regression, Multinomial Naive Bayes, KNN Classifiers, Ensemble models and any model in the sklearn library.

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.pipeline import Pipeline
# setting up models

clf1 = LogisticRegression(random_state=35)
clf2 = MultinomialNB()
clf3 = KNeighborsClassifier()
clf4 = DecisionTreeClassifier(random_state=35)
clf5 = RandomForestClassifier(random_state = 35)
clf6 = GradientBoostingClassifier(random_state = 35)
# setting up hyperparameters 

# Hyperparameters for Logistic regression https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
hyperparam1 = {}
hyperparam1['classifier__C'] = [10**-4, 10**-1, 10**0] # The hyper-parameter is C and its inside 'classifier' part of pipeline
hyperparam1['classifier__penalty'] = ['l1', 'l2'] # The hyper-parameter is penalty and its inside 'classifier' part of pipeline
hyperparam1['classifier__class_weight'] = [None, 'balanced']
hyperparam1['classifier'] = [clf1]

# Hyperparameters for Naive Bayes
hyperparam2 = {}
hyperparam2['classifier__alpha'] = [10**0,10**4]
hyperparam2['classifier'] = [clf2]

# Hyperparameters for KNN
hyperparam3 = {}
hyperparam3['classifier__n_neighbors'] = [2, 5, 10]
hyperparam3['classifier'] = [clf3]

# Hyperparameters for Decision Trees
hyperparam4 = {}
hyperparam4['classifier__max_depth'] = [2,5,10, None]
hyperparam4['classifier__min_samples_split'] = [2, 5, 10]
hyperparam4['classifier__class_weight'] = [None, 'balanced']
hyperparam4['classifier'] = [clf4]

# Hyperparameters for Random Forest
hyperparam5 = {}
hyperparam5['classifier__n_estimators'] = [100, 250]
hyperparam5['classifier__max_depth'] = [2,5,10]
hyperparam5['classifier__class_weight'] = [None, 'balanced']
hyperparam5['classifier'] = [clf5]

# Hyperparameters for Gradient Boosting
hyperparam6 = {}
hyperparam6['classifier__n_estimators'] = [100, 250]
hyperparam6['classifier__max_depth'] = [2, 5, 10]
hyperparam6['classifier__min_samples_split'] = [2, 5, 10]
hyperparam6['classifier'] = [clf6]

pipe = Pipeline([('classifier',clf1)])
hyperparam_total = [hyperparam1, hyperparam2, hyperparam3, hyperparam4, hyperparam5, hyperparam6]
%%time
# Record how long the search takes

# Train the grid search model
model_grid_search = GridSearchCV(pipe, hyperparam_total, cv=3, n_jobs=-1, scoring='roc_auc').fit(x_train, y_train)
CPU times: total: 48.2 s
Wall time: 3min 23s
# Output the best estimator which is a random forecast classifier 
model_grid_search.best_estimator_
Pipeline(steps=[('classifier',
                 GradientBoostingClassifier(max_depth=5, min_samples_split=10,
                                            n_estimators=250,
                                            random_state=35))])

After running grid search to find the best parameters and models, we find the best model to be Gradient Boosting with the best parameters below.

# Output the parameters in the best estimator
model_grid_search.best_params_
{'classifier': GradientBoostingClassifier(random_state=35),
 'classifier__max_depth': 5,
 'classifier__min_samples_split': 10,
 'classifier__n_estimators': 250}

The accuracy on the cross validated data close to 99%.

round(model_grid_search.best_score_, 4)
0.9981

On the test data also, we have a near perfect precision, recall and accuracy.

y_test_pred = model_grid_search.predict(x_test)
print(classification_report(y_test, y_test_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      9549
           1       0.99      0.95      0.97       487

    accuracy                           1.00     10036
   macro avg       0.99      0.98      0.99     10036
weighted avg       1.00      1.00      1.00     10036
plot_confusion_matrix(model_grid_search, x_test, y_test)
plt.show();

png

The features that are used and their importances in the model is:

def print_most_imp_features(model, data_features):
    feat_imp_df = pd.DataFrame({'columns' : data_features.columns, 'importance':(model.best_estimator_.named_steps["classifier"].feature_importances_*100).astype(int)})
    return feat_imp_df[feat_imp_df.importance>0].sort_values('importance', ascending=False)
print_most_imp_features(model_grid_search, data_features)
columns importance
18 no_leaves_till_date 40
9 no_of_days_since_first_vote 22
32 __Aug 7
41 __Oct 6
37 __Jun 5
10 no_of_votes_till_date 2
12 avg_vote_till_date 2
13 avg_vote 2
2 likes_till_date 1
6 days_since_last_comment 1
11 perc_days_voted 1
21 __INFORMATION 1
def plot_roc_curve(fpr, tpr, auc):
    fig, ax = plt.subplots()
    ax.plot(fpr, tpr)
    ax.set(xlabel='False Positive Rate', ylabel='True Positive Rate')
    ax.grid()
    ax.text(0.6, 0.3, 'ROC AUC Score: {:.3f}'.format(auc),
            bbox=dict(boxstyle='square,pad=0.3', fc='white', ec='k'))
    lims = [np.min([ax.get_xlim(), ax.get_ylim()]), np.max([ax.get_xlim(), ax.get_ylim()])]
    ax.plot(lims, lims, 'k--')
    ax.set_xlim(lims)
    ax.set_ylim(lims)
    plt.title('ROC curve')
auc = roc_auc_score(y_test, y_test_pred)
fpr, tpr, _ = roc_curve(y_test, y_test_pred)
plot_roc_curve(fpr, tpr, auc)

png

Hyperopt

The Grid search works by trying every possible combination of parameters you want to try in your model, this means it will take a lot of time to perform the entire search which can get very computationally expensive. Hyperopt uses a form of Bayesian optimization for parameter tuning that allows you to get the best parameters for a given model. It can optimize a model with hundreds of parameters on a large scale.

from hyperopt import tpe, hp, fmin, STATUS_OK, space_eval, Trials
def get_best_hyperparameters(x_train, y_train, kfold=5):
    def objective(params):
        kfold = KFold(n_splits=3)
        classification_type = params['type']
        del params['type']
        if classification_type == 'rf':
            clf = RandomForestClassifier(**params)
        elif classification_type == 'linreg':
            clf = Ridge(**params)
        elif classification_type == 'xgboost':
            clf = GradientBoostingClassifier(**params)
        elif classification_type == 'neural_network':
            clf = MLPClassifier(**params)
        elif classification_type == 'decision_tree':
            clf = DecisionTreeClassifier(**params)
        elif classification_type == 'lasso':
            clf = Lasso(**params)
        else:
            return 0
        score = cross_val_score(estimator=clf,
                                  X=x_train, y=y_train, cv=kfold, 
                                  scoring='accuracy'
                                ).mean()
        return {'loss': -score, 'status': STATUS_OK}

    search_space = hp.choice('classification_type', [
        {
          'type': 'rf',
          "n_estimators": hp.choice("n_estimators_rf", [20, 50, 100, 150]),
          "max_depth": hp.choice('max_depth_rf', range(3,30,1)),
          "max_features": hp.choice("max_features_rf", ['sqrt', 'log2']),
          "min_samples_split": hp.choice("min_samples_split_rf", [2,5,10]),
          "min_samples_leaf": hp.choice("min_samples_leaf_rf", [1, 2, 4, 10]),
          "bootstrap": hp.choice("bootstrap", [True, False]),
          'class_weight':hp.choice('class_weight_rf', ['balanced', None]),
        },
        {
          'type': 'xgboost',
          'learning_rate': hp.choice('learning_rate_xgb', [0.0001,0.001, 0.01, 0.1, 1]),
          'n_estimators' : hp.choice('n_estimators_xgb', range(100, 1000, 100)),
          'max_depth' : hp.choice('max_depth_xgb', range(3,10,3)),
          'min_weight_fraction_leaf' : hp.choice('min_weight_fraction_leaf_xgb', [i/20.0 for i in range(3,10)])
        },
        {
          'type': 'logreg',
          'penalty' : hp.choice('penalty_lr', ['l1', 'l2', 'elasticnet', None]),
          'class_weight':hp.choice('class_weight_lr', ['balanced', None]),
          'C':hp.choice('C_lr', [10**-4, 10**-1, 10**0])
        },
        {
          'type': 'decision_tree',
          "max_depth": hp.choice('max_depth_dt', range(3,10,3)),
          "min_samples_split": hp.choice("min_samples_split_dt", [2,5,10]),
          "min_samples_leaf": hp.choice("min_samples_leaf_dt", [4, 10]),
          'class_weight':hp.choice('class_weight_dt', ['balanced', None]),
        },
      ])
    trials = Trials()
    best_result = fmin(fn=objective, space=search_space, algo=tpe.suggest, max_evals=32, trials=trials)

    # Print the values of the best parameters
    print(space_eval(search_space, best_result))
    return space_eval(search_space, best_result)

def create_model(best_hyperparameter):
    if(best_hyperparameter['type'] == 'rf'):
        return RandomForestClassifier(
          bootstrap = best_hyperparameter['bootstrap'],
          max_depth=int(best_hyperparameter['max_depth']),
          max_features = best_hyperparameter['max_features'],
          min_samples_leaf = int(best_hyperparameter['min_samples_leaf']),
          min_samples_split=int(best_hyperparameter['min_samples_split']),
          n_estimators = int(best_hyperparameter['n_estimators']),
          class_weight = best_hyperparameter['class_weight']
        )
    elif(best_hyperparameter['type'] == 'xgboost'):
        return GradientBoostingClassifier(
          learning_rate=best_hyperparameter['learning_rate'],
          n_estimators=int(best_hyperparameter['n_estimators']),
          max_depth=int(best_hyperparameter['max_depth']),
          min_weight_fraction_leaf=best_hyperparameter['min_weight_fraction_leaf']
        )
    elif(best_hyperparameter['type'] == 'logreg'):
        return LogisticRegression(
          penalty = best_hyperparameter['penalty'],
          class_weight = best_hyperparameter['class_weight'],
          C = best_hyperparameter['C']
        )
    elif(best_hyperparameter['type'] == 'decision_tree'):
        return DecisionTreeClassifier(
          max_depth=int(best_hyperparameter['max_depth']),
          min_samples_leaf = int(best_hyperparameter['min_samples_leaf']),
          min_samples_split=int(best_hyperparameter['min_samples_split']),
          class_weight = best_hyperparameter['class_weight'],
        )
best_hyperparameter = get_best_hyperparameters(x_train, y_train)
hyperopt_model = create_model(best_hyperparameter)
hyperopt_model.fit(x_train, y_train)
100%|███████████████████████████████████████████████| 32/32 [05:34<00:00, 10.46s/trial, best loss: -0.9956011103993166]
{'bootstrap': False, 'class_weight': None, 'max_depth': 28, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100, 'type': 'rf'}
y_test_pred = hyperopt_model.predict(x_test)
print(classification_report(y_test, y_test_pred))
plot_confusion_matrix(hyperopt_model, x_test, y_test)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      9549
           1       0.99      0.96      0.98       487

    accuracy                           1.00     10036
   macro avg       1.00      0.98      0.99     10036
weighted avg       1.00      1.00      1.00     10036

png

Predictions

Using this model, we can predict who is going to be absent in the next few days.

next_day_df = df[df.date == max(df.date)].reset_index()
next_day_df.no_of_days_since_first_vote += 1
next_day_df.days_since_last_vote += 1

day_name= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday']
month_name = [None, 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
next_day_df.date = next_day_df.date + pd.DateOffset(1)
next_day_df['weekday'] = next_day_df.date.dt.weekday.apply(lambda x:day_name[x])
next_day_df['month'] = next_day_df.date.dt.month.apply(lambda x:month_name[x])
next_day_df['week'] = next_day_df.date.dt.day

# Handle catogorical variables
next_day_df_predict = pd.get_dummies(next_day_df[indep_vars], prefix = "_")
next_day_df_predict = next_day_df_predict.reindex(columns=x_train.columns, fill_value=0)

prob_current = model_grid_search.predict_proba(next_day_df_predict)
next_day_df['leave_prob'] = prob_current[:,1]

The top 5 employees who have the highest probability to be absent in the next day are:

next_day_df.sort_values('leave_prob', ascending = False).head(5)
index Unnamed: 0 employee date last_likes last_dislikes feedbackType likes_till_date dislikes_till_date last_2_likes last_2_dislikes days_since_last_comment last_vote timezone stillExists no_of_days_since_first_vote no_of_votes_till_date perc_days_voted avg_vote_till_date avg_vote last_2_votes_avg prev_vote days_since_last_vote employee_joined_after_jun17 countdown_to_last_day reason on_leave no_leaves_till_date previous_day_leave last_2_days_leaves weekday month week leave_prob
8 5445 33450 3WW 2019-03-12 22.0 1.0 CONGRATULATION 22.0 1.0 22.0 1.0 331 4.0 Europe/Madrid 1.0 531 376.0 0.709434 2.151596 2.151596 4.0 4.0 1 1.0 999 NaN 0.0 1.0 0.0 0.0 Tuesday Mar 12 0.000477
36 23024 20195 aQJ 2019-03-12 8.0 8.0 INFORMATION 237.0 71.0 56.0 10.0 190 3.0 Europe/Madrid 1.0 489 139.0 0.446945 3.071942 3.071942 3.0 0.0 178 1.0 999 NaN 0.0 14.0 0.0 0.0 Tuesday Mar 12 0.000441
20 13726 10313 DNY 2019-03-12 0.0 0.0 OTHER 0.0 0.0 0.0 0.0 13 3.0 Europe/Madrid 1.0 678 236.0 0.351190 2.936441 2.936441 3.0 0.0 5 0.0 999 NaN 0.0 2.0 0.0 0.0 Tuesday Mar 12 0.000422
12 8421 9018 6lL 2019-03-12 1.0 0.0 OTHER 152.0 68.0 5.0 1.0 436 3.0 Europe/Madrid 1.0 676 194.0 0.294833 3.242268 3.242268 3.0 0.0 17 0.0 999 NaN 0.0 53.0 0.0 0.0 Tuesday Mar 12 0.000414
32 20505 19706 YDm 2019-03-12 19.0 0.0 CONGRATULATION 523.0 250.0 34.0 9.0 69 3.0 Europe/Madrid 1.0 678 550.0 0.812408 2.943636 2.943636 3.0 3.0 1 0.0 999 NaN 0.0 1.0 0.0 0.0 Tuesday Mar 12 0.000395

Save Models

The last step is to save models for future deployment or run. Scikit-learn models are usually saved as pickle files.

import pickle
# Dump the trained model with Pickle
model_pkl_filename = 'classifier_employee_absenteeism.pkl'
# Open the file to save as pkl file
model_pkl = open(model_pkl_filename, 'wb')
pickle.dump(model_grid_search, model_pkl)
# Close the pickle instances
model_pkl.close()

References

  1. Scikit-learn documentation: link
  2. Satyam Kumar, How to tune multiple ML models with GridSearchCV at once?: link
  3. Notes and lectures, Machine Learning module, MSc Business analytics, Imperial College London, Class 2020-22
  4. Harsha A, Shaked A, Artem G, Tebogo M, Gokhan M: The workforce of the future Workforce Analytics
  5. Harsha A, Shaked A, Artem G, Tebogo M, Gokhan M: Predicting absenteeism Workforce Analytics
  6. Hyperopt: The Alternative Hyperparameter Optimization Technique You Need to Know link
Back to top