Skip to content

Feature engineering for Machine Learning

This is a blog post on some common feature engineering techniques on time series continuous data. Different features are created across four different datasets for the same use case.

Predicting absenteeism

A large problem within organisations is how to motivate their employees. In this blog, we will use HappyForce in order to predict employment absenteeism. The goal is to identify who are likely to be absent in the near future, and find the reasons for absenteeism.

This blog is the first part and contains Feature engineering and EDA for machine learning. Part 2 contains the machine learning part where we build models and compare the results. Part 3 contains modelling using neural networks. This blog shows various ways in which feature engineering can be carried out on time series datasets.

We focused on data from 408 employees of one company in Spain; 62 have record of absence. The below chart summarizes the main reasons for leave; in our case, common sickness / non job-related accidents are the main reasons. Given we are to analyse employee absence as it relates to motivation, it is feasible to assume a portion of these employees simply call in sick as they lack motivation or commitment to their workplace.

Employee absenteeism dataset

The datasets contains 7 files, first let us look at the employeeAbsenteeism dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date
pd.options.display.max_columns = None
df = pd.read_csv('employeeAbsenteeism.csv')
df.head()
employee companyAlias from to reason
0 19Q C1 10/1/18 0:10 10/26/18 0:10 Workplace accident
1 NY3 C1 10/1/18 0:10 10/31/18 0:10 Common sickness or accident not related to th...
2 qKO C1 10/1/18 0:10 10/5/18 0:10 Common sickness or accident not related to th...
3 qKO C1 10/10/18 0:10 10/31/18 0:10 Common sickness or accident not related to th...
4 2wx C1 10/1/18 0:10 10/31/18 0:10 Common sickness or accident not related to th...

Convert strings in different formats to datetime

The columns 'from' and 'to' are date time values in unspecified format.

df.to = pd.to_datetime(df.to) # converts string to type datetime
df['from'] = pd.to_datetime(df['from'])
df.head()
employee companyAlias from to reason
0 19Q C1 2018-10-01 00:10:00 2018-10-26 00:10:00 Workplace accident
1 NY3 C1 2018-10-01 00:10:00 2018-10-31 00:10:00 Common sickness or accident not related to th...
2 qKO C1 2018-10-01 00:10:00 2018-10-05 00:10:00 Common sickness or accident not related to th...
3 qKO C1 2018-10-10 00:10:00 2018-10-31 00:10:00 Common sickness or accident not related to th...
4 2wx C1 2018-10-01 00:10:00 2018-10-31 00:10:00 Common sickness or accident not related to th...

We can see that the dataset contains details of leaves taken by employees, the from-date, to-date, the leave time, and the reason for leave given by the employee.

df.describe(include='all', datetime_is_numeric=True)
employee companyAlias from to reason
count 106 106 106 106 106
unique 62 1 NaN NaN 4
top yKX C1 NaN NaN Common sickness or accident not related to th...
freq 4 106 NaN NaN 96
mean NaN NaN 2018-08-11 04:12:33.962264320 2018-08-26 23:13:41.886792448 NaN
min NaN NaN 2018-06-01 00:06:00 2018-06-06 00:06:00 NaN
25% NaN NaN 2018-06-18 06:06:00 2018-06-30 00:06:00 NaN
50% NaN NaN 2018-08-01 00:08:00 2018-08-31 00:08:00 NaN
75% NaN NaN 2018-10-01 00:10:00 2018-10-23 12:10:00 NaN
max NaN NaN 2018-10-31 00:10:00 2018-10-31 00:10:00 NaN

Further we can see that the data is across only one company, across 62 employees taking a total of around 106 leaves (not days of leave) with 4 unique reasons.

Difference in days

Creating a feature 'leave_time' which is the leave period

df['leave_time'] = ((df['to'] - df['from'])/np.timedelta64(1, 'D')).astype(int) # takes the difference in days

f, ax = plt.subplots(1, figsize = (20,10))
df['leave_time'].plot.hist()
plt.title('Histogram of the leave period')
plt.show()

png

The maximum period of leaves is 30 days, and the frequency of the leave perod is shown below. The majority of employees take either a short 1-3 days leave, or a much longer 27-30 day leave (likely due to a sickness, accident, or similar issue).

df_for_eda = df.copy()

# Creating class for the continuous variable
df_for_eda['leave_time_class'] = np.where(df_for_eda['leave_time'].isna(), 'None', 'Not None')

# Splitting the continuous variable into classes
df_for_eda['leave_time_class'] = pd.cut(df_for_eda.leave_time, 
                                    [0, 1, 5, 10, 15, 20, 25, 29, 30], 
                                    labels = ['1 day', '1-5days', '5-10 days','10-15 days', '15-20 days', '20-25 days', '25-29 days', '30 days'])
grouped_df_for_eda = df_for_eda.groupby('leave_time_class').aggregate(
    number_of_instances = ('leave_time_class', 'count'),
    unique_employees = ('employee', 'nunique')
).reset_index().assign(instance_percentage = lambda a:round(a.number_of_instances*100/(sum(a.number_of_instances)),2)).\
    assign(cumulative_percentage = lambda x:x.instance_percentage.cumsum())

fs, axs = plt.subplots(1, figsize=(20,10))
grouped_df_for_eda.index = grouped_df_for_eda[['leave_time_class']]
grouped_df_for_eda.number_of_instances.plot(kind='bar', ax=axs)
axs.set_ylabel('No of instances')
axs.set_xlabel('leave period (days)')
axs.set_title('Histogram of leave period')
axs.bar_label(axs.containers[0], 
              labels = round(grouped_df_for_eda.instance_percentage, 1).astype(str)+str("% - ")+grouped_df_for_eda.number_of_instances.astype(str) , label_type='edge')

plt.show();
grouped_df_for_eda

png

leave_time_class number_of_instances unique_employees instance_percentage cumulative_percentage
(1 day,) 1 day 14 13 14.58 14.58
(1-5days,) 1-5days 15 14 15.62 30.20
(5-10 days,) 5-10 days 10 9 10.42 40.62
(10-15 days,) 10-15 days 4 4 4.17 44.79
(15-20 days,) 15-20 days 3 3 3.12 47.91
(20-25 days,) 20-25 days 5 5 5.21 53.12
(25-29 days,) 25-29 days 19 17 19.79 72.91
(30 days,) 30 days 26 17 27.08 99.99

There are four reasons given for leaves, with "Common sickness not related to the job" as the most common reason. The below pie charts summarize the main reasons for leave shows that the main reasons employees take leave. In our case, common sickness / non job-related accidents are the main reasons for absence.

df.groupby('reason').\
    aggregate({'leave_time':'sum'}).\
    plot.pie(y='leave_time', figsize=(10, 10), 
             autopct='%1.1f%%', # to add the percentages text
             wedgeprops=dict(width=.5)
            )
plt.ylabel("")
plt.title('Proportion of the number of days leaves were taken due to different reasons')
plt.show()

png

grouped_df_for_eda = df_for_eda.groupby(['reason', 'leave_time_class']).aggregate(
    number_of_instances = ('leave_time_class', 'count')
).reset_index().pivot(index = 'leave_time_class', columns=['reason'],
          values = ['number_of_instances']).reset_index()
fs, axs = plt.subplots(1, figsize=(20,10))
grouped_df_for_eda.index = grouped_df_for_eda[['leave_time_class']]
grouped_df_for_eda.number_of_instances.plot(kind='bar', ax=axs)
axs.set_ylabel('No of instances')
axs.set_xlabel('leave period (days)')
axs.set_title('Histogram of leave period')
plt.show();

png

# Creating class for the categorical variable
df_for_eda['reason_class'] = np.where(df_for_eda['reason']=='Common sickness  or accident not related to the job', 'sick', 'others')

grouped_df_for_eda = df_for_eda.groupby(['reason_class', 'leave_time_class']).aggregate(
    number_of_instances = ('leave_time_class', 'count')
).reset_index().pivot(index = 'leave_time_class', columns=['reason_class'],
          values = ['number_of_instances']).reset_index()
# Naming columns
grouped_df_for_eda.columns = ['leave_time_class', 'others_number_of_instances', 'sick_number_of_instances']
# Adding columns for percentages
grouped_df_for_eda = grouped_df_for_eda.\
    assign(others_percentage = lambda x:round(x.others_number_of_instances*100/(sum(x.others_number_of_instances)),2)).\
    assign(others_cumulative_perc = lambda a:round(a.others_percentage.cumsum(), 2)).\
    assign(sick_percentage = lambda x:round(x.sick_number_of_instances*100/(sum(x.sick_number_of_instances)),2)).\
    assign(sick_cumulative_perc = lambda a:round(a.sick_percentage.cumsum(), 2))

# Plotting
fs, axs = plt.subplots(1, 2, sharex='col', sharey='row', figsize = (15,7))
grouped_df_for_eda.index = grouped_df_for_eda[['leave_time_class']]
grouped_df_for_eda.sick_number_of_instances.plot(kind='bar', ax=axs[0])
grouped_df_for_eda.others_number_of_instances.plot(kind='bar', ax=axs[1])

axs[0].bar_label(axs[0].containers[0], labels = round(grouped_df_for_eda.sick_percentage, 1).astype(str)+str("%"), label_type='edge')
axs[1].bar_label(axs[1].containers[0], labels = round(grouped_df_for_eda.others_percentage, 1).astype(str)+str("%"), label_type='edge')

axs[0].set_ylabel('No of leaves')
axs[0].set_xlabel('(leave period)')
axs[0].set_title('Common sickness  or accident not related to the job')
axs[1].set_xlabel('(leave period)')
axs[1].set_title('Other type of leaves')

plt.show();
grouped_df_for_eda

png

leave_time_class others_number_of_instances sick_number_of_instances others_percentage others_cumulative_perc sick_percentage sick_cumulative_perc
(1 day,) 1 day 0 14 0.0 0.0 16.28 16.28
(1-5days,) 1-5days 0 15 0.0 0.0 17.44 33.72
(5-10 days,) 5-10 days 2 8 20.0 20.0 9.30 43.02
(10-15 days,) 10-15 days 0 4 0.0 20.0 4.65 47.67
(15-20 days,) 15-20 days 0 3 0.0 20.0 3.49 51.16
(20-25 days,) 20-25 days 1 4 10.0 30.0 4.65 55.81
(25-29 days,) 25-29 days 3 16 30.0 60.0 18.60 74.41
(30 days,) 30 days 4 22 40.0 100.0 25.58 99.99

Create variable for date and day-of-month

Creating a new feature 'date' which captures the date from when the leave was applied, and 'from' and 'to' the day of the month.

df['date'] = df['from'].dt.date.astype('datetime64[ns]') # gets only the date part from the timeframe
df['from_day'] = df['from'].dt.day
df['to_day'] = df['to'].dt.day

Aditionally, we can observe that the most of the leaves were taken from the first of the month, and ended at the end of the month.

f, ax = plt.subplots(1, 2, figsize = (20,10), sharey = True)
df.groupby(df['from'].dt.day)['employee'].count().\
    plot(kind='bar', ax=ax[0], title = 'Number of leaves with "from" date')
df.groupby(df['to'].dt.day)['employee'].count().\
    plot(kind='bar', ax=ax[1], title = 'Number of leaves with "to" date')
plt.show();

png

The below scatterplot shows the number of employees that took leaves across time.

# Feature engineering on absentism dataset
# creating a dataframe with leaves as per employee and dates
leave_df = pd.DataFrame({'date':[], 'employee':[], 'reason':[]})
for index, row in df.iterrows():
    dates = pd.date_range(row['from'].date(), row['to'].date())
    for date in dates:
        leave_df = pd.concat([leave_df, pd.DataFrame({'date':[date], 'employee':[row['employee']], 'reason':[row['reason']]})])
leave_df['on_leave'] = 1

# Creating a cumulative sum to get tehe number of leaves taken by the employee till date
leave_df['no_leaves_till_date'] = leave_df.groupby('employee')['on_leave'].transform(lambda x: x.cumsum().shift()).fillna(0)

f, ax = plt.subplots(1, figsize = (20,10))
leave_df.groupby(['date'])['on_leave'].sum().reset_index().plot(x = 'date', y = 'on_leave', kind = 'scatter', ax=ax)
plt.title('Number of employees on leave')
plt.show()

png

We can see that most of the leaves are in three months in 2018, indicating that these could be a subset of the data.

Employee details dataset

The next dataset to look at lastparticipationExists which has the last participation date along with the details of the employee. This data also has number of votes each employee has till date (this variable can be used as a proxy for engagement).

employee_details = pd.read_csv("lastParticipationExists.csv")

# Convert to datetime
employee_details.lastParticipationDate = pd.to_datetime(employee_details.lastParticipationDate)
employee_details.deletedOn = pd.to_datetime(employee_details.deletedOn)

# Convert to integer
employee_details.stillExists = employee_details.stillExists.astype(int)
employee_details
employee companyAlias numVotes lastParticipationDate timezone stillExists deletedOn
0 l8 C1 285 2019-03-08 01:03:00 Europe/Madrid 1 NaT
1 Xv C1 143 2018-04-21 02:04:00 Europe/Berlin 1 NaT
2 w7 C1 381 2019-03-11 01:03:00 Europe/Madrid 1 NaT
3 jE C1 173 2019-03-01 01:03:00 Europe/Madrid 1 NaT
4 QP C1 312 2019-03-08 01:03:00 Europe/Berlin 1 NaT
... ... ... ... ... ... ... ...
475 D7J C1 29 2018-11-19 01:11:00 Europe/Madrid 0 2018-11-20 13:11:00
476 9KA C1 50 2018-11-09 01:11:00 Europe/Madrid 0 2018-12-13 16:12:00
477 zR7 C1 42 2018-10-26 02:10:00 Europe/Madrid 0 2018-11-20 13:11:00
478 B7E C1 16 2019-01-21 01:01:00 Europe/Madrid 0 2019-02-11 18:02:00
479 QJg C1 1 2018-11-28 01:11:00 Europe/Madrid 0 2019-01-28 10:01:00

480 rows × 7 columns

employee_details.describe(include='all', datetime_is_numeric=True)
employee companyAlias numVotes lastParticipationDate timezone stillExists deletedOn
count 480 480 480.000000 470 480 480.00000 105
unique 480 1 NaN NaN 6 NaN NaN
top l8 C1 NaN NaN Europe/Madrid NaN NaN
freq 1 480 NaN NaN 411 NaN NaN
mean NaN NaN 222.570833 2018-12-05 02:49:16.212765696 NaN 0.78125 2018-04-20 20:56:21.142856960
min NaN NaN 0.000000 2017-05-06 02:05:00 NaN 0.00000 2017-05-12 10:05:00
25% NaN NaN 44.000000 2018-12-29 07:09:15 NaN 1.00000 2017-11-09 17:11:00
50% NaN NaN 175.000000 2019-03-08 01:03:00 NaN 1.00000 2018-05-15 17:05:00
75% NaN NaN 374.500000 2019-03-11 01:03:00 NaN 1.00000 2018-08-29 09:08:00
max NaN NaN 671.000000 2019-03-11 01:03:00 NaN 1.00000 2019-03-07 14:03:00
std NaN NaN 193.047398 NaN NaN 0.41383 NaN

We can see that this dataset contains more employees than the previous dataset. The previous dataset might be a subset of all the leaves that different people have taken. We can see that most of the people are from Madrid in Europe, while considerable number of employees are from Berlin.

employee_details.groupby('timezone').\
    aggregate({'lastParticipationDate':'count'}).\
    plot.pie(y='lastParticipationDate', figsize=(10, 10), 
             autopct='%1.1f%%', # to add the percentages text
             wedgeprops=dict(width=.5)
            )
plt.ylabel("")
plt.title('Employee Locations')
plt.show()

png

f, ax = plt.subplots(1, figsize = (20,10))

employee_details.loc[employee_details.lastParticipationDate.__ne__(None)].\
    loc[employee_details.stillExists == 0].\
    groupby('lastParticipationDate').aggregate({'stillExists':'count'}).reset_index().\
    plot(x = 'lastParticipationDate', y = 'stillExists', kind = 'scatter', ax=ax)

plt.title('Number of employees who have left over time')
plt.xlabel('Number of employees')
plt.ylabel('Date')
plt.show()

png

Votes dataset

The next dataset of interest is the vote's dataset. A listing of all votes registered on Happyforce to the question "How are you today?" from the employees on the dataset. All employees do not participate in this survey, but quite a lot of them do regularly.

votes = pd.read_csv("votes.csv")

# Convert to datetime
votes['voteDate'] = pd.to_datetime(votes['voteDate'])

# Sort the dataset based on time
votes = votes.sort_values(['employee', 'voteDate'])
votes
employee companyAlias voteDate vote
17862 13L C1 2017-09-26 02:09:00 2
18093 13L C1 2017-09-27 02:09:00 3
18406 13L C1 2017-09-29 02:09:00 3
18562 13L C1 2017-09-30 02:09:00 4
18635 13L C1 2017-10-01 02:10:00 3
... ... ... ... ...
90055 zyx C1 2018-12-15 01:12:00 3
90797 zyx C1 2018-12-19 01:12:00 2
91235 zyx C1 2018-12-21 01:12:00 2
91983 zyx C1 2018-12-26 01:12:00 3
92186 zyx C1 2018-12-28 01:12:00 2

106834 rows × 4 columns

votes.describe(include='all', datetime_is_numeric=True)
employee companyAlias voteDate vote
count 106834 106834 106834 106834.000000
unique 472 1 NaN NaN
top xE4 C1 NaN NaN
freq 671 106834 NaN NaN
mean NaN NaN 2018-05-06 22:41:28.348840192 2.849580
min NaN NaN 2017-05-03 02:05:00 1.000000
25% NaN NaN 2017-11-27 01:11:00 2.000000
50% NaN NaN 2018-05-21 02:05:00 3.000000
75% NaN NaN 2018-10-19 02:10:00 4.000000
max NaN NaN 2019-03-11 01:03:00 4.000000
std NaN NaN NaN 0.980259

We can see that the votes vary between 1 and 4, with average being 2.89. This survey has data from 3-5-2017 to 11-3-2019. This survey was taken by 472 employees. While the data has details from 3-5-2017, there would be employees who joined after the survey has already started. Identifying them will help us to look for patterns over time for new employees. We are considering any employee that has the first vote 100 days after the inception (3-5-2017) as a new employee.

f, ax = plt.subplots(1, figsize = (20,10))
votes.vote.hist(grid = False, ax = ax)
plt.title('Histogram of the votes')
plt.xlabel('Vote')
plt.ylabel('Frequency')
plt.show()

png

We can observe that the default vote is 3, with more happy employees using 4 and less happy employees using 2 or 3. For the modelling, we want to create features which take only the data before the observation into account, and it is useful to have running means to capture the default characteristics of the employee, and also to capture the latest 2 votes to capture if there has been any change from the normal recently. The columns after feature engineering are:
1. (existing column) vote: what the employee voted (1 to 4, representing being very Bad, Bad, Good, Very Good)
2. new_employee: We add a new parameter called "new employee" - we will label someone as a newer employee if their first vote has been more than 100 days (~3 months) since the software began collecting data
3. min_date: the date of the first recorded vote by an employee
4. no_of_days_since_first_vote: we use this as proxy for tenure in the company
5. no_of_votes_till_date: use this as proxy for engagement with the company
6. perc_days_voted: another way to look at engagement with the company (adjusted for tenure)
7. avg_vote_till_date: proxy for job satisfaction (the higher, the better)
8. avg_vote: same as above, but taking all votes (not just to date)
9. last_2_votes_avg: proxy for most recent sentiment at their job

# Feature engineering on the votes dataset
votes['min_date'] = votes.groupby('employee')['voteDate'].transform(min)

# new employee is 100 day since inception of the software
votes['new_employee'] = (votes.min_date >= '13-06-2017 00:00:00').astype(int) 

# Getting the difference in days since first vote in days
votes['no_of_days_since_first_vote'] = ((votes.voteDate - votes.min_date)/np.timedelta64(1, 'D')).astype(int)+1

# Calculating the cumulative number of votes till date (only considering the past: this step should be done after order by)
votes['no_of_votes_till_date'] = votes.groupby('employee')['voteDate'].transform('cumcount')+1

# calculating the percentage of days that the employee has voted till date
votes['perc_days_voted'] = votes['no_of_votes_till_date']/votes['no_of_days_since_first_vote']

# average of the vote till date by the employee
votes['avg_vote_till_date'] = votes.groupby('employee')['vote'].transform('cumsum')/(votes['no_of_votes_till_date'])

# Average vote in general of the employee
votes['avg_vote'] = votes.groupby('employee')['vote'].transform(np.mean)

# Converting to datetime
votes['date'] = votes.voteDate.dt.date.astype('datetime64[ns]')

Last n-average

The variable 'last_2_votes_average' is created to get the running average of the last two votes

# The average of the last two votes (The last two votes might redict the probability of taking a leave now)
votes = votes.sort_values(['employee', 'voteDate'])
votes['last_2_votes_avg'] = votes.groupby('employee').vote.\
    transform(lambda s: s.rolling(2, closed='left').mean())

Previous vote

The previous vote for the employee

# The previous vote
votes['prev_vote'] = votes.groupby('employee').vote.transform(lambda s: s.shift(periods = 1))

In the scatter plot below we can see an interesting trend / relationship between votes (1 to 4) and employee tenure. We can see the newer employees (green dots) are overall more satisfied than older employees (red dots). We can also see that over time, more tenured employees show a concerning downward trend in satisfaction. This could be one of the drivers for churn or increased absences.

fs, axs = plt.subplots(1, figsize=(20,10))
votes.groupby(['new_employee', 'no_of_days_since_first_vote']).agg({'vote':'mean'}).reset_index().\
    plot.scatter(x='no_of_days_since_first_vote', y = 'vote', c='new_employee', alpha = 0.4, 
                cmap = sns.diverging_palette(10, 133,as_cmap=True), ax = axs)#, vmin = 2.0, vmax=3.5)
axs.set_ylim([2, 3.5])
plt.title('Averge vote across new and old employee')
plt.xlabel('Number of days since first vote')
plt.show()

png

We can see there is some seasonality in votes. This could be related to specific busy periods at the company, or relate to specific incidents that have occurred. It can also be a matter of seasons (employees tend to be happier before or around holiday seasons, for example).

fs, axs = plt.subplots(1, figsize=(20,10))
votes.groupby(['voteDate']).agg({'vote':'mean'}).reset_index().\
    plot.scatter(x='voteDate', y = 'vote', alpha = 0.4, ax = axs)
plt.title('Seasonality in the average vote')
plt.xlabel('Date of vote')
plt.show()

png

# joining the votes table with employee details to add the details of the employee
votes = votes.merge(employee_details, how = 'left', on = 'employee')

Does the average vote decrease just before a person quits the organisation? We can identify this by filtering for the employees who have already quit, and look at the average votes since the last day. We identify that the votes are actually pretty consistent with a mean around 2.8 before the notice period while it decreases in the notice period(of presumably 60 days).

people_who_left = votes[votes.stillExists == 0].copy()
people_who_left.loc[:,'no_of_days_since_exit'] = ((people_who_left.voteDate - people_who_left.deletedOn)/np.timedelta64(1, 'D')).astype(int)-1

fs, axs = plt.subplots(1, figsize=(20,10))
people_who_left.groupby(['new_employee', 'no_of_days_since_exit']).agg({'vote':'mean'}).reset_index().\
    plot.scatter(x='no_of_days_since_exit', y = 'vote', alpha = 0.4, ax = axs)
axs.set_xlim([-200, 0])
axs.set_ylim([2, 3.5])
plt.title('Averge vote for employees who have left the organisation')
plt.xlabel('Number of days since exit date')
plt.show()

png

As we have employees across 5 time zones, the average votes across time zones are also shown. We can see distinct seasonality patterns across different geographies. Some geographies have low variation as they have very few employees.

fs, axs = plt.subplots(1, figsize=(20,10))
sns.scatterplot(data = votes.groupby(['timezone', 'voteDate']).agg({'vote':'mean'}).reset_index(), 
                x='voteDate', y = 'vote', hue = 'timezone', alpha = 0.4, ax = axs)
plt.title('Averge vote for employees in different geographies')
plt.xlabel('Vote date')
plt.show()

png

Feedback dataset

The next dataset of interest is the comment's feedback dataset. This contains all the different comments given by employees, and the number of likes and dislikes on each of the comment. Likes, Dislikes and comments are various proxy’s for engagement.

feedback = pd.read_csv("comments_by_employees_in_anonymous_forum.csv")
feedback
employee companyAlias commentId commentLength likes dislikes commentDate feedbackType
0 aKP C1 5909b33da2ede4000473da6f 17 9 0 2017-05-03 12:05 OTHER
1 dNx C1 5909b6aca2ede4000473da72 25 12 0 2017-05-03 12:05 OTHER
2 ONv C1 5909c2dea2ede4000473db8c 58 33 5 2017-05-03 13:05 OTHER
3 e9M C1 5909d32ea2ede4000473db97 56 11 4 2017-05-03 14:05 OTHER
4 RWM C1 5909f227a2ede4000473dcbe 105 18 0 2017-05-03 17:05 OTHER
... ... ... ... ... ... ... ... ...
5067 7o1 C1 5c7108e8434c4500041722b0 28 0 0 2019-02-23 09:02 OTHER
5068 N3 C1 5c71519ca9f66e00042896f6 14 0 0 2019-02-23 14:02 OTHER
5069 DNY C1 5c73b11e50b72e0004cab283 63 0 0 2019-02-25 10:02 OTHER
5070 72j C1 5c744971e29c7b0004391da3 44 0 0 2019-02-25 21:02 OTHER
5071 qKO C1 5c781339efad100004ebb886 39 0 0 2019-02-28 17:02 OTHER

5072 rows × 8 columns

feedback.describe(include='all', datetime_is_numeric=True)
employee companyAlias commentId commentLength likes dislikes commentDate feedbackType
count 5072 5072 5072 5072.000000 5072.000000 5072.00000 5072 5072
unique 305 1 5072 NaN NaN NaN 3580 5
top 4ov C1 5909b33da2ede4000473da6f NaN NaN NaN 2018-09-10 09:09 OTHER
freq 373 5072 1 NaN NaN NaN 26 3188
mean NaN NaN NaN 168.518336 13.605875 4.89097 NaN NaN
std NaN NaN NaN 193.802568 15.280530 6.62993 NaN NaN
min NaN NaN NaN 1.000000 0.000000 0.00000 NaN NaN
25% NaN NaN NaN 48.000000 2.000000 0.00000 NaN NaN
50% NaN NaN NaN 111.000000 9.000000 3.00000 NaN NaN
75% NaN NaN NaN 223.000000 20.000000 7.00000 NaN NaN
max NaN NaN NaN 2509.000000 135.000000 65.00000 NaN NaN

We can observe that as the number of interactions becomes larger, the overall trend goes either towards likes or dislikes. We can also observe that comments above 80 interactions are usually highly liked.

feedback['log_comment_length'] = np.log10(feedback.commentLength)
feedback['total_interactions'] = feedback.likes+feedback.dislikes
feedback['mean_feedback'] = (feedback.likes-feedback.dislikes)/feedback.total_interactions

fs, axs = plt.subplots(1, figsize=(20,10))
sns.scatterplot(data = feedback, 
                x='total_interactions', y = 'mean_feedback', hue = 'log_comment_length', alpha = 0.8, ax = axs)
plt.title('Mean interaction vs response level for employee')
plt.xlabel('Number of interactions')
plt.ylabel('Mean of feedback')
plt.show()

png

We can see that we have five types of comments

feedback.groupby('feedbackType').\
    aggregate({'commentId':'count'}).\
    plot.pie(y='commentId', figsize=(10, 10), 
             autopct='%1.1f%%', # to add the percentages text
             wedgeprops=dict(width=.5)
            )
plt.ylabel("")
plt.title('Frequency of somments')
plt.show()

png

Similarly, we can see a trend between different comment types, with congratulations receiving more likes than dislikes, while suggestions and information having similar distributions. We can also see how the number of likes are skewed.

fs, axs = plt.subplots(1, figsize=(20,10))
sns.scatterplot(data = feedback, 
                x='likes', y = 'dislikes', hue = 'feedbackType', alpha = 0.4, ax = axs)
plt.title('Total likes and dislikes for comments')
plt.show()

png

feedback.groupby('feedbackType').aggregate({'likes':'mean', 'dislikes':'mean'})
likes dislikes
feedbackType
CONGRATULATION 30.941538 3.824615
CRITICISM 11.125000 1.562500
INFORMATION 18.751152 7.657450
OTHER 9.325282 3.705144
SUGGESTION 19.219917 7.290456
f, ax = plt.subplots(2, figsize = (20,20), sharex = True)

sns.boxplot(data = feedback.groupby(['feedbackType', 'employee'])['commentId'].count().reset_index().assign(
        log_count_comments = lambda df: np.log10(df.commentId)
    ), 
            x = 'feedbackType', y = 'log_count_comments', ax = ax[0])

sns.boxplot(data = feedback, x = 'feedbackType', y = 'log_comment_length', ax = ax[1])

ax[0].set_ylabel('Log of number of comments')
ax[1].set_ylabel('Log of comment length')
ax[0].set_title('Number of comments and comment legth across different feedback types')
plt.show()

png

# feature engineering on the feedback dataset
feedback['commentDate'] = pd.to_datetime(feedback['commentDate'])

# sort by date in ascending to do cumulative sum and rolling calculations
feedback = feedback.sort_values(['employee', 'commentDate'])

# Rolling two likes and dislikes, indicating the last two likes and dislikes
feedback['last_2_likes'] = feedback.groupby('employee').likes.transform(lambda s: s.rolling(2, closed='left').mean())
feedback['last_2_dislikes'] = feedback.groupby('employee').dislikes.transform(lambda s: s.rolling(2, closed='left').mean())
feedback['date'] = feedback.commentDate.dt.date.astype('datetime64[ns]')
feedback
employee companyAlias commentId commentLength likes dislikes commentDate feedbackType log_comment_length total_interactions mean_feedback last_2_likes last_2_dislikes date
4373 19Q C1 5959ec954040610004272a21 11 0 0 2017-07-03 09:07:00 OTHER 1.041393 0 NaN 0.0 0.0 2017-07-03
970 19Q C1 5970e1483da0e10004b17a27 64 3 4 2017-07-20 18:07:00 OTHER 1.806180 7 -0.142857 3.0 4.0 2017-07-20
1903 19Q C1 5a36f82b26c0110004c55d90 57 17 6 2017-12-18 00:12:00 OTHER 1.755875 23 0.478261 20.0 10.0 2017-12-18
1949 19Q C1 5a40e7d3de51cb00042dfda4 31 5 0 2017-12-25 12:12:00 OTHER 1.491362 5 1.000000 22.0 6.0 2017-12-25
1972 19Q C1 5a4a8275eb84e0000492659f 11 8 0 2018-01-01 19:01:00 OTHER 1.041393 8 1.000000 13.0 0.0 2018-01-01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4740 zRx C1 5be95746ed7ae70004b83417 11 0 0 2018-11-12 11:11:00 OTHER 1.041393 0 NaN 0.0 0.0 2018-11-12
4746 zRx C1 5be95746ed7ae70004b83415 26 0 0 2018-11-12 11:11:00 OTHER 1.414973 0 NaN 0.0 0.0 2018-11-12
4758 zRx C1 5be95747ed7ae70004b83419 33 0 0 2018-11-12 11:11:00 OTHER 1.518514 0 NaN 0.0 0.0 2018-11-12
4764 zRx C1 5be95746ed7ae70004b83418 11 0 0 2018-11-12 11:11:00 OTHER 1.041393 0 NaN 0.0 0.0 2018-11-12
4781 zRx C1 5be95746ed7ae70004b83416 8 0 0 2018-11-12 11:11:00 OTHER 0.903090 0 NaN 0.0 0.0 2018-11-12

5072 rows × 14 columns

Create till-date variables

Create variables that count the number of instances till date. These can be used to create average till date and other variables

# Count the likes and dislikes till the date given
feedback['likes_till_date']= feedback.groupby('employee')['likes'].transform('cumsum')
feedback['dislikes_till_date']= feedback.groupby('employee')['dislikes'].transform('cumsum')
feedback['comments_till_date']= feedback.groupby('employee')['commentId'].transform('cumcount')

Merging all the datasets

Now we combine employee absenteeism data with votes and comments datasets. We have the following issues: 1. The absenteeism dataset is a subset of the data and does not cover all employees and months
2. Not all employees have voted and even those who have voted infrequently
3. Not all employees have commented and even those who have posted a comment posted infrequently

So we have the following assumptions:
1. We are working with the data only for the employees that exist in the absenteeism dataset
2. We are assuming that the last vote of the employee talks about how the employee is today
3. We are also assuming that the last comment and the likes and dislikes have an effect on the employee

With these assumptions, we combine the datasets.

Handling missing dates in the dataframe

Imputing rows where dates are missing from the dataset

# Creating a dataframe containing the time from the start to the end of voting period
time_dataframe = pd.DataFrame({'date':pd.date_range(min(votes['voteDate']).date(), max(votes['voteDate']).date())})
time_dataframe['tmp'] = 1

# Creating a dataframe with all employees in the absenteeism dataset
complete_df = pd.DataFrame({'employee':df.employee.unique()})
complete_df['tmp'] = 1

# Creating a dataset that contains the combinations of all days for all employees
complete_df = pd.merge(complete_df, time_dataframe, on=['tmp']).drop('tmp', axis = 1)

Merging all the datasets

# Joining the feedback (comments) given by the employees
complete_df = pd.merge(complete_df, feedback, how = 'left', on=['date', 'employee'])
complete_df = complete_df[['employee', 'date', 'likes', 'dislikes', 'feedbackType', 'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 'commentDate']]

# Filling the last available feedback data the days when there was no feedback data for an employee.
complete_df[['date', 'likes', 'dislikes', 'feedbackType', 'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 'commentDate']] = \
    complete_df.groupby('employee').fillna(method='ffill')

# Creating new features
complete_df['days_since_last_comment'] = (complete_df.date-complete_df.commentDate)
complete_df['days_since_last_comment'] = (complete_df['days_since_last_comment'].fillna(pd.Timedelta(seconds=0))/np.timedelta64(1, 'D')).astype(int)

# Assuming that for the employees that hae not commented (yet) the feedback is 0.
complete_df = \
    complete_df[['employee', 'date', 'likes', 'dislikes', 'feedbackType', 'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 'days_since_last_comment']].\
    fillna(0)

# Joining the votes dataset fr every eployee-date
complete_df = pd.merge(complete_df, votes, how = 'left', on=['date', 'employee'])

# Filling the last available votes data the days when there was no vote for an employee.
complete_df[[ 'no_of_votes_till_date','perc_days_voted', 'deletedOn', 'new_employee', 'min_date', 'stillExists', 'vote',
             'avg_vote_till_date', 'avg_vote','last_2_votes_avg', 'voteDate', 'timezone']] = \
    complete_df.groupby('employee')[['no_of_votes_till_date','perc_days_voted', 'deletedOn', 'new_employee', 'min_date', 'stillExists', 'vote',
                                     'avg_vote_till_date', 'avg_vote','last_2_votes_avg', 'voteDate', 'timezone']].\
    fillna(method='ffill')

# Remove the data before the employee joined
complete_df = complete_df[complete_df.avg_vote >= 0]

# Remove data after employee left
complete_df = complete_df[(complete_df.stillExists == 1) | ((complete_df.stillExists == 0) & (complete_df.date <= complete_df.deletedOn))]

# Recomputing no_of_days_since_first_vote
complete_df.no_of_days_since_first_vote = ((complete_df.date - complete_df.min_date)/np.timedelta64(1, 'D')).astype(int)+1

# Adding new features
# assuming a 60 day notice period
# 60 days before the employee leaves are recorded
complete_df.deletedOn = complete_df.deletedOn.fillna(pd.to_datetime(date.today()))
complete_df['countdown_to_last_day'] = ((complete_df.date - complete_df.deletedOn)/np.timedelta64(1, 'D')).astype(int)+1
complete_df.loc[complete_df.countdown_to_last_day < -15, 'countdown_to_last_day'] = 999

# computing days since last vote
complete_df['days_since_last_vote'] = (complete_df.date-complete_df.voteDate)
complete_df['days_since_last_vote'] = (complete_df['days_since_last_vote'].fillna(pd.Timedelta(seconds=0))/np.timedelta64(1, 'D')).astype(int)

# Imputing still exists column
complete_df[['timezone', 'stillExists']] = complete_df.groupby('employee')[['timezone', 'stillExists']].fillna(method = 'bfill')

# Selecting the features used in the model
complete_df = complete_df[
    ['employee', 'date', 'likes', 'dislikes', 'feedbackType', 
     'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 
     'days_since_last_comment', 'vote', 'timezone', 'stillExists',
       'no_of_days_since_first_vote', 'no_of_votes_till_date',
       'perc_days_voted', 'avg_vote_till_date', 'avg_vote', 'last_2_votes_avg', 'prev_vote',
       'days_since_last_vote', 'new_employee', 'countdown_to_last_day']].fillna(0)
complete_df
employee date likes dislikes feedbackType likes_till_date dislikes_till_date last_2_likes last_2_dislikes days_since_last_comment vote timezone stillExists no_of_days_since_first_vote no_of_votes_till_date perc_days_voted avg_vote_till_date avg_vote last_2_votes_avg prev_vote days_since_last_vote new_employee countdown_to_last_day
41 19Q 2017-06-13 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 1 1.0 1.000000 3.000000 3.539171 0.0 0.0 0 1.0 999
42 19Q 2017-06-14 0.0 0.0 0 0.0 0.0 0.0 0.0 0 4.0 Europe/Madrid 1.0 1 2.0 1.000000 3.500000 3.539171 0.0 3.0 0 1.0 999
43 19Q 2017-06-15 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 2 3.0 1.000000 3.333333 3.539171 3.5 4.0 0 1.0 999
44 19Q 2017-06-16 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 3 3.0 1.000000 3.333333 3.539171 3.5 0.0 0 1.0 999
45 19Q 2017-06-17 0.0 0.0 0 0.0 0.0 0.0 0.0 0 4.0 Europe/Madrid 1.0 4 4.0 0.800000 3.500000 3.539171 3.5 3.0 0 1.0 999
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
42142 3WW 2019-03-07 22.0 1.0 CONGRATULATION 22.0 1.0 22.0 1.0 327 4.0 Europe/Madrid 1.0 526 372.0 0.707224 2.131720 2.151596 4.0 4.0 0 1.0 999
42143 3WW 2019-03-08 22.0 1.0 CONGRATULATION 22.0 1.0 22.0 1.0 328 4.0 Europe/Madrid 1.0 527 373.0 0.707780 2.136729 2.151596 4.0 4.0 0 1.0 999
42144 3WW 2019-03-09 22.0 1.0 CONGRATULATION 22.0 1.0 22.0 1.0 329 4.0 Europe/Madrid 1.0 528 374.0 0.708333 2.141711 2.151596 4.0 4.0 0 1.0 999
42145 3WW 2019-03-10 22.0 1.0 CONGRATULATION 22.0 1.0 22.0 1.0 330 4.0 Europe/Madrid 1.0 529 375.0 0.708885 2.146667 2.151596 4.0 4.0 0 1.0 999
42146 3WW 2019-03-11 22.0 1.0 CONGRATULATION 22.0 1.0 22.0 1.0 331 4.0 Europe/Madrid 1.0 530 376.0 0.709434 2.151596 2.151596 4.0 4.0 0 1.0 999

33451 rows × 23 columns

# Combining the leaves dataset
data = pd.merge(complete_df, leave_df, how = 'left', on=['date', 'employee'])
data.on_leave = data.on_leave.fillna(0)

# Get the leave status of the previous day
data['previous_day_leave'] = data.groupby('employee')['on_leave'].shift().fillna(0)

# Renaming columns to more suitable ones
data.rename(columns={"likes": "last_likes", "dislikes": "last_dislikes", "feedback_type":"last_feedback_type",
                    "vote":"last_vote", "new_employee":"employee_joined_after_jun17"}, inplace = True)

Back fill and forward fill for null values

# Creating new columns
data.no_leaves_till_date = data.no_leaves_till_date.fillna(method = 'bfill').fillna(method = 'ffill')

Create rolling n-days sum

# Get the number of days the person was on leave in the last two days
def get_rolling_sum(grp, freq, col):
    return(grp.rolling(freq, on='date')[col].sum())
data = data.sort_values(['employee', 'date'])
data['last_2_days_leaves']  = data.groupby('employee', as_index = False, group_keys=False).\
    apply(get_rolling_sum, '2D', 'on_leave') # 2d for 2 days

The correlation matrix for all the columns is

red_green_cmap = sns.diverging_palette(10, 133,as_cmap=True)

f = plt.figure(figsize=(20, 20))
plt.matshow(data.corr(), fignum = f, cmap = red_green_cmap, vmin=-1, vmax=1)
plt.xticks(range(data.select_dtypes(['number']).shape[1]), 
           data.select_dtypes(['number']).columns, rotation=90, fontsize = 15)
plt.yticks(range(data.select_dtypes(['number']).shape[1]), 
           data.select_dtypes(['number']).columns, rotation=0, fontsize = 15)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title("Correlation between different parameters", fontsize=20,loc='left')
plt.show()

png

The distributions across all the columns are

_ = data.hist(figsize = (20,20))

png

Date as a predictor variable

We have previously seen that most leaves are applied either on the first of the month or on the last of the month. This indicates that the day of the week, month and other date characteristics are also significant.

Create weekday, month and week numbers

day_name= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday']
month_name = [None, 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
data['weekday'] = data.date.dt.weekday.apply(lambda x:day_name[x])
data['month'] = data.date.dt.month.apply(lambda x:month_name[x])
data['week'] = data.date.dt.day % 7 # Which week in the month
data
employee date last_likes last_dislikes feedbackType likes_till_date dislikes_till_date last_2_likes last_2_dislikes days_since_last_comment last_vote timezone stillExists no_of_days_since_first_vote no_of_votes_till_date perc_days_voted avg_vote_till_date avg_vote last_2_votes_avg prev_vote days_since_last_vote employee_joined_after_jun17 countdown_to_last_day reason on_leave no_leaves_till_date previous_day_leave last_2_days_leaves weekday month week
23729 17r 2018-05-29 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 1 1.0 1.000000 3.000000 2.121212 0.0 0.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Tuesday May 1
23730 17r 2018-05-30 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 1 1.0 1.000000 3.000000 2.121212 0.0 0.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Wednesday May 2
23731 17r 2018-05-31 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 2 1.0 1.000000 3.000000 2.121212 0.0 0.0 1 1.0 999 NaN 0.0 0.0 0.0 0.0 Thursday May 3
23732 17r 2018-06-01 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 3 2.0 0.500000 3.000000 2.121212 0.0 3.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Friday Jun 1
23733 17r 2018-06-02 0.0 0.0 0 0.0 0.0 0.0 0.0 0 3.0 Europe/Madrid 1.0 4 2.0 0.500000 3.000000 2.121212 0.0 0.0 0 1.0 999 NaN 0.0 0.0 0.0 0.0 Saturday Jun 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32232 zGB 2019-03-07 24.0 8.0 OTHER 24.0 8.0 24.0 8.0 299 3.0 Europe/Madrid 1.0 639 66.0 0.109272 3.015152 3.015152 3.0 0.0 34 0.0 999 NaN 0.0 0.0 0.0 0.0 Thursday Mar 0
32233 zGB 2019-03-08 24.0 8.0 OTHER 24.0 8.0 24.0 8.0 300 3.0 Europe/Madrid 1.0 640 66.0 0.109272 3.015152 3.015152 3.0 0.0 35 0.0 999 NaN 0.0 0.0 0.0 0.0 Friday Mar 1
32234 zGB 2019-03-09 24.0 8.0 OTHER 24.0 8.0 24.0 8.0 301 3.0 Europe/Madrid 1.0 641 66.0 0.109272 3.015152 3.015152 3.0 0.0 36 0.0 999 NaN 0.0 0.0 0.0 0.0 Saturday Mar 2
32235 zGB 2019-03-10 24.0 8.0 OTHER 24.0 8.0 24.0 8.0 302 3.0 Europe/Madrid 1.0 642 66.0 0.109272 3.015152 3.015152 3.0 0.0 37 0.0 999 NaN 0.0 0.0 0.0 0.0 Sunday Mar 3
32236 zGB 2019-03-11 24.0 8.0 OTHER 24.0 8.0 24.0 8.0 303 3.0 Europe/Madrid 1.0 643 66.0 0.109272 3.015152 3.015152 3.0 0.0 38 0.0 999 NaN 0.0 0.0 0.0 0.0 Monday Mar 4

33451 rows × 31 columns

The second part of the blog is here.

References

  1. Notes and lectures, Workforce Analytics module, MSc Business analytics, Imperial College London, Class 2020-22
  2. Kyburz J, Morelli D, Schaaf A, Villani F, Wheatley D: Predicting absenteeism
  3. Harsha A, Shaked A, Artem G, Tebogo M, Gokhan M: Predicting absenteeism Workforce Analytics module
  4. Bebenroth, R., & Berengueres, J. O. (2020). New hires' job satisfaction time trajectory. International Journal of Human Resources Development and Management, 20(1), 61-74.
data.to_csv('data_after_feature_engg.csv')
Back to top