Feature engineering for Machine Learning¶

This is a blog post on some common feature engineering techniques on time series continuous data. Different features are created across four different datasets for the same use case.

Predicting absenteeism¶

A large problem within organisations is how to motivate their employees. In this blog, we will use HappyForce in order to predict employment absenteeism. The goal is to identify who are likely to be absent in the near future, and find the reasons for absenteeism.

This blog is the first part and contains Feature engineering and EDA for machine learning. Part 2 contains the machine learning part where we build models and compare the results. Part 3 contains modelling using neural networks. This blog shows various ways in which feature engineering can be carried out on time series datasets.

We focused on data from 408 employees of one company in Spain; 62 have record of absence. The below chart summarizes the main reasons for leave; in our case, common sickness / non job-related accidents are the main reasons. Given we are to analyse employee absence as it relates to motivation, it is feasible to assume a portion of these employees simply call in sick as they lack motivation or commitment to their workplace.

Employee absenteeism dataset¶

The datasets contains 7 files, first let us look at the employeeAbsenteeism dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date
pd.options.display.max_columns = None

df = pd.read_csv('employeeAbsenteeism.csv')
df.head()

	employee	companyAlias	from	to	reason
0	19Q	C1	10/1/18 0:10	10/26/18 0:10	Workplace accident
1	NY3	C1	10/1/18 0:10	10/31/18 0:10	Common sickness or accident not related to th...
2	qKO	C1	10/1/18 0:10	10/5/18 0:10	Common sickness or accident not related to th...
3	qKO	C1	10/10/18 0:10	10/31/18 0:10	Common sickness or accident not related to th...
4	2wx	C1	10/1/18 0:10	10/31/18 0:10	Common sickness or accident not related to th...

Convert strings in different formats to datetime¶

The columns 'from' and 'to' are date time values in unspecified format.

df.to = pd.to_datetime(df.to) # converts string to type datetime
df['from'] = pd.to_datetime(df['from'])
df.head()

	employee	companyAlias	from	to	reason
0	19Q	C1	2018-10-01 00:10:00	2018-10-26 00:10:00	Workplace accident
1	NY3	C1	2018-10-01 00:10:00	2018-10-31 00:10:00	Common sickness or accident not related to th...
2	qKO	C1	2018-10-01 00:10:00	2018-10-05 00:10:00	Common sickness or accident not related to th...
3	qKO	C1	2018-10-10 00:10:00	2018-10-31 00:10:00	Common sickness or accident not related to th...
4	2wx	C1	2018-10-01 00:10:00	2018-10-31 00:10:00	Common sickness or accident not related to th...

We can see that the dataset contains details of leaves taken by employees, the from-date, to-date, the leave time, and the reason for leave given by the employee.

df.describe(include='all', datetime_is_numeric=True)

	employee	companyAlias	from	to	reason
count	106	106	106	106	106
unique	62	1	NaN	NaN	4
top	yKX	C1	NaN	NaN	Common sickness or accident not related to th...
freq	4	106	NaN	NaN	96
mean	NaN	NaN	2018-08-11 04:12:33.962264320	2018-08-26 23:13:41.886792448	NaN
min	NaN	NaN	2018-06-01 00:06:00	2018-06-06 00:06:00	NaN
25%	NaN	NaN	2018-06-18 06:06:00	2018-06-30 00:06:00	NaN
50%	NaN	NaN	2018-08-01 00:08:00	2018-08-31 00:08:00	NaN
75%	NaN	NaN	2018-10-01 00:10:00	2018-10-23 12:10:00	NaN
max	NaN	NaN	2018-10-31 00:10:00	2018-10-31 00:10:00	NaN

Further we can see that the data is across only one company, across 62 employees taking a total of around 106 leaves (not days of leave) with 4 unique reasons.

Difference in days¶

Creating a feature 'leave_time' which is the leave period

df['leave_time'] = ((df['to'] - df['from'])/np.timedelta64(1, 'D')).astype(int) # takes the difference in days

f, ax = plt.subplots(1, figsize = (20,10))
df['leave_time'].plot.hist()
plt.title('Histogram of the leave period')
plt.show()

png

The maximum period of leaves is 30 days, and the frequency of the leave perod is shown below. The majority of employees take either a short 1-3 days leave, or a much longer 27-30 day leave (likely due to a sickness, accident, or similar issue).

df_for_eda = df.copy()

# Creating class for the continuous variable
df_for_eda['leave_time_class'] = np.where(df_for_eda['leave_time'].isna(), 'None', 'Not None')

# Splitting the continuous variable into classes
df_for_eda['leave_time_class'] = pd.cut(df_for_eda.leave_time, 
                                    [0, 1, 5, 10, 15, 20, 25, 29, 30], 
                                    labels = ['1 day', '1-5days', '5-10 days','10-15 days', '15-20 days', '20-25 days', '25-29 days', '30 days'])
grouped_df_for_eda = df_for_eda.groupby('leave_time_class').aggregate(
    number_of_instances = ('leave_time_class', 'count'),
    unique_employees = ('employee', 'nunique')
).reset_index().assign(instance_percentage = lambda a:round(a.number_of_instances*100/(sum(a.number_of_instances)),2)).\
    assign(cumulative_percentage = lambda x:x.instance_percentage.cumsum())

fs, axs = plt.subplots(1, figsize=(20,10))
grouped_df_for_eda.index = grouped_df_for_eda[['leave_time_class']]
grouped_df_for_eda.number_of_instances.plot(kind='bar', ax=axs)
axs.set_ylabel('No of instances')
axs.set_xlabel('leave period (days)')
axs.set_title('Histogram of leave period')
axs.bar_label(axs.containers[0], 
              labels = round(grouped_df_for_eda.instance_percentage, 1).astype(str)+str("% - ")+grouped_df_for_eda.number_of_instances.astype(str) , label_type='edge')

plt.show();
grouped_df_for_eda

png

	leave_time_class	number_of_instances	unique_employees	instance_percentage	cumulative_percentage
(1 day,)	1 day	14	13	14.58	14.58
(1-5days,)	1-5days	15	14	15.62	30.20
(5-10 days,)	5-10 days	10	9	10.42	40.62
(10-15 days,)	10-15 days	4	4	4.17	44.79
(15-20 days,)	15-20 days	3	3	3.12	47.91
(20-25 days,)	20-25 days	5	5	5.21	53.12
(25-29 days,)	25-29 days	19	17	19.79	72.91
(30 days,)	30 days	26	17	27.08	99.99

There are four reasons given for leaves, with "Common sickness not related to the job" as the most common reason. The below pie charts summarize the main reasons for leave shows that the main reasons employees take leave. In our case, common sickness / non job-related accidents are the main reasons for absence.

df.groupby('reason').\
    aggregate({'leave_time':'sum'}).\
    plot.pie(y='leave_time', figsize=(10, 10), 
             autopct='%1.1f%%', # to add the percentages text
             wedgeprops=dict(width=.5)
            )
plt.ylabel("")
plt.title('Proportion of the number of days leaves were taken due to different reasons')
plt.show()

png

grouped_df_for_eda = df_for_eda.groupby(['reason', 'leave_time_class']).aggregate(
    number_of_instances = ('leave_time_class', 'count')
).reset_index().pivot(index = 'leave_time_class', columns=['reason'],
          values = ['number_of_instances']).reset_index()
fs, axs = plt.subplots(1, figsize=(20,10))
grouped_df_for_eda.index = grouped_df_for_eda[['leave_time_class']]
grouped_df_for_eda.number_of_instances.plot(kind='bar', ax=axs)
axs.set_ylabel('No of instances')
axs.set_xlabel('leave period (days)')
axs.set_title('Histogram of leave period')
plt.show();

png

# Creating class for the categorical variable
df_for_eda['reason_class'] = np.where(df_for_eda['reason']=='Common sickness  or accident not related to the job', 'sick', 'others')

grouped_df_for_eda = df_for_eda.groupby(['reason_class', 'leave_time_class']).aggregate(
    number_of_instances = ('leave_time_class', 'count')
).reset_index().pivot(index = 'leave_time_class', columns=['reason_class'],
          values = ['number_of_instances']).reset_index()
# Naming columns
grouped_df_for_eda.columns = ['leave_time_class', 'others_number_of_instances', 'sick_number_of_instances']
# Adding columns for percentages
grouped_df_for_eda = grouped_df_for_eda.\
    assign(others_percentage = lambda x:round(x.others_number_of_instances*100/(sum(x.others_number_of_instances)),2)).\
    assign(others_cumulative_perc = lambda a:round(a.others_percentage.cumsum(), 2)).\
    assign(sick_percentage = lambda x:round(x.sick_number_of_instances*100/(sum(x.sick_number_of_instances)),2)).\
    assign(sick_cumulative_perc = lambda a:round(a.sick_percentage.cumsum(), 2))

# Plotting
fs, axs = plt.subplots(1, 2, sharex='col', sharey='row', figsize = (15,7))
grouped_df_for_eda.index = grouped_df_for_eda[['leave_time_class']]
grouped_df_for_eda.sick_number_of_instances.plot(kind='bar', ax=axs[0])
grouped_df_for_eda.others_number_of_instances.plot(kind='bar', ax=axs[1])

axs[0].bar_label(axs[0].containers[0], labels = round(grouped_df_for_eda.sick_percentage, 1).astype(str)+str("%"), label_type='edge')
axs[1].bar_label(axs[1].containers[0], labels = round(grouped_df_for_eda.others_percentage, 1).astype(str)+str("%"), label_type='edge')

axs[0].set_ylabel('No of leaves')
axs[0].set_xlabel('(leave period)')
axs[0].set_title('Common sickness  or accident not related to the job')
axs[1].set_xlabel('(leave period)')
axs[1].set_title('Other type of leaves')

plt.show();
grouped_df_for_eda

png

	leave_time_class	others_number_of_instances	sick_number_of_instances	others_percentage	others_cumulative_perc	sick_percentage	sick_cumulative_perc
(1 day,)	1 day	0	14	0.0	0.0	16.28	16.28
(1-5days,)	1-5days	0	15	0.0	0.0	17.44	33.72
(5-10 days,)	5-10 days	2	8	20.0	20.0	9.30	43.02
(10-15 days,)	10-15 days	0	4	0.0	20.0	4.65	47.67
(15-20 days,)	15-20 days	0	3	0.0	20.0	3.49	51.16
(20-25 days,)	20-25 days	1	4	10.0	30.0	4.65	55.81
(25-29 days,)	25-29 days	3	16	30.0	60.0	18.60	74.41
(30 days,)	30 days	4	22	40.0	100.0	25.58	99.99

Create variable for date and day-of-month¶

Creating a new feature 'date' which captures the date from when the leave was applied, and 'from' and 'to' the day of the month.

df['date'] = df['from'].dt.date.astype('datetime64[ns]') # gets only the date part from the timeframe
df['from_day'] = df['from'].dt.day
df['to_day'] = df['to'].dt.day

Aditionally, we can observe that the most of the leaves were taken from the first of the month, and ended at the end of the month.

f, ax = plt.subplots(1, 2, figsize = (20,10), sharey = True)
df.groupby(df['from'].dt.day)['employee'].count().\
    plot(kind='bar', ax=ax[0], title = 'Number of leaves with "from" date')
df.groupby(df['to'].dt.day)['employee'].count().\
    plot(kind='bar', ax=ax[1], title = 'Number of leaves with "to" date')
plt.show();

png

The below scatterplot shows the number of employees that took leaves across time.

# Feature engineering on absentism dataset
# creating a dataframe with leaves as per employee and dates
leave_df = pd.DataFrame({'date':[], 'employee':[], 'reason':[]})
for index, row in df.iterrows():
    dates = pd.date_range(row['from'].date(), row['to'].date())
    for date in dates:
        leave_df = pd.concat([leave_df, pd.DataFrame({'date':[date], 'employee':[row['employee']], 'reason':[row['reason']]})])
leave_df['on_leave'] = 1

# Creating a cumulative sum to get tehe number of leaves taken by the employee till date
leave_df['no_leaves_till_date'] = leave_df.groupby('employee')['on_leave'].transform(lambda x: x.cumsum().shift()).fillna(0)

f, ax = plt.subplots(1, figsize = (20,10))
leave_df.groupby(['date'])['on_leave'].sum().reset_index().plot(x = 'date', y = 'on_leave', kind = 'scatter', ax=ax)
plt.title('Number of employees on leave')
plt.show()

png

We can see that most of the leaves are in three months in 2018, indicating that these could be a subset of the data.

Employee details dataset¶

The next dataset to look at lastparticipationExists which has the last participation date along with the details of the employee. This data also has number of votes each employee has till date (this variable can be used as a proxy for engagement).

employee_details = pd.read_csv("lastParticipationExists.csv")

# Convert to datetime
employee_details.lastParticipationDate = pd.to_datetime(employee_details.lastParticipationDate)
employee_details.deletedOn = pd.to_datetime(employee_details.deletedOn)

# Convert to integer
employee_details.stillExists = employee_details.stillExists.astype(int)
employee_details

	employee	companyAlias	numVotes	lastParticipationDate	timezone	stillExists	deletedOn
0	l8	C1	285	2019-03-08 01:03:00	Europe/Madrid	1	NaT
1	Xv	C1	143	2018-04-21 02:04:00	Europe/Berlin	1	NaT
2	w7	C1	381	2019-03-11 01:03:00	Europe/Madrid	1	NaT
3	jE	C1	173	2019-03-01 01:03:00	Europe/Madrid	1	NaT
4	QP	C1	312	2019-03-08 01:03:00	Europe/Berlin	1	NaT
...	...	...	...	...	...	...	...
475	D7J	C1	29	2018-11-19 01:11:00	Europe/Madrid	0	2018-11-20 13:11:00
476	9KA	C1	50	2018-11-09 01:11:00	Europe/Madrid	0	2018-12-13 16:12:00
477	zR7	C1	42	2018-10-26 02:10:00	Europe/Madrid	0	2018-11-20 13:11:00
478	B7E	C1	16	2019-01-21 01:01:00	Europe/Madrid	0	2019-02-11 18:02:00
479	QJg	C1	1	2018-11-28 01:11:00	Europe/Madrid	0	2019-01-28 10:01:00

480 rows × 7 columns

employee_details.describe(include='all', datetime_is_numeric=True)

	employee	companyAlias	numVotes	lastParticipationDate	timezone	stillExists	deletedOn
count	480	480	480.000000	470	480	480.00000	105
unique	480	1	NaN	NaN	6	NaN	NaN
top	l8	C1	NaN	NaN	Europe/Madrid	NaN	NaN
freq	1	480	NaN	NaN	411	NaN	NaN
mean	NaN	NaN	222.570833	2018-12-05 02:49:16.212765696	NaN	0.78125	2018-04-20 20:56:21.142856960
min	NaN	NaN	0.000000	2017-05-06 02:05:00	NaN	0.00000	2017-05-12 10:05:00
25%	NaN	NaN	44.000000	2018-12-29 07:09:15	NaN	1.00000	2017-11-09 17:11:00
50%	NaN	NaN	175.000000	2019-03-08 01:03:00	NaN	1.00000	2018-05-15 17:05:00
75%	NaN	NaN	374.500000	2019-03-11 01:03:00	NaN	1.00000	2018-08-29 09:08:00
max	NaN	NaN	671.000000	2019-03-11 01:03:00	NaN	1.00000	2019-03-07 14:03:00
std	NaN	NaN	193.047398	NaN	NaN	0.41383	NaN

We can see that this dataset contains more employees than the previous dataset. The previous dataset might be a subset of all the leaves that different people have taken. We can see that most of the people are from Madrid in Europe, while considerable number of employees are from Berlin.

employee_details.groupby('timezone').\
    aggregate({'lastParticipationDate':'count'}).\
    plot.pie(y='lastParticipationDate', figsize=(10, 10), 
             autopct='%1.1f%%', # to add the percentages text
             wedgeprops=dict(width=.5)
            )
plt.ylabel("")
plt.title('Employee Locations')
plt.show()

png

f, ax = plt.subplots(1, figsize = (20,10))

employee_details.loc[employee_details.lastParticipationDate.__ne__(None)].\
    loc[employee_details.stillExists == 0].\
    groupby('lastParticipationDate').aggregate({'stillExists':'count'}).reset_index().\
    plot(x = 'lastParticipationDate', y = 'stillExists', kind = 'scatter', ax=ax)

plt.title('Number of employees who have left over time')
plt.xlabel('Number of employees')
plt.ylabel('Date')
plt.show()

png

Votes dataset¶

The next dataset of interest is the vote's dataset. A listing of all votes registered on Happyforce to the question "How are you today?" from the employees on the dataset. All employees do not participate in this survey, but quite a lot of them do regularly.

votes = pd.read_csv("votes.csv")

# Convert to datetime
votes['voteDate'] = pd.to_datetime(votes['voteDate'])

# Sort the dataset based on time
votes = votes.sort_values(['employee', 'voteDate'])
votes

	employee	companyAlias	voteDate	vote
17862	13L	C1	2017-09-26 02:09:00	2
18093	13L	C1	2017-09-27 02:09:00	3
18406	13L	C1	2017-09-29 02:09:00	3
18562	13L	C1	2017-09-30 02:09:00	4
18635	13L	C1	2017-10-01 02:10:00	3
...	...	...	...	...
90055	zyx	C1	2018-12-15 01:12:00	3
90797	zyx	C1	2018-12-19 01:12:00	2
91235	zyx	C1	2018-12-21 01:12:00	2
91983	zyx	C1	2018-12-26 01:12:00	3
92186	zyx	C1	2018-12-28 01:12:00	2

106834 rows × 4 columns

votes.describe(include='all', datetime_is_numeric=True)

	employee	companyAlias	voteDate	vote
count	106834	106834	106834	106834.000000
unique	472	1	NaN	NaN
top	xE4	C1	NaN	NaN
freq	671	106834	NaN	NaN
mean	NaN	NaN	2018-05-06 22:41:28.348840192	2.849580
min	NaN	NaN	2017-05-03 02:05:00	1.000000
25%	NaN	NaN	2017-11-27 01:11:00	2.000000
50%	NaN	NaN	2018-05-21 02:05:00	3.000000
75%	NaN	NaN	2018-10-19 02:10:00	4.000000
max	NaN	NaN	2019-03-11 01:03:00	4.000000
std	NaN	NaN	NaN	0.980259

We can see that the votes vary between 1 and 4, with average being 2.89. This survey has data from 3-5-2017 to 11-3-2019. This survey was taken by 472 employees. While the data has details from 3-5-2017, there would be employees who joined after the survey has already started. Identifying them will help us to look for patterns over time for new employees. We are considering any employee that has the first vote 100 days after the inception (3-5-2017) as a new employee.

f, ax = plt.subplots(1, figsize = (20,10))
votes.vote.hist(grid = False, ax = ax)
plt.title('Histogram of the votes')
plt.xlabel('Vote')
plt.ylabel('Frequency')
plt.show()

png

We can observe that the default vote is 3, with more happy employees using 4 and less happy employees using 2 or 3. For the modelling, we want to create features which take only the data before the observation into account, and it is useful to have running means to capture the default characteristics of the employee, and also to capture the latest 2 votes to capture if there has been any change from the normal recently. The columns after feature engineering are:
1. (existing column) vote: what the employee voted (1 to 4, representing being very Bad, Bad, Good, Very Good)
2. new_employee: We add a new parameter called "new employee" - we will label someone as a newer employee if their first vote has been more than 100 days (~3 months) since the software began collecting data
3. min_date: the date of the first recorded vote by an employee
4. no_of_days_since_first_vote: we use this as proxy for tenure in the company
5. no_of_votes_till_date: use this as proxy for engagement with the company
6. perc_days_voted: another way to look at engagement with the company (adjusted for tenure)
7. avg_vote_till_date: proxy for job satisfaction (the higher, the better)
8. avg_vote: same as above, but taking all votes (not just to date)
9. last_2_votes_avg: proxy for most recent sentiment at their job

# Feature engineering on the votes dataset
votes['min_date'] = votes.groupby('employee')['voteDate'].transform(min)

# new employee is 100 day since inception of the software
votes['new_employee'] = (votes.min_date >= '13-06-2017 00:00:00').astype(int) 

# Getting the difference in days since first vote in days
votes['no_of_days_since_first_vote'] = ((votes.voteDate - votes.min_date)/np.timedelta64(1, 'D')).astype(int)+1

# Calculating the cumulative number of votes till date (only considering the past: this step should be done after order by)
votes['no_of_votes_till_date'] = votes.groupby('employee')['voteDate'].transform('cumcount')+1

# calculating the percentage of days that the employee has voted till date
votes['perc_days_voted'] = votes['no_of_votes_till_date']/votes['no_of_days_since_first_vote']

# average of the vote till date by the employee
votes['avg_vote_till_date'] = votes.groupby('employee')['vote'].transform('cumsum')/(votes['no_of_votes_till_date'])

# Average vote in general of the employee
votes['avg_vote'] = votes.groupby('employee')['vote'].transform(np.mean)

# Converting to datetime
votes['date'] = votes.voteDate.dt.date.astype('datetime64[ns]')

Last n-average¶

The variable 'last_2_votes_average' is created to get the running average of the last two votes

# The average of the last two votes (The last two votes might redict the probability of taking a leave now)
votes = votes.sort_values(['employee', 'voteDate'])
votes['last_2_votes_avg'] = votes.groupby('employee').vote.\
    transform(lambda s: s.rolling(2, closed='left').mean())

Previous vote¶

The previous vote for the employee

# The previous vote
votes['prev_vote'] = votes.groupby('employee').vote.transform(lambda s: s.shift(periods = 1))

In the scatter plot below we can see an interesting trend / relationship between votes (1 to 4) and employee tenure. We can see the newer employees (green dots) are overall more satisfied than older employees (red dots). We can also see that over time, more tenured employees show a concerning downward trend in satisfaction. This could be one of the drivers for churn or increased absences.

fs, axs = plt.subplots(1, figsize=(20,10))
votes.groupby(['new_employee', 'no_of_days_since_first_vote']).agg({'vote':'mean'}).reset_index().\
    plot.scatter(x='no_of_days_since_first_vote', y = 'vote', c='new_employee', alpha = 0.4, 
                cmap = sns.diverging_palette(10, 133,as_cmap=True), ax = axs)#, vmin = 2.0, vmax=3.5)
axs.set_ylim([2, 3.5])
plt.title('Averge vote across new and old employee')
plt.xlabel('Number of days since first vote')
plt.show()

png

We can see there is some seasonality in votes. This could be related to specific busy periods at the company, or relate to specific incidents that have occurred. It can also be a matter of seasons (employees tend to be happier before or around holiday seasons, for example).

fs, axs = plt.subplots(1, figsize=(20,10))
votes.groupby(['voteDate']).agg({'vote':'mean'}).reset_index().\
    plot.scatter(x='voteDate', y = 'vote', alpha = 0.4, ax = axs)
plt.title('Seasonality in the average vote')
plt.xlabel('Date of vote')
plt.show()

png

# joining the votes table with employee details to add the details of the employee
votes = votes.merge(employee_details, how = 'left', on = 'employee')

Does the average vote decrease just before a person quits the organisation? We can identify this by filtering for the employees who have already quit, and look at the average votes since the last day. We identify that the votes are actually pretty consistent with a mean around 2.8 before the notice period while it decreases in the notice period(of presumably 60 days).

people_who_left = votes[votes.stillExists == 0].copy()
people_who_left.loc[:,'no_of_days_since_exit'] = ((people_who_left.voteDate - people_who_left.deletedOn)/np.timedelta64(1, 'D')).astype(int)-1

fs, axs = plt.subplots(1, figsize=(20,10))
people_who_left.groupby(['new_employee', 'no_of_days_since_exit']).agg({'vote':'mean'}).reset_index().\
    plot.scatter(x='no_of_days_since_exit', y = 'vote', alpha = 0.4, ax = axs)
axs.set_xlim([-200, 0])
axs.set_ylim([2, 3.5])
plt.title('Averge vote for employees who have left the organisation')
plt.xlabel('Number of days since exit date')
plt.show()

png

As we have employees across 5 time zones, the average votes across time zones are also shown. We can see distinct seasonality patterns across different geographies. Some geographies have low variation as they have very few employees.

fs, axs = plt.subplots(1, figsize=(20,10))
sns.scatterplot(data = votes.groupby(['timezone', 'voteDate']).agg({'vote':'mean'}).reset_index(), 
                x='voteDate', y = 'vote', hue = 'timezone', alpha = 0.4, ax = axs)
plt.title('Averge vote for employees in different geographies')
plt.xlabel('Vote date')
plt.show()

png

Feedback dataset¶

The next dataset of interest is the comment's feedback dataset. This contains all the different comments given by employees, and the number of likes and dislikes on each of the comment. Likes, Dislikes and comments are various proxy’s for engagement.

feedback = pd.read_csv("comments_by_employees_in_anonymous_forum.csv")
feedback

	employee	companyAlias	commentId	commentLength	likes	dislikes	commentDate	feedbackType
0	aKP	C1	5909b33da2ede4000473da6f	17	9	0	2017-05-03 12:05	OTHER
1	dNx	C1	5909b6aca2ede4000473da72	25	12	0	2017-05-03 12:05	OTHER
2	ONv	C1	5909c2dea2ede4000473db8c	58	33	5	2017-05-03 13:05	OTHER
3	e9M	C1	5909d32ea2ede4000473db97	56	11	4	2017-05-03 14:05	OTHER
4	RWM	C1	5909f227a2ede4000473dcbe	105	18	0	2017-05-03 17:05	OTHER
...	...	...	...	...	...	...	...	...
5067	7o1	C1	5c7108e8434c4500041722b0	28	0	0	2019-02-23 09:02	OTHER
5068	N3	C1	5c71519ca9f66e00042896f6	14	0	0	2019-02-23 14:02	OTHER
5069	DNY	C1	5c73b11e50b72e0004cab283	63	0	0	2019-02-25 10:02	OTHER
5070	72j	C1	5c744971e29c7b0004391da3	44	0	0	2019-02-25 21:02	OTHER
5071	qKO	C1	5c781339efad100004ebb886	39	0	0	2019-02-28 17:02	OTHER

5072 rows × 8 columns

feedback.describe(include='all', datetime_is_numeric=True)

	employee	companyAlias	commentId	commentLength	likes	dislikes	commentDate	feedbackType
count	5072	5072	5072	5072.000000	5072.000000	5072.00000	5072	5072
unique	305	1	5072	NaN	NaN	NaN	3580	5
top	4ov	C1	5909b33da2ede4000473da6f	NaN	NaN	NaN	2018-09-10 09:09	OTHER
freq	373	5072	1	NaN	NaN	NaN	26	3188
mean	NaN	NaN	NaN	168.518336	13.605875	4.89097	NaN	NaN
std	NaN	NaN	NaN	193.802568	15.280530	6.62993	NaN	NaN
min	NaN	NaN	NaN	1.000000	0.000000	0.00000	NaN	NaN
25%	NaN	NaN	NaN	48.000000	2.000000	0.00000	NaN	NaN
50%	NaN	NaN	NaN	111.000000	9.000000	3.00000	NaN	NaN
75%	NaN	NaN	NaN	223.000000	20.000000	7.00000	NaN	NaN
max	NaN	NaN	NaN	2509.000000	135.000000	65.00000	NaN	NaN

We can observe that as the number of interactions becomes larger, the overall trend goes either towards likes or dislikes. We can also observe that comments above 80 interactions are usually highly liked.

feedback['log_comment_length'] = np.log10(feedback.commentLength)
feedback['total_interactions'] = feedback.likes+feedback.dislikes
feedback['mean_feedback'] = (feedback.likes-feedback.dislikes)/feedback.total_interactions

fs, axs = plt.subplots(1, figsize=(20,10))
sns.scatterplot(data = feedback, 
                x='total_interactions', y = 'mean_feedback', hue = 'log_comment_length', alpha = 0.8, ax = axs)
plt.title('Mean interaction vs response level for employee')
plt.xlabel('Number of interactions')
plt.ylabel('Mean of feedback')
plt.show()

png

We can see that we have five types of comments

feedback.groupby('feedbackType').\
    aggregate({'commentId':'count'}).\
    plot.pie(y='commentId', figsize=(10, 10), 
             autopct='%1.1f%%', # to add the percentages text
             wedgeprops=dict(width=.5)
            )
plt.ylabel("")
plt.title('Frequency of somments')
plt.show()

png

Similarly, we can see a trend between different comment types, with congratulations receiving more likes than dislikes, while suggestions and information having similar distributions. We can also see how the number of likes are skewed.

fs, axs = plt.subplots(1, figsize=(20,10))
sns.scatterplot(data = feedback, 
                x='likes', y = 'dislikes', hue = 'feedbackType', alpha = 0.4, ax = axs)
plt.title('Total likes and dislikes for comments')
plt.show()

png

feedback.groupby('feedbackType').aggregate({'likes':'mean', 'dislikes':'mean'})

	likes	dislikes
feedbackType
CONGRATULATION	30.941538	3.824615
CRITICISM	11.125000	1.562500
INFORMATION	18.751152	7.657450
OTHER	9.325282	3.705144
SUGGESTION	19.219917	7.290456

f, ax = plt.subplots(2, figsize = (20,20), sharex = True)

sns.boxplot(data = feedback.groupby(['feedbackType', 'employee'])['commentId'].count().reset_index().assign(
        log_count_comments = lambda df: np.log10(df.commentId)
    ), 
            x = 'feedbackType', y = 'log_count_comments', ax = ax[0])

sns.boxplot(data = feedback, x = 'feedbackType', y = 'log_comment_length', ax = ax[1])

ax[0].set_ylabel('Log of number of comments')
ax[1].set_ylabel('Log of comment length')
ax[0].set_title('Number of comments and comment legth across different feedback types')
plt.show()

png

# feature engineering on the feedback dataset
feedback['commentDate'] = pd.to_datetime(feedback['commentDate'])

# sort by date in ascending to do cumulative sum and rolling calculations
feedback = feedback.sort_values(['employee', 'commentDate'])

# Rolling two likes and dislikes, indicating the last two likes and dislikes
feedback['last_2_likes'] = feedback.groupby('employee').likes.transform(lambda s: s.rolling(2, closed='left').mean())
feedback['last_2_dislikes'] = feedback.groupby('employee').dislikes.transform(lambda s: s.rolling(2, closed='left').mean())
feedback['date'] = feedback.commentDate.dt.date.astype('datetime64[ns]')
feedback

	employee	companyAlias	commentId	commentLength	likes	dislikes	commentDate	feedbackType	log_comment_length	total_interactions	mean_feedback	last_2_likes	last_2_dislikes	date
4373	19Q	C1	5959ec954040610004272a21	11	0	0	2017-07-03 09:07:00	OTHER	1.041393	0	NaN	0.0	0.0	2017-07-03
970	19Q	C1	5970e1483da0e10004b17a27	64	3	4	2017-07-20 18:07:00	OTHER	1.806180	7	-0.142857	3.0	4.0	2017-07-20
1903	19Q	C1	5a36f82b26c0110004c55d90	57	17	6	2017-12-18 00:12:00	OTHER	1.755875	23	0.478261	20.0	10.0	2017-12-18
1949	19Q	C1	5a40e7d3de51cb00042dfda4	31	5	0	2017-12-25 12:12:00	OTHER	1.491362	5	1.000000	22.0	6.0	2017-12-25
1972	19Q	C1	5a4a8275eb84e0000492659f	11	8	0	2018-01-01 19:01:00	OTHER	1.041393	8	1.000000	13.0	0.0	2018-01-01
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4740	zRx	C1	5be95746ed7ae70004b83417	11	0	0	2018-11-12 11:11:00	OTHER	1.041393	0	NaN	0.0	0.0	2018-11-12
4746	zRx	C1	5be95746ed7ae70004b83415	26	0	0	2018-11-12 11:11:00	OTHER	1.414973	0	NaN	0.0	0.0	2018-11-12
4758	zRx	C1	5be95747ed7ae70004b83419	33	0	0	2018-11-12 11:11:00	OTHER	1.518514	0	NaN	0.0	0.0	2018-11-12
4764	zRx	C1	5be95746ed7ae70004b83418	11	0	0	2018-11-12 11:11:00	OTHER	1.041393	0	NaN	0.0	0.0	2018-11-12
4781	zRx	C1	5be95746ed7ae70004b83416	8	0	0	2018-11-12 11:11:00	OTHER	0.903090	0	NaN	0.0	0.0	2018-11-12

5072 rows × 14 columns

Create till-date variables¶

Create variables that count the number of instances till date. These can be used to create average till date and other variables

# Count the likes and dislikes till the date given
feedback['likes_till_date']= feedback.groupby('employee')['likes'].transform('cumsum')
feedback['dislikes_till_date']= feedback.groupby('employee')['dislikes'].transform('cumsum')
feedback['comments_till_date']= feedback.groupby('employee')['commentId'].transform('cumcount')

Merging all the datasets¶

Now we combine employee absenteeism data with votes and comments datasets. We have the following issues: 1. The absenteeism dataset is a subset of the data and does not cover all employees and months
2. Not all employees have voted and even those who have voted infrequently
3. Not all employees have commented and even those who have posted a comment posted infrequently

So we have the following assumptions:
1. We are working with the data only for the employees that exist in the absenteeism dataset
2. We are assuming that the last vote of the employee talks about how the employee is today
3. We are also assuming that the last comment and the likes and dislikes have an effect on the employee

With these assumptions, we combine the datasets.

Handling missing dates in the dataframe¶

Imputing rows where dates are missing from the dataset

# Creating a dataframe containing the time from the start to the end of voting period
time_dataframe = pd.DataFrame({'date':pd.date_range(min(votes['voteDate']).date(), max(votes['voteDate']).date())})
time_dataframe['tmp'] = 1

# Creating a dataframe with all employees in the absenteeism dataset
complete_df = pd.DataFrame({'employee':df.employee.unique()})
complete_df['tmp'] = 1

# Creating a dataset that contains the combinations of all days for all employees
complete_df = pd.merge(complete_df, time_dataframe, on=['tmp']).drop('tmp', axis = 1)

Merging all the datasets

# Joining the feedback (comments) given by the employees
complete_df = pd.merge(complete_df, feedback, how = 'left', on=['date', 'employee'])
complete_df = complete_df[['employee', 'date', 'likes', 'dislikes', 'feedbackType', 'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 'commentDate']]

# Filling the last available feedback data the days when there was no feedback data for an employee.
complete_df[['date', 'likes', 'dislikes', 'feedbackType', 'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 'commentDate']] = \
    complete_df.groupby('employee').fillna(method='ffill')

# Creating new features
complete_df['days_since_last_comment'] = (complete_df.date-complete_df.commentDate)
complete_df['days_since_last_comment'] = (complete_df['days_since_last_comment'].fillna(pd.Timedelta(seconds=0))/np.timedelta64(1, 'D')).astype(int)

# Assuming that for the employees that hae not commented (yet) the feedback is 0.
complete_df = \
    complete_df[['employee', 'date', 'likes', 'dislikes', 'feedbackType', 'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 'days_since_last_comment']].\
    fillna(0)

# Joining the votes dataset fr every eployee-date
complete_df = pd.merge(complete_df, votes, how = 'left', on=['date', 'employee'])

# Filling the last available votes data the days when there was no vote for an employee.
complete_df[[ 'no_of_votes_till_date','perc_days_voted', 'deletedOn', 'new_employee', 'min_date', 'stillExists', 'vote',
             'avg_vote_till_date', 'avg_vote','last_2_votes_avg', 'voteDate', 'timezone']] = \
    complete_df.groupby('employee')[['no_of_votes_till_date','perc_days_voted', 'deletedOn', 'new_employee', 'min_date', 'stillExists', 'vote',
                                     'avg_vote_till_date', 'avg_vote','last_2_votes_avg', 'voteDate', 'timezone']].\
    fillna(method='ffill')

# Remove the data before the employee joined
complete_df = complete_df[complete_df.avg_vote >= 0]

# Remove data after employee left
complete_df = complete_df[(complete_df.stillExists == 1) | ((complete_df.stillExists == 0) & (complete_df.date <= complete_df.deletedOn))]

# Recomputing no_of_days_since_first_vote
complete_df.no_of_days_since_first_vote = ((complete_df.date - complete_df.min_date)/np.timedelta64(1, 'D')).astype(int)+1

# Adding new features
# assuming a 60 day notice period
# 60 days before the employee leaves are recorded
complete_df.deletedOn = complete_df.deletedOn.fillna(pd.to_datetime(date.today()))
complete_df['countdown_to_last_day'] = ((complete_df.date - complete_df.deletedOn)/np.timedelta64(1, 'D')).astype(int)+1
complete_df.loc[complete_df.countdown_to_last_day < -15, 'countdown_to_last_day'] = 999

# computing days since last vote
complete_df['days_since_last_vote'] = (complete_df.date-complete_df.voteDate)
complete_df['days_since_last_vote'] = (complete_df['days_since_last_vote'].fillna(pd.Timedelta(seconds=0))/np.timedelta64(1, 'D')).astype(int)

# Imputing still exists column
complete_df[['timezone', 'stillExists']] = complete_df.groupby('employee')[['timezone', 'stillExists']].fillna(method = 'bfill')

# Selecting the features used in the model
complete_df = complete_df[
    ['employee', 'date', 'likes', 'dislikes', 'feedbackType', 
     'likes_till_date', 'dislikes_till_date', 'last_2_likes', 'last_2_dislikes', 
     'days_since_last_comment', 'vote', 'timezone', 'stillExists',
       'no_of_days_since_first_vote', 'no_of_votes_till_date',
       'perc_days_voted', 'avg_vote_till_date', 'avg_vote', 'last_2_votes_avg', 'prev_vote',
       'days_since_last_vote', 'new_employee', 'countdown_to_last_day']].fillna(0)
complete_df

	employee	date	likes	dislikes	feedbackType	likes_till_date	dislikes_till_date	last_2_likes	last_2_dislikes	days_since_last_comment	vote	timezone	stillExists	no_of_days_since_first_vote	no_of_votes_till_date	perc_days_voted	avg_vote_till_date	avg_vote	last_2_votes_avg	prev_vote	days_since_last_vote	new_employee	countdown_to_last_day
41	19Q	2017-06-13	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	1	1.0	1.000000	3.000000	3.539171	0.0	0.0	0	1.0	999
42	19Q	2017-06-14	0.0	0.0	0	0.0	0.0	0.0	0.0	0	4.0	Europe/Madrid	1.0	1	2.0	1.000000	3.500000	3.539171	0.0	3.0	0	1.0	999
43	19Q	2017-06-15	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	2	3.0	1.000000	3.333333	3.539171	3.5	4.0	0	1.0	999
44	19Q	2017-06-16	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	3	3.0	1.000000	3.333333	3.539171	3.5	0.0	0	1.0	999
45	19Q	2017-06-17	0.0	0.0	0	0.0	0.0	0.0	0.0	0	4.0	Europe/Madrid	1.0	4	4.0	0.800000	3.500000	3.539171	3.5	3.0	0	1.0	999
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
42142	3WW	2019-03-07	22.0	1.0	CONGRATULATION	22.0	1.0	22.0	1.0	327	4.0	Europe/Madrid	1.0	526	372.0	0.707224	2.131720	2.151596	4.0	4.0	0	1.0	999
42143	3WW	2019-03-08	22.0	1.0	CONGRATULATION	22.0	1.0	22.0	1.0	328	4.0	Europe/Madrid	1.0	527	373.0	0.707780	2.136729	2.151596	4.0	4.0	0	1.0	999
42144	3WW	2019-03-09	22.0	1.0	CONGRATULATION	22.0	1.0	22.0	1.0	329	4.0	Europe/Madrid	1.0	528	374.0	0.708333	2.141711	2.151596	4.0	4.0	0	1.0	999
42145	3WW	2019-03-10	22.0	1.0	CONGRATULATION	22.0	1.0	22.0	1.0	330	4.0	Europe/Madrid	1.0	529	375.0	0.708885	2.146667	2.151596	4.0	4.0	0	1.0	999
42146	3WW	2019-03-11	22.0	1.0	CONGRATULATION	22.0	1.0	22.0	1.0	331	4.0	Europe/Madrid	1.0	530	376.0	0.709434	2.151596	2.151596	4.0	4.0	0	1.0	999

33451 rows × 23 columns

# Combining the leaves dataset
data = pd.merge(complete_df, leave_df, how = 'left', on=['date', 'employee'])
data.on_leave = data.on_leave.fillna(0)

# Get the leave status of the previous day
data['previous_day_leave'] = data.groupby('employee')['on_leave'].shift().fillna(0)

# Renaming columns to more suitable ones
data.rename(columns={"likes": "last_likes", "dislikes": "last_dislikes", "feedback_type":"last_feedback_type",
                    "vote":"last_vote", "new_employee":"employee_joined_after_jun17"}, inplace = True)

Back fill and forward fill for null values¶

# Creating new columns
data.no_leaves_till_date = data.no_leaves_till_date.fillna(method = 'bfill').fillna(method = 'ffill')

Create rolling n-days sum¶

# Get the number of days the person was on leave in the last two days
def get_rolling_sum(grp, freq, col):
    return(grp.rolling(freq, on='date')[col].sum())
data = data.sort_values(['employee', 'date'])
data['last_2_days_leaves']  = data.groupby('employee', as_index = False, group_keys=False).\
    apply(get_rolling_sum, '2D', 'on_leave') # 2d for 2 days

The correlation matrix for all the columns is

red_green_cmap = sns.diverging_palette(10, 133,as_cmap=True)

f = plt.figure(figsize=(20, 20))
plt.matshow(data.corr(), fignum = f, cmap = red_green_cmap, vmin=-1, vmax=1)
plt.xticks(range(data.select_dtypes(['number']).shape[1]), 
           data.select_dtypes(['number']).columns, rotation=90, fontsize = 15)
plt.yticks(range(data.select_dtypes(['number']).shape[1]), 
           data.select_dtypes(['number']).columns, rotation=0, fontsize = 15)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title("Correlation between different parameters", fontsize=20,loc='left')
plt.show()

png

The distributions across all the columns are

_ = data.hist(figsize = (20,20))

png

Date as a predictor variable¶

We have previously seen that most leaves are applied either on the first of the month or on the last of the month. This indicates that the day of the week, month and other date characteristics are also significant.

Create weekday, month and week numbers¶

day_name= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday']
month_name = [None, 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
data['weekday'] = data.date.dt.weekday.apply(lambda x:day_name[x])
data['month'] = data.date.dt.month.apply(lambda x:month_name[x])
data['week'] = data.date.dt.day % 7 # Which week in the month
data

	employee	date	last_likes	last_dislikes	feedbackType	likes_till_date	dislikes_till_date	last_2_likes	last_2_dislikes	days_since_last_comment	last_vote	timezone	stillExists	no_of_days_since_first_vote	no_of_votes_till_date	perc_days_voted	avg_vote_till_date	avg_vote	last_2_votes_avg	prev_vote	days_since_last_vote	employee_joined_after_jun17	countdown_to_last_day	reason	on_leave	no_leaves_till_date	previous_day_leave	last_2_days_leaves	weekday	month	week
23729	17r	2018-05-29	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	1	1.0	1.000000	3.000000	2.121212	0.0	0.0	0	1.0	999	NaN	0.0	0.0	0.0	0.0	Tuesday	May	1
23730	17r	2018-05-30	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	1	1.0	1.000000	3.000000	2.121212	0.0	0.0	0	1.0	999	NaN	0.0	0.0	0.0	0.0	Wednesday	May	2
23731	17r	2018-05-31	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	2	1.0	1.000000	3.000000	2.121212	0.0	0.0	1	1.0	999	NaN	0.0	0.0	0.0	0.0	Thursday	May	3
23732	17r	2018-06-01	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	3	2.0	0.500000	3.000000	2.121212	0.0	3.0	0	1.0	999	NaN	0.0	0.0	0.0	0.0	Friday	Jun	1
23733	17r	2018-06-02	0.0	0.0	0	0.0	0.0	0.0	0.0	0	3.0	Europe/Madrid	1.0	4	2.0	0.500000	3.000000	2.121212	0.0	0.0	0	1.0	999	NaN	0.0	0.0	0.0	0.0	Saturday	Jun	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32232	zGB	2019-03-07	24.0	8.0	OTHER	24.0	8.0	24.0	8.0	299	3.0	Europe/Madrid	1.0	639	66.0	0.109272	3.015152	3.015152	3.0	0.0	34	0.0	999	NaN	0.0	0.0	0.0	0.0	Thursday	Mar	0
32233	zGB	2019-03-08	24.0	8.0	OTHER	24.0	8.0	24.0	8.0	300	3.0	Europe/Madrid	1.0	640	66.0	0.109272	3.015152	3.015152	3.0	0.0	35	0.0	999	NaN	0.0	0.0	0.0	0.0	Friday	Mar	1
32234	zGB	2019-03-09	24.0	8.0	OTHER	24.0	8.0	24.0	8.0	301	3.0	Europe/Madrid	1.0	641	66.0	0.109272	3.015152	3.015152	3.0	0.0	36	0.0	999	NaN	0.0	0.0	0.0	0.0	Saturday	Mar	2
32235	zGB	2019-03-10	24.0	8.0	OTHER	24.0	8.0	24.0	8.0	302	3.0	Europe/Madrid	1.0	642	66.0	0.109272	3.015152	3.015152	3.0	0.0	37	0.0	999	NaN	0.0	0.0	0.0	0.0	Sunday	Mar	3
32236	zGB	2019-03-11	24.0	8.0	OTHER	24.0	8.0	24.0	8.0	303	3.0	Europe/Madrid	1.0	643	66.0	0.109272	3.015152	3.015152	3.0	0.0	38	0.0	999	NaN	0.0	0.0	0.0	0.0	Monday	Mar	4

33451 rows × 31 columns

The second part of the blog is here.

References¶

Notes and lectures, Workforce Analytics module, MSc Business analytics, Imperial College London, Class 2020-22
Kyburz J, Morelli D, Schaaf A, Villani F, Wheatley D: Predicting absenteeism
Harsha A, Shaked A, Artem G, Tebogo M, Gokhan M: Predicting absenteeism Workforce Analytics module
Bebenroth, R., & Berengueres, J. O. (2020). New hires' job satisfaction time trajectory. International Journal of Human Resources Development and Management, 20(1), 61-74.

data.to_csv('data_after_feature_engg.csv')