Hands-on Tutorials

Human Resource analytics — Can we predict Employee Turnover with caret in R?

Test different machine learning algorithms on small-to-medium-enterprises (SMEs) while keeping an eye on algorithmic biases

Hannah Roos

Published in

Towards Data Science

35 min readJul 18, 2021

People are the key factors for success to every organization — nothing else produces such a big value like skilled minds in the right time and place. This is why organizations all over the world make tremendous efforts to find and — maybe even more importantly — to maintain valuable talents. In a world of data, HR managers do not only rely on their gut feelings anymore when it comes to designing strategies to develop their own workforce of high-calibre minds: They make use of analytics to improve their HR practices and to make business success as well as employee satisfaction truly measurable.

In case of employee turnover, the use of predictive analytics is not only thought to benefit the people, but is also to save the company’s finances: When a skilled team member leaves voluntarily, it is always associated with a lot of time and money spent on finding and onboarding a suitable substitute. In addition, it can affect the firm’s overall productivity, customer loyalty and timely delivery of products (Hammermann & Thiele, 2019; Sexton et al., 2005). Among many other reasons, this is because a whole field emerged from the idea of using data to support human resources: HR analytics (also called people analytics) is about changing the way of recruiting and retaining talent based on data-driven insights (Isson & Harriot, 2016). This way data analytics are used to predict behavioural patterns (e.g., attrition rates, training costs, productivity) which are inherently informative to the respective management because it can guide their decision-making process. Based on the successful implementation of machine learning algorithms, some of the big players already apply predictive analytics to decrease attrition and increase retention of their profitable employees. For example, the (former) Senior Vice President of HR at Google argued that statistics were used to fully automate their job interview questions — based on their candidates’ profiles and actually use employee data to predict turnover (Laszlo Bock, 2016). Now as data enthusiasts, it is our task to support HR managers for sakes of their planning continuity while helping them to reduce the costs related to frequent turnover and facilitating successful growth on the market.

“HR analytics (also called people analytics) is about changing the way of recruiting and retaining talent based on data-driven insights.” — Isson & Harriot, 2016

A quick search on Google scholar reveals that there are a bunch of research articles out there that demonstrate how different ML algorithms can predict employee turnover. Nevertheless, the focus is usually put on technical characteristics (e.g., model performance, feature selection etc.) while the practical context of these applications is more or less left to the reader’s interpretation. For example, Zhao and colleagues evaluated different supervised machine learning techniques to predict employee turnover on simulated and real HR datasets of small-, medium and large-sized organizations (Zhao et al., 2019). These machine learning algorithms are used to predict employee turnover range from decision tree and random forest methods, gradient boosting trees, extreme gradient boosting over logistic regression, support vector machines, neural networks, linear discriminant analysis, Naïve Bayes methods and K-nearest neighbours.

Even if attempts to predict employee turnover with modern analytics seem to have a huge potential, there are some limitations which could make it challenging to transfer these scientific findings to real-world cases from the industry:

Predicting or explaining behaviour are two different things.

When working with data, we can drive either of the following strategies: When it is our goal to predict relevant outcomes, we do not have to fully understand the mechanisms that are at play (Yarkoni & Westfall, 2017). Our strategy would be rather prediction-focused. In the case of our employee turnover problem, we maybe do not want to lose too much time racking our brains over the “why” when we can already forecast which employees are at risk to leave soon — after all, our chance to change something lies only in the future. A good machine learning model does not have to be based on theory to make accurate predictions because it inherently learns from data: the algorithm mimics the outputs of the data-generating process when feeding it with new observations (e.g., new employees) without explicitly “knowing” anything about the reasons.

But if you have a strong academic background and ask a lot of “why”-questions, you would probably argue that we would also like to know why employees leave the organization in the first place. If we have no clue about the underlying mechanism that causes employees to churn, it will be even harder to design targeted interventions. Fortunately, there are studies highlighting the importance of regular pay raises, the role of business travel and job satisfaction to employee turnover. This makes it easier for us to pinpoint the actual “pain points” from inside of the organization and truly understand what drives the intention to go.

2. Statistics are not sufficient to deal with individuals.

In such a complex world we live in, data are not a magic key to a world full of perfectly computed, valid decisions that make everybody’s lives easier. Still, it sounds so cool and advanced when people talk about the fact that they use data to make evidence-based decisions. But if a machine learning algorithm is later applied to strongly inform decisions on SINGLE employees (e.g., when applied to rank candidates from a set of applicants), this procedure can easily gain an unethical taste. When applied incorrectly, applicants are not reviewed individually as a person anymore but as a score that estimates their probability to perform well on the job. Using data mining techniques, historical and labelled employee data can be used to detect features which can be associated with high job performance to later predict a new hire’s likelihood to perform well on the job (Mahmoud et al., 2019). Consequently, other people’s performance data along with some key measures (e.g., IQ, personality tests, structured interview results) or their curriculum vitae serve as a basis to forecast a new employee’s performance (Kluemper, Rosen & Mossholder, 2012; Apatean, Szakacs & Tilca, 2017, Li, Lai & Kao, 2008). Hence, the algorithm is trained with data from the past to predict the future. Especially in such a high-stakes context like job applications, this seems pretty deterministic to me and should be viewed with caution: the model is only capable of capturing associations from a specific moment in time even if they will change dynamically from person to person as well as across the organization’s development across time. Moreover, the algorithms typically also replicate discriminatory biases that are inherently represented in the data (e.g., being female could be predictive of having difficulties to gain a leadership position), which makes the model’s actual deployment within an organization hard to justify. To avoid any backfiring from employees, organizations thus need to take the issue of adverse impact against certain groups or individuals (e.g., parents, black people, pregnant women etc.) very seriously. A first step is to measure these biases statistically and correct them to create a fair AI that is benefits employees as well as the organization as a whole. You can find more ideas to fight discriminatory biases in an article from Andrew Burt published 2020 in Harvard Business Review. This problem is strongly tied to the impression the organization makes towards external candidates: Candidates should have a reasonable chance to convince the team by means of their actual skills and knowledge, without any biases or expectations. If the application of machine learning algorithms for recruiting purposes were made transparent, it may feel strange to get a job thanks to the mere combination of features that have been proven to be success factors for some predecessors. A similar thought applies to the prediction of employee turnover: Even if there is a set of features considered to be key drivers of employee churn (last pay raise, business travel, person-job fit, distance from home etc.), it is pretty obvious that the intention to quit a job is highly personal. By means of data analytics tools, we can only describe general patterns from a pool of individuals. We can even try to predict behaviour based upon these general tendencies. But we can never really know for sure if they apply to everybody. If data-driven insights can affect people’s lives (e.g., hiring decisions, retention efforts…), we should stay very careful and constantly question the sanity of our procedures.

3. The use of analytics does not justify unethical practices.

AI dystopia often involves machines making ethically sensitive decisions, turning computers into decision-makers. Even if the following examples are far from reaching this kind of scenario, we should be aware that even if data can inform decision makers with reasonable insights, they are not the decision makers themselves and should be used under proper data protection and privacy guidelines. When it comes to the use of HR analytics for recruiting purposes, it has been suggested to give candidates the opportunity to opt-in and have control over their data by deciding whether or not potential employers and recruiters can assess their digital footprint to address any ethical and legal concerns (Chamorro-Premuzic et al. 2013). Another suggestion touches upon more autonomy for the people affected by HR-analytics tools: Employees should not become passive recipients of algorithmic governance but have the chance to actually understand how the model makes predictions and to give critical feedback if needed.

Another creepy application I have stumbled upon in the literature includes the use of social media profiles like LinkedIn or Xing with the aim to predict the character of candidates bases on sentiment analysis — both in order to assess the candidate’s fit for a job (Faliagka, 2012). All of the above procedures certainly uncover interesting insights for researchers and psychologists, but should not be applied when the individuals, whose data are being processed, have not given any consent.

4. The employee data sets available in industry are often noisy and sparse.

If prediction is based on historical data, we always need to ask ourselves if these can really generalize to new, yet unknown observations. Now even if simulated HR data seem like a gift to any passionate data scientist, real HR data is often confidential, small, inconsistent and contains missing information. In case of medium-sized firms, not all of them can actually afford large-scale data storage, making it more difficult to store employee data in a consistent way. Moreover, it does typically include just a small proportion of employees who have actually left the company, making classes (stayed/left) imbalanced — a characteristic that needs special attention when evaluating machine learning models (but more on that later).

Another data-related change in perspective refers to the quality vs. quantity of data: Yahia, Hlel and Colomo-Palacious (2021) argue for a shift from big data to what they call “deep data” — qualitative data that contains all the necessary features to practically predict turnover. Indeed, massive sets of employee data are neither available to medium-to-small-sized firms nor necessary if we can identify the key drivers of turnover. There are other voices from the literature who go a step further and propose that large unstructured data sets (often referred to as “big data”) are not always better because they can be so noisy that it “overwhelms” each model’s predictive capacity (Chamorro-Premuzic et al., 2013).

This is just a gentle reminder to raise your awareness about the power of predictive analytics and its impact on the people — I highly recommend this paper from Dr. Michele Loi who has summarized ethical guidelines for the deployment of people analytics tools above and beyond the GDPR. Keeping these political issues in mind, we will now walk through a little case study to find out if the prediction of employee churn can be reasonably applied to a small fictive sample derived from the famous IBM employee dataset. I am curious to find out if it could potentially work for real-world cases, too!

Data overview — this is the famous IBM HR analytics dataset

The dataset we will use for our case study is a simulated dataset created by IBM Watson Analytics which can be found on Kaggle. It contains 1470 employee entries and common 38 features (Monthly income, Job satisfaction, gender etc.) — one of which is our target variable (Employee) Attrition (YES/NO). Let’s take a look at our raw data.

Disclaimer: All graphics are made by the author unless specified differently.

We seem to have 26 numeric variables and 9 character variables which differ in their distinct levels. None of the observations are missing and the summary the skim function gives us shows some descriptive statistics including mean, standard deviation, percentiles as well as a histogram. I am a big fan of the skim function — look how practical it is to get such a concise and yet detailed overview of the data!

It is pretty obvious that a payment that is perceived as unfair can influence a person’s intention to leave the job to look for better payment (Harden, Boakye & Ryan, 2018; Sarkar, 2018; Bryant & Allen, 2013). This is why we would like to create another variable that represents the payment competitiveness of each employee’s monthly income — the reasoning behind this is that employees may compare their income against those of their peers who share the same job level. Somebody who perceives his or her payment as fair should be less likely to leave the company compared to a person who gets considerably less for a similar position. To get there, we will use the data.table syntax to first calculate the median compensation by job level and store the appropriate value for each observation. Then we will divide each employee’s monthly income by the median income to get his or her compensation ratio: a measure that directly represents the person’s payment in respect to what would be expected by job level. Thus, a score of 1 means that the employee exactly matches the average payment for this position. A score of 1.2 means that the employee is paid 20% above the average pay and a score of 0.8 means that the person is paid 20% less than what would be expected by the usual payment per job level. To represent this at a factorial level, we will assign values

To “average” which lie within 0.75 and 1.25
To “below” that lie within 0 and 0.74 and
to “above” that lie within 1.25 and 2 of the CompensationRatio range.

This is how our newly generates features look like on the first 10 observations:

But how many employees actually left? Let’s calculate the turnover rate to learn something about the distribution of classes.

So, it appears that 237 employees (16%) left the company in a given timeframe while a majority (almost 84%) stayed. As argued above, many HR mangers do not have access to huge datasets containing thousands of employees with complete records. Now what if we would like to advise a small-to-medium firm that includes 50 to 250 employees? Can we still train ML algorithms to predict turnover?

To create an extra bit of a challenge and mimic a real-life sample that mimics a small-to-medium-sized company, we will randomly draw 126 observations from our full IBM Watson dataset. I will set a seed to make it more replicable for you.

Employee attrition — what could possibly drive people to leave?

Let’s now take a closer look at the interplay between two key factors for employee churn: job satisfaction and compensation. A typical hypothesis derived from the literature proposes that higher job satisfaction is associated with a lower likelihood of employee turnover — unhappy employees usually have more reason to leave because they expect to be happier somewhere else and do not feel as emotionally committed to their current organization, making it more desirable and easier to leave as soon as attractive alternatives are found (Zimmermann, Swider & Boswell, 2018).

It looks like the data generally support this hypothesis: even if the tails of the distribution demonstrate that a few employees who stayed are actually unsatisfied while some leavers are happy, the general tendency suggests that employees who leave are indeed on average less satisfied than the remaining employees in our sample.

Okay — but how does the combination of monthly income and job satisfaction differ among in relation to employee churn? More specifically, could a lower income be the reason to leave for employees who are actually satisfied with their job?

Of course, we cannot really fully tell if there is a cause-effect-relationship between income or job satisfaction and employee attrition — but it is still interesting to see if there is any hint about an association. Surprisingly, for very unsatisfied employees, monthly income is actually higher leavers compared to the remaining employees. This suggests that monthly income alone cannot really account for employee turnover in cases in which employees are not satisfied with their jobs — money is not everything! This is consistent with the observation that satisfaction with compensation is just one side of the same coin: To be truly happy with their jobs, employees do not only expect appropriate compensation for their hard work but rather a range of factors that contribute to their overall satisfaction with their work (e.g., non-monetary support from their supervisors, strong relationships to great colleagues, a sense of fulfilment from the work itself etc.) (Zimmermann et al., 2018). Intriguingly, the relationship seems to be vice versa for more satisfied employees: As could be expected, people who stayed are paid considerably better than leavers. This pattern even seems to be more pronounced the higher we climb the happiness ladder: The payment gap seems to increase linearly with each level of job satisfaction. We could speculate that the more satisfied employees have more resources to invest a lot of energy into their work, making them more resilient towards high-pressure job demands (Bakker & Demerouti, 2007). As compensation is often tied to performance, this could result in two scenarios: The boost in energy could result into an appropriate promotion for some employees, giving them even more incentive to stay with their current employer.But if they are not given any better compensation in return, this could be perceived as inequitable and yet another reason to leave the organization (Birtch, Chiang & Van Esch, 2016). Still, testing this assumption is a bit beyond the scope for this article and difficult to test such a simulated dataset which does not include longitudinal information to begin with.

But we could argue that payment is certainly related to the employee’s job level as managers naturally earn more than junior consultants. Do the relationships hold if we swap the y-axis by the CompensationRatio variable we have created earlier? For example, are medium-to-highly satisfied employees who left paid less compared to what would be expected by their job level?

Not quite. On average, leavers are often paid the average pay or even better compared to their peers. Even if it appears that the range of different degrees of pay competitiveness are a bit larger in the middle of the job satisfaction scale, suggesting that payment competitiveness of not-so-happy employees who have left can be very different. But we should be careful here and avoid any premature interpretation: there could be significantly more employees who show a medium degree of job satisfaction than those who lie at the more extreme levels (very happy/unhappy). If we have more employees at the middle of the satisfaction distribution, chances would be higher that each degree of pay competitiveness would be somewhat covered, right? On the other hand, larger samples often cause the distribution to appear more normally distributed and less flat while the density peaks around the average value as the central limit theorem states. Let’s calculate a quick sanity check to find it out how this applies to our sample:

Wow — there are more satisfied than unsatisfied employees in our sample as they amount to about 65% of the observations. Thus, our second interpretation is more likely in this case and at first sight, there are no obvious interaction effects between job satisfaction and pay competitiveness that could contribute to employee attrition.

Being a trained psychologist, I am particularly interested in the question how the psychological climate may affect employee’s intention to leave. For the modelling part in particular, we will use the caret package, short for Classification And REgression Training which was developed by Max Kuhn and various other smart contributors. Caret has a nice built-in function to quickly get an impression about the features we are interested in.

Because the scale on which each of these variables are rated is ordinal in nature, a density plot looks cool but might not be the ideal choice here: The kurtosis of the curves is directly influenced by our class imbalance (a few who leave and a lot who stay) which could be misleading and we do not want to hallucinate any patterns here where there are none. Therefore, let’s try another technique: mosaic plots.

Now what the mosaic function from the vcd-package does is to test whether the frequencies in our sample could have been generated by simple chance. This is done by calculating a Chi-squared-test behind the scenes. We can now analyse the results by looking at both the size of the rectangles and the colors: the area of the rectangle represents the proportion of cases for any given combination of levels and the color of the tiles indicate the degree relationship among the variables — the more the colour deviates from grey, the more we must question statistical independence between the different factor combinations (as represented by the pearson-residual scale on the right). Usually dark blue represents more cases than expected given random occurrence while dark red represents less cases than expected if they were generated by chance alone.

Okay — in our case, every tile is coloured in grey which suggests that there are no strong deviations from statistical independence. Only in case of Environment Satisfaction we could wonder whether leavers are overly unsatisfied with their work environment compared to the remaining as indicated by the close-to-significant p-value and the shift between both distributions we can draw from the featureplot above. A limitation could be that the mosaic plot function is probably not sensitive enough to capture slight deviations from random frequency distributions — we still have a small sample size that gets further broken down by the combination of categorical levels we try to investigate.

Feature Pre-Processing — make the data ready for analysis and remove any redundancies

We now make our dataset ready for the actual modelling part. As a first step, we will remove all variables that are very unlikely to have any predictive power. For example, employee-ID won’t explain any meaningful variation in employee turnover, therefore it should be deleted for now among some other variables. Other examples include variables that share a lot with other features and therefore could lead to multicollinearity issues (e.g., hourly rate and monthly income). We will save the reduced dataset by properly converting all string variables (e.g., Department) to factors at the same time.

To be pretty sure that we have not overlooked any highly intercorrelated variables, we will automatically detect and remove them. For this purpose, we first identify any numeric variables, compute a correlation matrix and find correlations that exceed 0.5.

So, there are indeed some variables that were flagged by our code — we should take a closer look at: “YearsAtCompany”, “JobLevel”, “MonthlyIncome”,”YearsInCurrentRole” and “PercentSalaryHike”.

It appears that YearsAtCompany is highly correlated with YearsInCurrentRole, YearsSinceLastPromotion and YearsWithCurrManager. Thus, just one of these time-related variables can stay: I would suggest to keep “Years since last Promotion” because this could explain some additional variance the others cannot: it relates not only to the time that has passed since the employee entered the company, but also the years that went by without promotion. As the literature suggests, regular pay raises that are often a consequence of a promotion play a crucial protective role against turnover (Das & Baruah, 2013).
Joblevel is highly correlated with Age and with monthly income: It is plausible that the older employees get, the higher the chances they have already climbed the career ladder and earn significantly more compared to previous years. Because we have discussed the impact of monthly income, I would rather drop Job Level and Age than our income variable.
PercentSalaryHike is highly correlated with PerformanceRating as a strong performance is rewarded with money.

After analysing the interrelations of these variables, let us alter the list of variables which should be removed and create a new dataframe containing the selected variables:

To make our machine learning algorithms work, we need to transform these factor variables into dummy variables. For each factor level, we will have a separate variable indicating whether or not the respective participant falls into this category (e.g. a person who travels rarely would get a 1 instead of a 0). Firstly, we will detect all categorical variables except for our target (attrition). Then, we will make use of caret’s dummyVars function, apply it to our dataset and create a brand new dataframe that contains our selected set of numeric variables, dummy variables and attrition (yes/no). Please note that the caret-function dummyVars transforms the variables into a complete set of dummy variables which means that all factor levels will be covered and none will be left out — a procedure that does not work for linear models for which the output is always compared against a reference level (e.g., if we want to compare the effect of being female against the intercept which represents male employees). Thus, our variables are one-hot encoded.

Next, we will remove variables that do not provide any predictive value using carets nearZeroVar function. It applies to predictors that only have a single unique value (i.e. a “zero-variance predictor”). This would be the case if all our employees would be frequent travellers, leaving all other options (rare or none travel) blank which would create a constant in our statistical model. It also applies to predictors that only have a few unique values which occur in very low frequencies (e.g., if 1 out of 100 employees would be divorced). For many models (except for tree-based models), this may cause the model to crash or the fit to be unstable. As a final pre-processing step, we will make sure that we have ordered the target’s factor levels correctly: I have noticed that in the past, caret appears to just take the first level as the positive class (e.g., yes vs. no, win vs. lose etc.) which can sometimes “confuse” the confusion matrix later — for example, specificity and sensitivity can easily mixed up. Therefore, we want to make sure that an actual employee turnover is considered to positive class by explicitly assigning “yes” and then “no” as factor levels of our attrition variable.

Smart Modelling — deal with overfitting in small samples via cross-validation

Our final dataset is now cleaned up and ready for modelling. Because we have a small sample, we could run into the problem to overly fit our model to our sample-specific data to an extent that we cannot apply it to new employee data later. A problem that is often referred to as overfitting — a phenomenon that could explain why we sometimes cannot replicate previously found effects anymore. For machine learning models, we often split our data into training and validation/test sets to overcome this issue. The training set is used to train the model and the validation/test set is used to validate it on data it has never seen before. If we would use a traditional 80/20-split to our use case, model performance would largely depend on chance because we it would differ each time the algorithm would randomly select 25 individuals for testing purposes. This problem becomes even more extreme in our case because we have an imbalance of classes: Remember that only roughly 20% of employees left the company — this means that our model would probably be tested on about 5 leavers and 20 remaining, leaving us with the question whether the algorithm would perform similarly on different cases. Also, if we would assess model performance by looking at the prediction accuracy, the result would easily overestimate its actual performance as there are many positive cases as a reference (e.g., employees who did not leave). You can find more on the question how to split the data from a small imbalanced dataset in this interesting Stackoverflow discussion. Fortunately, we have something from our statistical toolbox: cross-validation with various splits. We will apply our trained model(s) to a new set of observations and repeatedly adjust the parameters to reduce prediction error. For these “new observations”, we do not even need a new sample: we will recycle the dataset by training the model to a set of observations and use another part of the data to test model performance. We will repeat this 5 times and average the test performance to receive a final estimation of our model’s performance — a technique called 5-fold-cross-validation. Thus, we can make use of all of our data while the model is still tested on “new” cases.

Different ML models for the same purpose

Now we will set up a reusable train Control object to build our machine learning models with the same settings: repeated cross-validation makes sure that we run our 5-fold-cross-validation process 5 times. Moreover, we ask caret to provide us with class probabilities in our model output as well as the final predictions and we want to see the progress of our modelling process (verbose = True). We will use caret’s built-in hyperparameter search with its standard settings.

The models we will test against each other are the following:

Logistic Regression: this is a widely used, traditional classification algorithm that is based on the linear regression you know from your statistics course and was originally proposed in 1958 by Cox. The primary prediction output is the observation’s estimated probability to belong to a certain class. Based on the value of the probability, the model creates a linear boundary separating the input space into two regions (e.g., more likely yes or more likely no).

Random Forests: Basic decision trees are interpretable models that are built in a tree-like fashion: branches are combinations of features and leaves are the class labels of interest (e.g., yes or no). Random forests give us an edge over basic decision trees by combining the power of all multiple weak learners to come to a collective prediction. This makes more robust than simple decision trees because the final prediction is not dominated by a few influential predictors.

Extreme Gradient Boosting (XGB): this modelling technique is another tree-based method introduced by Chen (2014) which is based on Gradient boosting trees which is an ensemble machine learning method proposed in 2001 by Friedman for regression and classification purposes. A key characteristic is that they learn sequentially — each tree attempts to correct the mistakes of the previous tree until no further enhancement can be achieved. XGB is often described as the faster, more scalable and memory efficient technique version compared to gradient boosting trees.

GLMnet: this is a very flexible, efficient extension of glm models that is nicely implemented in R. It fits generalized linear models using penalized maximum likelihood estimation and thus reduces overfitting known from common regression models (e.g., basic logistic or linear regression) by using a lasso or elastic net penalty term. It is known for its ability to deal well with small samples, prefer simple over overly complex models and its built-in variable selection.

Naïve Bayes: this model uses the famous Bayes Theorem as it estimates the occurrence probability of an event based on prior knowledge of related features. Classifiers first learn joint probability distribution of their inputs and produce an output (e.g., yes or no) based on the maximum posterior probability of the given each respective feature combination.

The role of class imbalance — select an appropriate accuracy metric for optimization

Because we have imbalanced sample (more stayed than left), we won’t assess the model’s performance with accuracy later. As accuracy is the proportion of correctly classified cases out of all cases, it would be not a big deal for the algorithm to give us a high score even if it simply classified ALL cases as the majority class (e.g., no). It is a more appropriate metric if we would have more equally distributed classes which were similarly important to us. But in this case, we actually are about the positive cases: you could argue that it is more detrimental NOT to correctly identify leavers (e.g., sensitivity or true positive rate) than accidentally predict that an employee would leave if the person actually stayed (false positive rate or 1 — specificity). Therefore, I would like to use the F1-score as accuracy metric for training optimization as it assigns more importance to correctly classify positive cases (e.g., employee churn) and makes more sense for heavily imbalanced datasets.

The F1-score is the harmonic mean of precision and recall:

The precision is the amount of correctly classified positive cases divided by the number of all positive predictions (including the false positives, e.g. employees who were identified as leavers but did not go). It is also called positive predictive value.

Recall, on the other hand, is the amount of true positive cases divided by the number of all samples that should have been identified as positive (e.g., all actual leavers, even if not all of them were correctly identified). It is also known as sensitivity in binary classification use cases. If you did not get it yet, no worries, it is not that intuitive as the simple accuracy metric. I hope that my visualizations may help you to wrap your head around it. In sum, the F1-score is associated with the algorithm’s ability to detect positive cases correctly.

Because caret does not directly provide the f1 metric as an option to our train function, we will use a DIY-code found on Stackoverflow. You can find more on accuracy metrices here. By the way, it is not as easy to interpret whether other not our achieved F1-score is “good enough” because it heavily depends on the amount of truly positive cases in our sample. Therefore, we will later look for the model highest F1-score we have achieved on the same data.

Final model contest — let our machine learning algorithms compete!

To have a good baseline model to compare the others with, we will create a logistic regression model: Based on the visualizations we have created earlier and the theories from the psychological literature, we would hypothesize that higher job satisfaction is protective against employee attrition. Also, we would assume that the lower monthly income, the higher the likelihood for employees to leave the company to get better payment. Moreover, we would think that the effect of monthly income on employee turnover gets amplified with each level of job satisfaction. For all other models, we will throw in all the variables we have selected before and make no further theoretical predictions. This way, we can see whether give us a predictive advantage.

But before we dive into the actual model comparison, let’s see if our baseline model can actually explain employee attrition at a rudimentary stage.

model_baseline
summary(model_baseline)

As predicted by our hypothesis, can see that the estimates of the model suggest that the likelihood that an employee leaves the company decreases slightly with every additional dollar monthly income and every additional level of job satisfaction. Note that the estimates cannot be directly interpreted because they are scaled as log odds that fits our logistic regression formula. The interaction term (combination of monthly income and job satisfaction) also became statistically significant (p < .001).

Can the other models with a larger set of predictors do a better job predicting employee turnover in our small sample? Let’s find it out. We will first make a list of all our model objects (random forest, glmnet etc.) and name them for future reference. Then we will use the resamples-function from caret to plot the models’ performance against each. It will give us the range of F1 values across all 5 folds, making it possible to select the model with the highest average performance.

… and the winner is: XGBoost!

It appears that XGBoost outperformed all other models when it comes to its ability to correctly classify leavers and showed a pretty robust performance across all folds. Our baseline model showed pretty variable model performance depending on the folds that were used for testing, making it seem a bit unstable. However, we did some unfair comparison here by comparing apples with bananas: for our basic model, we used a theoretically plausible formula while we threw in all of the candidate-variables into the other models. In these cases, we tried to predict employee churn with EVERYTHING. This makes it hard to judge whether the model performance of let’s say the random forest algorithm is due to an overly complex formula or due to the kind of model itself. I could even imagine that a simpler model might be beneficial in our case because we do not have enough observations to justify such a large set of predictors used in our model. As Yarkoni and Westfall (2017) nicely pointed out, the higher the chance that a small set of predictors is applied to many observations, the smaller the likelihood of overfitting, represented in a low n to p ratio (sample size to predictors). If we however have a small dataset and many parameters that all make small contributions to an outcome X like in our first modelling round, the likelier we will get large prediction errors and the performance gap between training and test set will be substantial.

Therefore, we will make a fairer comparison by demonstrating what happens if you tell caret to predict employee churn with job satisfaction, monthly income and the combination of these for all of the models:

Now random forest now does not seem to show the worst performance anymore, but again XGB seems to be our winner. Interestingly, the F1-score encompasses very similar values like our complex models, suggesting that a more parsimonious model is preferable over too complex models. Let’s see if the confusion matrix can tell us a bit more about the XGB’s performance that includes our simple model formula compared to our baseline model. As a reminder, this is how such a confusion matrix translates to our problem.

Accuracy metrices of our baseline model

Accuracy metrices for our baseline model

Accuracy metrices of our XGBoost model

Wow — the direct comparison shows that XGBoost is much better in predicting employee churn than our baseline model! By looking at the raw confusion matrix we can see that the XGBoost correctly identified 17 out of 22 leavers whereas the baseline model only identified 3 of them. Our winner model’s precision is very good (0.94) which means that it did not mix up true leavers with fake leavers (false positives, i.e. employees who did not actually leave the company but stayed). On the other hand, the baseline model just predicted 4 positive cases from which 3 were correct, resulting in a rather poor precision of 0.75. The capacity to detect employee churn correctly becomes even more pronounced for the other metrices: Because the XGBoost algorithm also missed 5 true positive cases and incorrectly flagged them as negative, recall is not exceedingly high but still acceptable (0.77). As a reminder, recall is the amount of true positive cases divided by the number of all samples that should have been identified as positive (e.g., all actual leavers, even if not all of them were correctly identified) and is also known as sensitivity. In contrast, the baseline model was not sensitive enough to pick up the true leavers and accidentally flagged a majority of them as remaining employees.

The gap in performance is also captured by the balanced accuracy score on the bottom which represents the balance between specificity and sensitivity of the respective model, suggesting that our baseline model underperformed when it comes to correctly identifying loyal employees as well. Apart from the F1-score, balanced accuracy has been suggested to be a better proxy of the model’s accuracy in imbalanced samples.

All in all, XGBoost seems to give us a predictive edge over a simple generalized linear model even if we keep our predictors constant.

Risk of turnover? Generate a data-driven retention strategy

As a next step, we would like to actually use the model to increase retention in our small company. For this purpose, we will first get the indices of employees that are still active and predict the likelihood for these employees to leave according to our model. Then we will save these probabilities as well as the actual employee data. Lastly, we will find the top 5 employees with the highest risk to leave the company. In order to give the company a chance to intervene, these are probably the people that should be recruited first to find out what they need to be happier and how they would like to develop in the future. This way, we can hopefully address any voluntary turnover. In the end, we will give the managers a full list of employees to talk to, ranked by their risk to leave — after all, it is good to give employees the chance to get rid of constructive feedback that may enhance the working climate.

Learn from my mistakes — What I have not told you yet…

Do not use tibbles for modelling

Before I have discovered the beauty of the data.table syntax, I have worked with tibbles because I think the %>% operator is such an intuitive tool. Unfortunately, caret does only take in dataframes (or data.tables), but not tibbles and it took a long time for me to work out why my code would not run through. Until I have discovered this comment on github and changed the whole data wrangling part to a DT-like structure. Make sure to transform tibbles into classic dataframes before modelling with caret or do not use dplyr-syntax in the first place.

Appropriate accuracy metrices make such a huge difference

Moreover, I have first used the AUC ROC score as the metric for optimization before I have discovered that you can also make it work with other metrices (e.g., the F1-score) as well. This resulted in a horrible model performance, with sensitivity scores under 0.1 — the algorithms did not really detect positive cases (e.g., actual employee churn) which might be due to our heavily imbalanced dataset. As predicted before, the model summary still demonstrated acceptable accuracy scores because it was not such a big deal for the models to identify those who stay as the majority of employees did not churn anyway. Using a more appropriate metric for optimization was really a game-changer and massively improved model performance.

Our results are highly sample-specific

Before I have set a seed for replication purposes when drawing random samples from our original IBM HR analytics dataset, I realized that results would largely differ from sample to sample. This was especially true for the psychological variables like environment or relationship satisfaction. Even if this may sound a bit obvious to you, think about what this means to cases in which you do not have any larger dataset available from which you can draw a subset from: your interpretation of the data-generating processes (e.g., which factors would explain employee turnover) would be heavily biased but you would not have any chance to validate it yet due to a lack of data. This should make us careful about drawing preliminary conclusions about general patterns in nature before we have the chance to prove them on a larger population. Also, please keep in mind that the dataset is fictious as it does not contain real HR employee data, so it should not be used as a single source to answer truly scientific questions.

Time to think deep — can we predict employee turnover in small samples?

From a technical perspective, I would argue that we can indeed make use of machine learning to predict employee turnover, even in small samples if you have a complete, high-quality dataset. Also, we have seen that when it comes to the number of predictors, a simpler, more parsimonious model can perform at least as good as a complex model. Nevertheless, the type of machine learning model can make a huge difference to the accuracy of our prediction. From a broader perspective, we need to be aware of the fact that it seems to be inevitable that we capture some degree of sample-specific error that does not generalize to new employees. Therefore, it would not be recommended to let the algorithm decide over people’s jobs and lives alone — as human beings, being data scientists, managers or HR experts, we need to take responsibility for the decisions being made under our supervision.

There are several ways to increase the transparency of our data-driven approach towards external parties: If we did not have strong hypothesis before evaluating data, it may be helpful to explicitly claim that our work is exploratory in nature because our explanations are somewhat post-hoc and therefore could be biased. This limitation also applies to our approach since we inspected the potential psychological reasons driving employee turnover visually before specifying our model formula. At the end of the day, it might be most fruitful to make use of both of the worlds: investigating the roots and causes behind the data, widening our view towards patterns within the data that we have actually not hypothesized in the first place and always test a model’s ability to predict out-of-sample behaviour to minimize overfitting. Thus, a mix of some degree of theoretical flexibility but mindful interpretation of the data could enable us to make good and reasonable predictions.

When you really think about deploying HR analytics tools in your organization, the end-users of the model need to be educated about the its benefits as well as its limitations: Even if we understand the common roots which increase the likelihood of turnover, people as well as organizations are still pretty unique — and so are the reasons to leave. Consequently, the mixture personal and practical reasons that drive frequent turnover may vary from company to company and even change across time. Thus, when a predictive model is applied, it must be constantly evaluated before it is already outdated and makes profound mistakes. I am such a big fan of data-driven technologies and if we use them thoughtfully, machine learning techniques can be a powerful tool to improve the workplace for good. But if we use it mechanically, it can easily become the dangerous black-box AI-critics fear and which our society should not aim for. So, let’s stay mindful data-enthusiasts.

References

[1] A. Hammermann & C. Thiele, People Analytics: Evidenzbasiert Entscheidungsfindung im Personalmanagement (2019), (№35/2019), IW-Report.

[2] R. S. Sexton, S. McMurtrey, J. O. Michalopoulos & A. M., Employee turnover: a neural network solution (2005), Computers & Operations Research, 32(10), 2635–2651.

[3] J. P. Isson & J. S. Harriott, People analytics in the era of big data: Changing the way you attract, acquire, develop, and retain talent (2016), John Wiley & Sons.

[4] L. Bock, Work Rules!: Wie Google die Art und Weise, wie wir leben und arbeiten, verändert. (2016), Vahlen.

[5] Y. Zhao, M. K. Hryniewicki, F. Cheng, B. Fu & X. Zhu, Employee turnover prediction with machine learning: A reliable approach (2018), In Proceedings of SAI intelligent systems conference (pp. 737–758). Springer, Cham.

[6] T. Yarkoni & J. Westfall, Choosing prediction over explanation in psychology: Lessons from machine learning (2017), Perspectives on Psychological Science, 12(6), 1100–1122.

[7] A. A. Mahmoud, T. A. Shawabkeh, W. A. Salameh & I. Al Amro, Performance predicting in hiring process and performance appraisals using machine learning (2019), In 2019 10th International Conference on Information and Communication Systems (ICICS) (pp. 110–115). IEEE.

[8] D. H. Kluemper, P. A. Rosen & K. W. Mossholder, Social networking websites, personality ratings, and the organizational context: More than meets the eye? (2012),1. Journal of Applied Social Psychology, 42(5), 1143–1172.

[9] A. Apatean, E. Szakacs & M. Tilca, Machine-learning based application for staff recruiting (2017), Acta Technica Napocensis, 58(4), 16–21.

[10] Y. M. Li, C. Y. Lai, & C. P. Kao, Incorporate personality trait with support vector machine to acquire quality matching of personnel recruitment (2008), In 4th international conference on business and information (pp. 1–11).

[11] T. Chamorro-Premuzic, D. Winsborough, R. A. Sherman & R. Hogan, New talent signals: shiny new objects or a brave new world (2013), Ind. Organ. Psychol. Perspect. Sci. Pract., 53:1689–1699.

[12] E. Faliagka, K. Ramantas, A. Tsakalidis & G. Tzimas, Application of machine learning algorithms to an online recruitment system (2012), In Proc. International Conference on Internet and Web Applications and Services (pp. 215–220).

[13] M. Loi, People Analytics must benefit the people. An ethical analysis of data-driven algorithmic systems in human resources management (2020), Algorithmwatch.

[14] N. B. Yahia, J. Hlel & R. Colomo-Palacios, From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction (2021), IEEE Access, 9, 60447–60458.

[15] G. Harden, K. G. Boakye & S. Ryan, Turnover intention of technology professionals: A social exchange theory perspective (2018), Journal of Computer Information Systems, 58(4), 291–300.

[16] J. Sarkar, Linking Compensation and Turnover: Retrospection and Future Directions (2018), IUP Journal of Organizational Behavior, 17(1).

[17] P.C. Bryant & D. G. Allen, Compensation, benefits and employee turnover: HR strategies for retaining top talent (2013), Compensation & Benefits Review, 45(3), 171–175.

[18] R. D. Zimmerman, B. W. Swider & W. R. Boswell, Synthesizing content models of employee turnover (2019), Human Resource Management, 58(1), 99–114.

[19] A. B. Bakker & E. Demerouti, The job demands‐resources model: State of the art (2007), Journal of managerial psychology.

[20] T. A, Birtch, F. F. Chiang & E. Van Esch, A social exchange theory framework for understanding the job characteristics–job outcomes relationship: The mediating role of psychological contract fulfillment. (2016), The international journal of human resource management, 27(11), 1217–1236.

[21] M. Kuhn, J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt., … & M. Benesty, Package ‘caret’ (2020), The R Journal, 223.

[22] B. L. Das & M. Baruah, Employee retention: A review of literature (2013), Journal of business and management, 14(2), 8–16.

[23] T. Chen & Guestrin, Xgboost: A scalable tree boosting system (2016), In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794).