Automating Work Allocation — Part 1

5 min readFeb 23, 2020

A Machine Learning approach to Task Allocation.

Being an Engineering Manager is the most awesome. You neither have to slog like a developer, at their wits’ end, trying to tune that last bit of code to perfection. Nor are you under the stress that senior management above you has, answering to the Business that pays them, in hard dollars.

However, this role comes with its fair share of chores. It’s not only the Process Documentation, Audit Support or typing Release Notes, but also the en masse Task Allocation that takes some sheen off of this otherwise great job.

The Situation

Imagine having to allocate 100+ items to your team , before Monday, when the intake system allows the requestors to add items to the work queue all the way into Friday night.

For the visually inclined, following is a snippet of the xls that needs to be uploaded for a JIRA/Developer combo — one line for each of the 100+ items (there’s a reason it’s called ‘crap-job’, by fellow Leads !):

The Solution

An old school way would be to spend a day of the weekend on it (wallowing in self-pity and at the same time patting one’s own back on their commitment to their employer). The new school way? Automate, using Machine Learning ! (and because there is a developer in you, who is not fully dead, yet)

Without further ado, let’s dive in and get our hands dirty !

Get the Data

Above is a snippet from our work intake system, a custom built UI on top of Atlassian JIRA, so i am able to fetch the data as follows, directly from the JIRA system:

Sample Export Functionality. Your Intake system may or may not look similar

Alternatively, you may fetch data using the JIRA API, if you have the required access.

Load the data

Using the friendly pandas

import pandas as pd
jira_data_all=pd.read_csv(“Data/102–108/JIRA Export All fields.csv”,low_memory=False)

Run Some Analysis

jira_data.columns

Index(['Summary', 'Issue_key', 'Multiple_Devs', 'Issue_Type', 'POD', 'SOR','Iteration', 'Description'],
      dtype='object'
)<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 2 columns):
Dev_uid    74 non-null object
counts     74 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.2+ KB

Outliers

Ah, what shall we do without our extraordinary performers !

import seaborn as sns
sns.boxplot(data = jira_counts_by_uid,x=jira_counts_by_uid[‘counts’] )
import matplotlib.pyplot as plt
plt.show()

Actually, we get rid of them (not ‘them’ of course, but the items assigned to them. In our case these are individuals/resources who don’t necessarily work on these items but are POCs for the effort)

z_score = np.abs(stats.zscore(jira_counts_by_uid[‘counts’]))
outliers=np.where(z_score > 3)
outlier_devs=[]
for t in outliers:
outlier_devs.append(jira_counts_by_uid[‘Dev_uid’][t])
uid_name[uid_name[“LOGIN_ID”].isin(outlier_devs[0])][“FIRST_NM”]
jira_data=jira_data[~jira_data[“Multiple_Devs”].isin(outlier_devs[0])]

Get Training Data Ready for the Model

1 — Scrub Scrub

Standardize the data in your features

# Format data

jiras_non_sourcing[‘Name’] = jiras_non_sourcing[‘Name’].str.title()
jiras_non_sourcing[‘SOR’] = jiras_non_sourcing[‘SOR’].str.upper()
jiras_non_sourcing[‘Summary’] = jiras_non_sourcing[‘Summary’].str.lower()
jiras_non_sourcing[‘Description’] = jiras_non_sourcing[‘Description’].str.lower()

# Drop rows for developers with low number of allocations
jira_filtered=jira_filtered.groupby(“Name”).filter(lambda x: len(x) > 7)

jira_filtered.head()

2 — Encode

Encode the non numeric features. In this case I am using 1hot encoding.

from sklearn.preprocessing import LabelBinarizer
encoder1 = LabelBinarizer()
SOR_1hot = encoder1.fit_transform(jira_filtered[‘SOR’])
print(encoder1.classes_)
encoder2 = LabelBinarizer()
POD_1hot = encoder2.fit_transform(jira_filtered[‘POD’])
print(encoder2.classes_)
encoder3 = LabelBinarizer()
IssueType_1hot = encoder3.fit_transform(jira_filtered[‘Issue_Type’])
print(encoder3.classes_)

['AFS' 'AIMS' 'BROADRIDGE' 'CCDMS' 'CDD TABLES' 'EXIMBILLS HK'
 'EXIMBILLS HK INSOURCING' 'GBF' 'IIS' 'INFOLEASE' 'LEASECONNECT (IKNX)'
 'LIQ' 'LUCAS' 'RAILS' 'SHAW/WFDS' 'STRATEGY LOAN SERVICES'
 'SYNTHETIC LEASE' 'TRIP' 'WFA FIRST CLEARING' 'YARDI NMTC']
['Commercial Cap / Leasing' 'Commercial Lending' 'Financial Crimes'
 'Securities/Fees/Common']
['Defect' 'Enhancement' 'Sub-Task - Bug' 'Sub-task' 'Tech Task']

3 — Scale

Notice the Iteration values(an iteration is nothing but a numeric identifier for a two weekly Scrum). They are in the 100s and at a scale vastly different than that of other features.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
jira_filtered[“Scaled_Iteration”]=scaler.fit_transform(jira_filtered[“Iteration”].values.reshape(-1, 1))

4 — Add a hint of NLP (just because)

The Description of the JIRAs contains a lot of stop words, which would cause the Classifier to associate these stop words with the resources

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union([“please”,”missed”,”olmd”,”column”,”columns”,”new”,”changes”,”mapping”,”refer”,”color”,
“issue”,”logic”,”join”,”condition”,”following”,”update”,”processing”,”following”…………………………..needs”,”removed”,”replaced”,”file”,”sor s”,”direct”,”match”,”types”,”included”,”need”,”trim”,”forward” ])

5 — Load up the Features

This is the final stage where you load up all of the transformed features.

features = tfidf.fit_transform(jira_filtered.Summary + jira_filtered.Description).toarray()

features = np.append(features,SOR_1hot, axis = 1)
features = np.append(features,POD_1hot , axis = 1)
features = np.append(features,IssueType_1hot , axis = 1)
features.shape

(121, 52)

Select a Model

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVCfrom sklearn.model_selection import train_test_split

model = LogisticRegression(random_state=0)

Train the Model

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, jira_filtered.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)

Check the Confusion Matrix

Evaluate the Scores

                precision    recall  f1-score   support

   Prasanna       0.83      1.00      0.91         5
   Michelle       0.67      0.67      0.67         3
     Chirag       0.50      0.33      0.40         3
     Vamshi       0.33      1.00      0.50         2
   Ravindra       1.00      1.00      1.00         5
     Aniket       1.00      0.83      0.91         6
   Sudhakar       1.00      0.67      0.80         3
     Andrew       0.00      0.00      0.00         0
       Brad       1.00      0.33      0.50         3
     Satish       1.00      0.20      0.33         5
    Vamshee       1.00      1.00      1.00         4
     Hafiza       0.33      1.00      0.50         1

avg / total       0.87      0.72      0.73        40

87% — Pretty neat, eh!

OK, i agree, it’s not in the high nineties, but we are not solving to detect a morbid condition either.

Procure the data to Predict for

jira_data_next_iteration=pd.read_csv(“Data/109/JIRA Export All Fields.csv”)

Run the Prediction

y_pred_next_iteration = model.predict(features_pred)

Publish the Results !!

jira_predictions.head()
#jira_predictions.to_excel(“jira_predictions.xlsx”,sheet_name=’jira_predicted_dev’,index=False)

The last column Predicted_Developer is our Label i.e. what we are solving for

Conclusion

In this post we have seen how to build a Multi Label Classifier, to predict resources that are best aligned to a work requirement.

GIT Notebook Link

Continue to Part 2 ….