A Machine Learning approach to Task Allocation.
Being an Engineering Manager is the most awesome. You neither have to slog like a developer, at their wits’ end, trying to tune that last bit of code to perfection. Nor are you under the stress that senior management above you has, answering to the Business that pays them, in hard dollars.
However, this role comes with its fair share of chores. It’s not only the Process Documentation, Audit Support or typing Release Notes, but also the en masse Task Allocation that takes some sheen off of this otherwise great job.
Imagine having to allocate 100+ items to your team , before Monday, when the intake system allows the requestors to add items to the work queue all the way into Friday night.
For the visually inclined, following is a snippet of the xls that needs to be uploaded for a JIRA/Developer combo — one line for each of the 100+ items (there’s a reason it’s called ‘crap-job’, by fellow Leads !):
An old school way would be to spend a day of the weekend on it (wallowing in self-pity and at the same time patting one’s own back on their commitment to their employer). The new school way? Automate, using Machine Learning ! (and because there is a developer in you, who is not fully dead, yet)
Without further ado, let’s dive in and get our hands dirty !
Get the Data
Above is a snippet from our work intake system, a custom built UI on top of Atlassian JIRA, so i am able to fetch the data as follows, directly from the JIRA system:
Alternatively, you may fetch data using the JIRA API, if you have the required access.
Load the data
Using the friendly pandas
import pandas as pd
jira_data_all=pd.read_csv(“Data/102–108/JIRA Export All fields.csv”,low_memory=False)
Run Some Analysis
Index(['Summary', 'Issue_key', 'Multiple_Devs', 'Issue_Type', 'POD', 'SOR','Iteration', 'Description'],
RangeIndex: 74 entries, 0 to 73
Data columns (total 2 columns):
Dev_uid 74 non-null object
counts 74 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.2+ KB
Ah, what shall we do without our extraordinary performers !
import seaborn as sns
sns.boxplot(data = jira_counts_by_uid,x=jira_counts_by_uid[‘counts’] )
import matplotlib.pyplot as plt
Actually, we get rid of them (not ‘them’ of course, but the items assigned to them. In our case these are individuals/resources who don’t necessarily work on these items but are POCs for the effort)
z_score = np.abs(stats.zscore(jira_counts_by_uid[‘counts’]))
outliers=np.where(z_score > 3)
for t in outliers:
Get Training Data Ready for the Model
1 — Scrub Scrub
Standardize the data in your features
# Format data
jiras_non_sourcing[‘Name’] = jiras_non_sourcing[‘Name’].str.title()
jiras_non_sourcing[‘SOR’] = jiras_non_sourcing[‘SOR’].str.upper()
jiras_non_sourcing[‘Summary’] = jiras_non_sourcing[‘Summary’].str.lower()
jiras_non_sourcing[‘Description’] = jiras_non_sourcing[‘Description’].str.lower()
# Drop rows for developers with low number of allocations
jira_filtered=jira_filtered.groupby(“Name”).filter(lambda x: len(x) > 7)
2 — Encode
Encode the non numeric features. In this case I am using 1hot encoding.
from sklearn.preprocessing import LabelBinarizer
encoder1 = LabelBinarizer()
SOR_1hot = encoder1.fit_transform(jira_filtered[‘SOR’])
encoder2 = LabelBinarizer()
POD_1hot = encoder2.fit_transform(jira_filtered[‘POD’])
encoder3 = LabelBinarizer()
IssueType_1hot = encoder3.fit_transform(jira_filtered[‘Issue_Type’])
['AFS' 'AIMS' 'BROADRIDGE' 'CCDMS' 'CDD TABLES' 'EXIMBILLS HK'
'EXIMBILLS HK INSOURCING' 'GBF' 'IIS' 'INFOLEASE' 'LEASECONNECT (IKNX)'
'LIQ' 'LUCAS' 'RAILS' 'SHAW/WFDS' 'STRATEGY LOAN SERVICES'
'SYNTHETIC LEASE' 'TRIP' 'WFA FIRST CLEARING' 'YARDI NMTC']
['Commercial Cap / Leasing' 'Commercial Lending' 'Financial Crimes'
['Defect' 'Enhancement' 'Sub-Task - Bug' 'Sub-task' 'Tech Task']
3 — Scale
Notice the Iteration values(an iteration is nothing but a numeric identifier for a two weekly Scrum). They are in the 100s and at a scale vastly different than that of other features.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
4 — Add a hint of NLP (just because)
The Description of the JIRAs contains a lot of stop words, which would cause the Classifier to associate these stop words with the resources
from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union([“please”,”missed”,”olmd”,”column”,”columns”,”new”,”changes”,”mapping”,”refer”,”color”,
“issue”,”logic”,”join”,”condition”,”following”,”update”,”processing”,”following”…………………………..needs”,”removed”,”replaced”,”file”,”sor s”,”direct”,”match”,”types”,”included”,”need”,”trim”,”forward” ])
5 — Load up the Features
This is the final stage where you load up all of the transformed features.
features = tfidf.fit_transform(jira_filtered.Summary + jira_filtered.Description).toarray()
features = np.append(features,SOR_1hot, axis = 1)
features = np.append(features,POD_1hot , axis = 1)
features = np.append(features,IssueType_1hot , axis = 1)
Select a Model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVCfrom sklearn.model_selection import train_test_split
model = LogisticRegression(random_state=0)
Train the Model
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, jira_filtered.index, test_size=0.33, random_state=0)
Check the Confusion Matrix
Evaluate the Scores
precision recall f1-score support
Prasanna 0.83 1.00 0.91 5
Michelle 0.67 0.67 0.67 3
Chirag 0.50 0.33 0.40 3
Vamshi 0.33 1.00 0.50 2
Ravindra 1.00 1.00 1.00 5
Aniket 1.00 0.83 0.91 6
Sudhakar 1.00 0.67 0.80 3
Andrew 0.00 0.00 0.00 0
Brad 1.00 0.33 0.50 3
Satish 1.00 0.20 0.33 5
Vamshee 1.00 1.00 1.00 4
Hafiza 0.33 1.00 0.50 1
avg / total 0.87 0.72 0.73 40
87% — Pretty neat, eh!
OK, i agree, it’s not in the high nineties, but we are not solving to detect a morbid condition either.
Procure the data to Predict for
jira_data_next_iteration=pd.read_csv(“Data/109/JIRA Export All Fields.csv”)
Run the Prediction
y_pred_next_iteration = model.predict(features_pred)
Publish the Results !!
In this post we have seen how to build a Multi Label Classifier, to predict resources that are best aligned to a work requirement.
GIT Notebook Link
Continue to Part 2 ….