Want your data scientists to freak out? Tell them to start following the agile methodology.
Want your product teams to freak out? Tell them that agile doesn’t work for data science projects.
You’re going to run into extreme opinions on this topic. Scrum, Kanban, and the many variants have become so standard that today’s developers and product teams can’t imagine working differently. The “agile is best” zealotry has become so ingrained that proponents will reject any contrary opinions.
Unfortunately, blindly following any agile software methodology won’t work for data science. At the same time, your product teams rely on agile for accountability, and they will correctly assert that perpetual data science experimentation is an unacceptable risk.
In this issue of FeedForward I’ll explore this topic and help you get your data science and product teams working together in harmony. Fortunately it isn’t that hard once you recognize the risks they are trying to mitigate.
Scrum, Kanban, and the many variants have become so standard that today’s developers and product teams can’t imagine working differently. The “agile is best” zealotry has become so ingrained that proponents will reject any contrary opinions.
The challenge of applying Agile to data science projects
Agile methodologies were primarily created to handle requirements uncertainty. If you can perfectly document what a software application is supposed to do—and nothing changes—then you don’t need agile. But in the real world this never happens. Requirements change, new opportunities emerge, and unforeseen technical challenges spring up. Agile methodologies were created to give product teams more flexibility in dealing with this uncertainty.
The root problem is that agile was not developed with data science in mind.
In data science the biggest risk is solution viability—NOT requirements uncertainty.
I’ll illustrate this point with a thought experiment. I would like you to build a model that can automatically generate this article. My requirements are quite clear and unlikely to change. So what’s the problem? This solution isn’t viable because text-generation techniques are not yet good enough. Optimizing methodology won’t get us anywhere. The only way forward is to redefine the problem, such as building a model that can automatically generate summaries of sporting events.
Isolating viability risks require a different approach
Many data scientists believe that Scrum is awful for data science because it doesn’t address viability risk. Addressing viability requires … well … science!
Data scientists are trained to systematically isolate viability risks through experimentation. Unfortunately data science experiments are highly unpredictable, particularly on machine learning applications that leverage state-of-the-art techniques.
The biggest risk in these projects is (1) insufficient data, or (2) unacceptable model performance, and experienced data scientists have techniques for rapidly isolating them. Examples are exploratory data analysis or deliberate model overfitting. Unfortunately tasks and times are nearly impossible to predict at the outset.
Many product teams do not appreciate this technical risk in data science
Most software engineers have never confronted this level of technical risk. The biggest risk in software is building something nobody wants, particularly for new software products. I know because I am also guilty of making this error. Prior to Prolego I attempted to launch a machine learning startup for content marketers. I nailed the customer needs but data issues killed our product’s viability.
Unfortunately most product teams don’t realize their own ignorance about data science and machine learning.
Many product leaders view machine learning models like any other software library, and assume the data scientist’s only job is to make this library.
Building a machine learning model isn’t like building a data-driven web application. A machine learning project cannot be easily broken down into a series of steps.
So what happens at the first sprint planning meeting? The Scrum master goes ballistic when the data scientists create ambiguously-worded JIRA tickets like “data gathering” with undefined story points.
Many data scientists don’t appreciate the need for Agile
Our hypothetical Scum master’s reaction is understandable. Many data scientists have no software product experience, and they don’t understand how hard it is to maintain and scale software in a modern data center. The brilliant astrophysicist on your team may have spent 10 years writing experimental python code which nobody else had to understand or maintain.
They may not appreciate or understand the many execution challenges that processes are designed to prevent. I’ve watched data science teams create “solutions” in thousands of lines of code in Jupyter notebooks. Reality hits when they ask IT to deploy it.
You simply cannot build an application that solves real world problems without having a process.
The solution? Modify agile for data science
So here is your dilemma: your team needs a process to successfully build and deploy your ML models, but the agile methodology they have been refining for the past decade won’t work. Here is how I’ve learned to solve this problem.
Introduce agile at the right time
I start by thinking of any new ML project in distinct phases:
Initial viability
Before ramping up any ML project the data scientist should evaluate whether or not the problem is even solvable. This usually requires (1) assessing the quality and quantity of the underlying, and (2) evaluating whether a methodology exists. The data scientists will perform exploratory data analysis and possibly a literature review.
Many machine learning projects never make it past this point. The data science team should be working independently. Since the project isn’t yet a software development effort Agile plays no role.
Prototype
If the solution is potentially viable the data science team can begin assessing how well it solves the business problem. This step usually requires iteratively training a model and evaluating the results against the minimum acceptable accuracy.
Prototyping can take several months. The schedule is usually driven by the complexity of the solution, accessibility of the data, and the availability of business customers for feedback. Again, the data scientist should do most of this independently.
When the model is “good enough”—a threshold that obviously varies depending on the problem—other team members may join the project. At this point the data scientist’s tasks are more predictable and you can begin introducing Agile.
Production
Finally the entire team needs to deploy the model and the project starts to look like a software engineering effort. Agile is critical at this junction because the entire engineering team needs to work collaboratively to get the model in production.
The data scientist will continue running experiments, but the time and scope will be more predictable. These data science tasks, however, will never be as predictable as tasks in a traditional software engineering effort.
Provide opportunities for cross-functional training
Many of your problems will be solved by simply introducing Agile at the right time. However, your entire team will be more effective if everyone appreciates each others’ challenges.
Product teams
Encourage your product managers and developers to get some basic data science experience. At Prolego we ask our data engineers (the people putting models into production) to take online courses or do Kaggle competitions. Your product team will be much more effective if everyone understands the basics of data cleansing, feature development, and model training.
Data scientists
Data scientists will be more effective team members if they understand how enterprise software development works. Have them participate in activities like sprint planning sessions, code reviews, and customer meetings. They may also benefit from courses in object-oriented programming, Kanban, or Scrum.
If nothing else, cross-functional training will get everyone using the same language.
Another great opportunity for AI leaders to elevate the whole company
I have yet to see a single ML project that didn’t struggle with this challenge, and it will be several years until the methodologies catch up. In the meantime you have another opportunity to be the AI translator your company needs.
When you see a team arguing about the ‘best’ way to run a ML project, just remember that most of the disagreement comes from misunderstanding. Data scientists are not deliberately trying to be vague, and Scrum masters are not trying to create busy work with Jira tickets.
Be the AI translator who helps the product team understand why ML is different. Be the process champion who helps the data science team realize the value of following a methodology. By doing so you will further establish the critical role you play in leading the company to the inevitable AI-driven future.