My model has higher BLEU, can I ship it? The Joel Test for machine learning systems

Lucy Park
Naver Papago
2017-11-15

ACML-AIMLP Workshop

Hi, I’m Lucy Park

A machine learning scientist at Naver Papago
Currently working on machine translation and user log analysis

Machine translation (MT)

The use of a software to translate from one language to another
Related services: Google Translate, GenieTalk, Systran, … and Papago!

Neural machine translation (NMT)

Many translation services now use NMT for production, including Papago

Phase 1	Collect a bunch of parallel corpora
Phase 2	Train an NMT algorithm (Usually, sequence-to-sequence)
Phase 3	Profit!

…and recent advances in NMT

Multilingual? ¹

…and recent advances in NMT

Multi-head self attention? ²

…and recent advances in NMT

Unsupervised? ³

What does this mean?

Fast development cycles!
Frequent master branch updates
Frequent model releases
Not only in MT, but practically in every area that employs machine learning

Let’s put the “science part” aside, and talk about the “engineering part” of ML systems for today

Caveats

This talk is NOT about machine translation algorithms

This talk is NOT about BLEU, or related metrics

This talk is NOT about Papago or Naver

This talk aims to pinpoint the subtle areas in our machine learning systems

“Fast development cycles incur technical debt”

“Technical debt” ⁴
- Long-term costs incurred by moving quickly in software engineering
- It becomes more and more painful to add a feature
Payback methods ⁵ ⁶
- Not adding new functionality
- Enabling future improvements and maintainability
- Examples:
  - refactoring code
  - improving unit tests
  - deleting dead code
  - reducing dependencies
  - tightening APIs
  - improving documentation
  - …

There are many prophets regarding technical debt

Then there’s the test for assessing teams:
- SEMA (Software Engineering Measurement and Analysis)
Then there’s also the simpler test:
- The Joel Test 🎉

The Joel test

So I’ve come up with my own, highly irresponsible, sloppy test to rate the quality of a software team. The great part about it is that it takes about 3 minutes. With all the time you save, you can go to medical school. – Joel Spolsky

Do you use source control?
Can you make a build in one step?
Do you make daily builds?
Do you have a bug database?
Do you fix bugs before writing new code?
Do you have an up-to-date schedule?
Do you have a spec?
Do programmers have quiet working conditions?
Do you use the best tools money can buy?
Do you have testers?
Do new candidates write code during their interview?
Do you do hallway usability testing?

But this list is insufficient for machine learning


Instead of writing a program that solves the problem, we (machine learning scientists) write a program that learns to solve the problem, from examples. ⁷ – Chris Olah	Software 2.0 is written in neural network weights. No human is involved in writing this code. ⁸ – Andrej Karpathy

The biggest difference in machine learning: Data is involved!

“Hidden debt in machine learning systems”

Data acquisition
Model training (Involves training dataset)
Model deployment (Involves test dataset)
Monitoring

And other prophets regarding technical debt in machine learning

Have you read the papers below? If not, they’re highly recommended!

Martin, Rules of machine learning: Best practices for ML engineering, 2016.
Yii-Huumo et al., How do software development teams manage technical debt? – An empirical study, Journal of Systems and Software, 2016.
Breck et al., What’s your ML Test Score? A rubric for ML production systems, 2016.
Sculley et al., Machine learning: The high-interest credit card of technical debt, 2014.

However if you don’t have much time to spare… let’s try the following list. 😜

The Joel test for better ML systems

So I’ve come up with my own, highly irresponsible, sloppy test to rate the quality of a machine learning team. The great part about it is that it takes about 3 minutes. With all the time you save, you can take a MOOC. – Lucy Park

Do you keep your data versioned as well as your code?
Do you have an experiment database?
Do you have specified evaluation metrics?
Do the evaluation datasets match the needs of your users?
Can you reproduce your experiments in one step?
Do you have up-to-date documents?
Do you have the best computational resources money can buy?
Do you have tools to test model training?
Do you have tools to interpret your models?
Can you easily replace a component of your algorithm?
Does your team have a clear vision?

Do you keep your data versioned as well as your code?

There are tools for code version control (ex: Git)

Before After

⁹
“My name is Linus, and I am your God.”
– Linus Torvalds

Before	After
⁹	“My name is Linus, and I am your God.” – Linus Torvalds

Do you keep your data versioned as well as your code?

But are there tools for data version control?

Before	After
enko-corpus-2017-10-11/ enko-corpus-2017-10-31/ enko-corpus-2017-11-15/ …	?

Maybe
- http://codalab.org?
- http://dataversioncontrol.com?
- Some tool that efficiently stores and manages TBs of data?
- NSML?
But at the least, datasets should be versioned or named using an agreed-upon rule

Do you have specified evaluation metrics?

Does not have to be a single metric
Important question: Does your team all agree and believe the results?
- Ex: BLEU is a de-facto standard measure for MT
- BLEU, however, does not necessarily assure user satisfaction
- Then also conduct human evaluation, or build quality estimation models
Better yet, run small-sized A/B tests!
- This will let you deploy the model as soon as sanity checks are completed
- Plus, you can elastically scale each model according to user satisfaction measures

¹⁰

Do the evaluation datasets match the needs of your users?

Higher BLEU doesn’t assure higher user satisfaction
One way: Use multiple evaluation sets
- Including various text lengths (ex: words, sentences, paragraphs)
- Including multiple domains (ex: IT, politics, instant messaging)
- Including adversarial examples (ex: gender/ethnic biased text)
Better yet, let the product manager, instead of the engineer, create the test set

¹¹

Do you have an experiment database?

A minimal experiment database should include the following:
- Dataset version
- Code version (preferably, git hashes)
- HYPERPARAMETERS delivered to the code at runtime
- Running environment (ex: hostname, gpu model, …)
- Results of the experiment, according to the predefined evaluation measure
- The time consumed to train a model and infer on test sets
- …
Better yet, make a dashboard that marks your chronological progress

Can you reproduce your experiments in one step?

Important, so that teammates can incrementally build upon your experiments
One way: Put all settings for each experiment in a file, runnable with some script
Better yet, dockerize them? 🤔 (ex: http://codalab.org)

Do you have up-to-date documents?

“Documents” are organized usage instructions regarding the code or platform
Important for new teammates, and the future us
Try your best to keep them up-to-date!
Sub-considerations:
- Are the documents in sync? (ex: docstrings, wikis, READMEs)
- Are the documents located in a place where all teammates are aware of?

Do you have the best computational resources money can buy?

Do we need explanations here…?

Do you have tools to test model training?

ML model debugging is not straightforward
Many possible causes:
- The dataset can be faulty (ex: misaligned labels, extremely small dataset)
- The code doesn’t raise an error and the loss still goes down, but bad results
- Sometimes it’s just the hyperparameters’ fault
Standard unit tests might not work as intended
Sub-considerations:
- Model stability: Does your model replicate similar results for multiple runs?
- Ablation testing: Which part of your model is more responsible?
The naive way: Establish a baseline and test against it ¹²

Do you have tools to interpret your models?

Important because this leads us to hints to debug, or improve algorithms

¹³

Can you easily replace a component of your algorithm?

Encapsulation, modularization are both very important terms in software engineering
But they are extremely important in machine learning too, due to the fast dev cycles
Example in MT:

Tokenization → Train MT model → Infer on test set → Evaluate
- These “components” should all be easily replaceable
- There can be different levels of granularity of components

Does your team have a clear vision?

Does your team have explicit priorities that everyone can agree upon?
Does your team make a consolidated decision for your ML system?
This is not an ML specific item, but also very important!
Keep in mind that a vision is USELESS if no one is aware about it

The Joel test for better ML systems

What a sloppy test! But it’s short and simple, and just might give us ideas to improve our system.

Do you keep your data versioned as well as your code?
Do you have an experiment database?
Do you have specified evaluation metrics?
Do the evaluation datasets match the needs of your users?
Can you reproduce your experiments in one step?
Do you have up-to-date documents?
Do you have the best computational resources money can buy?
Do you have tools to test model training?
Do you have tools to interpret your models?
Can you easily replace a component of your algorithm?
Does your team have a clear vision?

Thank you for listening!

And also special thanks to the following colleagues for reviewing and providing valuable comments:

Sungjoo Ha

Sung Kim

Hyunjoong Kim

Donghyun Kwak

Chanju Jung

Jaesong Lee

Zaemyung Kim

Hyunchang Cho

Junseok Kim

Joongwhi Shin

Appendix

The “dark side” of reproducibility in ML

¹⁴

Yes, it’s definitely not an easy thing…

Some discussions here → ICML 2017 reproducibility workshop

References

Johnson et al., Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, Nov 2016.↩
Vaswani et al., Attention Is All You Need, Jun 2017.↩
Artetxe et al., Unsupervised neural machine translation, Oct 2017.↩
Cunningham, The WyCash Portfolio Management System, 1992.↩
Sculley et al., Hidden Technical Debt in Machine Learning Systems, 2015.↩
M. Fowler. Refactoring: improving the design of existing code. Pearson Education India, 1999.↩
How Does Your Phone Know This Is A Dog?, Sep 2015.↩
Karpathy, Software 2.0, 2017.↩
http://phdcomics.com/comics/archive.php?comicid=1531 ↩
What is A/B Testing?↩
Andrew Ng, Artificial Intelligence is the New Electricity (video), Feb 2017.↩
http://blog.mpacula.com/2011/02/17/unit-testing-statistical-software/, Feb 2011.↩
Lee et al., Interactive Beam Search for Visualizing Neural Machine Translation, EMNLP, 2017.↩
https://github.com/ilyasu123/rlntm ↩