My model has higher BLEU, can I ship it? The Joel Test for machine learning systems

ACML-AIMLP Workshop

Hi, I’m Lucy Park


  • A machine learning scientist at Naver Papago
  • Currently working on machine translation and user log analysis

Machine translation (MT)

  • The use of a software to translate from one language to another
  • Related services: Google Translate, GenieTalk, Systran, … and Papago!


Neural machine translation (NMT)

  • Many translation services now use NMT for production, including Papago


Phase 1 Collect a bunch of parallel corpora
Phase 2 Train an NMT algorithm (Usually, sequence-to-sequence)
Phase 3 Profit!

…and recent advances in NMT

Multilingual? 1

…and recent advances in NMT

Multi-head self attention? 2

…and recent advances in NMT

Unsupervised? 3

What does this mean?

  • Fast development cycles!
  • Frequent master branch updates
  • Frequent model releases
  • Not only in MT, but practically in every area that employs machine learning

Let’s put the “science part” aside, and talk about the “engineering part” of ML systems for today

Caveats

  1. This talk is NOT about machine translation algorithms
  2. This talk is NOT about BLEU, or related metrics
  3. This talk is NOT about Papago or Naver
  4. This talk aims to pinpoint the subtle areas in our machine learning systems

“Fast development cycles incur technical debt”

  • “Technical debt” 4
    • Long-term costs incurred by moving quickly in software engineering
    • It becomes more and more painful to add a feature
  • Payback methods 5 6
    • Not adding new functionality
    • Enabling future improvements and maintainability
    • Examples:
      • refactoring code
      • improving unit tests
      • deleting dead code
      • reducing dependencies
      • tightening APIs
      • improving documentation

There are many prophets regarding technical debt


The Joel test

So I’ve come up with my own, highly irresponsible, sloppy test to rate the quality of a software team. The great part about it is that it takes about 3 minutes. With all the time you save, you can go to medical school. – Joel Spolsky

  1. Do you use source control?
  2. Can you make a build in one step?
  3. Do you make daily builds?
  4. Do you have a bug database?
  5. Do you fix bugs before writing new code?
  6. Do you have an up-to-date schedule?
  7. Do you have a spec?
  8. Do programmers have quiet working conditions?
  9. Do you use the best tools money can buy?
  10. Do you have testers?
  11. Do new candidates write code during their interview?
  12. Do you do hallway usability testing?

But this list is insufficient for machine learning

Instead of writing a program that solves the problem, we (machine learning scientists) write a program that learns to solve the problem, from examples. 7 – Chris Olah Software 2.0 is written in neural network weights. No human is involved in writing this code. 8 – Andrej Karpathy

The biggest difference in machine learning: Data is involved!

“Hidden debt in machine learning systems”

Sculley, 2015
Sculley, 2015
  • Data acquisition
  • Model training (Involves training dataset)
  • Model deployment (Involves test dataset)
  • Monitoring

And other prophets regarding technical debt in machine learning

Have you read the papers below? If not, they’re highly recommended!

  • Martin, Rules of machine learning: Best practices for ML engineering, 2016.
  • Yii-Huumo et al., How do software development teams manage technical debt? – An empirical study, Journal of Systems and Software, 2016.
  • Breck et al., What’s your ML Test Score? A rubric for ML production systems, 2016.
  • Sculley et al., Machine learning: The high-interest credit card of technical debt, 2014.

However if you don’t have much time to spare… let’s try the following list. 😜

The Joel test for better ML systems

So I’ve come up with my own, highly irresponsible, sloppy test to rate the quality of a machine learning team. The great part about it is that it takes about 3 minutes. With all the time you save, you can take a MOOC. – Lucy Park

  1. Do you keep your data versioned as well as your code?
  2. Do you have an experiment database?
  3. Do you have specified evaluation metrics?
  4. Do the evaluation datasets match the needs of your users?
  5. Can you reproduce your experiments in one step?
  6. Do you have up-to-date documents?
  7. Do you have the best computational resources money can buy?
  8. Do you have tools to test model training?
  9. Do you have tools to interpret your models?
  10. Can you easily replace a component of your algorithm?
  11. Does your team have a clear vision?

Do you keep your data versioned as well as your code?

  • There are tools for code version control (ex: Git)

    Before After
    9
    “My name is Linus, and I am your God.”
    – Linus Torvalds

Do you keep your data versioned as well as your code?

  • But are there tools for data version control?

    Before After
    enko-corpus-2017-10-11/
    enko-corpus-2017-10-31/
    enko-corpus-2017-11-15/
    ?
  • Maybe
  • But at the least, datasets should be versioned or named using an agreed-upon rule

Do you have specified evaluation metrics?

  • Does not have to be a single metric
  • Important question: Does your team all agree and believe the results?
    • Ex: BLEU is a de-facto standard measure for MT
    • BLEU, however, does not necessarily assure user satisfaction
    • Then also conduct human evaluation, or build quality estimation models
  • Better yet, run small-sized A/B tests!
    • This will let you deploy the model as soon as sanity checks are completed
    • Plus, you can elastically scale each model according to user satisfaction measures


10

Do the evaluation datasets match the needs of your users?

  • Higher BLEU doesn’t assure higher user satisfaction
  • One way: Use multiple evaluation sets
    • Including various text lengths (ex: words, sentences, paragraphs)
    • Including multiple domains (ex: IT, politics, instant messaging)
    • Including adversarial examples (ex: gender/ethnic biased text)
  • Better yet, let the product manager, instead of the engineer, create the test set

    Andrew Ng, on “AI is the new electricity” 11

Do you have an experiment database?

  • A minimal experiment database should include the following:
    • Dataset version
    • Code version (preferably, git hashes)
    • HYPERPARAMETERS delivered to the code at runtime
    • Running environment (ex: hostname, gpu model, …)
    • Results of the experiment, according to the predefined evaluation measure
    • The time consumed to train a model and infer on test sets
  • Better yet, make a dashboard that marks your chronological progress

Can you reproduce your experiments in one step?

  • Important, so that teammates can incrementally build upon your experiments
  • One way: Put all settings for each experiment in a file, runnable with some script
  • Better yet, dockerize them? 🤔 (ex: http://codalab.org)
Codalab 👍
Codalab 👍

Do you have up-to-date documents?

  • “Documents” are organized usage instructions regarding the code or platform
  • Important for new teammates, and the future us
  • Try your best to keep them up-to-date!
  • Sub-considerations:
    • Are the documents in sync? (ex: docstrings, wikis, READMEs)
    • Are the documents located in a place where all teammates are aware of?

Do you have the best computational resources money can buy?

Do we need explanations here…?
Inside Gak, Naver’s data center
Inside Gak, Naver’s data center

Do you have tools to test model training?

  • ML model debugging is not straightforward
  • Many possible causes:
    • The dataset can be faulty (ex: misaligned labels, extremely small dataset)
    • The code doesn’t raise an error and the loss still goes down, but bad results
    • Sometimes it’s just the hyperparameters’ fault
  • Standard unit tests might not work as intended
  • Sub-considerations:
    • Model stability: Does your model replicate similar results for multiple runs?
    • Ablation testing: Which part of your model is more responsible?
  • The naive way: Establish a baseline and test against it 12

Do you have tools to interpret your models?

  • Important because this leads us to hints to debug, or improve algorithms

    13

Can you easily replace a component of your algorithm?

  • Encapsulation, modularization are both very important terms in software engineering
  • But they are extremely important in machine learning too, due to the fast dev cycles
  • Example in MT:

    Tokenization Train MT model Infer on test set Evaluate
    • These “components” should all be easily replaceable
    • There can be different levels of granularity of components

Does your team have a clear vision?

  • Does your team have explicit priorities that everyone can agree upon?
  • Does your team make a consolidated decision for your ML system?
  • This is not an ML specific item, but also very important!
  • Keep in mind that a vision is USELESS if no one is aware about it

The Joel test for better ML systems

What a sloppy test! But it’s short and simple, and just might give us ideas to improve our system.

  1. Do you keep your data versioned as well as your code?
  2. Do you have an experiment database?
  3. Do you have specified evaluation metrics?
  4. Do the evaluation datasets match the needs of your users?
  5. Can you reproduce your experiments in one step?
  6. Do you have up-to-date documents?
  7. Do you have the best computational resources money can buy?
  8. Do you have tools to test model training?
  9. Do you have tools to interpret your models?
  10. Can you easily replace a component of your algorithm?
  11. Does your team have a clear vision?

Thank you for listening!

And also special thanks to the following colleagues for reviewing and providing valuable comments:

  • Sungjoo Ha
  • Sung Kim
  • Hyunjoong Kim
  • Donghyun Kwak
  • Chanju Jung
  • Jaesong Lee
  • Zaemyung Kim
  • Hyunchang Cho
  • Junseok Kim
  • Joongwhi Shin

Appendix

The “dark side” of reproducibility in ML

14

Yes, it’s definitely not an easy thing…

Some discussions here → ICML 2017 reproducibility workshop

References

  1. Johnson et al., Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, Nov 2016.

  2. Vaswani et al., Attention Is All You Need, Jun 2017.

  3. Artetxe et al., Unsupervised neural machine translation, Oct 2017.

  4. Cunningham, The WyCash Portfolio Management System, 1992.

  5. Sculley et al., Hidden Technical Debt in Machine Learning Systems, 2015.

  6. M. Fowler. Refactoring: improving the design of existing code. Pearson Education India, 1999.

  7. How Does Your Phone Know This Is A Dog?, Sep 2015.

  8. Karpathy, Software 2.0, 2017.

  9. http://phdcomics.com/comics/archive.php?comicid=1531

  10. What is A/B Testing?

  11. Andrew Ng, Artificial Intelligence is the New Electricity (video), Feb 2017.

  12. http://blog.mpacula.com/2011/02/17/unit-testing-statistical-software/, Feb 2011.

  13. Lee et al., Interactive Beam Search for Visualizing Neural Machine Translation, EMNLP, 2017.

  14. https://github.com/ilyasu123/rlntm