What’s up at NIPS 2017 In the perspective of a machine translation researcher

  • Lucy Park
  • Naver Papago
  • 2018-01-25 / 2018-02-02

Hi, I’m Lucy Park


  • A machine learning scientist at Naver Papago
  • Currently working on machine translation and user log analysis

Machine translation (MT)

  • The use of a software to translate from one language to another, given an input sentence and target language
  • Related services: Google Translate, GenieTalk, Systran, … and Papago!


Neural machine translation (NMT)

  • Many translation services now use NMT for production, including Papago


Phase 1 Collect a bunch of parallel corpora
ex: Orlando Bloom and Miranda Kerr still love each other <-> Orlando Bloom und Miranda Kerr lieben sich noch immer
Phase 2 Train an NMT algorithm (Usually, sequence-to-sequence)
Phase 3 Profit!

Neural machine translation (NMT)

RNN-based CNN-based Attention-based Unsupervised
Sutskever et al.
(NIPS 2014)
Bahdanau et al.
(ICLR 2015)
Gehring et al.
(ACL 2017)
Lee et al.
(TACL 2017)
Vaswani et al.
(NIPS 2017)
Lample et al.
(ICLR 2018)
Artetxe et al.
(ICLR 2018)

NMT challenges

  • Vocabulary:
    • 특정 언어에 종속되지 않으면서 최대한 많은 개념을 수용할 수 있으면서 문장을 빠르게 생성할 수 있는 최소의 단위는?
    • ex: Words/morphemes, BPE, characters 등 다양한 형태의 token을 입력 받음.
  • Architecture:
    • Faster
    • Stronger
    • Smaller
  • Optimization
  • Evaluation

MT papers at NIPS 2017

New architectures: The Transformer

New architectures: The Transformer

  • 동기: 학습할 때 sequential함을 우회해서 시간/공간 효율적으로 computation 할 수는 없을까?
  • 방법: Multi-head self-attention, residual connections, sinusoidal positional encoding, Adam with warm-up, smoothing cross entropy regularizaton, …

New architectures: The Transformer

  • Training time: 12 hrs (base model) / 3.5 days (big model) with 8 x NVIDIA P100 (en-de)
  • SOTA on WMT translation task (en-de, en-fr)

New decoders: Deliberation Networks

New decoders: Deliberation Networks

  • 동기: One-pass decoding은 좀 부족하지 않나? 사람들은 작문, 독서, 번역을 할 때 같은 문장도 여러 번 검토한다.
  • 방법: Use two levels of decoders
    1. generate raw sequence
    2. polish and refine with deliberation

New decoders: Deliberation Networks

  • Real SOTA…?

New decoders: Decoding with Value Networks

  • Objective: Overcome myopic bias and therefore improve beam search
  • Model: Use a recurrent BLEU prediction network (value network)
    • Value function is estimated via pairwise ranking loss (instead of MSE loss)

New decoders: Decoding with Value Networks

Efficient computation: SVD-Softmax

Efficient computation: SVD-Softmax

Efficient computation: SVD-Softmax

  • Requires only approximately 20% of arithmetic operations (for an 800K vocabulary case)
  • More than a three-fold speedup on a GPU

External memory: Unbounded cache model

  • Large-scale (as in storing millions) non-parametric memory component that stores all hidden activations/representations seen in the past
  • Equipped with efficient search
    • approximate nearest neighbor search
    • quantization

External memory: Unbounded cache model

Evaluation: Babble Labble (demo)

And more…

Most surprising session

“Factorized deep retrieval” by Ed Chi (Google)

Most amusing session

“Improvised comedy” @ Machine Learning for Creativity and Design Workshop

Personal moment #1

Mentoring sessions at @WiML

With Joelle Pineau

cf. Black in AI was great, too!

Personal moment #2

Morning run with @hardmaru at Long Beach

Personal moment #3

파파고 PPL w/ The “Unbreakable” Cho

Impressions of NIPS 2017 in short

  • HUGE…… and intense (gotta workout)
  • Diverse participants, variety of topics
  • Networking is fun!

Opportunities at Papago

…are open at all times!

Interested in foreign languages?
Interested in lowering language barriers?

Send your CV to [email protected]!