What’s up at NIPS 2017 In the perspective of a machine translation researcher

Lucy Park
Naver Papago
2018-01-25 / 2018-02-02

Hi, I’m Lucy Park

A machine learning scientist at Naver Papago
Currently working on machine translation and user log analysis

Machine translation (MT)

The use of a software to translate from one language to another, given an input sentence and target language
Related services: Google Translate, GenieTalk, Systran, … and Papago!

Neural machine translation (NMT)

Many translation services now use NMT for production, including Papago

Phase 1	Collect a bunch of parallel corpora ex: Orlando Bloom and Miranda Kerr still love each other `<->` Orlando Bloom und Miranda Kerr lieben sich noch immer
Phase 2	Train an NMT algorithm (Usually, sequence-to-sequence)
Phase 3	Profit!

Neural machine translation (NMT)

RNN-based	CNN-based	Attention-based	Unsupervised

Sutskever et al. (NIPS 2014) Bahdanau et al. (ICLR 2015)	Gehring et al. (ACL 2017) Lee et al. (TACL 2017)	Vaswani et al. (NIPS 2017)	Lample et al. (ICLR 2018) Artetxe et al. (ICLR 2018)

NMT challenges

Vocabulary:
- 특정 언어에 종속되지 않으면서 최대한 많은 개념을 수용할 수 있으면서 문장을 빠르게 생성할 수 있는 최소의 단위는?
- ex: Words/morphemes, BPE, characters 등 다양한 형태의 token을 입력 받음.
Architecture:
- Faster
- Stronger
- Smaller
Optimization
Evaluation
…

MT papers at NIPS 2017

New architectures: The Transformer

New architectures: The Transformer

동기: 학습할 때 sequential함을 우회해서 시간/공간 효율적으로 computation 할 수는 없을까?
방법: Multi-head self-attention, residual connections, sinusoidal positional encoding, Adam with warm-up, smoothing cross entropy regularizaton, …

New architectures: The Transformer

Training time: 12 hrs (base model) / 3.5 days (big model) with 8 x NVIDIA P100 (en-de)
SOTA on WMT translation task (en-de, en-fr)

New decoders: Deliberation Networks

New decoders: Deliberation Networks

동기: One-pass decoding은 좀 부족하지 않나? 사람들은 작문, 독서, 번역을 할 때 같은 문장도 여러 번 검토한다.
방법: Use two levels of decoders
1. generate raw sequence
2. polish and refine with deliberation

New decoders: Deliberation Networks

Real SOTA…?

New decoders: Decoding with Value Networks

Objective: Overcome myopic bias and therefore improve beam search
Model: Use a recurrent BLEU prediction network (value network)
- Value function is estimated via pairwise ranking loss (instead of MSE loss)

New decoders: Decoding with Value Networks

Efficient computation: SVD-Softmax

Efficient computation: SVD-Softmax

Efficient computation: SVD-Softmax

Requires only approximately 20% of arithmetic operations (for an 800K vocabulary case)
More than a three-fold speedup on a GPU

External memory: Unbounded cache model

Large-scale (as in storing millions) non-parametric memory component that stores all hidden activations/representations seen in the past
Equipped with efficient search
- approximate nearest neighbor search
- quantization

External memory: Unbounded cache model

Evaluation: Babble Labble (demo)

Learning from Natural Language Explanations
https://hazyresearch.github.io/snorkel/blog/babble_labble.html

And more…

Most surprising session

“Factorized deep retrieval” by Ed Chi (Google)

Most amusing session

“Improvised comedy” @ Machine Learning for Creativity and Design Workshop

Personal moment #1

Mentoring sessions at @WiML

With Joelle Pineau

cf. Black in AI was great, too!

Personal moment #2

Morning run with @hardmaru at Long Beach

Personal moment #3

파파고 PPL w/ The “Unbreakable” Cho

Impressions of NIPS 2017 in short

HUGE…… and intense (gotta workout)
Diverse participants, variety of topics
Networking is fun!

Opportunities at Papago

…are open at all times!

Interested in foreign languages?
Interested in lowering language barriers?

Send your CV to [email protected]!