Mining English and Korean text with Python

We use Python 3 in this tutorial, but provide minimal guidelines for Python 2.

Terminologies

English	한국어	Description
Document	문서	-
Corpus	말뭉치	A set of documents
Token	토큰	Meaningful elements in a text such as words or phrases or symbols
Morphemes	형태소	Smallest meaningful unit in a language
POS	품사	Part-of-speech (ex: Nouns)
Classification	분류	A supervised learning task where $X$ and $y$ are given and $y$ is a set of discrete classes
Clustering	군집화	An unsupervised learning task where $X$ is given

Text analysis process

Load text
Tokenize text (ex: stemming, morph analyzing)
Tag tokens (ex: POS, NER)
Token(Feature) selection and/or filter/rank tokens (ex: stopword removal, TF-IDF)
...and so on (ex: calculate word/document similarities, cluster documents)

Python Packages for Text Mining and NLP

...that we use in this tutorial.

NLTK: Provides modules for text analysis (mostly language independent)
```
pip install nltk
```
- Text corpora
```
nltk.download('gutenberg')
nltk.download('maxent_treebank_pos_tagger')
```
- Word POS, NER classification
- Document classification
KoNLPy: Provides modules for Korean text analysis
```
pip install konlpy
```
- Text corpora
- Word POS classification
  - Hannanum
  - Kkma
  - Mecab
  - Komoran
  - Twitter
Gensim: Provides modules for topic modeling and calculating similarities among documents
```
pip install -U gensim
```
- Topic modeling
- Word embedding
  - word2vec

Twython: Provides easy access to Twitter API

pip install twython

Example: Getting "Samsung (삼성)" related tweets

from twython import Twython
import settings as s    # Create a file named settings.py, and put oauth KEY values inside
twitter = Twython(s.APP_KEY, s.APP_SECRET, s.OAUTH_TOKEN, s.OAUTH_TOKEN_SECRET)
tweets = twitter.search(q='삼성', count=100)
data = [(t['user']['screen_name'], t['text'], t['created_at']) for t in tweets['statuses']]

Text exploration

1. Read document

As example documents, we select Jane Austen's Emma for English, and Korea National Assembly's bill number 1809890 for Korean. Otherwise, you can use a document of your own with open('some_file.txt').read().

English

from nltk.corpus import gutenberg   # Docs from project gutenberg.org
files_en = gutenberg.fileids()      # Get file ids
doc_en = gutenberg.open('austen-emma.txt').read()

Korean

from konlpy.corpus import kobill    # Docs from pokr.kr/bill
files_ko = kobill.fileids()         # Get file ids
doc_ko = kobill.open('1809890.txt').read()

2. Tokenize

There are numerous ways to tokenize a document.

Here, we use nltk.regexp_tokenize for English, konlpy.tag.Twitter.morph for Korean text.

English

from nltk import regexp_tokenize
pattern = r'''(?x) ([A-Z]\.)+ | \w+(-\w+)* | \$?\d+(\.\d+)?%? | \.\.\. | [][.,;"'?():-_`]'''
tokens_en = regexp_tokenize(doc_en, pattern)

Korean

from konlpy.tag import Twitter; t = Twitter()
tokens_ko = t.morphs(doc_ko)

3. Load tokens with `nltk.Text()`

English
```
import nltk
en = nltk.Text(tokens_en)
```

Korean

import nltk
ko = nltk.Text(tokens_ko, name='대한민국 국회 의안 제 1809890호')   # For Python 2, input `name` as u'유니코드'

nltk.Text() is a convenient way to explore a current document. For Python 2, name has to be input as u'유니코드'. If you are using Python 2, use u'유니코드' for input of all following Korean text.

Tokens

English

print(len(en.tokens))       # returns number of tokens (document length)
print(len(set(en.tokens)))  # returns number of unique tokens
en.vocab()                  # returns frequency distribution

191061
7927
FreqDist({',': 12018, '.': 8853, 'to': 5127, 'the': 4844, 'and': 4653, 'of': 4278, '"': 4187, 'I': 3177, 'a': 3000, 'was': 2385, ...})

Korean

print(len(ko.tokens))       # returns number of tokens (document length)
print(len(set(ko.tokens)))  # returns number of unique tokens
ko.vocab()                  # returns frequency distribution

1707
476
FreqDist({'.': 61, '의': 46, '육아휴직': 38, '을': 34, '(': 27, ',': 26, '이': 26, ')': 26, '에': 24, '자': 24, ...})

Plot frequency distributions

English

en.plot(50)     # Plot sorted frequency of top 50 tokens

Korean

ko.plot(50)     # Plot sorted frequency of top 50 tokens

Tip: To save a plot programmably, and not through the GUI, overwrite pylab.show with pylab.savefig before drawing the plot (reference):
from matplotlib import pylab
pylab.show = lambda: pylab.savefig('some_filename.png')
Troubleshooting: For those who see rectangles instead of letters in the saved plot file, include the following configurations before drawing the plot:
from matplotlib import font_manager, rc
font_fname = 'c:/windows/fonts/gulim.ttc'     # A font of your choice
font_name = font_manager.FontProperties(fname=font_fname).get_name()
rc('font', family=font_name)
Some example fonts:

Mac OS: /Library/Fonts/AppleGothic.ttf

Count

English

en.count('Emma')        # Counts occurrences

Korean

ko.count('초등학교')   # Counts occurrences

Dispersion plot

English

en.dispersion_plot(['Emma', 'Frank', 'Jane'])

Korean

ko.dispersion_plot(['육아휴직', '초등학교', '공무원'])

Concordance

English

en.concordance('Emma', lines=5)

Displaying 5 of 865 matches:
                                     Emma by Jane Austen 1816 ] VOLUME I CHAPT
                                     Emma Woodhouse , handsome , clever , and 
both daughters , but particularly of Emma . Between them it was more the int
 friend very mutually attached , and Emma doing just what she liked ; highly e
r own . The real evils , indeed , of Emma ' s situation were the power of havi

Korean (or, use konlpy.utils.concordance)

ko.concordance('초등학교')

Displaying 6 of 6 matches:
 ․ 김정훈 김학송 의원 ( 10 인 ) 제안 이유 및 주요 내용 초등학교 저학년 의 경우 에도 부모 의 따뜻한 사랑 과 보살핌 이 필요 한
 을 할 수 있는 자녀 의 나이 는 만 6 세 이하 로 되어 있어 초등학교 저학년 인 자녀 를 돌보기 위해서 는 해당 부모님 은 일자리 를 
 다 . 제 63 조제 2 항제 4 호 중 “ 만 6 세 이하 의 초등학교 취학 전 자녀 를 ” 을 “ 만 8 세 이하 ( 취학 중인 경우 
 전 자녀 를 ” 을 “ 만 8 세 이하 ( 취학 중인 경우 에는 초등학교 2 학년 이하 를 말한 다 ) 의 자녀 를 ” 로 한 다 . 부 
 . ∼ 3 . ( 현행 과 같 음 ) 4 . 만 6 세 이하 의 초등학교 취 4 . 만 8 세 이하 ( 취학 중인 경우 학 전 자녀 를 양
세 이하 ( 취학 중인 경우 학 전 자녀 를 양육 하기 위하 에는 초등학교 2 학년 이하 를 여 필요하거 나 여자 공무원 이 말한 다 ) 의

Find similar words

English

en.similar('Emma')
en.similar('Frank')

she it he i harriet you her jane him that me and all they them there herself was hartfield be
mr mrs emma harriet you it her she he him hartfield them jane that isabella all herself look i me

Korean

ko.similar('자녀')
ko.similar('육아휴직')

논의
None

Collocations

English

en.collocations()

Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss
Fairfax; every thing; young man; every body; great deal; dare say;
John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin;
Colonel Campbell; Box Hill; said Emma; Harriet Smith; William Larkins

Korean

en.collocations()

초등학교 저학년; 육아휴직 대상

For more information on nltk.Text(), see the source code or API.

Tagging and chunking

Until now, we used delimited text, namely tokens, to explore our sample document. Now let's classify words into given classes, namely part-of-speech tags, and chunk text into larger pieces.

1. POS tagging

There are numerous ways of tagging a text. Among them, the most frequently used, and developed way of tagging is arguably POS tagging.

Since one document is too long to observe a parsed structure, lets use one short sentence for each language.

English

tokens = "The little yellow dog barked at the Persian cat".split()
tags_en = nltk.pos_tag(tokens)

[('The', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'NN'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Persian', 'NNP'),
 ('cat', 'NN')]

It is also possible to use the famous Stanford POS tagger with NLTK, with from nltk.tag.stanford import POSTagger

Korean

from konlpy.tag import Twitter; t = Twitter()
tags_ko = t.pos("작고 노란 강아지가 페르시안 고양이에게 짖었다")

[('작고', 'Noun'),
 ('노란', 'Adjective'),
 ('강아지', 'Noun'),
 ('가', 'Josa'),
 ('페르시안', 'Noun'),
 ('고양이', 'Noun'),
 ('에게', 'Josa'),
 ('짖었', 'Noun'),
 ('다', 'Josa')]

2. Noun phrase chunking

nltk.RegexpParser() is a great way to start chunking.

English

parser_en = nltk.RegexpParser("NP: {<DT>?<JJ>?<NN.*>*}")
chunks_en = parser_en.parse(tags_en)
chunks_en.draw()

Korean

parser_ko = nltk.RegexpParser("NP: {<Adjective>*<Noun>*}")
chunks_ko = parser_ko.parse(tags_ko)
chunks_ko.draw()

For more information on chunking, refer to Extracting Information from Text for English, and Chunking for Korean.

Topic modeling

Topic modeling in a nutshell
History
- LSI: Learns latent topics by performing a matrix decomposition (SVD) on the term-document matrix
- LDA: A generative probabilistic model, that assumes a Dirichelt prior over the latent topics
- HDP: A natural nonparametric generalization of LDA, where the number of topics can be unbounded ant learnt from data

1. Preprocessing

Load documents

English

from nltk.corpus import reuters
docs_en = [reuters.words(i) for i in reuters.fileids()]

Korean

from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]

Tokenize

English

texts_en = docs_en # because we loaded tokenized documents in step 1
print(texts_en[0])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...]

Korean

from konlpy.tag import Twitter; t = Twitter()
pos = lambda d: ['/'.join(p) for p in t.pos(d, stem=True, norm=True)]
texts_ko = [pos(doc) for doc in docs_ko]
print(texts_ko[0])

['지방공무원법/Noun', '일부/Noun', '개정/Noun', '법률/Noun', '안/Noun', '(/Punctuation', '정의화/Noun', '의원/Noun', ...]

Encode tokens to integers

English

from gensim import corpora
dictionary_en = corpora.Dictionary(texts_en)
dictionary_en.save('en.dict')  # save dictionary to file for future use

Korean

from gensim import corpora
dictionary_ko = corpora.Dictionary(texts_ko)
dictionary_ko.save('ko.dict')  # save dictionary to file for future use

Calculate TF-IDF

English

from gensim import models
tf_en = [dictionary_en.doc2bow(text) for text in texts_en]
tfidf_model_en = models.TfidfModel(tf_en)
tfidf_en = tfidf_model_en[tf_en]
corpora.MmCorpus.serialize('en.mm', tfidf_en) # save corpus to file for future use

# print first 10 elements of first document's tf-idf vector
print(tfidf_en.corpus[0][:10])
# print top 10 elements of first document's tf-idf vector
print(sorted(tfidf_en.corpus[0], key=lambda x: x[1], reverse=True)[:10])
# print token of most frequent element
print(dictionary_en.get(9))

[(0, 7), (1, 3), (2, 13), (3, 2), (4, 1), (5, 1), (6, 20), (7, 6), (8, 10), (9, 62)]
[(9, 62), (363, 32), (276, 30), (371, 26), (6, 20), (96, 19), (112, 19), (326, 16), (118, 14), (2, 13)]
'.'

Korean

from gensim import models
tf_ko = [dictionary_ko.doc2bow(text) for text in texts_ko]
tfidf_model_ko = models.TfidfModel(tf_ko)
tfidf_ko = tfidf_model_ko[tf_ko]
corpora.MmCorpus.serialize('ko.mm', tfidf_ko) # save corpus to file for future use

# print first 10 elements of first document's tf-idf vector
print(tfidf_ko.corpus[0][:10])
# print top 10 elements of first document's tf-idf vector
print(sorted(tfidf_ko.corpus[0], key=lambda x: x[1], reverse=True)[:10])
# print token of most frequent element
print(dictionary_ko.get(414))

[(0, 10), (1, 27), (2, 1), (3, 26), (4, 3), (5, 26), (6, 4), (7, 2), (8, 1), (9, 1)]
[(414, 71), (14, 61), (309, 38), (314, 38), (313, 28), (1, 27), (3, 26), (5, 26), (353, 22), (13, 21)]
'하다/Verb'

2. Train topic models

LSI

English

ntopics, nwords = 3, 5
lsi_en = models.lsimodel.LsiModel(tfidf_en, id2word=dictionary_en, num_topics=ntopics)
print(lsi_en.print_topics(num_topics=ntopics, num_words=nwords))

['0.509*"vs" + 0.272*"000" + 0.258*"cts" + 0.243*"loss" + 0.238*"mln"',
'-0.294*"the" + 0.237*"vs" + -0.176*"to" + -0.148*"in" + -0.137*"pct"',
'0.331*"Record" + 0.316*"div" + 0.312*"Pay" + 0.303*"Qtly" + 0.268*"prior"']

Korean

ntopics, nwords = 3, 5
lsi_ko = models.lsimodel.LsiModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics)
print(lsi_ko.print_topics(num_topics=ntopics, num_words=nwords))

['0.518*"육아휴직/Noun" + 0.257*"만/Noun" + 0.227*"×/Foreign" + 0.214*"대체/Noun" + 0.201*"고용/Noun"',
 '0.449*"파견/Noun" + 0.412*"부대/Noun" + 0.267*"UAE/Alpha" + 0.243*"○/Foreign" + 0.192*"국군/Noun"',
 '0.326*"결혼/Noun" + 0.315*"예고/Noun" + 0.285*"손해/Noun" + 0.205*"ㆍ/Foreign" + 0.197*"원사/Noun"']

LDA

English

import numpy as np; np.random.seed(42)  # optional
lda_en = models.ldamodel.LdaModel(tfidf_en, id2word=dictionary_en, num_topics=ntopics)
print(lda_en.print_topics(num_topics=ntopics, num_words=nwords))

['0.005*the + 0.003*to + 0.003*pct + 0.002*of + 0.002*said',
 '0.005*cts + 0.005*Record + 0.005*div + 0.004*Pay + 0.004*Qtly',
 '0.010*vs + 0.006*mln + 0.006*000 + 0.005*loss + 0.004*cts']

Korean

import numpy as np; np.random.seed(42)  # optional
lda_ko = models.ldamodel.LdaModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics)
print(lda_ko.print_topics(num_topics=ntopics, num_words=nwords))

['0.001*학위/Noun + 0.001*파견/Noun + 0.001*손해/Noun + 0.001*간호/Noun + 0.001*소말리아/Noun',
 '0.002*파견/Noun + 0.002*부대/Noun + 0.001*UAE/Alpha + 0.001*손해/Noun + 0.001*○/Foreign',
 '0.003*육아휴직/Noun + 0.002*만/Noun + 0.002*×/Foreign + 0.002*대체/Noun + 0.002*고용/Noun']

HDP

English

import numpy as np; np.random.seed(42)  # optional
hdp_en = models.hdpmodel.HdpModel(tfidf_en, id2word=dictionary_en)
print(hdp_en.print_topics(topics=ntopics, topn=nwords))

['topic 0: 0.005*the + 0.003*to + 0.002*in + 0.002*a + 0.002*of',
 'topic 1: 0.008*vs + 0.005*000 + 0.004*loss + 0.004*mln + 0.004*cts',
 'topic 2: 0.001*the + 0.001*vs + 0.001*in + 0.001*to + 0.001*mln']

Korean

import numpy as np; np.random.seed(42)  # optional
hdp_ko = models.hdpmodel.HdpModel(tfidf_ko, id2word=dictionary_ko)
print(hdp_ko.print_topics(topics=ntopics, topn=nwords))

['topic 0: 0.004*소집/Noun + 0.004*도/Josa + 0.004*’/Foreign + 0.004*｢/Foreign + 0.004*9892/Number',
 'topic 1: 0.004*이애주/Noun + 0.004*年/Foreign + 0.004*意思/Foreign + 0.004*마찰/Noun + 0.004*고 려/Noun',
 'topic 2: 0.005*명시/Noun + 0.004*영업정지/Noun + 0.004*세로/Noun + 0.004*중개업/Noun + 0.004*다양하다/Adjective']

3. Scoring documents

English

bow = tfidf_model_en[dictionary_en.doc2bow(texts_en[0])]
sorted(lsi_en[bow], key=lambda x: x[1], reverse=True)
sorted(lda_en[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_en[bow], key=lambda x: x[1], reverse=True)

[(0, 0.1336800876240628),
 (2, -0.030832981664564624),
 (1, -0.39895210562646022)]
[(2, 0.84087091284115845),
 (0, 0.13882114432084294),
 (1, 0.020307942837998694)]
[(0, 0.95369717052959579)]

bow = tfidf_model_en[dictionary_en.doc2bow(texts_en[1])]
sorted(lsi_en[bow], key=lambda x: x[1], reverse=True)
sorted(lda_en[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_en[bow], key=lambda x: x[1], reverse=True)

[(0, 0.072924758682943097),
 (2, -0.0029545572070390153),
 (1, -0.13195370933374836)]
[(0, 0.62957273636869904),
 (2, 0.3270007771486681),
 (1, 0.043426486482632851)]
[(0, 0.90574410236561731),
 (1, 0.010409702375525492)]

Korean

bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[0])]
sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True)
sorted(lda_ko[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)

[(0, 0.97829017893328929),
 (1, -0.016909513239922121),
 (2, -0.020121561014425089)]
[(2, 0.93880436704581616),
 (0, 0.030626827732744354),
 (1, 0.030568805221439507)]
[(0, 0.94848723192042672),
 (1, 0.014364056233061516),
 (2, 0.010285449586192942)]

bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[1])]
sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True)
sorted(lda_ko[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)

[(0, 0.97829017893328929),
 (1, -0.016909513239922121),
 (2, -0.020121561014425089)]
[(2, 0.93881674048370278),
 (0, 0.0306176131467021),
 (1, 0.030565646369595065)]
[(0, 0.94848723192042672),
 (1, 0.014364056233061516),
 (2, 0.010285449586192942)]

Confident with topic modeling? Try a bigger dataset: Experiments on the English Wikipedia

Word embedding

Objective: Learn feature vectors from documents
- Text is normally represented with one-hot encoding + hand crafted features
- Ex: [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ]
Word embedding: A set of feature unsupervised learning techniques where words are mapped to n-dimensional vectors of real numbers (the continuous space)
- Use local context to get a more syntactic or semantic representation
- Ex: v("cat") = [0.2, -0.4, ..., 0.7], v("mat") = [-0.0, -0.2, ..., -0.1]
Approaches
- Neural networks (Bengio et al., 2001, Mikolov et al., 2013)
- Dimensionality reduction (Lebret et al., 2013)

word2vec (Mikolov et al., 2013)

A neural network based embedding method for learning distributed vector representations of words
- No hidden layers!
"an optimized single-machine can train 100B+ words in one day"
CBOW & Skip-gram: Two ways of creating the "task" for the neural network
Characteristics
- Places similar words next to each other in a vector space
- Places similar relations in parallel (preserve linguistic regularities)
  - ex: France: Paris = Germany: Berlin != Italy: Madrid
- Linguistic regularities
  - v(KING) – v(MAN) + v(WOMAN) = v(QUEEN)
  - v(KINGS) – v(KING) + v(QUEEN) = v(QUEENS)
  - v(MADRID) – v(SPAIN) + v(FRANCE) = v(PARIS)
Applications
- Machine translation (Socher et al., 2013)
- Jointly embedding images and text (Frome et al., 2013, link)
Some good references to begin with in case you are interested:
- http://radimrehurek.com/2014/02/word2vec-tutorial/
- http://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

Let's go for it.

word2vec toy problem

Load documents

English

from nltk.corpus import reuters
docs_en = [reuters.words(i) for i in reuters.fileids()]

Korean

from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]

Tokenize

English

texts_en = docs_en # because we loaded tokenized documents in step 1

Korean

from konlpy.tag import Twitter; t = Twitter()
pos = lambda d: ['/'.join(p) for p in t.pos(d)]
texts_ko = [pos(doc) for doc in docs_ko]

Train

English

from gensim.models import word2vec
wv_model_en = word2vec.Word2Vec(texts_en)
wv_model_en.init_sims(replace=True)
wv_model_en.save('en_word2vec.model')

Korean

from gensim.models import word2vec
wv_model_ko = word2vec.Word2Vec(texts_ko)
wv_model_ko.init_sims(replace=True)
wv_model_ko.save('ko_word2vec.model')

Test

English

wv_model_en.most_similar('president')
wv_model_en.most_similar('secretary')
wv_model_en.most_similar('country')

[('chairman', 0.8655247688293457),
 ('vice', 0.8160154819488525),
 ('executive', 0.8094440698623657),
 ('officer', 0.7894954085350037),
 ('Kjell', 0.7766541838645935),
 ('former', 0.7680522203445435),
 ('chief', 0.7660256028175354),
 ('Robert', 0.7623487114906311),
 ('director', 0.7434573173522949),
 ('Roger', 0.7231118679046631)]
[('assistant', 0.8573123812675476),
 ('Carlos', 0.796258807182312),
 ('Daniel', 0.7900130748748779),
 ('undersecretary', 0.7888025045394897),
 ('representative', 0.7878221273422241),
 ('Deputy', 0.7847912311553955),
 ('NAWG', 0.7829214930534363),
 ('Republican', 0.7773356437683105),
 ('Greek', 0.7752739191055298),
 ('Papandreou', 0.7684933543205261)]
[('kingdom', 0.8003361225128174),
 ('biggest', 0.765742301940918),
 ('island', 0.7639101147651672),
 ('founding', 0.7143765687942505),
 ('nation', 0.7080289125442505),
 ('fortunes', 0.7054018974304199),
 ('strength', 0.6875098943710327),
 ('challenging', 0.6863174438476562),
 ('actions', 0.6835225820541382),
 ('departure', 0.6834459900856018)]

Korean

wv_model_ko.most_similar(pos('정부'))
wv_model_ko.most_similar(pos('초등학교'))

[('경비/Noun', 0.9357226490974426),
 ('선박/Noun', 0.9204540252685547),
 ('연장/Noun', 0.9183653593063354),
 ('임무/Noun', 0.9179578423500061),
 ('우리/Noun', 0.9015840291976929),
 ('목적/Noun', 0.8871368169784546),
 ('기타/Noun', 0.875058650970459),
 ('화/Suffix', 0.8669425249099731),
 ('해역/Noun', 0.8575668334960938),
 ('한국/Noun', 0.8549510836601257)]
[('취학/Noun', 0.9686248898506165),
 ('중인/Noun', 0.9336546659469604),
 ('하더/Verb', 0.8985729217529297),
 ('정의화/Noun', 0.8843945860862732),
 ('김정훈/Noun', 0.8682949542999268),
 ('지방/Noun', 0.8677719831466675),
 ('조정함/Verb', 0.8617256879806519),
 ('44/Number', 0.8445801734924316),
 ('세/Noun', 0.8318654298782349),
 ('第/Foreign', 0.8222816586494446)]

word2vec in the real world

Not enough? Let's see a real life example.

Data source: Naver News & Naver blog
Questions
Matching pairs: 그/Noun:남자/Noun = 그녀/Noun:?
Visualization

Courses

Mining English and Korean text with Python

Terminologies

Text analysis process

Python Packages for Text Mining and NLP

Text exploration

1. Read document

2. Tokenize

3. Load tokens with nltk.Text()

Tagging and chunking

1. POS tagging

2. Noun phrase chunking

Topic modeling

1. Preprocessing

2. Train topic models

3. Scoring documents

Word embedding

word2vec (Mikolov et al., 2013)

word2vec toy problem

word2vec in the real world

3. Load tokens with `nltk.Text()`