f:id:taxa_program:20200502163654p:plain

こんにちは。takapy（@takapy0210）です。

本エントリは言語処理100本ノック 2020の4章を解いてみたので、それの備忘です。

例によってコードはGithubに置いてあります。

第4章: 形態素解析

第4章: 形態素解析

夏目漱石の小説『吾輩は猫である』の文章（neko.txt）をMeCabを使って形態素解析し，その結果をneko.txt.mecabというファイルに保存せよ．このファイルを用いて，以下の問に対応するプログラムを実装せよ．

なお，問題37, 38, 39はmatplotlibもしくはGnuplotを用いるとよい．

始めに.txtファイルを形態素解析したファイル（.mecab）に出力してからスタートします。

# $ mecab INPUT -o OUTPUT の形式でファイルを引数に取って形態素解析を実行できます
mecab neko.txt -o neko.txt.mecab

出力されたneko.txt.mecabは下記のようになっているはずです。

一  名詞,数,*,*,*,*,一,イチ,イチ
EOS
EOS
　 記号,空白,*,*,*,*,　,　,　
吾輩  名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある  助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
。 記号,句点,*,*,*,*,。,。,。
EOS
名前  名詞,一般,*,*,*,*,名前,ナマエ,ナマエ

mecabの詳細については下記を参照してください。

taku910.github.io

また、今回の可視化にはplotlyを使用しています。

plotly.com

30. 形態素解析結果の読み込み

"""
形態素解析結果（neko.txt.mecab）を読み込むプログラムを実装せよ．
ただし，各形態素は表層形（surface），基本形（base），品詞（pos），品詞細分類1（pos1）をキーとするマッピング型に格納し，
1文を形態素（マッピング型）のリストとして表現せよ．第4章の残りの問題では，ここで作ったプログラムを活用せよ．
"""
def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict

file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))
print(ans_list[:5])

ファイルを読み込んだ後、不要な値（''）を除外して、指定の形態素を辞書型に格納しています。
出力は先頭5行のみ表示させるようにしました。

実行結果

[{'surface': '一', 'base': '一', 'pos': '名詞', 'pos1': '数'}, {'surface': '\u3000', 'base': '\u3000', 'pos': '記号', 'pos1': '空白'}, {'surface': '吾輩', 'base': '吾輩', 'pos': '名詞', 'pos1': '代名詞'}, {'surface': 'は', 'base': 'は', 'pos': '助詞', 'pos1': '係助詞'}, {'surface': '猫', 'base': '猫', 'pos': '名詞', 'pos1': '一般'}]

31. 動詞

"""
動詞の表層形をすべて抽出せよ．
"""
def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items, get_type, key, value):
    return [x[get_type] for x in items if key in x and get_type in x and x[key] == value]


file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list, 'surface', 'pos', '動詞')
print(ans[:5])

get_value関数を実装し、動詞のsurfaceを抽出しています。

実行結果

['生れ', 'つか', 'し', '泣い', 'し']

32. 動詞の原形

"""
動詞の原形をすべて抽出せよ．
"""
def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items, get_type, key, value):
    return [x[get_type] for x in items if key in x and get_type in x and x[key] == value]


file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list, 'base', 'pos', '動詞')
print(ans[:5])

31のコードのget_value関数に渡す引数をbaseに変更しただけです。

実行結果

['生れる', 'つく', 'する', '泣く', 'する']

33. 「AのB」

"""
2つの名詞が「の」で連結されている名詞句を抽出せよ．
"""
def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items):
    return [items[i-1]['surface'] + x['surface'] + items[i+1]['surface']
            for i, x in enumerate(items)
            if x['surface'] == 'の'
            and items[i-1]['pos'] == '名詞'
            and items[i+1]['pos'] == '名詞']

file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list)
print(ans[:5])

get_value関数内でリスト内包表記を用いて名詞 + の + 名詞を抽出しています。

実行結果

['彼の掌', '掌の上', '書生の顔', 'はずの顔', '顔の真中']

34. 名詞の連接

"""
名詞の連接（連続して出現する名詞）を最長一致で抽出せよ．
"""
def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items):
    ret = []
    noun_list = []
    for i, x in enumerate(items):
        if x['pos'] == '名詞':
            if items[i+1]['pos'] == '名詞':
                noun_list.append(x['surface'])
            else:
                if len(noun_list) >= 1:
                    noun_list.append(x['surface'])
                    ret.append(noun_list)
                noun_list = []
    return ret

file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list)
print(ans[:5])

get_value関数で連接を抽出しています。
名詞が連続している場合はそれらをnoun_listに格納し、連続が途切れた段階でretに詰めています。

実行結果

[['人間', '中'], ['一番', '獰悪'], ['時', '妙'], ['一', '毛'], ['その後', '猫']]

35. 単語の出現頻度

"""
文章中に出現する単語とその出現頻度を求め，出現頻度の高い順に並べよ．
"""
import pandas as pd
from collections import defaultdict


def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items):
    return [x['surface'] for x in items]


def get_freq(value):
    def generate_ngrams(text, n_gram=1):
        token = [token for token in text.lower().split(" ") if token != "" if token]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]

    freq_dict = defaultdict(int)
    for sent in value:
        for word in generate_ngrams(str(sent)):
            freq_dict[word] += 1

    fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
    fd_sorted.columns = ['word', 'word_count']
    return fd_sorted


file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list)
ans = get_freq(ans)

print(ans.head())

get_freq関数で出現頻度を計算しています。

実行結果

     word  word_count  
0    の        9194  
1    。        7486  
2    て        6868  
3    、        6772  
4    は        6420

36. 頻度上位10語

"""
出現頻度が高い10語とその出現頻度をグラフ（例えば棒グラフなど）で表示せよ．
"""
import pandas as pd
from collections import defaultdict
import plotly.express as px
from plotly.offline import plot


def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items):
    return [x['surface'] for x in items]


def get_freq(value):
    def generate_ngrams(text, n_gram=1):
        token = [token for token in text.lower().split(" ") if token != "" if token]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]

    freq_dict = defaultdict(int)
    for sent in value:
        for word in generate_ngrams(str(sent)):
            freq_dict[word] += 1

    fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
    fd_sorted.columns = ['word', 'word_count']
    return fd_sorted.head(10)


file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list)
ans = get_freq(ans)

fig = px.bar(
    ans.sort_values('word_count'),
    y='word',
    x='word_count',
    text='word_count',
    orientation='h',
)
fig.update_traces(
    texttemplate='%{text:.2s}',
    textposition='auto',
)
fig.update_layout(
    title=str('頻度上位10語'),
    xaxis_title=str('出現数'),
    yaxis_title=str('単語'),
    width=1000,
    height=500,
)
plot(fig, filename='ans_36_plot.html', auto_open=False)

get_freqのreturnでhead(10)として、上位10単語のみを抽出しています。
今回はplotlyを用いて可視化しました。
実行すると、実行ディレクトリにans_36_plot.htmlファイルが出力され、それをブラウザで開くと画像が確認できます。

実行結果

37. 「猫」と共起頻度の高い上位10語

"""
「猫」とよく共起する（共起頻度が高い）10語とその出現頻度をグラフ（例えば棒グラフなど）で表示せよ．
"""
import pandas as pd
from collections import defaultdict
import plotly.express as px
from plotly.offline import plot


def parseMecab(block):
    res = []
    for line in block.split('\n'):
        if line == '':
            return res
        (surface, attr) = line.split('\t')
        attr = attr.split(',')
        lineDict = {
            'surface': surface,
            'base': attr[6],
            'pos': attr[0],
            'pos1': attr[1]
        }
        res.append(lineDict)


def extract(block):
    return [b['base'] for b in block]


filename = 'neko.txt.mecab'
with open(filename, mode='rt', encoding='utf-8') as f:
    blockList = f.read().split('EOS\n')
blockList = list(filter(lambda x: x != '', blockList))
blockList = [parseMecab(block) for block in blockList]
wordList = [extract(block) for block in blockList]
wordList = list(filter(lambda x: '猫' in x, wordList))
d = defaultdict(int)
for word in wordList:
    for w in word:
        if w != '猫':
            d[w] += 1
ans = sorted(d.items(), key=lambda x: x[1], reverse=True)[:10]

ans = pd.DataFrame(ans)
ans.columns = ['word', 'word_count']

fig = px.bar(
    ans.sort_values('word_count'),
    y='word',
    x='word_count',
    text='word_count',
    orientation='h',
)
fig.update_traces(
    texttemplate='%{text:.2s}',
    textposition='auto',
)
fig.update_layout(
    title=str('「猫」との共起回数上位10語'),
    xaxis_title=str('「猫」との共起数'),
    yaxis_title=str('単語'),
    width=1000,
    height=500,
)
plot(fig, filename='ans_37_plot.html', auto_open=False)

共起頻度を計算する想定でデータの読み込みを行っていなかったので、前半の処理部分はu++さんのコードをカンニングしました・・・🙇‍♂️
（さすがに30から解き直す気力もなく）

言語処理100本ノック、4章の37問目でこれ状態になった
（共起回数を計算できる形式でデータを読み込んで無かった） pic.twitter.com/4R4iWKReuL
— takapy | たかぱい (@takapy0210) 2020年5月4日

実行すると、実行ディレクトリにans_37_plot.htmlファイルが出力され、それをブラウザで開くと画像が確認できます。

実行結果

38. ヒストグラム

"""
単語の出現頻度のヒストグラム（横軸に出現頻度，縦軸に出現頻度をとる単語の種類数を棒グラフで表したもの）を描け．
"""
import pandas as pd
from collections import defaultdict
import plotly.express as px
from plotly.offline import plot


def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items):
    return [x['surface'] for x in items]


def get_freq(value):
    def generate_ngrams(text, n_gram=1):
        token = [token for token in text.lower().split(" ") if token != "" if token]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]

    freq_dict = defaultdict(int)
    for sent in value:
        for word in generate_ngrams(str(sent)):
            freq_dict[word] += 1

    fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
    fd_sorted.columns = ['word', 'word_count']
    return fd_sorted


file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list)
ans = get_freq(ans)

fig = px.histogram(ans, x='word_count', nbins=50)
fig.update_layout(
    title=str('単語の出現頻度のヒストグラム'),
    xaxis_title=str('出現頻度'),
    yaxis_title=str('単語の種類数'),
    width=1000,
    height=500,
)
plot(fig, filename='ans_38_plot.html', auto_open=False)

実行すると、実行ディレクトリにans_38_plot.htmlファイルが出力され、それをブラウザで開くと画像が確認できます。

実行結果

39. Zipfの法則

"""
単語の出現頻度順位を横軸，その出現頻度を縦軸として，両対数グラフをプロットせよ．
"""
import pandas as pd
import math
from collections import defaultdict
import plotly.express as px
from plotly.offline import plot


def parse_morpheme(morpheme):
    (surface, attr) = morpheme.split('\t')
    attr = attr.split(',')
    morpheme_dict = {
        'surface': surface,
        'base': attr[6],
        'pos': attr[0],
        'pos1': attr[1]
    }
    return morpheme_dict


def get_value(items):
    return [x['surface'] for x in items]


def get_freq(value):
    def generate_ngrams(text, n_gram=1):
        token = [token for token in text.lower().split(" ") if token != "" if token]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]

    freq_dict = defaultdict(int)
    for sent in value:
        for word in generate_ngrams(str(sent)):
            freq_dict[word] += 1

    fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
    fd_sorted.columns = ['word', 'word_count']
    return fd_sorted


file = 'neko.txt.mecab'
with open(file, mode='rt', encoding='utf-8') as f:
    morphemes_list = [s.strip('EOS\n') for s in f.readlines()]

morphemes_list = [s for s in morphemes_list if s != '']
ans_list = list(map(parse_morpheme, morphemes_list))

ans = get_value(ans_list)
ans = get_freq(ans)
ans['rank_log'] = [math.log(r + 1) for r in range(len(ans))]
ans['count_log'] = [math.log(v) for v in ans['word_count']]

fig = px.scatter(ans, x='rank_log', y='count_log')
fig.update_layout(
    title=str('単語の出現頻度のヒストグラム'),
    xaxis_title=str('単語の出現頻度順位'),
    yaxis_title=str('出現頻度'),
    width=800,
    height=600,
)
plot(fig, filename='ans_39_plot.html', auto_open=False)

両対数グラフとは、「x軸：対数目盛、y軸：対数目盛」のようにx軸とy軸の両方が対数目盛となっているグラフのことです。 rank_logとcount_logでそれぞれ計算しています。

実行結果

ギークなエンジニアを目指す男

機械学習系の知識を蓄えようとするブログ

【言語処理100本ノック 2020】 4章をPythonで解いた

第4章: 形態素解析

30. 形態素解析結果の読み込み

31. 動詞

32. 動詞の原形

33. 「AのB」

34. 名詞の連接

35. 単語の出現頻度

36. 頻度上位10語

37. 「猫」と共起頻度の高い上位10語

38. ヒストグラム

39. Zipfの法則