前回Rubyでマルコフ連鎖を用いた文章作成アルゴリズムを書いてみたけど、その際に文章の分割方法に少し苦労したので備忘録として。
英語やスペイン語みたいに単語と単語が空白で分かれている場合のケースを想定。単純に単語毎に分割したい場合は.split(sentence.splitみたいに)を使えばいいけど、今回は一文毎に分割する方法について書いてみる。
例としてフランツ・カフカの「変身(Metamorphosis)」の第二段落を使用(?も含んでいるため)。文章毎に分割するということなので区切り文字としてピリオドとクエスチョンマークを採用。
最初は単純にsplitを使ってこういう風に書いてみた。
1 |
M1 = Metamorphosis.split(/[.?]/) |
結果はこうなる。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
-----Part 1----- What's happened to me -----Part 2----- he thought -----Part 3----- It wasn't a dream -----Part 4----- His room, a proper human room although a little too small, lay peacefully between its four familiar walls -----Part 5----- A collection of textile samples lay spread out on the table - Samsa was a travelling salesman - and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame -----Part 6----- It showed a lady fitted out with a fur hat and fur boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer |
しっかり分けられているけど、区切り文字が取り除かれてしまっているのでこれでは使えない。次に試したのが同じsplitメソッドだけど少し修正を加えたもの。具体的にはブラケットの丸括弧(parenthesis)で囲う。
1 |
M2 = Metamorphosis.split(/([.?])/) |
今回はこういう風に出力された。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
-----Part 1----- What's happened to me -----Part 2----- ? -----Part 3----- he thought -----Part 4----- . -----Part 5----- It wasn't a dream -----Part 6----- . -----Part 7----- His room, a proper human room although a little too small, lay peacefully between its four familiar walls -----Part 8----- . -----Part 9----- A collection of textile samples lay spread out on the table - Samsa was a travelling salesman - and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame -----Part 10----- . -----Part 11----- It showed a lady fitted out with a fur hat and fur boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer -----Part 12----- . |
今回は区切り文字もしっかり含まれているけれど、区切り文字自体も一つの独立した部分として戻された。マルコフ連鎖アルゴリズムでは区切り文字は勿論文章の単語にくっつく形でなければならないのでこれも不十分。
で色々調べた結果、splitではなくscanメソッドを使えることに気付いた。試行錯誤した後、こう書いてみた。
1 |
M3 = Metamorphosis.scan(/[^.?]*./) |
簡単に説明するとブラケット([^])内の文字を除いた文字を全てスキャンし、その省いた文字を区切り文字として利用する仕様。こうすると下記の様な結果を得られる。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
-----Part 1----- What's happened to me? -----Part 2----- he thought. -----Part 3----- It wasn't a dream. -----Part 4----- His room, a proper human room although a little too small, lay peacefully between its four familiar walls. -----Part 5----- A collection of textile samples lay spread out on the table - Samsa was a travelling salesman - and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame. -----Part 6----- It showed a lady fitted out with a fur hat and fur boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer. |
取り敢えず、これで必要な結果は得られた。なお、この方法だとどうしても二文目以降の文頭に空白が出来てしまう。これはstripメソッドで解決。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
M3.strip-----Part 1----- What's happened to me? -----Part 2----- he thought. -----Part 3----- It wasn't a dream. -----Part 4----- His room, a proper human room although a little too small, lay peacefully between its four familiar walls. -----Part 5----- A collection of textile samples lay spread out on the table - Samsa was a travelling salesman - and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame. -----Part 6----- It showed a lady fitted out with a fur hat and fur boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer. What's happened to me? he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls. A collec tion of textile samples lay spread out on the table - Samsa was a travelling salesman - and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame. It showed a lady fitted out with a fur hat and fu r boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer.! |
空白がしっかり取り除かれている。入力されたテキストを文に分割することが出来たのでマルコフ連鎖による文章作成の準備が整った!
一応、参考まで今回のコードはこんな感じ。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
##Split and Scan Test ##Metamorphosis retrieved from: http://www.gutenberg.org/cache/epub/5200/pg5200.txt def displayParts(sentences) count = 1 sentences.each do |parts| puts "-----Part " + count.to_s + "-----" parts.strip! ##Remove unnecessary spaces puts parts count += 1 end end Metamorphosis = "" open('.\Metamorphosis.txt') do |t| t.each do |line| Metamorphosis << line end end M1 = Metamorphosis.split(/[.?]/) ## DON'T return delimiters displayParts(M1) M2 = Metamorphosis.split(/([.?])/) ## RETURN delimiters displayParts(M2) M3 = Metamorphosis.scan(/[^.?]*./)## RETURN delimiters ATTACHED TO words displayParts(M3) |