行有餘力則以學文

2010年12月3日星期五

"如何训练一个中文的Berkeley Parser"

http://playwithnlp.blogspot.com/2010/06/berkeley-parser.html

整個網路上只有這篇文章提到中文文法 chn_sm5.gr 的使用，故特為此文紀錄之

"Call Stanford Parser in Perl"，真是太酷了

http://layesuen.spaces.live.com/blog/cns!BCB0A55D794BEAF6!1034.entry?wa=wsignin1.0&sa=523996814

use Inline (
    Java => <<'END_JAVA',

import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class Parser {
    LexicalizedParser lexParser;
    public Parser(String model) {
        lexParser = new LexicalizedParser(model);
    }
    public String parse(String sentence) {
        lexParser.parse(sentence);
        return lexParser.getBestParse().toString();
    }
}

END_JAVA

    CLASSPATH => 'stanford-parser.jar',
    EXTRA_JAVA_ARGS => '-mx800m'
);

my $p = Parser->new("englishPCFG.ser.gz");
print $p->parse($_)."\n" while (<>);

感觉实在是很 Cool，主要使用了 Inline-Java 这个 bundle。
运行时需要把 stanford-parser.jar, Parser 数据文件 englishPCFG.ser.gz 和这个 perl 程序放在同一目录下，当然必须保证 Inline-Java 能找到你的 JDK，可以通过 J2SDK 这个 Option 来指定。

Stanford parser及Berkeley parser產出結果初探

使用的輸入檔內容

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

產出的結果是差不多的，list風格的樹狀表示；Berkeley parser將所有結果一次印出在一行中，佔用較少的畫面；Stanford parser將結果展開，方便看出結構；不過這並不是重點。雖然Stanford parser更新較頻繁，但是Berkeley parser的好評似乎較多。無論如何，在自然語言處理上，純文字的剖析是第一步，接下來的應用才是好戲。

Berkeley parser初探

對應的目錄

http://code.google.com/p/berkeleyparser/downloads/list

把這些檔案抓下來，放在同一個工作目錄中，以要分析的檔案名叫mumbai.txt為例，鍵入

java -Xms64m -Xmx512m -jar berkeleyParser.jar -gr eng_sm6.gr.gz -inputFile mumbai.txt

它的說明檔範例中沒有參數-Xms64m -Xmx512m，對於使用者來說可能會得到空間不夠的錯誤訊息；其它可用參數如下

-render Write rendered tree to image file. (Default: false)
-inputFile Read input from this file instead of reading it from STDIN.
-substates Output subcategories (only for binarized viterbi trees). (Default: false)
-gr Grammarfile (Required) [required]
-binarize Output binarized trees. (Default: false)
-likelihood Output sentence likelihood, i.e. summing out all parse trees: P(w) (Default: false)
-confidence Output confidence measure, i.e. tree likelihood: P(T|w) (Default: false)
-tokenize Tokenize input first. (Default: false=text is already tokenized)
-scores Output inside scores (only for binarized viterbi trees). (Default: false)
-viterbi Compute viterbi derivation instead of max-rule tree (Default: max-rule)
-chinese Enable some Chinese specific features in the lexicon.
-accurate Set thresholds for accuracy. (Default: set thresholds for efficiency)

Stanford parser初探

在作業系統的開發上，Stanford與Berkeley一直互有競逐，並延伸到其它的層面。在自然語言處理上，代表作就是stanford parser和berkeley parser。先來看看Stanford parser，訪問首頁

http://nlp.stanford.edu/software/lex-parser.shtml

可以下載最新的版本。將它解壓到一個方便的目錄下，依照網頁下方所言產生mumbai.txt檔案來進行實驗，打出以下指令

java -mx200m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz mumbai.txt

文章的作者因為是開發者，所以沒注意到加上-cp這段，我們如果作為純使用者的話，指明class path是必要的