行有餘力則以學文: 12月 2010

2010年12月9日星期四

製作 USB 隨身碟開機方法

http://siryeh.com/module-news-display-sid-26.htm

這是正體中文的網站中，解說最正確的文章了~~，可惜圖片連結有問題~~

2010年12月7日星期二

簡報術

星期天聽了一場很棒的演講，主講人把ppt檔用得淋灕盡至，完全沒有冷場，令我想到之前曾經想學習的高橋流簡報法。巧的是這本書的中譯本也在今年上市了，晚些會花點時間看看。當然江山各有才人出，網路上也有其它相關的介紹， Presentation Zen – 書不如blog 這篇評論我覺得相當犀利，一方面提到了創意黏力學，同時也再次導引大家到簡報禪的聖堂 presentation zen 去一窺堂奧。

2010年12月4日星期六

從官網 http://nlp.stanford.edu/software/segmenter.shtml 下載並解壓後，執行以下命令

segment.bat pku test.simp.utf8 UTF-8 0 > out.txt

結果會存到out.txt當中

實際以繁中文件測試，結果並不理想；但翻譯為簡中後，正確率超過99%，相當出色。要直接處理繁中，有文件指出可以下達 -loadClassifier data\traditional.gz 參數，但是並沒有找到這個檔案；退而求其次的方法，應該就是把原文轉為簡中再處理了，幸好處理完不需要再轉為繁中，因為簡、繁中字的對應位置不會改變，只要把位置資訊留著就可以指回原來的文件

2010年12月3日星期五

與Stanford parser處理中文有關的網路資源

官方所付的中文文法檔為 chineseFactored.ser.gz ，以此為關鍵字進行搜尋

http://114.255.218.78/wikiteamwork/images/d/d7/(3)Stanfold_Parser_in_GATE.ppt 句法分析工具Stanfold Parser及其在GATE中的使用

http://blog.amelielee.com/archives/140#comments Solving the ‘exceeded MAX_ITEMS’ problem in Stanford Parser

https://mailman.stanford.edu/pipermail/parser-user/2010-August/000652.html [parser-user] stanford parser 中文分词关于分词“的” 的特别现象

http://blog.csdn.net/leeharry/archive/2008/03/06/2153583.aspx stanford parser 使用；值得注意的是，這篇文章帶出一個重點：中文資料必需另外用斷字程式斷好，才能丟進parser裏面，這點不管是s牌或b牌都是一樣的。

"如何训练一个中文的Berkeley Parser"

http://playwithnlp.blogspot.com/2010/06/berkeley-parser.html

整個網路上只有這篇文章提到中文文法 chn_sm5.gr 的使用，故特為此文紀錄之

"Call Stanford Parser in Perl"，真是太酷了

http://layesuen.spaces.live.com/blog/cns!BCB0A55D794BEAF6!1034.entry?wa=wsignin1.0&sa=523996814

use Inline (
    Java => <<'END_JAVA',

import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class Parser {
    LexicalizedParser lexParser;
    public Parser(String model) {
        lexParser = new LexicalizedParser(model);
    }
    public String parse(String sentence) {
        lexParser.parse(sentence);
        return lexParser.getBestParse().toString();
    }
}

END_JAVA

    CLASSPATH => 'stanford-parser.jar',
    EXTRA_JAVA_ARGS => '-mx800m'
);

my $p = Parser->new("englishPCFG.ser.gz");
print $p->parse($_)."\n" while (<>);

感觉实在是很 Cool，主要使用了 Inline-Java 这个 bundle。
运行时需要把 stanford-parser.jar, Parser 数据文件 englishPCFG.ser.gz 和这个 perl 程序放在同一目录下，当然必须保证 Inline-Java 能找到你的 JDK，可以通过 J2SDK 这个 Option 来指定。

Stanford parser及Berkeley parser產出結果初探

使用的輸入檔內容

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

產出的結果是差不多的，list風格的樹狀表示；Berkeley parser將所有結果一次印出在一行中，佔用較少的畫面；Stanford parser將結果展開，方便看出結構；不過這並不是重點。雖然Stanford parser更新較頻繁，但是Berkeley parser的好評似乎較多。無論如何，在自然語言處理上，純文字的剖析是第一步，接下來的應用才是好戲。

Berkeley parser初探

對應的目錄

http://code.google.com/p/berkeleyparser/downloads/list

把這些檔案抓下來，放在同一個工作目錄中，以要分析的檔案名叫mumbai.txt為例，鍵入

java -Xms64m -Xmx512m -jar berkeleyParser.jar -gr eng_sm6.gr.gz -inputFile mumbai.txt

它的說明檔範例中沒有參數-Xms64m -Xmx512m，對於使用者來說可能會得到空間不夠的錯誤訊息；其它可用參數如下

-render Write rendered tree to image file. (Default: false)
-inputFile Read input from this file instead of reading it from STDIN.
-substates Output subcategories (only for binarized viterbi trees). (Default: false)
-gr Grammarfile (Required) [required]
-binarize Output binarized trees. (Default: false)
-likelihood Output sentence likelihood, i.e. summing out all parse trees: P(w) (Default: false)
-confidence Output confidence measure, i.e. tree likelihood: P(T|w) (Default: false)
-tokenize Tokenize input first. (Default: false=text is already tokenized)
-scores Output inside scores (only for binarized viterbi trees). (Default: false)
-viterbi Compute viterbi derivation instead of max-rule tree (Default: max-rule)
-chinese Enable some Chinese specific features in the lexicon.
-accurate Set thresholds for accuracy. (Default: set thresholds for efficiency)

Stanford parser初探

在作業系統的開發上，Stanford與Berkeley一直互有競逐，並延伸到其它的層面。在自然語言處理上，代表作就是stanford parser和berkeley parser。先來看看Stanford parser，訪問首頁

http://nlp.stanford.edu/software/lex-parser.shtml

可以下載最新的版本。將它解壓到一個方便的目錄下，依照網頁下方所言產生mumbai.txt檔案來進行實驗，打出以下指令

java -mx200m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz mumbai.txt

文章的作者因為是開發者，所以沒注意到加上-cp這段，我們如果作為純使用者的話，指明class path是必要的