行有餘力則以學文

2013年6月30日星期日

emacs如何redo

C-g C-_

不太好記的按鍵組合…

參 http://stackoverflow.com/questions/3527142/how-do-you-redo-changes-after-undo-with-emacs

2013年6月11日星期二

emacs如何刪除所有空白行

M-x flush-lines RET ^$ RET

參考以下連結的說明

http://www.masteringemacs.org/articles/2011/03/16/removing-blank-lines-buffer/

2013年6月6日星期四

先看個範例

安裝步驟參 http://orgmode.org/worg/org-tutorials/org-plot.html

1. 安裝gnuplot。不過在ubuntu安裝gnuplot要下

sudo apt-get install gnuplot-x11
(參 http://askubuntu.com/questions/217867/just-installed-ubuntu-12-10-and-gnuplot-wxt-terminal-doesnt-work)

安裝完記得確認是否正確安裝，參考 http://user.frdm.info/ckhung/b/ma/gnuplot.php 測試一下

2. emacs中用elpa裝 gnuplot-mode 。參 https://github.com/bruceravel/gnuplot-mode

3. 可以處理的表格目前只能以列來表示一組資料，轉置指令為 org-table-transpose-table-at-point

4. 有missing data時記得用 ? 填入，否則畫出來的圖會有問題

2013年5月16日星期四

perl使用regular expression來match中文數字的寫法

中文的一，從emacs上按C-u C-x =可查得以下資料

file code: #xE4 #xB8 #x80 (encoded by coding system utf-8-dos)

character: 一 (displayed as 一) (codepoint 19968, #o47000, #x4e00)

以下假設資料檔abc為utf8格式儲存。最基本的寫法是

perl -ne 'print if m/^一、/' abc

使用utf8碼亦可

perl -ne 'print if m/^\xe4\xb8\x80、/' abc

以下寫法則錯誤，因為[]這個寫法會取第一個byte，所以會match到其它字

~~perl -ne 'print if m/^[一]、/' abc~~

由於中文字的大寫數字當初制定unicode時沒有連號，事實上也不能用[0-9]這種寫法…

以下是目前看來可以work的寫法

perl -ne 'print if m/^(一|二|三|四|五|六|七|八|九|十)+、/' abc

要match任意中文字，必需要使用 '-Mopen qw/:std :utf8/' 選項，它最大的功用在於將所讀入的utf8編碼字串自動轉為unicode，不需要手動呼叫decode('utf8',$string)，例如

perl '-Mopen qw/:std :utf8/' -ne 'print if /\p{Han}/' abc
perl '-Mopen qw/:std :utf8/' -ne 'print if /[\x{4e00}-\x{9fcc}]/' abc

這兩種寫法的效果是一樣的，而//當中放中文字時顯然固定會被當作utf8去match，因此以下寫法錯誤

~~perl -ne 'print if m/^一、/' abc~~

perl '-Mopen qw/:std :utf8/' '-MEncode' -ne '$test=decode("utf8","一");print if m/$test/' abc

理想上我也希望能混用中文及萬用字元於常規表示法，不過目前看來是無解了…

2013年5月14日星期二

嘸蝦米的日文輸入

https://sites.google.com/site/freelearnwang/jp/jp-in-met-1/wuxiami-riwenshurufa

這其中長音  比較特別，是 ee, 或 ee.

2013年5月9日星期四

Emacs中特殊字元的顯示及輸入

http://ergoemacs.org/emacs/keystroke_rep.html

像是由word的doc檔轉來的，裏面會出現很多^K，這是當使用者使用了自動編號功能，在換行時按了C-enter所致。還有word中換行是\r\n，在emacs中只使用\n，因此多出來的\r會以^M這樣的形式出現在每一行末；有時Emacs左下角若為U(DOS)模式時可認出文件由windows所生成，自然會轉換為正確的顯示方式，但有時認不出來，或換行符號有多種組合存在同一文件中，那麼也會認不出來。

至於這些特殊字元的輸入，則以像C-q C-J這樣的按鍵組合來輸入\n(即^J)這個0xA字元於buffer當中

2013年5月6日星期一

perl one liner開utf8檔的寫法

一開始是因為要試regular expression在match 單一中文字時，竟然無法成功。眼看 http://stackoverflow.com/questions/12312310/how-to-match-chinese-characters-in-perl?lq=1 上寫得斬釘截鐵，我開始懷疑起自己來了…

結果寫了個.pl檔試試，一開始也失敗，但是開始發現，問題似乎是因為沒有以utf8格式開檔，後來找到 http://zenoga.tumblr.com/post/40094864918/unicode-aware-perl-one-liners 才解決了這個問題。看起來沒什麼，不過可花了我半個早上啊…

以下這行就是結果，可以拿來斷中文句子

perl '-Mopen qw/:std :utf8/' -ne 'while(m/(\p{Han}+)/g){print "$1\n";}' test.simp.utf8

據說詞頻的統計要用到 pat-tree / suffix tree / trie，估狗了一下覺得只有 http://search.cpan.org/~avif/Tree-Trie-1.9/Trie.pm 適合中文詞頻統計之用。有個現成的斷詞模組叫mmseg, 用法如下

perl '-Mopen qw/:utf8/' '-MLingua::ZH::MMSEG' -Mutf8 -ne 'print join "\n", mmseg' jian.txt

下一步就是要把結果丟給AI::Categorizer或AI::Classifier來看看結果如何…

20170208更新：
windows 平台上指令請改為
perl "-Mopen qw/:utf8/" -MLingua::ZH::MMSEG -Mutf8 -ne "print join \"\n\", mmseg" XXX