行有餘力則以學文: perl使用regular expression來match中文數字的寫法

2013年5月16日星期四

perl使用regular expression來match中文數字的寫法

中文的一，從emacs上按C-u C-x =可查得以下資料

file code: #xE4 #xB8 #x80 (encoded by coding system utf-8-dos)

character: 一 (displayed as 一) (codepoint 19968, #o47000, #x4e00)

以下假設資料檔abc為utf8格式儲存。最基本的寫法是

perl -ne 'print if m/^一、/' abc

使用utf8碼亦可

perl -ne 'print if m/^\xe4\xb8\x80、/' abc

以下寫法則錯誤，因為[]這個寫法會取第一個byte，所以會match到其它字

~~perl -ne 'print if m/^[一]、/' abc~~

由於中文字的大寫數字當初制定unicode時沒有連號，事實上也不能用[0-9]這種寫法…

以下是目前看來可以work的寫法

perl -ne 'print if m/^(一|二|三|四|五|六|七|八|九|十)+、/' abc

要match任意中文字，必需要使用 '-Mopen qw/:std :utf8/' 選項，它最大的功用在於將所讀入的utf8編碼字串自動轉為unicode，不需要手動呼叫decode('utf8',$string)，例如

perl '-Mopen qw/:std :utf8/' -ne 'print if /\p{Han}/' abc
perl '-Mopen qw/:std :utf8/' -ne 'print if /[\x{4e00}-\x{9fcc}]/' abc

這兩種寫法的效果是一樣的，而//當中放中文字時顯然固定會被當作utf8去match，因此以下寫法錯誤

~~perl -ne 'print if m/^一、/' abc~~

perl '-Mopen qw/:std :utf8/' '-MEncode' -ne '$test=decode("utf8","一");print if m/$test/' abc

理想上我也希望能混用中文及萬用字元於常規表示法，不過目前看來是無解了…