行有餘力則以學文

2014年8月13日星期三

sourceforge上有趣的語音辨識專案

http://cmusphinx.sourceforge.net/

這是 cmu 所開發的

http://julius.sourceforge.jp/en_index.php

這是京都大學所開發的

中文的專案目前還沒有，不過網路上現在有這麼多影片和字幕，拿來訓練這些辨識引擎應該很方便才對…

github上有趣的 python 專案

參考自 https://github.com/trending?l=python

1．scrapy 。據文件說是類似網路爬蟲的專案。有趣是有趣，但文件一看就很嚇人的多啊…

2．hardseed。對岸寫的抓妙蛙種子的程式~~，重點是因為可以利用proxy繞過金盾，所以在對岸火得很啊…~~

perl one-liner以mozrepl查詢firefox瀏覽器資料--以海盜灣(the pirate bay)網頁內容的表格為例

在此將問題分解為三個部分，分別是

取得網頁原始碼。結果為一個檔案；generic
取得表格原始碼。結果為列導向的多筆資料，特定分隔符號分隔各欄位；site specific
對表格原始碼進行後處理，得到想要的資料，排列成所需的格式；requirement specific

1．取得網頁原始碼。使用 mozrepl 來取得原始碼有許多好處，可以忽略登入的問題，不需處理解壓縮的問題，可以處理 javascript ，各種好處。

perl -MNet::Telnet -e "$t=new Net::Telnet();$t->open(Host=>'localhost', Port=>4242); $t-> print ('content.document.body.innerHTML');while(1){my $data=$t->get(Timeout=>1);print $data;}"

這個程式碼片斷會取得顯示中頁面的原始碼

2．表格原始碼的取得。列舉的方式有 map 或 foreach ，兩者我都列出來，供大家參考。範例頁面為 http://thepiratebay.se/top/all ，其中列有當日前100名資源的資料。為方便觀察，將結果存到 top100.txt。

perl -MNet::Telnet -e "$t=new Net::Telnet();$t->open(Host=>'localhost', Port=>4242);$t->print('content.document.body.innerHTML');while(1){my $data=$t->get(Timeout=>1);print $data;}"

| perl -MData::Dumper -MHTML::TreeBuilder::XPath -MHTML::Element -e "$string = do { local($/); <> }; $tree= HTML::TreeBuilder::XPath-> new_from_content($string);my @results=$tree->findnodes( '/html/body/div[@id=\"content\"]/div[@id=\"main-content\"]/table[@id=\"searchResult\"]/tbody') ;foreach my $table(@results){foreach my $row($table->findnodes('.//tr')){my @cells=$row-> findnodes ('.//td'); print join(\"\n\", map{ $_->as_HTML if ref($_)} $cells[1]-> content_list() ), \"\n\";foreach $acell($cells[1]->content_list()){print $acell->as_text.\"\n\" if ref($acell) ;};print $cells[0]->string_value,\"\n\";}print \"\n\";}" > top100.txt

這個程式片斷前半就是抓網頁原始碼的程式，使用 pipe 將結果做過濾

3．to be continued...

2014年8月12日星期二

perl one-liner以mozrepl查詢firefox瀏覽器資料--以標題為例

perl -MNet::Telnet -MEncode -e "$t=new Net::Telnet(Dump_Log=>\*STDOUT);$t->open (Host=>'localhost', Port=>4242);$t->print('document.title');while(1){my $data=$t->get (Timeout=>1);print encode('big5',decode('utf8',$data));}"

要先裝好mozrepl
本例在win8上測試正常
encode的部分視所在環境而調整，在ubuntu上完全可以拿掉
或許你會說，幹麻不用 WWW::Mechanize::Firefox 就好了，問題是它在windows上沒有人 port 啊 XD
嘗試一下 WWW::Mechanize::Firefox 的 porting 。下載並解壓後，執行 perl makefile.pl ，出現以下錯誤訊息：
Warning: prerequisite HTML::Selector::XPath 0 not found.
Warning: prerequisite MozRepl::RemoteObject 0.31 not found.
Warning: prerequisite Object::Import 0 not found.
打開 ppm ，安裝上述三個套件。
MozRepl::RemoteObject 可能無法用 ppm 安裝，此時下載該套件的 tar.gz 檔，解壓後進入子目錄執行 perl Makefile.PL ，再將 lib 子目錄中的所有內容複製到 C:\Perl64\site\lib (視perl 安裝在何處而定)
補充一下，ppm上沒有的套件，在不需 c compiler 的情況下，可以下達 cpan WWW::Mechanize::Firefox 安裝

mozrepl在putty/pietty和php中的使用

http://www.codediesel.com/tools/peeking-inside-firefox-using-mozrepl/

很奇怪的，我使用 pietty 時，一旦連上馬上就被切斷。可能要研究一下 pietty 本身的設定，因為用手寫的 perl script 去連就不會被切斷…

2014年8月7日星期四

perl one-liner查詢網頁資料--以下載yyets某頁面上所有字幕為例

perl -MLWP::Simple -e "getprint('http://www.yyets.com/search/index?keyword=%E7%A1%85%E8%B0%B7&type=tv');" |perl -e "while(<>){print \"start http://www.yyets.com/subtitle/index/download?id=$1\n\" if m/\"http.+?subtitle\/(.+?)\"/;}" > abc.bat

然後執行所產生的abc.bat 即可

2014年8月6日星期三

perl one-liner查詢網頁資料--以103年指考放榜為例

perl -MLWP::Simple -e "for $i(21011601..21011842){getprint('http://fast.uac.edu.tw/'.$i);}" | perl -MHTML::Entities -e "while (<>){print decode_entities( \"$1\n\" )if m/(准考證號 :.*?)
<\/BODY/;}"

這是在win8上執行的形式，其它平台可能要做些修正
已安裝LWP及HTML模組
使用pipe將第一段程式的結果導向到第二段，此時可以while(<>) 做逐行讀取的動作

行有餘力則以學文

2014年8月13日星期三

sourceforge上有趣的語音辨識專案

github上有趣的 python 專案

perl one-liner以mozrepl查詢firefox瀏覽器資料--以海盜灣(the pirate bay)網頁內容的表格為例

2014年8月12日星期二

perl one-liner以mozrepl查詢firefox瀏覽器資料--以標題為例

mozrepl在putty/pietty和php中的使用

2014年8月7日星期四

perl one-liner查詢網頁資料--以下載yyets某頁面上所有字幕為例

2014年8月6日星期三

perl one-liner查詢網頁資料--以103年指考放榜為例

常用資訊速查

搜尋此網誌

熱門文章

網誌存檔

2014年8月13日 星期三

2014年8月12日 星期二

2014年8月7日 星期四

2014年8月6日 星期三

常用資訊速查

搜尋此網誌

熱門文章

網誌存檔

2014年8月13日星期三

2014年8月12日星期二

2014年8月7日星期四

2014年8月6日星期三