行有餘力則以學文

2016年7月15日星期五

ubuntu 上 r 的更新

參考 http://askubuntu.com/questions/503270/problem-with-project-r-installation

其中有一個指令非常重要：

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

少了它就無法更新喔!!

2016年6月26日星期日

使用rstudio的rentrez查詢ncbi提供的資料庫

首先當然要安裝好 rstudio ，請參考本站2015年7月的舊文

然後安裝 rentrez

install.packages("rentrez")

啟用 rentrez

library("rentrez", lib.loc="/usr/local/lib/R/site-library")

以下操作參考自 https://cran.r-project.org/web/packages/rentrez/vignettes/rentrez_tutorial.html

> library("rentrez", lib.loc="/usr/local/lib/R/site-library")
> entrez_dbs()
[1] "pubmed" "protein"
[3] "nuccore" "nucleotide"
[5] "nucgss" "nucest"
[7] "structure" "genome"
[9] "annotinfo" "assembly"
[11] "bioproject" "biosample"
[13] "blastdbinfo" "books"
[15] "cdd" "clinvar"
[17] "clone" "gap"
[19] "gapplus" "grasp"
[21] "dbvar" "gene"
[23] "gds" "geoprofiles"
[25] "homologene" "medgen"
[27] "mesh" "ncbisearch"
[29] "nlmcatalog" "omim"
[31] "orgtrack" "pmc"
[33] "popset" "probe"
[35] "proteinclusters" "pcassay"
[37] "biosystems" "pccompound"
[39] "pcsubstance" "pubmedhealth"
[41] "seqannot" "snp"
[43] "sra" "taxonomy"
[45] "unigene" "gencoll"
[47] "gtr"
> entrez_db_summary("geoprofiles")
DbName: geoprofiles
MenuName: GEO Profiles
Description: Genes Expression Omnibus
DbBuild: Build141002-1115.90
Count: 108708851
LastUpdate: 2016/06/21 04:48

> entrez_db_searchable("geoprofiles")
Searchable fields for database 'geoprofiles'
ALL All terms from all searchable fields
UID Unique number assigned to publication
FILT Limits the records
ORGN Exploded organism names
ACCN Accession for GDS (DataSet), GPL (Platform), GSM (Sample), GSE (Series)
GDST GDS text from title and description
GEOT Sample titles
RTYP Platform reporter type, e.g. genbank, clone, orf
GTYP Type of dataset
VTYP Sample value type, e.g. log ratio, count
NSAM Number of samples
SRC Sample source
ID Spot ID from GEO Platform, SAGE tag, Affy ProbeSet ID
NAME Name or identifier for the spot, e.g. GenBank accession, CLONE_ID, ORF etc.
SYMB Gene symbol (name) from Entrez-Gene or Entrez-UniGene.
GDSC Gene Description
RSTD Ranked standard deviation
RMAX Maximal value of ranks
RMIN Minimal value of ranks
FINF Indicates an interesting or notable uid in the GDS context
FTYP Type of flag that indicates a uid of interest, or outliers etc.
GI GenBank Identifier
ATYP Type of annotation (gene, unigene, nucleotide)
GO Gene Ontology
CHR Chromosomes
CPOS Chromosome base position

2016年6月17日星期五

腳本語言強勢回歸

下圖截自 http://www.tiobe.com/tiobe_index

python, perl, ruby 三劍客同時擠到10名內，相對的 C 的rating降了不少。其實本該如此不是嗎? 一般人要處理的工作，應該要用腳本語言就能很快處理，對應到C/C++，不知道要寫多少倍程式行、多花多少時間。

話說 C# 和 objective-C 看來也危險了…

2016年4月3日星期日

windows 作業系統 "檔案名稱將會太長目的地資料夾無法接受"的問題

https://support.microsoft.com/en-us/kb/2891362

http://answers.microsoft.com/en-us/windows/forum/windows_10-files/source-path-too-long-bug-in-windows-10/b0cb82b0-85c1-4fcf-81cd-041b2175563e?page=3

windows 10 目前竟然還沒有修正…sigh

2016年3月5日星期六

What is the rationale behind the magic number 30 in statistics? What's the difference between LLN and CLT?

n>30?

單就中央極限定理，並無法證明 n>30 時抽樣分配近似常態、即可以用標準常態分配取代t分配，可見以下連結

https://www.researchgate.net/post/What_is_the_rationale_behind_the_magic_number_30_in_statistics

目前看來還是要依不同狀況來計算所需的檢定力，反推出至少要抽多少樣本

抽後放回，連抽n次 vs. 一次抽n個

歷史上常態分配的推導是來自於二項分配，絕大部分不是數理統計背景的作者，都誤會了"中央極限定理(CLT)"中的n，它不是一次抽取n個樣本，而是"n次伯努利實驗"，也就是觀測的"次數"，參見以下連結：

https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%BF%83%E6%9E%81%E9%99%90%E5%AE%9A%E7%90%86

LLN

在論證中央極限定理前，有一個小標題經常被忽略掉了，叫做"大數法則(the law of large numbers (LLN) )，它的意涵才是"一次抽取n個樣本，n愈大時，樣本均數愈接近(收斂至)母體均數"，但是除非母體分配是 well behaved (最好是常態或接近常態)，不然是有可能不會成立的!!

而當母體為有限母體時，則必需由樣本變異數所估計的母體變異數需以"有限母體校正因子"校正，因前者持續(一致)地高估了母體變異數。嚴格來說在 LLN 的篇幅中交代變異數是沒有必要的，也容易造成初學者更多的誤解，最好在後續的抽樣分配課程中再來介紹會比較適當。

今日之所以在統計程序上的計算少見此類校正，乃是因為各種樣本分配已經建立在未校正變異數之上，所以即使所用的標準誤不是無偏估計量，大家也都無感…