Tuesday, November 4, 2014

R输入输出

R的输入输出主要有几个类型:文本文件,xls文件,二进制文件,数据库文件,R文件,其它统计软件来源的文件,
首先是最简单的,文本文件,包括csv文件。假设我们有一个文件,题为qiuworld.com.txt,内容为
#this is a sample
ArrayDataFile SourceName FactorValue
GSM286765.CEL t_24h_rep1 treated_24h
GSM286759.CEL t_12h_rep1 treated_12h
GSM286763.CEL c_24h_rep2 control_24h
GSM286760.CEL t_12h_rep2 treated_12h
GSM286757.CEL c_12h_rep2 control_12h
GSM286766.CEL t_24h_rep2 treated_24h
GSM286756.CEL c_12h_rep1 control_12h
GSM286762.CEL c_24h_rep1 control_24h
我们现在需要把它读入R,可以使用read.table命令。read.table还有几个快捷的形式,比如read.delim,read.delim2,read.csv,read.csv2。这几个快捷的方式帮助我们减少参数的书写。一般的,如果是csv文件,它的分隔符是逗号,字符串的两边会加上引号,可以直接使用read.csv(“文件名”)的方式读入数据。
> read.table("qiuworld.com.txt",header=TRUE,sep="\t",quote="",comment.char="#")
  ArrayDataFile SourceName FactorValue
1 GSM286765.CEL t_24h_rep1 treated_24h
2 GSM286759.CEL t_12h_rep1 treated_12h
3 GSM286763.CEL c_24h_rep2 control_24h
4 GSM286760.CEL t_12h_rep2 treated_12h
5 GSM286757.CEL c_12h_rep2 control_12h
6 GSM286766.CEL t_24h_rep2 treated_24h
7 GSM286756.CEL c_12h_rep1 control_12h
8 GSM286762.CEL c_24h_rep1 control_24h
> read.delim("qiuworld.com.txt",header=TRUE,sep="\t",quote="",comment.char="#")
  ArrayDataFile SourceName FactorValue
1 GSM286765.CEL t_24h_rep1 treated_24h
2 GSM286759.CEL t_12h_rep1 treated_12h
3 GSM286763.CEL c_24h_rep2 control_24h
4 GSM286760.CEL t_12h_rep2 treated_12h
5 GSM286757.CEL c_12h_rep2 control_12h
6 GSM286766.CEL t_24h_rep2 treated_24h
7 GSM286756.CEL c_12h_rep1 control_12h
8 GSM286762.CEL c_24h_rep1 control_24h
> read.csv("qiuworld.com.txt",header=TRUE,sep="\t",quote="",comment.char="#")
  ArrayDataFile SourceName FactorValue
1 GSM286765.CEL t_24h_rep1 treated_24h
2 GSM286759.CEL t_12h_rep1 treated_12h
3 GSM286763.CEL c_24h_rep2 control_24h
4 GSM286760.CEL t_12h_rep2 treated_12h
5 GSM286757.CEL c_12h_rep2 control_12h
6 GSM286766.CEL t_24h_rep2 treated_24h
7 GSM286756.CEL c_12h_rep1 control_12h
8 GSM286762.CEL c_24h_rep1 control_24h
有时候我们只需要读取文件的前几行,那么可以在上面的命令当中加入nrows参数,比如
> read.delim("qiuworld.com.txt",comment.char="#",nrows=5)
  ArrayDataFile SourceName FactorValue
1 GSM286765.CEL t_24h_rep1 treated_24h
2 GSM286759.CEL t_12h_rep1 treated_12h
3 GSM286763.CEL c_24h_rep2 control_24h
4 GSM286760.CEL t_12h_rep2 treated_12h
5 GSM286757.CEL c_12h_rep2 control_12h

有时候我们需要读取文件中间的几行,那么可以在上面的命令当中加入skip及nrows参数,比如
> read.delim("qiuworld.com.txt",header=F,nrows=5,skip=3)
             V1         V2          V3
1 GSM286759.CEL t_12h_rep1 treated_12h
2 GSM286763.CEL c_24h_rep2 control_24h
3 GSM286760.CEL t_12h_rep2 treated_12h
4 GSM286757.CEL c_12h_rep2 control_12h
5 GSM286766.CEL t_24h_rep2 treated_24h
问题是,以上的文件都是格式化的表格文件。如果文件并非表格式文件,而是一些的字符或者数字,并且每行长度不同呢?我们可以使用scan来输入
> scan("qiuworld.com.txt",what=character(),sep="\t",fill=F,comment.char="#")
Read 27 items
 [1] "ArrayDataFile" "SourceName"    "FactorValue"   "GSM286765.CEL" "t_24h_rep1"    "treated_24h"  
 [7] "GSM286759.CEL" "t_12h_rep1"    "treated_12h"   "GSM286763.CEL" "c_24h_rep2"    "control_24h"  
[13] "GSM286760.CEL" "t_12h_rep2"    "treated_12h"   "GSM286757.CEL" "c_12h_rep2"    "control_12h"  
[19] "GSM286766.CEL" "t_24h_rep2"    "treated_24h"   "GSM286756.CEL" "c_12h_rep1"    "control_12h"  
[25] "GSM286762.CEL" "c_24h_rep1"    "control_24h"
> scan("qiuworld.com.txt",what=list(ArrayDataFile=character(),SourceName=character(),FactorValue=character()),skip=2,nlines=5,comment.char="#")
Read 5 records
$ArrayDataFile
[1] "GSM286765.CEL" "GSM286759.CEL" "GSM286763.CEL" "GSM286760.CEL" "GSM286757.CEL"
 
$SourceName
[1] "t_24h_rep1" "t_12h_rep1" "c_24h_rep2" "t_12h_rep2" "c_12h_rep2"
 
$FactorValue
[1] "treated_24h" "treated_12h" "control_24h" "treated_12h" "control_12h"
scan命令功能强大,但是参数复杂。它还可以实现类似c当中的scanf的功能。这里不介绍。
下面的问题是如何写文件。写文本文件一般使用write.table,write.csv,write.csv2。
> x<-read.table("qiuworld.com.txt",header=TRUE,sep="\t",quote="",comment.char="#")
> write.csv(x,"qiuworld.com.csv",row.names=F)
问题,现在有一个sequence文件(比如fasta, fastq, BAM, gff, bed, 或者 wig)需要读入R,应该怎么办?
> library("Biostrings")
> read.DNAStringSet("qiuworld.com.fasta")
  A DNAStringSet instance of length 7
    width seq                                                                                 names               
[1]   624 TGGTTCAATCTAGTCTACGAATCTTCAGTTTATTGACTAG...TACTACATACACACACAGAAATACATACATACATACACCA I:10035327-10035950
[2]   371 ACAAACTCAAGGAAATTTCGATAAAGCAAGGAATATTGCA...AATCCGAGCACGGAAAAATGGGCCCGCGACCCCCGTTTTC I:10036127-10036497
[3]   211 TAGTTGTGTCTGTCTCCGTCTATGTATGTATGTCTGTGTG...ATAAAGACGGAGAGATGAAAACGAAAGAGGAAAAGGAACA I:10097210-10097420
[4]   165 TAATAAACGATTAGTTGTGGATGATTGTTTACATGATTAG...GCACGGGTAGAAGTACGTGTATGATGCCGATTCCAGGGAT I:10184570-10184734
[5]   249 CGGCGCTGCATTTCAACAACTATTTTGTGAGAGAGAGAAG...TATTCGTATCAAATTTTCACTTAAAAGATGTTTAAACAAA I:10235327-10235575
[6]   582 GATAGATGGGATGATGAACTTGAAGTGCTGGATCATCAAA...TCAGATGGCAAAGTAGGTTCGCGTCTCGATAATCGTGATT I:10265250-10265831
[7]   293 TGGTGTAGATGGTCCAGGAATTTGATCATAAAAAAATTGA...TGTACTGTGATTCTGTCATTTCACTAGTAAAGTCTCTAGT I:10277090-10277382
相关的命令还有
read.BStringSet(filepath, format="fasta",
                nrec=-1L, skip=0L, use.names=TRUE)
read.DNAStringSet(filepath, format="fasta",
                  nrec=-1L, skip=0L, use.names=TRUE)
read.RNAStringSet(filepath, format="fasta",
                  nrec=-1L, skip=0L, use.names=TRUE)
read.AAStringSet(filepath, format="fasta",
                 nrec=-1L, skip=0L, use.names=TRUE)
如果是短序文件的话,可以使用ShortRead库。
> library(ShortRead)
> seq<-readFasta("qiuworld.com.fasta")
> sread(seq)
  A DNAStringSet instance of length 7
    width seq
[1]   624 TGGTTCAATCTAGTCTACGAATCTTCAGTTTATTGACTAGTTAGCAGTAGC...CCATGGCCAGTACTACATACACACACAGAAATACATACATACATACACCA
[2]   371 ACAAACTCAAGGAAATTTCGATAAAGCAAGGAATATTGCAAAATGAACTTG...AAAAGGCTAGAATCCGAGCACGGAAAAATGGGCCCGCGACCCCCGTTTTC
[3]   211 TAGTTGTGTCTGTCTCCGTCTATGTATGTATGTCTGTGTGCGTGATGTGCG...GAAGATGAAAATAAAGACGGAGAGATGAAAACGAAAGAGGAAAAGGAACA
[4]   165 TAATAAACGATTAGTTGTGGATGATTGTTTACATGATTAGATTGGTGTCAG...GCATATACGTGCACGGGTAGAAGTACGTGTATGATGCCGATTCCAGGGAT
[5]   249 CGGCGCTGCATTTCAACAACTATTTTGTGAGAGAGAGAAGAAAAAGAGAAG...AACTTCAATCTATTCGTATCAAATTTTCACTTAAAAGATGTTTAAACAAA
[6]   582 GATAGATGGGATGATGAACTTGAAGTGCTGGATCATCAAATGTTGTCCATC...CAAAATGTTTTCAGATGGCAAAGTAGGTTCGCGTCTCGATAATCGTGATT
[7]   293 TGGTGTAGATGGTCCAGGAATTTGATCATAAAAAAATTGATTATGGTACAA...TCGTCGGCTGTGTACTGTGATTCTGTCATTTCACTAGTAAAGTCTCTAGT
问题,如果我有一个大文件,比如RNAseq分析之后的BAM或者BED文件,需要读入R,应该怎么办?首先需要了解的是,特别大的文件不可能一次读入内存的话,需要使用file和readLines来解决。
> con<-file("qiuworld.com.txt","r")
> while(length(line<-readLines(con,1))>0){cat(paste(line,"\n"))}
#this is a sample 
ArrayDataFile SourceName FactorValue 
GSM286765.CEL t_24h_rep1 treated_24h 
GSM286759.CEL t_12h_rep1 treated_12h 
GSM286763.CEL c_24h_rep2 control_24h 
GSM286760.CEL t_12h_rep2 treated_12h 
GSM286757.CEL c_12h_rep2 control_12h 
GSM286766.CEL t_24h_rep2 treated_24h 
GSM286756.CEL c_12h_rep1 control_12h 
GSM286762.CEL c_24h_rep1 control_24h 
Warning message:
In readLines(con, 1) : incomplete final line found on 'qiuworld.com.txt'
> close(con)
而对于大的BAM文件,需要使用的是Rsamtools包。下面是个例子
   library(Rsamtools)
   fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
 
#获取每个序列的长度 
#实例化GRanges类,每个序列一个实例。
 
   t <- scanBamHeader(fl)[[1]][["targets"]]
   which <- GRanges(names(t), IRanges(1, unname(t)))
 
#枚举序列, 使用'for'
 
   for (i in seq_along(which)) {
       ## read one chromosome
       param <- ScanBamParam(which=which[i], what=character())
       aln <- readGappedAlignments(fl, param=param)
       ## do more work...
   }
 
#或者 lapply
 
   lapply(seq_along(which), function(i, fl, which) {
       param <- ScanBamParam(which=which[i], what=character())
       aln <- readGappedAlignments(fl, param=param)
       ## do more work
       table(width(aln))
   }, fl, which)
读取xls文件。
> library(gdata)
> x <- read.xls("Mouse.MitoCarta.xls",1)
> library(xlsx)
> x<-read.xlsx("Mouse.MitoCarta.xls",1)
读取二进制文件使用readBin函数。代码非常简单。读取二进制文件的问题关键在于全面了解所需要读取的文件的数据格式,否则无法正确地读出内容。
打开文件还是使用file函数,其参数open要设置成rb,r表示读取,b表示二进制。
> to.read = file("http://www.ats.ucla.edu/stat/r/faq/bintest.dat", "rb")
> readBin(to.read, integer(), endian = "little")
[1] 1
> readBin(to.read, integer(), n = 4, endian = "little")
[1] 2 3 4 5
> readBin(to.read, integer(), n = 2, size = 4, endian = "little")
[1] 6 7
写文件就使用writeBin。
> data(mtcars)
> to.write = file("binfile.dat", "wb")
> writeBin(colnames(mtcars)[1:3],to.write)
> writeBin(mtcars$mpg,to.write)
> writeBin(mtcars$cyl,to.write)
> writeBin(mtcars$disp,to.write)
> close(to.write)
R的数据库操作主要有以下几个包:RODBC, RMySQL, ROracle, RJDBC。这里只介绍RMySQL。MySQL被广泛地应用于小型数据库的构建,结合apache, php成为当前网络应用的主流。使用RMySQL获取数据分为四步:建立联接,发送查寻语句,读取数据,关闭联接。这和php, perl等其它语言联接mysql并没什么不同。数据库写入只是查寻语句不同,并且不需要再用fetch读取数据而已。
> library(RMySQL)
> con <- dbConnect(MySQL(), user=user, password=pass, dbname=db, host=host, unix.socket = socket)
> sql<-paste("select count(*) from ",table,sep="")
> rs <- dbSendQuery(con,sql)
> an <- fetch(rs,n=-1)
> dbDisconnect(con)
R读取R文件应该是最轻松的任务了。读就用load,写就用save。
> save(list=ls(all=TRUE),file="all.RData")
> rm(list=ls())
> ls()
character(0)
> load("all.RData")
> ls()
 [1] "biocinstall"             "biocinstallPkgGroups"    "biocinstallRepos"        "biocLite"               
 [5] "con"                     "datavals"                "fl"                      "getBioC"                
 [9] "line"                    "newdata"                 "seq"                     "sourceBiocinstallScript"
[13] "t"                       "to.read"                 "varnames"                "which"                  
[17] "x"                       "zz"
如果是包中已经有的dataset,就用data载入。
> data(mtcars)
> head(mtcars)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
...
读取其它统计软件来源的文件,使用foreign库,例,
>test.stata <- read.dta("test.dta")
>print(test.stata)

No comments:

Post a Comment