R study notes: September 2014

Thursday, September 25, 2014

Creating strings from variables

Problem

You want to do create a string from variables.

Solution

The two common ways of creating strings from variables are the paste function and the sprintf function.paste is more useful for vectors, and sprintf is more useful for precise control of the output.

Using paste()

a <- "apple"
b <- "banana"

# Put a and b together, with a space in between:
paste(a, b)
# "apple banana"

# With no space:
paste(a, b, sep="")
# "applebanana"

# With a comma and space:
paste(a, b, sep=", ")
# "apple, banana"


# With a vector
d <- c("fig", "grapefruit", "honeydew")

# If the input is a vector, use collapse to put the elements together:
paste(d, collapse=", ")
# "fig, grapefruit, honeydew"

# If the input is a scalar and a vector, it puts the scalar with each
# element of the vector, and returns a vector:
paste(a, d)
# "apple fig"  "apple grapefruit"  "apple honeydew"  

# Use sep and collapse:
paste(a, d, sep="-", collapse=", ")
# "apple-fig, apple-grapefruit, apple-honeydew"

Using sprintf()

Another way is to use sprintf function. This is derived from the function of the same name in the C programming language.

To substitute in a string or string variable, use %s:

a <- "string"
sprintf("This is where a %s goes.", a)
# "This is where a string goes."

For integers, use %d or a variant:

x <- 8
sprintf("Regular:%d", x)
# "Regular:8"

# Can print to take some number of characters, leading with spaces.
sprintf("Leading spaces:%4d", x)
# "Leading spaces:   8"

# Can also lead with zeros instead.
sprintf("Leading zeros:%04d", x)
#"Leading zeros:0008:"

For floating-point numbers, use %f for standard notation, and %e or %E for exponential notation. You can also use %g or %G for a "smart" formatter that automatically switches between the two formats, depending on where the significant digits are. The following examples are taken from the R help page for sprintf:

sprintf("%f", pi)         # "3.141593"
sprintf("%.3f", pi)       # "3.142"
sprintf("%1.0f", pi)      # "3"
sprintf("%5.1f", pi)      # "  3.1"
sprintf("%05.1f", pi)     # "003.1"
sprintf("%+f", pi)        # "+3.141593"
sprintf("% f", pi)        # " 3.141593"
sprintf("%-10f", pi)      # "3.141593  "   (left justified)
sprintf("%e", pi)         #"3.141593e+00"
sprintf("%E", pi)         # "3.141593E+00"
sprintf("%g", pi)         # "3.14159"
sprintf("%g",   1e6 * pi) # "3.14159e+06"  (exponential)
sprintf("%.9g", 1e6 * pi) # "3141592.65"   ("fixed")
sprintf("%G", 1e-6 * pi)  # "3.14159E-06"

In the %m.nf format specification: The m represents the field width, which is the minimum number of characters in the output string, and can be padded with leading spaces, or zeros if there is a zero in front of m. The n represents precision, which the number of digits after the decimal.

Other miscellaneous things:

x <- "string"
sprintf("Substitute in multiple strings: %s %s", x, "string2")
# "Substitute in multiple strings: string string2"

# To print a percent sign, use "%%"
sprintf("A single percent sign here %%")
# "A single percent sign here %"

Wednesday, September 24, 2014

R中如何实现格式化输出的小知识点

假设有1位 2位 3位到6位的数字不足六位的前面加0，格式化输出，怎么能实现？

实现一：

substr(as.character(1:10+1000000),2,7)

[1] "000001" "000002" "000003""000004" "000005" "000006" "000007""000008" "000009" "000010"

实现二：

sprintf("%06.f",1:3)

[1] "000001" "000002" "000003"

扩展：

1、任务编程中，通常需要指定的格式。R能有效的控制字符串格式，通过函数sprintf可以实现。

2、Sprintf继承自C语言中的同名函数，经过封装，在R环境中的实现。

3、Sprintf:String_print_format

4、用法：sprintf(fmt,...)　

5、参数fmt，格式表征，为字符串，最大长度8192字节。定义了实现语法，由字符％起始，aAdifeEgGosxX%（其中任意一个）结束，更多详细的解释清参照文档。

6、解决方案实例解释：

%06.f:

%　　　开始符

0　　　是"填空字元"表示,如果长度不足时就用0来填满。

6　　　格式化后总长度

f　　　浮点数，小数位长度可以通过数字控制，如2f即2位

Thursday, September 11, 2014

Box plot

hsb2 <- read.table('http://www.ats.ucla.edu/stat/r/modules/hsb2.csv', header=T, sep=",")
attach(hsb2)

	boxplot(write)
	boxplot(write, xlab="write", boxwex=.4, col="darkblue")
	boxplot(write ~ ses)
	seslab<-as.vector(c("low","medium", "high")) sesf<-factor(ses, label=seslab) boxplot(write ~ sesf, xlab="write by ses")
	boxplot(write ~ sesf, xlab="write by ses", boxwex=.2, notch = TRUE, col = "darkblue")
	# by two factors boxplot(write ~ female + ses, xlab="write by female and ses", boxwex=.2)

Histograms

set.seed(121343)
u <- rnorm(100)

	#default histogram hist(u)
	#with shading hist(u, density=20)
	#with specific number of bins hist(u, density=20, breaks=20)
	# proportion, instead of frequency # also specifying y-axis hist(u, density=20, breaks=-3:3, ylim=c(0,.5), prob=TRUE)
	# overlay normal curve with x-lab and ylim # colored normal curve m<-mean(u) std<-sqrt(var(u)) hist(u, density=20, breaks=20, prob=TRUE, xlab="x-variable", ylim=c(0, 0.7), main="normal curve over histogram") curve(dnorm(x, mean=m, sd=std), col="darkblue", lwd=2, add=TRUE)
	hist(u, density=20, breaks=20, prob=TRUE, xlab="x-variable", ylim=c(0, 0.8), main="Density curve over histogram") lines(density(u), col = "blue")
	# hist(u) is an object # names(uh) will show all of its components uh<-hist(u) plot(uh, ylim=c(0, 40), col="lightgray", xlab="", main="Histogram of u") text(uh$mids, uh$counts+2, label=c(uh$counts))

Wednesday, September 10, 2014

Probabilities and Distributions

1. Generating random samples from a normal distribution

Even though we would like to think of our samples as random, it is in fact almost impossible to generate random numbers on a computer. So, we will admit that we are really drawing a pseudo-random sample. In order to be able to reproduce the results on this page we will set the seed for our pseudo-random number generator to the value of 124 using the set.seed function. (For more information on the random number generator used in R please refer to the help pages for the Random.Seed function which has a very detailed explanation.)

set.seed(124)

It is often very useful to be able to generate a sample from a specific distribution. To generate a sample of size 100 from a standard normal distribution (with mean 0 and standard deviation 1) we use the rnorm function. We only have to supply the n (sample size) argument since mean 0 and standard deviation 1 are the default values for the mean and stdev arguments.

norm <- rnorm(100)

Now let's look at the first 10 observations. We use square brackets to surround the first and last element number. In the output, the number of the first element listed on the line is given in the square brackets. For example, the [9] indicates that the first number given (0.19709386) is the ninth element.

norm[1:10]

##  [1] -1.3851  0.0383 -0.7630  0.2123  1.4255  0.7445  0.7002 -0.2294
##  [9]  0.1971  1.2072

mean(norm)

## [1] 0.00962

sd(norm)

## [1] 0.884

If we want to obtain a sample of values drawn from a normal distribution with a different value for the mean and standard deviation then we just have to use the mean and sd arguments. Let's draw a sample of size 100 from a normal distribution with mean 2 and standard deviation 5.

set.seed(124)
norm <- rnorm(100, 2, 5)
norm[1:10]

##  [1] -4.925  2.192 -1.815  3.062  9.128  5.722  5.501  0.853  2.985  8.036

mean(norm)

## [1] 2.05

sd(norm)

## [1] 4.42

2. Generating random samples from other distributions

Here is a list of the functions that will generate a random sample from other common distributions: runif, rpois, rmvnorm, rnbinom, rbinom, rbeta, rchisq,rexp, rgamma, rlogis, rstab, rt, rgeom, rhyper, rwilcox, rweibull. Each function has its own set of parameter arguments. For example, the rpois function is the random number generator for the Poisson distribution and it has only the parameter argument lambda. The rbinom function is the random number generator for the binomial distribution and it takes two arguments: size and prob. The size argument specifies the number of Bernoulli trials and the probargument specifies the probability of a success for each trial.

# Generating a random sample from a Poisson distribution with lambda=3
set.seed(124)
pois <- rpois(100, lambda = 3)
pois[1:10]

##  [1] 1 2 3 2 2 2 3 3 6 2

mean(pois)

## [1] 2.83

var(pois)

## [1] 2.34

# Generating a random sample from a Binomial distribution with size=20 and
# prob=.2
set.seed(124)
binom <- rbinom(100, 20, 0.2)
binom[1:10]

##  [1] 2 3 4 3 3 3 4 4 7 3

mean(binom)

## [1] 3.85

sd(binom)

## [1] 1.6

3. Other probability and distribution functions

For each of the distributions there are four functions which will generate fundamental quantities of a distribution. Let's consider the normal distribution as an example. We have already given examples of the rnorm function which will generate a random sample from a specific normal distribution. The dnormfunction will generate the density (or point) probability for a specific value for a normal distribution. This function is very useful for creating a plot of a density function of a distribution. In the list of the random number generator functions all the functions started with an "r", similarly the density functions for all the distributions all start with a "d".

# point probability for a specific value of a standard normal dist
dnorm(-1.96)

## [1] 0.0584

# plotting the density function of a normal distribution: N(2, .25)
x <- seq(0, 4, 0.1)

plot(x, dnorm(x, 2, 0.5), type = "l")

# plotting the density function of a binomial distribution: Binom(30, .25)
y <- 0:30
plot(y, dbinom(y, 30, 0.25), type = "h")

It is also possible to calculate p-values using the cumulative distribution functions. For the normal distribution this function is the pnorm and for the other distributions these functions all start with a "p".

# calculating the p-values for the quantiles of a standard normal
1 - pnorm(1.959964)

## [1] 0.025

1 - pnorm(1.644854)

## [1] 0.05

It is also possible to calculate the quantiles for a specific distribution. For the normal distribution this function is the qnorm and for the other distribution these functions all start with a "q".

# calculating the quantiles for the standard normal
qnorm(0.05)

## [1] -1.64

qnorm(0.025)

## [1] -1.96

4. The sample function

The sample function is used to generate a random sample from a given population. It can be used to sample with or without replacement by using thereplace argument (the default is F). The only obligatory argument is a vector of data which will constitute the population from which the sample will be drawn. The default is to create a sample equal in size to the population but by using the size argument any sample size can be specified. A vector of probabilities can also be supplied in the prob argument. This vector has to be equal in length to the size of the population and it will automatically be normalized if its elements do not sum up to one. The default is for every element in the population to have equal chance of being chosen.

# random sample of size 8 from sequence [5, 15]
set.seed(124)

sample(seq(5, 15), 8)

## [1]  5  9 14  8  6 11  7 10

# random permutation of sequence [1, 10]
set.seed(124)

sample(10)

##  [1]  1  4  5  3  2  6  7  8  9 10

# random sample of size 10 from sequence [1, 5] with unequal probabilities
# of being chosen
set.seed(124)

sample(5, 10, prob = c(0.3, 0.4, 0.1, 0.1, 0.1), replace = T)

##  [1] 2 1 1 2 2 2 1 1 4 2

R语言里的矩阵处理学习笔记

关于矩阵，通常都会使用matlab来做处理。其实使用R也可以对矩阵做出一些简单的处理。而R语言中提供的matrix，matlab包也提供了不少关于矩阵处理的东西(可以通过??matlab来查看具体函数）。

一、矩阵的输入

通常我们使用函数matrix来创建矩阵，函数的介绍如下：

matrix(data = NA,nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

如果想将数据按行输入矩阵，参数byrow须改为T。

由于矩阵也是特殊的数组，我们也可以用生成数组的函数array（）。具体格式如下：

array(data = NA, dim= length(data), dimnames = NULL)

这里的dim是一个二维数组，生成的就是矩阵了。

当然dim（）函数，attr()（这个是一个格式转换的函数）也是可以生成矩阵的。

还有如diag（）可以产生对角矩阵这样特殊矩阵的函数。

例如生成下面的这个矩阵（为了便于下面的叙述，这个矩阵记为x，生成命令x<-matrix（1:16,2,8））:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 3 5 7 9 11 13 15
[2,] 2 4 6 8 10 12 14 16

我们可以使用如下几种命令：

matrix（1:16,2,8）

x<-1:16 ;dim（x）<-c(2,8)

array(1:16,c(2,8))

x<-1:16;attr(x,"dim")<-c(2,8)

二、矩阵的基本操作

这里主要的有：矩阵的加法与乘法，矩阵求秩，矩阵的转置，方阵求行列式，矩阵求逆，解线性方程组

1、矩阵的加法与乘法

加法直接使用加号，实现对应元素相加。但是要注意两个矩阵必须可加

矩阵的乘法：和matlab类似，R也给出了两种乘法：

“*”：对应位置元素做乘法，如x*x得到结果：

> x*x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 9 25 49 81 121 169 225
[2,] 4 16 36 64 100 144 196 256

"%*%":这个是通常意义下的矩阵乘法，如x%*%t(x)得到结果：

> x%*%t(x)
[,1] [,2]
[1,] 680 744
[2,] 744 816

这里乘法也必须是有意义的才行。

通常我们也使用crossprod()函数来做乘法，crossprod(x,x)效果与x%*%t(x)相同

2、矩阵求秩

这里可以利用qr分解来解决求秩的问题

qr(x)$rank

可以得到x的秩

3、矩阵的转置

常用的命令为t().

R中还可以使用命令aperm()来实现矩阵的广义转置，函数用法如下：

aperm(a, perm, ...)

## Default S3 method:

aperm(a, perm = NULL, resize =TRUE, ...)

## S3 method for class 'table'

aperm(a, perm = NULL, resize =TRUE, keep.class = TRUE, ...)

4、方阵求行列式

命令为det()，无须赘述

5、矩阵求逆与解线性方程组

使用函数solve（）

对于线性方程组b<-A%*%y

的解使用函数solve（A,b)即可

从而我们知道取b为单位阵时即可求逆，通常简化为solve（A)

但是值得注意的是，用直接求逆的办法解线性方程组y<-solve(A)%*%b不仅稳定性低，效率也不咋地。

三、矩阵的分解

这里主要介绍矩阵的lu分解，Cholesky分解，特征值与特征向量，QR分解，奇异值分解

1、LU分解

　　将系数矩阵A转变成等价两个矩阵L和U的乘积，其中L和U分别是下三角和上三角矩阵。当A的所有顺序主子式都不为0时，矩阵A可以分解为A=LU，且分解唯一。其中L是单位下三角矩阵，U是上三角矩阵。

library(Matrix) #这里引入函数包Matrix

> m
[,1] [,2] [,3]
[1,] 2 -1 3
[2,] 1 2 1
[3,] 2 4 2

> l <- lu(m)

> l
'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"]with 3 slots
..@ x : num [1:9] 2 1 0.5 -1 5 0.5 3 -1 0
..@ perm: int [1:3] 1 3 3
..@ Dim : int [1:2] 3 3

> LU <- expand(l) #生成P,L,U

> LU
$L
3 x 3 Matrix of class "dtrMatrix" (unitriangular)
[,1] [,2] [,3]
[1,] 1.0 . .
[2,] 1.0 1.0 .
[3,] 0.5 0.5 1.0

$U
3 x 3 Matrix of class "dtrMatrix"
[,1] [,2] [,3]
[1,] 2 -1 3
[2,] . 5 -1
[3,] . . 0

$P（这个矩阵的意思是保持被变换的矩阵第一行不变，二三行对调）
3 x 3 sparse Matrix of class "pMatrix"

[1,] | . .
[2,] . . |
[3,] . | .

可以验证A = LU$P%*%LU$L%*%LU$U

P为置换矩阵，L为下单位三角矩阵，U为上三角矩阵；

2、Cholesky分解

如果矩阵A为n阶对称正定矩阵，则存在一个非奇异（满秩）的下三角实矩阵L，使得：A=L%*%t(L)当限定的L的对角元素为正时，分解唯一，成为Cholesky分解

> A
[,1] [,2] [,3] [,4]
[1,] 2 1 1 1
[2,] 1 2 1 1
[3,] 1 1 2 1
[4,] 1 1 1 2
> chol(A)
[,1] [,2] [,3] [,4]
[1,] 1.414214 0.7071068 0.7071068 0.7071068
[2,] 0.000000 1.2247449 0.4082483 0.4082483
[3,] 0.000000 0.0000000 1.1547005 0.2886751
[4,] 0.000000 0.0000000 0.0000000 1.1180340
> t(chol(A))%*%chol(A)
[,1] [,2] [,3] [,4]
[1,] 2 1 1 1
[2,] 1 2 1 1
[3,] 1 1 2 1
[4,] 1 1 1 2
> crossprod(chol(A),chol(A))
[,1] [,2] [,3] [,4]
[1,] 2 1 1 1
[2,] 1 2 1 1
[3,] 1 1 2 1
[4,] 1 1 1 2
若矩阵为对称正定矩阵，可以利用Cholesky分解求行列式的值，如：
> prod(diag(chol(A))^2)
[1] 5
> det(A)
[1] 5
若矩阵为对称正定矩阵，可以利用Cholesky分解求矩阵的逆，这时用函数chol2inv()，这种用法更有效。

3、特征值与特征向量

函数eigen（A）用来计算方阵A的特征值与特征向量，返回一个含有特征值与特征向量的列表

> A
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> eigen(A)
$values
[1] 3.620937e+01 -2.209373e+00 -1.050249e-15 8.203417e-16

$vectors
[,1] [,2] [,3] [,4]
[1,] -0.4140028 -0.82289268 0.4422036 -0.1001707
[2,] -0.4688206 -0.42193991 -0.3487083 0.5349238
[3,] -0.5236384 -0.02098714 -0.6291942 -0.7693354
[4,] -0.5784562 0.37996563 0.5356989 0.3345823

有时只需特征值时，我们使用eigen（A,only.value=T)$value可以快速得到结果

4、QR分解

A为m×n矩阵可以进行QR分解，A=QR，其中：Q'Q＝I，在R中可以用函数qr()进行QR分解，例如：
> A=matrix(1:16,4,4)
> qr(A)
$qr
[,1] [,2] [,3]    [,4]
[1,] -5.4772256 -12.7801930 -2.008316e+01 -2.738613e+01
[2,] 0.3651484 -3.2659863 -6.531973e+00 -9.797959e+00
[3,] 0.5477226 -0.3781696 2.641083e-15 2.056562e-15
[4,] 0.7302967 -0.9124744 8.583032e-01 -2.111449e-16

$rank
[1] 2

$qraux
[1] 1.182574e+00 1.156135e+00 1.513143e+00 2.111449e-16

$pivot
[1] 1 2 3 4

attr(,"class")
[1] "qr"
rank项返回矩阵的秩，qr项包含了矩阵Q和R的信息，要得到矩阵Q和R，可以用函数qr.Q()和qr.R()作用qr()的返回结果，例如：
> qr.R(qr(A))
[,1] [,2] [,3]    [,4]
[1,] -5.477226 -12.780193 -2.008316e+01 -2.738613e+01
[2,] 0.000000 -3.265986 -6.531973e+00 -9.797959e+00
[3,] 0.000000 0.000000 2.641083e-15 2.056562e-15
[4,] 0.000000 0.000000 0.000000e+00 -2.111449e-16
> qr.Q(qr(A))
[,1] [,2] [,3]    [,4]
[1,] -0.1825742 -8.164966e-01 -0.4000874 -0.37407225
[2,] -0.3651484 -4.082483e-01 0.2546329 0.79697056
[3,] -0.5477226 -8.131516e-19 0.6909965 -0.47172438
[4,] -0.7302967 4.082483e-01 -0.5455419 0.04882607
> qr.Q(qr(A))%*%qr.R(qr(A))
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16

> t(qr.Q(qr(A)))%*%qr.Q(qr(A))
[,1] [,2] [,3] [,4]
[1,] 1.000000e+00 -1.457168e-16 -6.760001e-17 -7.659550e-17
[2,] -1.457168e-16 1.000000e+00 -4.269046e-17 7.011739e-17
[3,] -6.760001e-17 -4.269046e-17 1.000000e+00 -1.596437e-16
[4,] -7.659550e-17 7.011739e-17 -1.596437e-16 1.000000e+00
> qr.X(qr(A))
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16

5、svd分解

利用函数svd（）可以对矩阵做svd分解。看一个R提供的例子：

svd> hilbert <- function(n) { i <- 1:n; 1 /outer(i - 1, i, "+") }

svd> X <- hilbert(9)[,1:6]

svd> (s <- svd(X))
$d
[1] 1.668433e+00 2.773727e-01 2.223722e-02 1.084693e-03 3.243788e-05
[6] 5.234864e-07

$u
[,1] [,2]    [,3] [,4]   [,5] [,6]
[1,] -0.7244999 0.6265620 0.27350003 -0.08526902 0.02074121 -0.00402455
[2,] -0.4281556 -0.1298781 -0.64293597 0.55047428 -0.27253421 0.09281592
[3,] -0.3121985 -0.2803679 -0.33633240 -0.31418014 0.61632113-0.44090375
[4,] -0.2478932 -0.3141885 -0.06931246 -0.44667149 0.02945426 0.53011986
[5,] -0.2063780 -0.3140734 0.10786005 -0.30241655 -0.35566839 0.23703838
[6,] -0.1771408 -0.3026808 0.22105904 -0.09041508 -0.38878613-0.26044927
[7,] -0.1553452 -0.2877310 0.29280775 0.11551327 -0.19285565-0.42094482
[8,] -0.1384280 -0.2721599 0.33783778 0.29312535 0.11633231 -0.16079025
[9,] -0.1248940 -0.2571250 0.36542543 0.43884649 0.46496714 0.43459954

$v
[,1] [,2]    [,3] [,4]   [,5] [,6]
[1,] -0.7364928 0.6225002 0.2550021 -0.06976287 0.01328234-0.001588146
[2,] -0.4432826 -0.1818705 -0.6866860 0.50860089 -0.19626669 0.041116974
[3,] -0.3274789 -0.3508553 -0.2611139 -0.50473697 0.61605641 -0.259215626
[4,] -0.2626469 -0.3921783 0.1043599 -0.43747940 -0.40833605 0.638901622
[5,] -0.2204199 -0.3945644 0.3509658 0.01612426 -0.46427916-0.675826789
[6,] -0.1904420 -0.3831871 0.5110654 0.53856351 0.44663632 0.257248908

svd> D <- diag(s$d)

svd> s$u %*% D %*% t(s$v) # X = U D V'
[,1] [,2]    [,3] [,4] [,5]    [,6]
[1,] 1.0000000 0.5000000 0.33333333 0.25000000 0.20000000 0.16666667
[2,] 0.5000000 0.3333333 0.25000000 0.20000000 0.16666667 0.14285714
[3,] 0.3333333 0.2500000 0.20000000 0.16666667 0.14285714 0.12500000
[4,] 0.2500000 0.2000000 0.16666667 0.14285714 0.12500000 0.11111111
[5,] 0.2000000 0.1666667 0.14285714 0.12500000 0.11111111 0.10000000
[6,] 0.1666667 0.1428571 0.12500000 0.11111111 0.10000000 0.09090909
[7,] 0.1428571 0.1250000 0.11111111 0.10000000 0.09090909 0.08333333
[8,] 0.1250000 0.1111111 0.10000000 0.09090909 0.08333333 0.07692308
[9,] 0.1111111 0.1000000 0.09090909 0.08333333 0.07692308 0.07142857

svd> t(s$u) %*% X %*% s$v # D = U' X V
[,1]    [,2] [,3]    [,4] [,5]
[1,] 1.668433e+00 2.009230e-16 -2.333610e-16 1.193300e-16 2.347298e-17
[2,] 1.627828e-17 2.773727e-01 7.318365e-19 3.109966e-17 -5.251265e-17
[3,] 1.828617e-17 1.086828e-17 2.223722e-02 4.511721e-18 1.194020e-17
[4,] 2.420517e-17 1.205777e-17 3.433104e-18 1.084693e-03 -4.584063e-18
[5,] -3.406808e-17 -1.147965e-17 -8.968404e-19 6.405788e-18 3.243788e-05
[6,] -1.591696e-17 2.714931e-18 1.721002e-17 -2.358252e-18 1.170640e-17
[,6]
[1,] 1.015423e-16
[2,] 2.334931e-17
[3,] -1.562373e-17
[4,] 1.364795e-18
[5,] -2.473743e-18
[6,] 5.234864e-07

6、矩阵广义逆(Moore-Penrose)
n×m矩阵A+称为m×n矩阵A的Moore-Penrose逆，如果它满足下列条件：
① A A+A=A；②A+A A+= A+；③(A A+)H=A A+；④(A+A)H= A+A
在R的MASS包中的函数ginv()可计算矩阵A的Moore-Penrose逆，例如：
library(“MASS”)
> A
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> ginv(A)
[,1] [,2] [,3] [,4]
[1,] -0.285 -0.1075 0.07 0.2475
[2,] -0.145 -0.0525 0.04 0.1325
[3,] -0.005 0.0025 0.01 0.0175
[4,] 0.135 0.0575 -0.02 -0.0975
验证性质1：
> A%*%ginv(A)%*%A
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
验证性质2：
> ginv(A)%*%A%*%ginv(A)
[,1] [,2] [,3] [,4]
[1,] -0.285 -0.1075 0.07 0.2475
[2,] -0.145 -0.0525 0.04 0.1325
[3,] -0.005 0.0025 0.01 0.0175
[4,] 0.135 0.0575 -0.02 -0.0975
验证性质3:
> t(A%*%ginv(A))
[,1] [,2] [,3] [,4]
[1,] 0.7 0.4 0.1 -0.2
[2,] 0.4 0.3 0.2 0.1
[3,] 0.1 0.2 0.3 0.4
[4,] -0.2 0.1 0.4 0.7
> A%*%ginv(A)
[,1] [,2] [,3] [,4]
[1,] 0.7 0.4 0.1 -0.2
[2,] 0.4 0.3 0.2 0.1
[3,] 0.1 0.2 0.3 0.4
[4,] -0.2 0.1 0.4 0.7
验证性质4:
> t(ginv(A)%*%A)
[,1] [,2] [,3] [,4]
[1,] 0.7 0.4 0.1 -0.2
[2,] 0.4 0.3 0.2 0.1
[3,] 0.1 0.2 0.3 0.4
[4,] -0.2 0.1 0.4 0.7
> ginv(A)%*%A
[,1] [,2] [,3] [,4]
[1,] 0.7 0.4 0.1 -0.2
[2,] 0.4 0.3 0.2 0.1
[3,] 0.1 0.2 0.3 0.4
[4,] -0.2 0.1 0.4 0.7

对于矩阵，我们还可以使用cbind(),rbind()构造分块矩阵。

关于矩阵的更多信息可以参阅Douglas Bates的Introduction to the Matrix package.你可以在help.search(Matrix)里找到它。

Coding for Categorical Variables in Regression Models

In R there are at least three different functions that can be used to obtain contrast variables for use in regression or ANOVA. For those shown below, the default contrast coding is "treatment" coding, which is another name for "dummy" coding. This is the coding most familiar to statisticians. "Dummy" or "treatment" coding basically consists of creating dichotomous variables where each level of the categorical variable is contrasted to a specified reference level. In the case of the variable race which has four levels, a typical dummy coding scheme would involve specifying a reference level, let's pick level 1 (which is the default), and then creating three dichotomous variables, where each variable would contrast each of the other levels with level 1. So, we would have a variable which would contrast level 2 with level 1, another variable that would contrast level 3 with level 1 and a third variable that would contrast level 4 with level 1. There are actually four different contrasts coding that have built in functions in R, but we will focus our attention on the treatment (or dummy) coding since it is the most popular choice for data analysts. For more information about different contrasts coding systems and how to implement them in R, please refer to R Library: Coding systems for categorical variables.

For the examples on this page we will be using the hsb2 data set. Let's first read in the data set and create the factor variable race.f based on the variablerace. We will then use the is.factor function to determine if the variable we create is indeed a factor variable, and then we will use the lm function to perform a regression, and get a summary of the regression using the summary function.

hsb2 <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")

1. The factor function

# creating the factor variable
hsb2$race.f <- factor(hsb2$race)
is.factor(hsb2$race.f)

## [1] TRUE

hsb2$race.f[1:15]

##  [1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3
## Levels: 1 2 3 4

summary(lm(write ~ race.f, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.f, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.46       1.84   25.22  < 2e-16 ***
## race.f2        11.54       3.29    3.51  0.00055 ***
## race.f3         1.74       2.73    0.64  0.52461    
## race.f4         7.60       1.99    3.82  0.00018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

You can also use the factor function within the lm function, saving the step of creating the factor variable first.

summary(lm(write ~ factor(race), data = hsb2))

## 
## Call:
## lm(formula = write ~ factor(race), data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      46.46       1.84   25.22  < 2e-16 ***
## factor(race)2    11.54       3.29    3.51  0.00055 ***
## factor(race)3     1.74       2.73    0.64  0.52461    
## factor(race)4     7.60       1.99    3.82  0.00018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

2. Using the C function

The C function (this must be a upper-case "C") allows you to create several different kinds of contrasts, including treatment, Helmert, sum and poly. Treatment is another name for dummy coding. Sum stands for contrasts that sum to zero, such as the type used in ANOVA models. Poly is short for polynomial. Three arguments are used with this function. The first one names the factor to be used, the second indicated the type of contrast to be used (e.g., treatment, Helmert, etc.), and the third indicates the number of contrasts to be set. The default is one less than the number of levels of the factor variable. We will start out by using the treatment contrast. We will accept the default setting for the number of levels, so that argument can be omitted.

hsb2 <- within(hsb2, {
    race.ct <- C(race.f, treatment)
    print(attributes(race.ct))
})

## $levels
## [1] "1" "2" "3" "4"
## 
## $class
## [1] "factor"
## 
## $contrasts
## [1] "contr.treatment"

summary(lm(write ~ race.ct, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.ct, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.46       1.84   25.22  < 2e-16 ***
## race.ct2       11.54       3.29    3.51  0.00055 ***
## race.ct3        1.74       2.73    0.64  0.52461    
## race.ct4        7.60       1.99    3.82  0.00018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

Now we will try an example using the Helmert coding system which compares each subsequent level to the mean of the previous levels. For example, the third level will be compared with the mean of the first two levels, and the fourth level will be compared to the mean of the first three levels. Also note that, like most functions in R, C is case-sensitive: the arguments for the type of contrast must be in all lower case letters (i.e., typing Helmert will give you a strange error message that does not indicate that the problem is that you need to use a lower-case h (helmert)). We will make two objects using this type of coding: for the first one we will accept the default number of contrasts to be created, and in the second one we will specify that three contrasts are to be made (because the variable race has four levels). As you will see, the difference is found in the output of the attributes function, not in the results of the lm.

hsb2 <- within(hsb2, {
    race.ch <- C(race.f, helmert)
    print(attributes(race.ch))
})

## $levels
## [1] "1" "2" "3" "4"
## 
## $class
## [1] "factor"
## 
## $contrasts
## [1] "contr.helmert"

summary(lm(write ~ race.ch, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.ch, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   51.678      0.982   52.62  < 2e-16 ***
## race.ch1       5.771      1.643    3.51  0.00055 ***
## race.ch2      -1.343      0.867   -1.55  0.12317    
## race.ch3       0.792      0.372    2.13  0.03444 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

hsb2 <- within(hsb2, {
    race.ch1 <- C(race.f, helmert, 3)
    print(attributes(race.ch1))
})

## $levels
## [1] "1" "2" "3" "4"
## 
## $class
## [1] "factor"
## 
## $contrasts
##   [,1] [,2] [,3]
## 1   -1   -1   -1
## 2    1   -1   -1
## 3    0    2   -1
## 4    0    0    3

summary(lm(write ~ race.ch1, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.ch1, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   51.678      0.982   52.62  < 2e-16 ***
## race.ch11      5.771      1.643    3.51  0.00055 ***
## race.ch12     -1.343      0.867   -1.55  0.12317    
## race.ch13      0.792      0.372    2.13  0.03444 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

3. Using the contr. function

The contr. function is a little different from the preceding functions, in that it is really two functions. In most cases, you will have function on both sides of <- . On the left side you will usually have the contrasts() function, and on the right contr.treatment(), contr.helmert(), or whatever contrast you want to use. We suggest that you first look at the help file for this function, as the arguments are different for each type of contrast (i.e., treatment, Helmert, sum and poly). For the treatment contrast, the arguments are n, base and contrasts. There is no default for the n argument, so this number must be specified. The default for the base argument is 1, meaning that the first level is used as the reference level. The default for the contrasts argument is TRUE.

One advantage to using the two function method is that it allows you to change the default reference level if you like. We will not show that here, but it can be done using the options() function (see the help file for contrasts for an example of how to do this).

First, we will use the contrasts() function by itself simply to show what it is doing. Please note that while the example works for treatment coding, it does not work for other types of coding.

(a <- contrasts(hsb2$race.f))

Now let's use the contrasts() function with the contr.treatment() function. The results from the linear model (the lm() function) should match those that we have obtained previously. Note that the number given in the parentheses is the number of levels of the factor variable race.

contrasts(hsb2$race.f) <- contr.treatment(4)
summary(lm(write ~ race.f, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.f, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.46       1.84   25.22  < 2e-16 ***
## race.f2        11.54       3.29    3.51  0.00055 ***
## race.f3         1.74       2.73    0.64  0.52461    
## race.f4         7.60       1.99    3.82  0.00018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

Now let's try changing the reference level to the second level of race.f.

contrasts(hsb2$race.f) <- contr.treatment(4, base = 2)
summary(lm(write ~ race.f, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.f, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    58.00       2.72   21.31  < 2e-16 ***
## race.f1       -11.54       3.29   -3.51  0.00055 ***
## race.f3        -9.80       3.39   -2.89  0.00425 ** 
## race.f4        -3.94       2.82   -1.40  0.16380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

Another way of doing the same thing would be to specify which levels of the factor variable race.f are to be included in the model.

summary(lm(write ~ I(race.f == 1) + I(race.f == 3) + I(race.f == 4), data = hsb2))

## 
## Call:
## lm(formula = write ~ I(race.f == 1) + I(race.f == 3) + I(race.f == 
##     4), data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           58.00       2.72   21.31  < 2e-16 ***
## I(race.f == 1)TRUE   -11.54       3.29   -3.51  0.00055 ***
## I(race.f == 3)TRUE    -9.80       3.39   -2.89  0.00425 ** 
## I(race.f == 4)TRUE    -3.94       2.82   -1.40  0.16380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

Now let's try using the Helmert coding.

contrasts(hsb2$race.f) <- contr.helmert(4)
summary(lm(write ~ race.f, data = hsb2))

## 
## Call:
## lm(formula = write ~ race.f, data = hsb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.055  -5.458   0.972   7.000  18.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   51.678      0.982   52.62  < 2e-16 ***
## race.f1        5.771      1.643    3.51  0.00055 ***
## race.f2       -1.343      0.867   -1.55  0.12317    
## race.f3        0.792      0.372    2.13  0.03444 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.03 on 196 degrees of freedom
## Multiple R-squared:  0.107, Adjusted R-squared:  0.0934 
## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05