Thursday, September 4, 2014

R学习笔记(三)

1. Simulation

Functions for probability distributions in R:
  • rnorm: generate random Normal variates with a given mean and standard deviation
  • dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
  • pnorm: evaluate the cumulative distribution function for a Normal distribution
  • rpois: generate random Poisson variates with a given rate
Probability distribution functions usually have four functions associated with them.
The functions are prefixed with a
  • d for density
  • r for random number generation
  • p for cumulative distribution
  • q for quantile function
1
2
3
4
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

1
2
3
4
5
6
7
8
9
10
11
12
13
# Generating random Normal variates
> x <- rnorm(10)
> x
[1] 1.38380206 0.48772671 0.53403109 0.66721944
[5] 0.01585029 0.37945986 1.31096736 0.55330472
[9] 1.22090852 0.45236742
> x <- rnorm(10, 20, 2)
> x
[1] 23.38812 20.16846 21.87999 20.73813 19.59020
[6] 18.73439 18.31721 22.51748 20.36966 21.04371
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.32 19.73 20.55 20.67 21.67 23.39

1
2
3
4
5
6
7
8
9
10
11
12
13
# Setting the random number seed with set.seed ensures reproducibility
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808
[5] 0.3295078
> rnorm(5)
[1] -0.8204684 0.4874291 0.7383247 0.5757814
[5] -0.3053884
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808
[5] 0.3295078
# Always set the random number seed when conducting a simulation!

1
2
3
4
5
6
7
8
9
10
11
12
13
# Generating Poisson data
> rpois(10, 1)
[1] 3 1 0 1 0 0 1 0 1 1
> rpois(10, 2)
[1] 6 2 2 1 3 2 2 1 1 2
> rpois(10, 20)
[1] 20 11 21 20 20 21 17 15 24 20
> ppois(2, 2) ## Cumulative distribution
[1] 0.6766764 ## Pr(x <= 2)
> ppois(4, 2)
[1] 0.947347 ## Pr(x <= 4)
> ppois(6, 2)
[1] 0.9954662 ## Pr(x <= 6)

1
2
3
4
5
6
7
8
9
# simulate from a linear model
> set.seed(20)
> x <- rnorm(100)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050
> plot(x, y)

1
2
3
4
5
6
7
8
9
# simulate from a Poisson model
> set.seed(1)
> x <- rnorm(100)
> log.mu <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log.mu))
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 1.55 2.00 6.00
> plot(x, y)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# The sample function draws randomly from a specified set of (scalar) objects allowing
you to sample from arbitrary distributions.
> set.seed(1)
> sample(1:10, 4)
[1] 3 4 5 7
> sample(1:10, 4)
[1] 3 9 8 5
> sample(letters, 5)
[1] "q" "b" "e" "x" "p"
> sample(1:10) ## permutation
[1] 4 7 10 6 9 2 8 3 1 5
> sample(1:10)
[1] 2 3 4 1 9 5 10 8 6 7
> sample(1:10, replace = TRUE) ## Sample w/replacement
[1] 2 9 7 8 2 8 5 9 7 8
Summary
  • Drawing samples from specific probability distributions can be done with r*functions
  • Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
  • The sample function can be used to draw random samples from arbitrary vectors
  • Setting the random number generator seed via set.seed is critical for reproducibility

2. Plotting with Base Graphics

Base graphics are usually constructed piecemeal, with each aspect of the plot handled separately through a series of function calls; this is often conceptually simpler and allows plotting to mirror the thought process
Base graphics are used most commonly and are a very powerful system for creating 2-D graphics.
  • Calling plot(x, y) or hist(x) will launch a graphics device (if one is not already open) and draw the plot on the device
  • If the arguments to plot are not of some special class, then the default method for plot is called; this function has many arguments, letting you set the title, x axis lable, y axis label, etc.
  • The base graphics system has many parameters that can set and tweaked; these parameters are documented in ?par; it wouldn’t hurt to memorize this help page!
The par function is used to specify global graphics parameters that a?ect all plots in an R session. These parameters can often be overridden as arguments to specific plotting functions.
  • pch: the plotting symbol (default is open circle)
  • lty: the line type (default is solid line), can be dashed, dotted, etc.
  • lwd: the line width, specified as an integer multiple(default is 1)
  • col: the plotting color, specified as a number, string, or hex code; the colors function gives you a vector of colors by name(default is black)
  • las: the orientation of the axis labels on the plot
  • bg: the background color(default is transparent)
  • mar: the margin size(default is bottom5.1 left4.1 top4.1 right2.1)
  • oma: the outer margin size (default is 0 for all sides)
  • mfrow: number of plots per row, column (plots are filled row-wise)
  • mfcol: number of plots per row, column (plots are filled column-wise)
  • plot: make a scatterplot, or other type of plot depending on the class of the object being plotted
  • lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
  • points: add points to a plot
  • text: add text labels to a plot using specified x, y coordinates
  • title: add annotations to x, y axis labels, title, subtitle, outer margin
  • mtext: add arbitrary text to the margins (inner or outer) of the plot
  • axis: adding axis ticks/labels
The list of devices is found in ?Devices; there are also devices created by users on CRAN
  • pdf: useful for line-type graphics, vector format, resizes well, usually portable
  • postscript: older format, also vector format and resizes well, usually portable, can be used to create encapsulated postscript files, Windows systems often don't have a postscript viewer
  • xfig: good of you use Unix and want to edit a plot by hand
  • png: bitmapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well
  • jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings
  • bitmap: needed to create bitmap files (png, jpeg) in certain situations (uses Ghostscript), also can be used to create a variety of other bitmapped formats not mentioned
  • bmp: a nativeWindows bitmapped format

3. Plotting with Lattice

Lattice Functions:
  • xyplot: this is the main function for creating scatterplots
  • bwplot: box-and-whiskers plots (“boxplots”)
  • histogram: histograms
  • stripplot: like a boxplot but with actual points
  • dotplot: plot dots on “violin strings”
  • splom: scatterplot matrix; like pairs in base graphics system
  • levelplot, contourplot: for plotting “image” data
Lattice functions generally take a formula for their first argument, usually of the form:y ~ x | f * g
  • On the left of the ~ is the y variable, on the right is the x variable
  • After the | are conditioning variables — they are optional; the * indicates an interaction
  • The second argument is the data frame or list from which the variables in the formula should be obtained.
  • If no data frame or list is passed, then the parent frame is used.
  • If no other arguments are passed, there are defaults that can be used.
Lattice functions behave differently from base graphics functions in one critical way.
  • Base graphics functions plot data directly the graphics device
  • Lattice graphics functions return an object of class trellis.
  • The print methods for lattice functions actually do the work of plotting the data on the graphics device.
  • Lattice functions return “plot objects” that can, in principle, be stored (but it's usually better to just save the code + data).
  • On the command line, trellis objects are auto-printed so that it appears the function is plotting the data
Lattice functions have a panel function which controls what happens inside each panel of the entire plot.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x <- rnorm(100)
y <- x + rnorm(100, sd = 0.5)
f <- gl(2, 50, labels = c("Group 1", "Group 2"))
xyplot(y ~ x | f) # plots y vs. x conditioned on f.
xyplot(y ~ x | f,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(h = median(y),
lty = 2)
})
# plots y vs. x conditioned on f with horizontal (dashed) line drawn at the median of y for each panel.
xyplot(y ~ x | f,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, col = 2)
})
# Adding a regression line, fits and plots a simple linear regression line to each panel of the plot.

4. Graphics with ggplot2

The ggplot2 is a data visualization package for the statistical programming language R. Created by Hadley Wickham in 2005, ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics—a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. (from Wikipedia)
The ggplot2 package implements a system for creating graphics in R based on a comprehensive and coherent grammar. This provides a consistency to graph creation often lacking in R, and allows the user to create graph types that are innovative and novel.
The simplest approach for creating graphs in ggplot2 is through the qplot() or quick plot function. The format is: qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=,facets=, xlim=, ylim=, xlab=, ylab=, main=, sub=)
  • x : x values
  • y : y values
  • data: data frame to use (optional). If not specified, will create one, extracting vectors from the current environment.
  • facets : faceting formula to use. Picks facet_wrap or facet_grid depending on whether the formula is one sided or two-sided
  • margins : whether or not margins will be displayed
  • geom : character vector specifying geom to use. Defaults to "point" if x and y are specified, and "histogram" if only x is specified.
  • stat :character vector specifying statistics to use
  • position :character vector giving position adjustment to use
  • xlim : limits for x axis
  • ylim : limits for y axis
  • log : which variables to log transform ("x", "y", or "xy")
  • main : character vector or expression for plot title
  • xlab : character vector or expression for x axis label
  • ylab : character vector or expression for y axis label
  • asp : the y/x aspect ratio

No comments:

Post a Comment