A short introduction to R

(1)

A short introduction to R

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/IIT/

(2)

Some advice on how to write and submit programs

• R is case sensitive

• Commands can be written:

– on Console window

(use arrows up and down to scroll through the history) – on Editor window (ASCII file that can be saved for further

processing).

To run commands on one or more lines (or parts of lines) select the text and select Run from the Menu, or type Ctrl+R or, with Windows, F5

The commands are automatically copied to the Console window and ran

• More commands can be written on the same line separated by ;

A Command can be written on more lines (the Console prompt becomes +)

• Output is always written on the Console window (or on the Graphics window)

• Comments are preceded by #

(3)

Objects in R

Vectors, lists, arrays, matrices, tables, data frames, . . . Assign a name to an object (all equivalent)

> x=3

> x<-3

> 3->x

Object names can contain letters, numbers, characters such as -, _, etc.

(Check!). The first character must be a letter.

Display the content of an object

> x [1] 3

Object modes

- numeric: x=3 - character: t="Ciao"

- logical: z=TRUE; admitted values: TRUE (or T), FALSE (or F).

> mode(t)

[1] "character"

Logical operators : <, <=, >, >=, == (equal), != (different), & (and), | (or)

(4)

Vectors

• Assign elements to a vector

pay close attention to the function c (concatenation):

> y = c(10,9,8,7,6,5,4,3,2,1)

• Extract/remove elements from a vector

> y[2] ## 2 is the index of the elements to extract [1] 9

> y[3:6] ## return the content of y in position 3 to 6 [1] 8 7 6 5

> y[c(2,3,6)]

[1] 9 8 5

> y[-4] ## specify the indices of the elements to remove

[1] 10 9 8 6 5 4 3 2 1

(5)

• Logical conditions

> y>=3 ## specify a vector of logical values

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE

> y[y>=3]

[1] 10 9 8 7 6 5 4 3

• Extract TRUE indices of a logical object

> which(y>=3)

[1] 1 2 3 4 5 6 7 8

• Missing values Default symbol: NA.

> vec=c(13,12,NA,45,72,NA,23)

Remove missing values

> v=na.omit(vec);v [1] 13 12 45 72 23 attr(,"na.action") [1] 3 6

attr(,"class") [1] "omit"

Statistics with missing values

> mean(vec) [1] NA

> mean(vec,na.rm=T) [1] 33

(6)

• Patterned data

> a=5:15; a

[1] 5 6 7 8 9 10 11 12 13 14 15

> y=15:1; y

[1] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

> b=seq(5,16,3);b ## the first value is 5;

> ## the final value is <= 16; by 3 [1] 5 8 11 14

> c=rep(c(1,2,3),4);c

[1] 1 2 3 1 2 3 1 2 3 1 2 3

> c(rep("M",3),rep("F",4))

[1] "M" "M" "M" "F" "F" "F" "F"

> LETTERS[1:10] # pay attention to the brackets [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

> letters[1:10]

[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

(7)

Data frames Structure of a data frame

• rows: observations (subjects/units/ ...)

• columns: variables An example

Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.

variables

observations

PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)

RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)

SEX (1: M – 2: F)

HEIGHT Height (inch) WEIGHT Weight (lb)

ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY

1 64 88 1 2 1 66.00 140 2

2 58 70 1 2 1 72.00 145 2

3 62 76 1 1 1 73.50 160 3

(8)

How to read/create a data frame in R Example: PULSE

Data contained in an ASCII file, separated by blank.

pulse=read.table("C:/DATA/pulse.txt",header=T,row.name=1)

header=T: the first row contains the name of the variables

row.name=1: the first columns contains the name of the subjects

> pulse

PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY

1 64 88 1 2 1 66.00 140 2

2 58 70 1 2 1 72.00 145 2

3 62 76 1 1 1 73.50 160 3

> PULSE1 # error

Error: object ’PULSE1’ not found

> pulse$PULSE1

[1] 64 58 62 66 64 74 84 68 62 76 90

[12] 80 92 68 60 62 66 70 68 72 70 74

> attach(pulse)

> PULSE1

[1] 64 58 62 66 64 74 84 68 62 76 90

(9)

Main options of read.table

- file="" name of the file, with its path (use / not \);

file= can be omitted

alternatively use file.choose()

- sep="" separator between columns (default: blank;

if tab: "\t" )

- dec="" decimal separator (default: dot)

- na.strings= "" string (or string vector) denoting missing val- ues (default: NA)

- header=T (or F) if column names are in the first row (or not) - row.names= number or name of the column containing the

names of the subjects

- skip= number of rows to skip at the beginning of the file - nrows= number of rows to read

Some example and use of setwd

setwd("C:/DATA/")

cicale=read.table("cicala.txt",header =T,nrows=104,row.names=1, sep=";",dec=",",na.strings="999")

d=read.table(file="C:/DATA/data.txt",header =F, row.names=2,

col.names = c("Fertility","Agriculture","Examination","Education"))

(10)

Quantitative/qualitative variables Numeric/character objects

> str(pulse) ### cf. dim(pulse)

’data.frame’: 92 obs. of 8 variables:

$ PULSE1 : int 64 58 62 66 64 74 84 68 62 76 ...

$ PULSE2 : int 88 70 76 78 80 84 84 72 75 118 ...

$ RAN : int 1 1 1 1 1 1 1 1 1 1 ...

$ SMOKES : int 2 2 1 1 2 2 2 2 2 2 ...

$ SEX : int 1 1 1 1 1 1 1 1 1 1 ...

$ HEIGHT : num 66 72 73.5 73 69 73 72 74 72 71 ...

$ WEIGHT : int 140 145 160 190 155 165 150 190 195 138 ...

$ ACTIVITY: int 2 2 3 1 2 1 3 2 2 2 ...

Create factors and ordered factors

> ran=factor(RAN,levels=c(1,2),labels=c("run","sit"))

> mode(ran) [1] "numeric"

> str(ran)

Factor w/ 2 levels "run","sit": 1 1 1 1 1 1 1 1 1 1 ...

> activity=ordered(ACTIVITY,levels=c(0,1,2,3), + labels=c("no","low","medium","high"))

> mode(activity) [1] "numeric"

> str(activity)

Ord.factor w/ 4 levels "no"<"low"<"medium"<..: 3 3 4 2 3 2 4 3 3 3 ...

(11)

How to create a data frame (continue)

• Data frame from a list (option: textConnection)

> data_list =

+ "id,type,gender,q1,q2,q3,q4 # no comma at the end of the line!

+ 1,1,f,1,1,5,1 + 2,2,f,2,1,4,1 + 3,1,f,2,2,4,3 + 4,2, ,3,1, ,3 + 5,1,m,4,5,2,4 + 6,2,m,5,4,5,5 + 7,1,m,5,3,4,4 + 8,2,m,4,5,5,5"

>

> data = read.table(textConnection(data_list),header=TRUE, sep=",", + row.names="id", na.strings=" ")

• Data frame from vectors (command: data.frame)

> y = c(10,9,8,7,6,5,4,3,2,1)

> x = c("A","A","B","B","A","B","A","A","B","B")

> xy=data.frame(x,y)

> xy x y 1 A 10

2 A 9

3 B 8

(12)

• Data frame from an Excel sheet – in Excel:

save the sheet in CSV (comma delimited) format *.csv – in Notepad (or another text editor):

open the file *.csv and check the used characters for the variable separator and the decimal separator

– in R:

use read.table with suitable options for the separators

(13)

Functions on a data frame

• Delete rows with missing values

data2=na.omit(data)

• Dimension of the data frame

> dim(data) ## vector with 2 elements [1] 8 6

> dim(data)[1]

[1] 8

> dim(data)[2]

[1] 6

• Extract the row/column names

> rownames(data)

[1] "1" "2" "3" "4" "5" "6" "7" "8"

> colnames(data) ## equivalently names(data) [1] "type" "gender" "q1" "q2" "q3" "q4"

(14)

• Select rows or columns specified by their indices – Select rows

> Data=data[2:4,]; Data type gender q1 q2 q3 q4

2 2 f 2 1 4 1

3 1 f 2 2 4 3

4 2 <NA> 3 1 NA 3

> Data2=data[c(1,3),];Data2 type gender q1 q2 q3 q4

1 1 f 1 1 5 1

3 1 f 2 2 4 3

– Select columns

data3=data[,2:4]

(15)

• Select a subset (of rows) identified by the values of a variable – Function subset

> dataM2= subset(data,gender=="m");dataM2 type gender q1 q2 q3 q4

5 1 m 4 5 2 4

6 2 m 5 4 5 5

7 1 m 5 3 4 4

8 2 m 4 5 5 5

– Other ways

> dataM=data[data$gender=="m",];dataM #4th subject has gender=NA type gender q1 q2 q3 q4

NA NA <NA> NA NA NA NA

5 1 m 4 5 2 4

6 2 m 5 4 5 5

7 1 m 5 3 4 4

8 2 m 4 5 5 5

> dataM1=na.omit(data[data$gender=="m",]);dataM1 type gender q1 q2 q3 q4

5 1 m 4 5 2 4

6 2 m 5 4 5 5

7 1 m 5 3 4 4

8 2 m 4 5 5 5

(16)

• Select a subset of columns

data4=subset(data, select=c(gender,q1,q2)) ## or equivalently data5=subset(data, select=gender:q2)

• Select the first or the last rows

> head(data,2)

type gender q1 q2 q3 q4

1 1 f 1 1 5 1

2 2 f 2 1 4 1

> tail(data,2)

type gender q1 q2 q3 q4

7 1 m 5 3 4 4

8 2 m 4 5 5 5

• Delete rows or columns

d1=data[-c(1,3,7),]

d2=data[,-2]

• Recode a variable

ifelse(..., ...,...)

> attach(data)

> q4_a=ifelse(q4<3,-1,2); q4_a [1] -1 -1 2 2 2 2 2 2

> q4_b=ifelse(q4<3,1,ifelse(q4==3,0,1)); q4_b [1] 1 1 0 0 1 1 1 1

(17)

• Concatenate “compatible” data frames – by columns (cbind)

> dataM2= subset(data,gender=="m");dataF2=subset(data,gender=="f")

> dataMF=rbind(dataM2,dataF2);dataMF type gender q1 q2 q3 q4

5 1 m 4 5 2 4

6 2 m 5 4 5 5

7 1 m 5 3 4 4

8 2 m 4 5 5 5

51 1 f 1 1 5 1

61 2 f 2 1 4 1

71 1 f 2 2 4 3

– by rows (rbind)

> dataA=subset(data,select=q1:q2);dataB=subset(data,select=type:gender)

> dataC=cbind(dataA,dataB);dataC q1 q2 type gender

1 1 1 1 f

2 2 1 2 f

3 2 2 1 f

4 3 1 2 <NA>

5 4 5 1 m

6 5 4 2 m

7 5 3 1 m

8 4 5 2 m

(18)

• Ordering a data frame by one or more variables

> dd2= data[with(data, order(type)), ];dd2 type gender q1 q2 q3 q4

1 1 f 1 1 5 1

3 1 f 2 2 4 3

5 1 m 4 5 2 4

7 1 m 5 3 4 4

2 2 f 2 1 4 1

4 2 <NA> 3 1 NA 3

6 2 m 5 4 5 5

8 2 m 4 5 5 5

> dd3= data[with(data, order(type,-q1)), ]

> dd3 ## first type ascending, then q1 descending type gender q1 q2 q3 q4

7 1 m 5 3 4 4

5 1 m 4 5 2 4

3 1 f 2 2 4 3

1 1 f 1 1 5 1

6 2 m 5 4 5 5

8 2 m 4 5 5 5

4 2 <NA> 3 1 NA 3

2 2 f 2 1 4 1

(19)

Matrices

• Assign values to a matrix

> A=matrix(c(2,3,4,5),nrow=2,byrow=T)

> B=matrix(c(1,2,0,1),nrow=2,byrow=F)

> colnames(B)=c("C1","C2")

> rownames(B)=c("R1","R2")

> A

[,1] [,2]

[1,] 2 3

[2,] 4 5

> B

C1 C2 R1 1 0 R2 2 1

• Operations with matrix

Element-wise operations (*,+,/,. . . )

> A*B

C1 C2 R1 2 0 R2 8 5

Matrix multiplication

> A%*%B C1 C2 [1,] 8 3 [2,] 14 5

Determinant

> det(A) [1] -2

Inverse

> solve(A)

[,1] [,2]

[1,] -2.5 1.5 [2,] 2.0 -1.0

Diagonal

> diag(A) > diag(diag(A))

[1] 2 5 [,1] [,2]

[1,] 2 0

[2,] 0 5

(20)

Solve AX = B

> solve(A,B)

C1 C2

[1,] 0.5 1.5 [2,] 0.0 -1.0

Eigenvalues/vectors

> E=eigen(A);E

$values

[1] 7.2749172 -0.2749172

$vectors

[,1] [,2]

[1,] -0.4943691 -0.7968121 [2,] -0.8692521 0.6042272

> round(E$values,3) [1] 7.275 -0.275

> round(E$vectors,3) [,1] [,2]

[1,] -0.494 -0.797 [2,] -0.869 0.604

E is a list with two elements

Note the use of round

(21)

• Matrix and Data frame

> mdata=data.matrix(data);mdata type gender q1 q2 q3 q4

1 1 1 1 1 5 1

2 2 1 2 1 4 1

3 1 1 2 2 4 3

4 2 NA 3 1 NA 3

5 1 2 4 5 2 4

6 2 2 5 4 5 5

7 1 2 5 3 4 4

8 2 2 4 5 5 5

> DB=data.frame(B)

> DB

C1 C2 R1 1 0 R2 2 1

Note that row and column names are preserved.

(22)

Help

• R site

http://www.r-project.org

• help(<function name>) (if you know it!)

> help(mean)

It goes to on-line help.

• apropos("<part of function name>")

> apropos("med")

[1] ".__C__namedList" "elNamed" "elNamed<-" "median"

[5] "median.default" "medpolish" "runmed"