- A programming language (interpreter)
- A statistics package
**An environment for statistical computing and graphics**- Developed by a community
- Extended with ‘packages’ that contain data, code, and documentation

- R is a flavor of the S computer language
- S was developed by John Chambers at Bell Labs in the late 1970s
*[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.*(John Chambers)

- 1991 R is created by Ross Ihaka and Robert Gentleman
- Why
**R**?

- Why
- 1993 R is made public
- 1995 R becomes Open Source (GNU General Public License)
- 1997 R Core Group is formed
- 2000 Version 1.0.0 ships

Why not Stata, or SPSS, or …?

- A flexible programming language
- Open Source and free (philosophy, or practical reasons)
- It is free to study the code
- It is free to redistribute it
- It is free to modify it
- It is free to redistribute the modified version
- It is free of charge, too

- Platform independent (Linux, Mac, Windows, desktops, servers)
- Easy to share with others
- Suitable for almost all scientific disciplines
- Extenive, diverse and growing community
- Ever increasing number of packages

- Easy interactions with other programs (web, presentation, databases, big data, etc.)
**Part of the open source toolchain of research (from data analysis to reporting, for the web or a thesis)**

**Advantages**- Free of charge, easy to install
- Strong community support
- Up-to-date
- Can complement / can be complemented by other programs
- Requires user knowledge - user thinks about what s/he is doing

**Drawbacks**- Steep learning curve
- Not user friendly
- All objects are stored in the computer memory
- Slower than compiled languages
- Easy to make mistakes, difficult to find the sources of mistakes

The number of scholarly articles found by Google Scholar

Number of people who follow each software on LinkedIn and Quara

Number of R packages available on its main distribution site

Base R and most R packages are available at cran.r-project.org

- There are thousands of R packages
- A package for almost anything
- More than one package for almost everything

- Commonly used packages
- data.table
- ggplot2
- reshape2
- dplyr

R Studio can be downloaded from here.

```
* package
* library
* workspace
* environment
* class/object
- dataframe
- vector
```

- R is case sensitive!
- R is inconsistent in it’s naming conventions

- Install the package into your computer

`install.packages(ggplot2)`

- Update all packages/one package

```
update.packages()
update.packages(ggplot2)
```

- Load the library (packages)

```
library(ggplot2) # Returns error
require(ggplot2) # Returns warning
```

- Session information

`sessionInfo()`

- List all objects in the current environment

`ls()`

- List all objects in a package

`ls("package:ggplot2")`

- Remove objects

```
rm(objectname)
rm(list = ls())
```

- Help

```
?keyword # Help files for functions
??keyword # Search R documentation
```

- Help on the web
- Stackoverflow - just Google

- New packages/applications

`2 + 3`

`## [1] 5`

`2 * 5`

`## [1] 10`

`5 / 0`

`## [1] Inf`

`0 / 0`

`## [1] NaN`

`is.na(10 / 0)`

`## [1] FALSE`

`is.finite(10 / 0)`

`## [1] FALSE`

`2 * pi`

`## [1] 6.283185`

`sqrt(4^4)`

`## [1] 16`

`2 * 3 + 4`

`## [1] 10`

`15 %/% 4`

`## [1] 3`

`15 %% 4`

`## [1] 3`

`1 & 0`

`## [1] FALSE`

`1 | 0`

`## [1] TRUE`

`3 < 6`

`## [1] TRUE`

`3.33 <= 10 / 3`

`## [1] TRUE`

`TRUE & FALSE`

`## [1] FALSE`

`TRUE | FALSE`

`## [1] TRUE`

`FALSE | !FALSE`

`## [1] TRUE`

`TRUE && FALSE`

`## [1] FALSE`

`FALSE && TRUE`

`## [1] FALSE`

`exp(1)`

`## [1] 2.718282`

`log(exp(1))`

`## [1] 1`

`log10(10)`

`## [1] 1`

`log2(2^3)`

`## [1] 3`

`round(10/3, digits = 2)`

`## [1] 3.33`

`3*round(10/3, digits = 2)`

`## [1] 9.99`

`ceiling(10/3)`

`## [1] 4`

`floor(10/3)`

`## [1] 3`

- <- is the assignment operator

`a <- 5`

- = is the same, but prefer to use <-

```
b = 3L
a * b
```

`## [1] 15`

- : is the sequence operator

```
c <- 1:3
c
```

`## [1] 1 2 3`

`(c <- 100:103)`

`## [1] 100 101 102 103`

- c() combines values into a vector or a list

```
d <- c(1, 2, 3, 4, 5)
e <- c(1:5)
f <- c(1, 2, "a", "b", 5)
```

- List all objects

`ls()`

`## [1] "a" "b" "c" "d" "e" "f"`

- List all objects with their description

`ls.str()`

```
## a : num 5
## b : int 3
## c : int [1:4] 100 101 102 103
## d : num [1:5] 1 2 3 4 5
## e : int [1:5] 1 2 3 4 5
## f : chr [1:5] "1" "2" "a" "b" "5"
```

- Remove objects

```
rm(a)
ls()
```

`## [1] "b" "c" "d" "e" "f"`

```
rm(list=ls())
ls()
```

`## character(0)`

- # means comment. Anything after the # symbol is ignored by R

`# 2 + 3`

Everything in R is an object

- Data

```
a <- c(1:5)
a
```

`## [1] 1 2 3 4 5`

- Functions

`sum(a)`

`## [1] 15`

`sum`

`## function (..., na.rm = FALSE) .Primitive("sum")`

- Graphs, too

`b <- plot(a)`

`b`

`## NULL`

- Results of functions can be stored in objects

```
a <- rnorm(100)
b <- a + rnorm(100)
model_1 <- lm(a ~ b)
model_1
```

```
##
## Call:
## lm(formula = a ~ b)
##
## Coefficients:
## (Intercept) b
## -0.002034 0.520716
```

- Look at summaries

`summary(a)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.02800 -0.62460 -0.01513 0.08441 0.92330 2.88700
```

`summary(model_1)`

```
##
## Call:
## lm(formula = a ~ b)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3665 -0.5560 -0.1041 0.4757 1.7655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002034 0.072584 -0.028 0.978
## b 0.520716 0.048414 10.755 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7214 on 98 degrees of freedom
## Multiple R-squared: 0.5414, Adjusted R-squared: 0.5367
## F-statistic: 115.7 on 1 and 98 DF, p-value: < 2.2e-16
```

- … and plots

`plot(a)`

`plot(a, b)`

`plot(model_1)`

* A function’s output depends on the object!

- Numeric
- Integer
- Double

- Character
- Logical

```
a <- 5
b <- "a"
c <- TRUE
A <- c(1:5)
B <- c("a", "b", "c")
C <- c(T, F, T)
class(a)
```

`## [1] "numeric"`

`class(A)`

`## [1] "integer"`

`class(b)`

`## [1] "character"`

`class(B)`

`## [1] "character"`

`class(c)`

`## [1] "logical"`

`class(C)`

`## [1] "logical"`

`is.numeric(C)`

`## [1] FALSE`

`is.logical(C)`

`## [1] TRUE`

`as.numeric(C)`

`## [1] 1 0 1`

`C + 1`

`## [1] 2 1 2`

`rm(list=ls())`

- Factors are categorical variables
- Unordered
- Ordered

```
a <- c("small", "large", "medium")
a
```

`## [1] "small" "large" "medium"`

```
b <- a[sample(3, 10, replace = TRUE)]
b
```

```
## [1] "medium" "medium" "medium" "small" "large" "medium" "small"
## [8] "large" "small" "small"
```

```
bfactor <- factor(b)
bfactor
```

```
## [1] medium medium medium small large medium small large small small
## Levels: large medium small
```

```
bordered <- ordered(bfactor, levels=c("small", "medium", "large"))
bordered
```

```
## [1] medium medium medium small large medium small large small small
## Levels: small < medium < large
```

`as.numeric(bfactor)`

`## [1] 2 2 2 3 1 2 3 1 3 3`

`bfactor`

```
## [1] medium medium medium small large medium small large small small
## Levels: large medium small
```

`as.numeric(bordered)`

`## [1] 2 2 2 1 3 2 1 3 1 1`

`bordered`

```
## [1] medium medium medium small large medium small large small small
## Levels: small < medium < large
```

```
x <- rnorm(10)
y <- x + as.numeric(bordered) + rnorm(10)
summary(lm(y ~ x + bordered))
```

```
##
## Call:
## lm(formula = y ~ x + bordered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6688 -0.2544 0.0000 0.4504 1.2658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9621 0.3222 6.089 0.000892 ***
## x 1.2917 0.3470 3.722 0.009826 **
## bordered.L 1.2009 0.5945 2.020 0.089912 .
## bordered.Q -0.1416 0.6554 -0.216 0.836101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9658 on 6 degrees of freedom
## Multiple R-squared: 0.7831, Adjusted R-squared: 0.6746
## F-statistic: 7.221 on 3 and 6 DF, p-value: 0.02042
```

`summary(lm(y ~ x + bfactor))`

```
##
## Call:
## lm(formula = y ~ x + bfactor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6688 -0.2544 0.0000 0.4504 1.2658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7534 0.6957 3.958 0.00747 **
## x 1.2917 0.3470 3.722 0.00983 **
## bfactormedium -0.6757 0.9466 -0.714 0.50215
## bfactorsmall -1.6983 0.8408 -2.020 0.08991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9658 on 6 degrees of freedom
## Multiple R-squared: 0.7831, Adjusted R-squared: 0.6746
## F-statistic: 7.221 on 3 and 6 DF, p-value: 0.02042
```

```
a <- factor(c("3", "11", "2", "23", "313", "2"))
a
```

```
## [1] 3 11 2 23 313 2
## Levels: 11 2 23 3 313
```

`a + 1`

`## Warning in Ops.factor(a, 1): '+' not meaningful for factors`

`## [1] NA NA NA NA NA NA`

```
afnumeric <- as.numeric(a)
afnumeric
```

`## [1] 4 1 2 3 5 2`

```
anumeric <- as.numeric(as.character(a))
anumeric
```

`## [1] 3 11 2 23 313 2`

`rm(list=ls())`

- All dates are converted to numeric values
- The origin date: January 1, 1970
- Different date formats can be used

```
date1 <- as.Date("02/28/2016", format = "%m/%d/%Y")
date2 <- as.Date("28 February 10", format = "%d %B %y")
date3 <- as.Date("02/30/2016", format = "%m/%d/%Y")
date1
```

`## [1] "2016-02-28"`

`date2`

`## [1] "2010-02-28"`

`date3`

`## [1] NA`

`date1 - date2`

`## Time difference of 2191 days`

`weekdays(date2)`

`## [1] "Sunday"`

`as.numeric(date2)`

`## [1] 14668`

- Vector
- Everything is a vector in R

- Matrix, array
- Data frame
- Data table
- Lists