What is R?

R as a software

  • A programming language (interpreter)
  • A statistics package
  • An environment for statistical computing and graphics
  • Developed by a community
    • Extended with ‘packages’ that contain data, code, and documentation

History

  • R is a flavor of the S computer language
  • S was developed by John Chambers at Bell Labs in the late 1970s
    • [W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important. (John Chambers)
  • 1991 R is created by Ross Ihaka and Robert Gentleman
    • Why R?
  • 1993 R is made public
  • 1995 R becomes Open Source (GNU General Public License)
  • 1997 R Core Group is formed
  • 2000 Version 1.0.0 ships

Why R?

Why not?

Why not Stata, or SPSS, or …?

  • A flexible programming language
  • Open Source and free (philosophy, or practical reasons)
    • It is free to study the code
    • It is free to redistribute it
    • It is free to modify it
    • It is free to redistribute the modified version
      • It is free of charge, too
  • Platform independent (Linux, Mac, Windows, desktops, servers)
  • Easy to share with others
  • Suitable for almost all scientific disciplines
    • Extenive, diverse and growing community
    • Ever increasing number of packages
  • Easy interactions with other programs (web, presentation, databases, big data, etc.)
  • Part of the open source toolchain of research (from data analysis to reporting, for the web or a thesis)

Pros and cons

  • Advantages
    • Free of charge, easy to install
    • Strong community support
    • Up-to-date
    • Can complement / can be complemented by other programs
    • Requires user knowledge - user thinks about what s/he is doing
  • Drawbacks
    • Steep learning curve
    • Not user friendly
    • All objects are stored in the computer memory
    • Slower than compiled languages
    • Easy to make mistakes, difficult to find the sources of mistakes

Some statistics

  • The number of scholarly articles found by Google Scholar Articles

  • Number of people who follow each software on LinkedIn and Quara Followers

  • Number of R packages available on its main distribution site Packages

Source: Robert A. Muenchen, “The Popularity of Data Analysis Software”, r4stats

Installing R

  • Base R and most R packages are available at cran.r-project.org

  • There are thousands of R packages
    • A package for almost anything
    • More than one package for almost everything
  • Commonly used packages
    • data.table
    • ggplot2
    • reshape2
    • dplyr

Using R

IDE (R Studio)

RStudio R Studio can be downloaded from here.

Basic vocabulary

* package
* library
* workspace
* environment
* class/object
    - dataframe
    - vector
  • R is case sensitive!
  • R is inconsistent in it’s naming conventions

Installing R packages

  • Install the package into your computer
install.packages(ggplot2)
  • Update all packages/one package
update.packages()
update.packages(ggplot2)
  • Load the library (packages)
library(ggplot2)   # Returns error
require(ggplot2)   # Returns warning

Session information

  • Session information
sessionInfo()
  • List all objects in the current environment
ls()
  • List all objects in a package
ls("package:ggplot2")
  • Remove objects
rm(objectname)
rm(list = ls())

Getting help

  • Help
?keyword     # Help files for functions
??keyword    # Search R documentation
  • Help on the web
    • Stackoverflow - just Google
  • New packages/applications

R as an calculator

2 + 3
## [1] 5
2 * 5
## [1] 10
5 / 0
## [1] Inf
0 / 0
## [1] NaN
is.na(10 / 0)
## [1] FALSE
is.finite(10 / 0)
## [1] FALSE
2 * pi
## [1] 6.283185
sqrt(4^4)
## [1] 16
2 * 3 + 4
## [1] 10
15 %/% 4
## [1] 3
15 %% 4
## [1] 3
1 & 0
## [1] FALSE
1 | 0
## [1] TRUE
3 < 6
## [1] TRUE
3.33 <= 10 / 3
## [1] TRUE
TRUE & FALSE
## [1] FALSE
TRUE | FALSE
## [1] TRUE
FALSE | !FALSE
## [1] TRUE
TRUE && FALSE
## [1] FALSE
FALSE && TRUE
## [1] FALSE
exp(1)
## [1] 2.718282
log(exp(1))
## [1] 1
log10(10)
## [1] 1
log2(2^3)
## [1] 3
round(10/3, digits = 2)
## [1] 3.33
3*round(10/3, digits = 2)
## [1] 9.99
ceiling(10/3)
## [1] 4
floor(10/3)
## [1] 3

Key operators

  • <- is the assignment operator
a <- 5
  • = is the same, but prefer to use <-
b = 3L
a * b
## [1] 15
  • : is the sequence operator
c <- 1:3
c
## [1] 1 2 3
(c <- 100:103)
## [1] 100 101 102 103
  • c() combines values into a vector or a list
d <- c(1, 2, 3, 4, 5)
e <- c(1:5)
f <- c(1, 2, "a", "b", 5)
  • List all objects
ls()
## [1] "a" "b" "c" "d" "e" "f"
  • List all objects with their description
ls.str()
## a :  num 5
## b :  int 3
## c :  int [1:4] 100 101 102 103
## d :  num [1:5] 1 2 3 4 5
## e :  int [1:5] 1 2 3 4 5
## f :  chr [1:5] "1" "2" "a" "b" "5"
  • Remove objects
rm(a)
ls()
## [1] "b" "c" "d" "e" "f"
rm(list=ls())
ls()
## character(0)
  • # means comment. Anything after the # symbol is ignored by R
# 2 + 3

R objects

Everything in R is an object

  • Data
a <- c(1:5)
a
## [1] 1 2 3 4 5
  • Functions
sum(a)
## [1] 15
sum
## function (..., na.rm = FALSE)  .Primitive("sum")
  • Graphs, too
b <- plot(a)

b
## NULL
  • Results of functions can be stored in objects
a <- rnorm(100)
b <- a + rnorm(100)
model_1 <- lm(a ~ b)
model_1
## 
## Call:
## lm(formula = a ~ b)
## 
## Coefficients:
## (Intercept)            b  
##   -0.002034     0.520716
  • Look at summaries
summary(a)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.02800 -0.62460 -0.01513  0.08441  0.92330  2.88700
summary(model_1)
## 
## Call:
## lm(formula = a ~ b)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3665 -0.5560 -0.1041  0.4757  1.7655 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002034   0.072584  -0.028    0.978    
## b            0.520716   0.048414  10.755   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7214 on 98 degrees of freedom
## Multiple R-squared:  0.5414, Adjusted R-squared:  0.5367 
## F-statistic: 115.7 on 1 and 98 DF,  p-value: < 2.2e-16
  • … and plots
plot(a)

plot(a, b)

plot(model_1)

* A function’s output depends on the object!

Data modes

  • Numeric
    • Integer
    • Double
  • Character
  • Logical
a <- 5
b <- "a"
c <- TRUE
A <- c(1:5)
B <- c("a", "b", "c")
C <- c(T, F, T)
class(a)
## [1] "numeric"
class(A)
## [1] "integer"
class(b)
## [1] "character"
class(B)
## [1] "character"
class(c)
## [1] "logical"
class(C)
## [1] "logical"
is.numeric(C)
## [1] FALSE
is.logical(C)
## [1] TRUE
as.numeric(C)
## [1] 1 0 1
C + 1
## [1] 2 1 2
rm(list=ls())

Factors

  • Factors are categorical variables
    • Unordered
    • Ordered
a <- c("small", "large", "medium")
a
## [1] "small"  "large"  "medium"
b <- a[sample(3, 10, replace = TRUE)]
b
##  [1] "medium" "medium" "medium" "small"  "large"  "medium" "small" 
##  [8] "large"  "small"  "small"
bfactor <- factor(b)
bfactor
##  [1] medium medium medium small  large  medium small  large  small  small 
## Levels: large medium small
bordered <- ordered(bfactor, levels=c("small", "medium", "large"))
bordered
##  [1] medium medium medium small  large  medium small  large  small  small 
## Levels: small < medium < large
as.numeric(bfactor)
##  [1] 2 2 2 3 1 2 3 1 3 3
bfactor
##  [1] medium medium medium small  large  medium small  large  small  small 
## Levels: large medium small
as.numeric(bordered)
##  [1] 2 2 2 1 3 2 1 3 1 1
bordered
##  [1] medium medium medium small  large  medium small  large  small  small 
## Levels: small < medium < large
x <- rnorm(10)
y <- x + as.numeric(bordered) + rnorm(10)
summary(lm(y ~ x + bordered))
## 
## Call:
## lm(formula = y ~ x + bordered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6688 -0.2544  0.0000  0.4504  1.2658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.9621     0.3222   6.089 0.000892 ***
## x             1.2917     0.3470   3.722 0.009826 ** 
## bordered.L    1.2009     0.5945   2.020 0.089912 .  
## bordered.Q   -0.1416     0.6554  -0.216 0.836101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9658 on 6 degrees of freedom
## Multiple R-squared:  0.7831, Adjusted R-squared:  0.6746 
## F-statistic: 7.221 on 3 and 6 DF,  p-value: 0.02042
summary(lm(y ~ x + bfactor))
## 
## Call:
## lm(formula = y ~ x + bfactor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6688 -0.2544  0.0000  0.4504  1.2658 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     2.7534     0.6957   3.958  0.00747 **
## x               1.2917     0.3470   3.722  0.00983 **
## bfactormedium  -0.6757     0.9466  -0.714  0.50215   
## bfactorsmall   -1.6983     0.8408  -2.020  0.08991 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9658 on 6 degrees of freedom
## Multiple R-squared:  0.7831, Adjusted R-squared:  0.6746 
## F-statistic: 7.221 on 3 and 6 DF,  p-value: 0.02042
a <- factor(c("3", "11", "2", "23", "313", "2"))
a
## [1] 3   11  2   23  313 2  
## Levels: 11 2 23 3 313
a + 1
## Warning in Ops.factor(a, 1): '+' not meaningful for factors
## [1] NA NA NA NA NA NA
afnumeric <- as.numeric(a)
afnumeric
## [1] 4 1 2 3 5 2
anumeric <- as.numeric(as.character(a))
anumeric
## [1]   3  11   2  23 313   2
rm(list=ls())

Dates

  • All dates are converted to numeric values
  • The origin date: January 1, 1970
  • Different date formats can be used
date1 <- as.Date("02/28/2016", format = "%m/%d/%Y")
date2 <- as.Date("28 February 10", format = "%d %B %y")
date3 <- as.Date("02/30/2016", format = "%m/%d/%Y")
date1
## [1] "2016-02-28"
date2
## [1] "2010-02-28"
date3
## [1] NA
date1 - date2
## Time difference of 2191 days
weekdays(date2)
## [1] "Sunday"
as.numeric(date2)
## [1] 14668

Data structures

  • Vector
    • Everything is a vector in R
  • Matrix, array
  • Data frame
  • Data table
  • Lists