R (programming language)

R
	R terminal
Paradigms	Multi-paradigm: procedural, object-oriented, functional, reflective, imperative, array
Designed by	Ross Ihaka and Robert Gentleman
Developer	R Core Team
First appeared	August 1993
Stable release	4.3.2 / 31 October 2023
Typing discipline	Dynamic
Platform	arm64 and x86-64
License	GNU GPL v2
Filename extensions	.r ; .rdata; .rds; .rda ;
Website	www.r-project.org
Influenced by
	Lisp; S ; Scheme ;
Influenced
	Julia
	R Programming at Wikibooks;

R is a programming language for statistical computing and graphics. Created by statisticians Ross Ihaka and Robert Gentleman, R is used for data mining, bioinformatics, and data analysis.

The core R language is augmented by a large number of extension packages containing reusable code and documentation.

R software is open-source free software, licensed by the GNU Project, and available under the GNU General Public License. It is written primarily in C, Fortran, and R itself. Precompiled executables are provided for various operating systems. R is supported by the R Core Team and the R Foundation for Statistical Computing.

R has a native command line interface. Moreover, multiple third-party graphical user interfaces are available, such as RStudio -- an integrated development environment, and Jupyter -- a notebook interface.

History

Robert Gentleman, co-originator of R

Ross Ihaka, co-originator of R

R was started by professors Ross Ihaka and Robert Gentleman as a programming language to teach introductory statistics at the University of Auckland. The language took heavy inspiration from the S programming language, with most S programs able to run unaltered in R, as well as from Scheme's lexical scoping, allowing for local variables.

The name of the language, R, comes from being both an S language successor as well as the shared first letter of the authors, Ross and Robert. Ihaka and Gentleman first shared binaries of R on the data archive StatLib and the s-news mailing list in August 1993. In June 1995, statistician Martin Mächler convinced Ihaka and Gentleman to make R free and open-source under the GNU General Public License. Mailing lists for the R project began on 1 April 1997 preceding the release of version 0.50. R officially became a GNU project on 5 December 1997 when version 0.60 released. The first official 1.0 version was released on 29 February 2000.

The Comprehensive R Archive Network (CRAN) was founded in 1997 by Kurt Hornik and Fritz Leisch to host R's source code, executable files, documentation, and user-created packages. Its name and scope mimics the Comprehensive TeX Archive Network and the Comprehensive Perl Archive Network. CRAN originally had three mirrors and 12 contributed packages. As of December 2022, it has 103 mirrors and 18,976 contributed packages.

The R Core Team was formed in 1997 to further develop the language. As of January 2022, it consists of Chambers, Gentleman, Ihaka, and Mächler, plus statisticians Douglas Bates, Peter Dalgaard, Kurt Hornik, Michael Lawrence, Friedrich Leisch, Uwe Ligges, Thomas Lumley, Sebastian Meyer, Paul Murrell, Martyn Plummer, Brian Ripley, Deepayan Sarkar, Duncan Temple Lang, Luke Tierney, and Simon Urbanek, as well as computer scientist Tomas Kalibera. Stefano Iacus, Guido Masarotto, Heiner Schwarte, Seth Falcon, Martin Morgan, and Duncan Murdoch were members. In April 2003, the R Foundation was founded as a non-profit organization to provide further support for the R project.

Features

Data structures

R's data structures include vectors, arrays, lists, data frames, and environments. Vectors are ordered collections of values and can be mapped to arrays of one or more dimensions in a column major order. That is, given an ordered collection of dimensions, one fills in values along the first dimension first, then fills in one-dimensional arrays across the second dimension, and so on.

R supports array arithmetic and in this regard is like languages such as APL and MATLAB. The special case of an array with two dimensions is called a matrix. Lists serve as collections of objects that do not necessarily have the same data type. Data frames contain a list of vectors of the same length, plus a unique set of row names. R has no scalar data type. Instead, a scalar is represented as a length-one vector.

Programming

R is an interpreted language; users can access it through a command-line interpreter. If a user types 1+1 at the R command prompt and presses enter, the computer replies with 2.

R supports procedural programming with functions and, for some functions, object-oriented programming with generic functions. Extending it is facilitated by its lexical scoping rules, which are derived from Scheme. R uses S syntax (not to be confused with S-expressions) to represent both data and code. R's extensible object system includes objects for (among others) regression models, time-series, and geo-spatial coordinates. Advanced users can write C, C++, Java via the Rserve socket server (website), .NET (website) and Python code to manipulate R objects directly.

Functions are first-class objects and can be manipulated in the same way as data objects, facilitating meta-programming that allows multiple dispatch. Function arguments are passed by value, and are lazy—that is to say, they are only evaluated when they are used, not when the function is called. A generic function acts differently depending on the classes of the arguments passed to it. In other words, the generic function dispatches the method implementation specific to that object's class. For example, R has a generic print function that can print almost every class of object in R with print(objectname).

R and its libraries implement various statistical techniques, including linear, generalized linear and nonlinear modeling, classical statistical tests, spatial and time-series analysis, classification, clustering, and others. For computationally intensive tasks, C, C++, and Fortran code can be linked and called at run time. Another of R's strengths is static graphics; it can produce publication-quality graphs that include mathematical symbols.

Packages

An R package is a collection of functions, documentation, and data that expands R. As examples: packages add output features to more graphical devices, transform features to import and export data, and report features such as RMarkdown, knitr and Sweave. Easy package installation and use have contributed to the language's adoption in data science.

Multiple packages are included with the basic installation. Additional packages are available on different repositories: CRAN, Bioconductor, R-Forge, Omegahat, GitHub, and others.

The "Task Views" on the CRAN website lists packages in fields including Finance, Genetics, High-Performance Computing, Machine Learning, Medical Imaging, Meta-Analysis, Social Sciences and Spatial Statistics.

The Bioconductor project provides packages for genomic data analysis, including object-oriented data handling and analysis tools for data from Affymetrix, cDNA microarray, and next-generation high-throughput sequencing methods.

Packages add the capability to interface with standalone, interactive graphics.

The Tidyverse package is organized to have a common interface. Each function in the package is designed to couple together all the other functions in the package.

Installing a package occurs only once. To install tidyverse:

> install.packages( "tidyverse" )

To instantiate the functions, data, and documentation of a package, execute the library() function. To instantiate tidyverse:

> library( tidyverse )

Interfaces

R comes installed with a command line console. Available for installation are various integrated development environments (IDE). IDEs for R include R.app (OSX/macOS only), Rattle GUI, R Commander, RKWard, RStudio, and Tinn-R.

General purpose IDEs that support R include Eclipse via the StatET plugin and Visual Studio via R Tools for Visual Studio.

Editors that support R include Emacs, Vim via the Nvim-R plugin (website), Kate, LyX via Sweave, WinEdt (website), and Jupyter (website).

Scripting languages that support R include Python (website), Perl (website), Ruby (source code), F# (website), and Julia (source code).

General purpose programming languages that support R include Java via the Rserve socket server (website) and .NET C# (website).

Statistical frameworks which use R in the background include Jamovi and JASP.

Implementations

The main R implementation is written primarily in C, Fortran, and R itself. Several other implementations are aimed at improving speed or increasing extensibility. A closely related implementation is pqR (pretty quick R) by Radford M. Neal with improved memory management and support for automatic multithreading. Renjin and FastR are Java implementations of R for use in a Java Virtual Machine. CXXR, rho, and Riposte are implementations of R in C++. Renjin, Riposte, and pqR attempt to improve performance by using multiple cores and deferred evaluation.

TIBCO, who previous sold the commercial implementation S-PLUS, built a runtime engine called TERR, which is part of Spotfire.

Microsoft R Open (MRO) was a fully compatible R distribution with modifications for multi-threaded computations. As of 30 June 2021, Microsoft started to phase out MRO in favor of the CRAN distribution.

Community

The R community hosts many conferences and in-person meetups. Some of these groups include:

UseR!: an annual international R user conference (website)
Directions in Statistical Computing (DSC) (website)
R-Ladies: an organization to promote gender diversity in the R community (website)
SatRdays: R-focused conferences held on Saturdays (website)
R Conference (website)
Posit::conf (formerly known as Rstudio::conf) (website)

The R Journal

The R Journal is an open access, refereed journal of the R project. It features short to medium-length articles on the use and development of R, including packages, programming tips, CRAN news, and foundation news.

Comparison with alternatives

SAS

In January 2009, the New York Times ran an article charting the growth of R, noting its extensibility with user-created packages as well as R's open-source nature in contrast to SAS. SAS supports Windows, UNIX, and z/OS. R has precompiled binaries for Windows, macOS, and Linux with the option to compile and install R from source code. SAS can only store data in rectangular data sets while R's more versatile data structures allow it to perform difficult analysis more flexibly. Completely integrating functions in SAS requires a developer's kit but, in R, user-defined functions are already on equal footing with provided functions. In a technical report authored by Patrick Burns in 2007, respondents found R more convenient for periodic reports but preferred SAS for big data problems.

Stata

Stata and R are designed to be easily extendable. Outputs in both software are structured to become inputs for further analysis. They hold data in main memory giving a performance boost but limiting data both can handle. R is free software while Stata is not.

Commercial support

Although R is an open-source project, some companies provide commercial support and extensions.

In 2007, Richard Schultz, Martin Schultz, Steve Weston, and Kirk Mettler founded Revolution Analytics to provide commercial support for Revolution R, their distribution of R, which includes components developed by the company. Major additional components include ParallelR, the R Productivity Environment IDE, RevoScaleR (for big data analysis), RevoDeployR, web services framework, and the ability for reading and writing data in the SAS file format. Revolution Analytics offers an R distribution designed to comply with established IQ/OQ/PQ criteria that enables clients in the pharmaceutical sector to validate their installation of REvolution R. In 2015, Microsoft Corporation acquired Revolution Analytics and integrated the R programming language into SQL Server, Power BI, Azure SQL Managed Instance, Azure Cortana Intelligence, Microsoft ML Server and Visual Studio 2017.

In October 2011, Oracle announced the Big Data Appliance, which integrates R, Apache Hadoop, Oracle Linux, and a NoSQL database with Exadata hardware. As of 2012, Oracle R Enterprise became one of two components of the "Oracle Advanced Analytics Option" (alongside Oracle Data Mining).

IBM offers support for in-Hadoop execution of R, and provides a programming model for massively parallel in-database analytics in R.

TIBCO offers a runtime-version R as a part of Spotfire.

Mango Solutions offers a validation package for R, ValidR, to comply with drug approval agencies, such as the FDA. These agencies required the use of validated software, as attested by the vendor or sponsor.

Examples

Basic syntax

The following examples illustrate the basic syntax of the language and use of the command-line interface. (An expanded list of standard language features can be found in the R manual, "An Introduction to R".)

In R, the generally preferred assignment operator is an arrow made from two characters <-, although = can be used in some cases.

> x <- 1:6 # Create a numeric vector in the current environment
> y <- x^2 # Create vector based on the values in x.
> print(y) # Print the vector’s contents.
[1]  1  4  9 16 25 36
> z <- x + y # Create a new vector that is the sum of x and y
> z # Return the contents of z to the current environment.
[1]  2  6 12 20 30 42
> z_matrix <- matrix(z, nrow=3) # Create a new matrix that turns the vector z into a 3x2 matrix object
> z_matrix 
     [,1] [,2]
[1,]    2   20
[2,]    6   30
[3,]   12   42
> 2*t(z_matrix)-2 # Transpose the matrix, multiply every element by 2, subtract 2 from each element in the matrix, and return the results to the terminal.
     [,1] [,2] [,3]
[1,]    2   10   22
[2,]   38   58   82
> new_df <- data.frame(t(z_matrix), row.names=c('A','B')) # Create a new data.frame object that contains the data from a transposed z_matrix, with row names 'A' and 'B'
> names(new_df) <- c('X','Y','Z') # Set the column names of new_df as X, Y, and Z.
> print(new_df)  # Print the current results.
   X  Y  Z
A  2  6 12
B 20 30 42
> new_df$Z # Output the Z column
[1] 12 42
> new_df$Z==new_df['Z'] && new_df[3]==new_df$Z # The data.frame column Z can be accessed using $Z, ['Z'], or [3] syntax and the values are the same. 
[1] TRUE
> attributes(new_df) # Print attributes information about the new_df object
$names
[1] "X" "Y" "Z"
$row.names
[1] "A" "B"
$class
[1] "data.frame"
> attributes(new_df)$row.names <- c('one','two') # Access and then change the row.names attribute; can also be done using rownames()
> new_df
     X  Y  Z
one  2  6 12
two 20 30 42

Structure of a function

One of R's strengths is the ease of creating new functions. Objects in the function body remain local to the function, and any data type may be returned. Example:

# Declare function “f” with parameters “x”, “y“
# that returns a linear combination of x and y.
f <- function(x, y) {
  z <- 3 * x + 4 * y
  return(z) ## the return() function is optional here
}

> f(1, 2)
[1] 11
> f(c(1,2,3), c(5,3,4))
[1] 23 18 25
> f(1:3, 4)
[1] 19 22 25

Modeling and plotting

The R language has built-in support for data modeling and graphics. The following example shows how R can easily generate and plot a linear model with residuals.

> x <- 1:6 # Create x and y values
> y <- x^2  
> model <- lm(y ~ x)  # Linear regression model y = A + B * x.
> summary(model)  # Display an in-depth summary of the model.
Call:
lm(formula = y ~ x)
Residuals:
      1       2       3       4       5       6       7       8      9      10
 3.3333 -0.6667 -2.6667 -2.6667 -0.6667  3.3333
Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -9.3333     2.8441  -3.282 0.030453 * 
x             7.0000     0.7303   9.585 0.000662 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared:  0.9583, Adjusted R-squared:  0.9478
F-statistic: 91.88 on 1 and 4 DF,  p-value: 0.000662
> par(mfrow = c(2, 2))  # Create a 2 by 2 layout for figures.
> plot(model)  # Output diagnostic plots of the model.

Mandelbrot set

Short R code calculating Mandelbrot set through the first 20 iterations of equation z = z² + c plotted for different complex constants c. This example demonstrates:

use of community-developed external libraries (called packages), such as the caTools package
handling of complex numbers
multidimensional arrays of numbers used as basic data type, see variables C, Z, and X.

install.packages("caTools")  # install external package
library(caTools)             # external package providing write.gif function
jet.colors <- colorRampPalette(c("green", "pink", "#007FFF", "cyan", "#7FFF7F",
                                 "white", "#FF7F00", "red", "#7F0000"))
dx <- 1500                    # define width
dy <- 1400                    # define height
C  <- complex(real = rep(seq(-2.2, 1.0, length.out = dx), each = dy),
              imag = rep(seq(-1.2, 1.2, length.out = dy), dx))
C <- matrix(C, dy, dx)       # reshape as square matrix of complex numbers
Z <- 0                       # initialize Z to zero
X <- array(0, c(dy, dx, 20)) # initialize output 3D array
for (k in 1:20) {            # loop with 20 iterations
  Z <- Z^2 + C               # the central difference equation
  X[, , k] <- exp(-abs(Z))   # capture results
}
write.gif(X, "Mandelbrot.gif", col = jet.colors, delay = 100)