About R – Still Pond Cytomics

What is R?

R calls itself a “Statistical Programming Environment”. It is actually a GNU open-source offshoot of the S language which was developed back in the 1970’s at Bell Labs by John Chambers and others. It consists of two big pieces; the language and the environment. The environment provides the underlying infrastructure for data handling and storage, definitions for both simple and complex data structures and operators for conducting calculations, and a well-developed graphical subsystem that supports creation of publication-quality figures. The language is nearly identical to S. Its syntax allows the expression of most programming ideas such as assignment, conditional processing, loops, branches and recursion.

R is supported by a vibrant community. There are about 2 dozen core contributors, and scores more people who have donated code, fixed bugs and provided documentation. There is an annual international meeting (this year it was in Denmark).

Why R?

A key reason that R has emerged as a very powerful platform for developing algorithms for flow cytometry is the idea of the package. Packages are coherent collections of software that address a fairly specific problem. Specifications for writing packages are extremely well defined, and include documentation as a core element. Packages are contributed freely by their authors (generally under a liberal open source license), and peer-reviewed before inclusion in either R or Bioconductor. Thus, their quality is on average considerably higher than much free, open-source software at large (you’ll occasionally come across exceptions to this rule). Packages (like flowCore) enable you to address flow cytometry-specific data analysis, and when you’ve boiled your data down to a set of “features”, you can then proceed to statistical analysis, empirical modeling, visualization etc. using a vast array of other packages developed for those purposes. Source code for packages is generally distributed as part of the package. If necessary, this allows you to determine, with no ambiguity, what the code is actually doing.

What isn’t R?

R is not a point-and-click application. For that matter, neither is Git, nor any of the tools that we are using. So the entry barrier to R is higher than for lots of software used for flow cytometry data analysis (like FCSExpress, FlowJo, Kaluza, and so forth). R is not intuitive. Fortunately we have RStudio, which is an Integrated Development Environment (IDE) for R. Rstudio makes the process of writing and running R code a whole lot easier than it used to be. It does syntax highlighting to make it easy to read code. It does autocomplete and in-line completion suggestion. It presents documentation in a nicely formatted way, with search facilities to help you find the documentation you’re looking for. Thus, although it takes a bit of effort to climb R’s learning curve, R provides nearly unlimited flexibility, because if the methods you need aren’t yet available you can write them yourself (that’s what you’re here to learn). It also allows you to process large data collections automatically, run on large compute servers, transact with relational databases, and generate graphics the way you want them and not the way someone else has decided to provide.