Decluttering R

DSC_0937
Importance of decluttering the R environment

R is a versatile and powerful programming language that enables the user to perform various types of statistical and data analyses. Like with any other tool, R’s potential largely lies in the user’s knowledge of the extent of its capability. Having used R extensively over a period of time, we have some useful tips we think will benefit the beginner and the seasoned R user alike. Because R is open source, its adaptation has increased exponentially. Several users without any programming or computer science background have been able to benefit from it. Being a newcomer to programming and scripting languages myself, I have fallen prey to several programming and scripting fallacies. Over the course of time, thanks to a multitude of help from experienced colleagues, and to the sea of information readily available on the internet, I have been able to learn several programming etiquettes which I wish I knew sooner.

The R environment is an area on the system disk drive, where objects and functions from an R session are stored. Certain objects from the environment such as datasets can be tinkered within the R console, and also can be downloaded to the local file system.

The R environment is the end-users friend, where objects and functions can be safely stored and utilized for analytic purposes. Many times, data aggregation and modeling requires numerous variations of the same dataset, which means that the environment gets populated with several objects, very quickly.

Here are some good practices we recommend while using R Studio that we think will enable productivity and creativity for the end user:

  1. Functions
    Writing R functions has several benefits to the user, some of which are realized immediately, and some over a longer period of time. An R function is an automated script, that has a set of user defined predetermined parameters and performs its analyses and outputs the result in the desired form.

For example, a simple R function to test the normality of a dataset can be written, that could test normality of the dataset in question and give a readable output such as

“Conducting Residual Tests

—————————————————————————

Shapiro Residuals Not Normal (p-value: 1.944596e-07)”

Further explanations of the capabilities and the construction of functions is out of scope of this article, and we might write another article explaining the basics of writing one. However, functions are completely customizable, and they can output calculations, texts, or dataset objects according to the user’s wish. Functions not only make running models easier, they also make it faster, and omit the likelihood of human error, in case the end user forgets to run a line of code when running a particular script manually.

Functions can be daisy chained to each other, as long as they’re are loaded in the environment, thus the end user can write complex functions that make use of other functions. Functions can be sourced into scripts and embedded into other functions, and the code need not be duplicated.

The long term return of writing functions lies in the improvement and enhancement the understanding of coding and model building. The end user starts recognizing possibilities that lie within the dataset and in the analysis, and can think her or his way thru and actualize it in the form of functions. Writing codes from the ground up is not required as old functions can be easily tweaked to satisfy current requirements. Knowing the potential of functions, the end user learns to separate the analysis and the creativity of the process away from coding, and lets the functions take care of the heavy-work.

2. Standardization in structure

Its beneficial to write code in recognizable patterns, so that the end user subconsciously knows where specific elements of the code lie. For example, my construction usually follows a structure of data exploration, data analysis, and outputs in the forms of tests and graphs. Within these subsections, I have other recognizable patters such as the names function always preceding the aggregate function, which is usually followed by the merge function. In my data exploration section, I usually begin by importing the code, followed by a line that takes care of missing data, followed by a simple function that gives me a percentage breakdown of the variables of the dataset.

This construction largely depends on the end users’ preference and the nature of their analysis. However, writing code from this perspective forms a habit of writing standardized models that can be reused and retained for reference, that don’t loose their relevance in the long term. If the end user archives a lot of code and models, and regularly needs to refer them, standardization helps in understanding old models.

3. Nomenclature

While performing similar processes, where the end user goes about a predetermined course while analyzing data, standardized object and function names prove to be immensely helpful. It is even more helpful if the names carry some meaning and context, so that they are self explanatory.

Standardization and automation are the two habits that prove to be really useful for repetitive processes, and especially more so in creative fields (like data science), so that they don’t interfere with the thought processes of the end user.

Just as with standardization in construction, standardization in the nomenclature proves to be really helpful, especially when dealing with lots of datasets, calculations, and functions. Names of datasets can carry meaning, and indicate versions easily with the inclusion of abbreviations and version numbers. These names can be easy to remember and also be reusable.

4. Saving sessions

Another nifty feature of R Studio, the GUI version of R, is its ability to save and load sessions. Saving sessions saves the entire environment, including the functions and the codes. This helps the user to save all of the work as it is, on the local disk drive, and retain it in case anything goes wrong or if the user needs to revisit any of the calculations.

Saved sessions can be a good way to manage versions and iterations of models or functions. When coupled with standardized nomenclature, it can prove to be really helpful when there are several parallel analyses for the end user.

5. Clearing all of the environment except that one file

Many times, we might need to restart an analysis, and the best way to start things over is by clearing the environment. We never know when objects in the environment from the last session creep up into current calculations and affect the analysis.

However, when dealing with large datasets, clearing the whole environment might cause additional delays while loading the raw dataset each time. Wouldn’t it be helpful in this case to have a line of code that cleared all of the environment except for the raw datasets that have not changed over the course of the analysis?

rm(list=setdiff(ls(), c(“NameOfDataset1″, “NameOfDataset2”)))

This line of code does just that. It saves me at least 20 minutes everyday, and also saves me from any frustration when I’m lost in thoughts of the analysis. I can include as many datasets in this line as I’d like, and the line clears the environment of all its contents except for the ones mentioned.

6. R Beeper

Some functions or models take a while to run. If the datasets are large or if the functions are complex and involve performing memory intensive tasks, the code can take anywhere from a few minutes to a few hours to execute and produce the desired output.

During such times, the end user might need to switch to other applications, or he or she may need to move away from the computer. There’s a useful package called “Beepr“, which makes a beeping sound at the end of the analysis, thus notifying the user that the code has finished executing the assigned task. Thus, the end user need not stay on the R screen as processes are being run, and can return to R promptly after the process has finished running, without wasting any time.

The way to use Beepr is by placing “beep()” at the end of a script, or embed it in a function at the end so that it is run when the loop is over. The package is available on CRAN and is called Beepr.

Apurv currently works as a statistical analyst for Five Element Analytics, an analytics firm based in NY. He graduated from Hofstra University in 2015 with an MBA in Business Analytics.