Intro to programming in R Session 1

Part of QMUL RMC361-2013. © Robert Verity & Yannick Wurm

If you’re familiar with R, jump straight to “New Stuff”.

Table of Contents

Research Methods and Communication - R Session 1

Table of Contents

Reminder from last year: R - recapping the basics

The environment

Variables

Good coding practice

Reminder from last year: Data Classes

Reminder from last year: Types of Objects

Scalars

Vectors

Matrices

Data frames and data classes

Reminder from last year: Subsetting

How to subset a data frame

New stuff: Regular Expressions

Basic find and replace

Using "fuzzy" searching

*Bonus Section: importing/exporting

Setting the working directory

Data output into various format

Reminder from the past - recapping the basics  

The environment

To start with, let us reacquaint ourselves with the R environment. First of all, launch R (in the computer room, probably the “Bio” version” if there are several to choose from; but any R install should be fine).

Once you have done this you should see a single window. This is called the console (see Figure 1).

Figure 1   Screenshot of R, showing the console and a new script.

The console is where you go to actually run lines of code. One way of doing this is by simply typing commands directly into the console. For example, have a go at typing the following...

Notice that as soon as you press return the command is processed, and the console spits an output back at you. By typing directly into the console in this way you can write code that takes effect immediately. This method is sometimes referred to as programming "on the fly".

The other major way in which you will program in R is by writing code in a separate file called a script. Create a new script by going to File> New script. An empty window pops up, in which you can also type code. Try typing 1+2 in this window and pressing return. Notice that no output is produced - the curser just moves to the next line. This is because a script is fundamentally different from the console, and in fact works just like any other text editor. In order to run the code that we have written in a script we need to select the line(s) that we want to run and go to Edit>Run line or selection. This copies the selected line(s) over to the console and evaluates them in the order they are written. In this way we can create a long sequence of commands (i.e. a program) in a way that would not be possible by working directly in the console. In general you should work mainly with scripts, and only use the console for small tasks, like checking your script is working as planned.


Tip - to run an entire script, first select all with the keyboard shortcut Ctrl+a, then run selected lines with the shortcut Ctrl+r (Windows only).


Although it is possible to save some of the code that you have evaluated in the console (through the File>Save Workspace option), this method has proven to be slightly temperamental on Queen Mary computers in the past! On the other hand, scripts are nothing more than simple text files, and are therefore very unlikely to cause problems. For this reason we strongly suggest that you write anything you want to keep in scripts, and that you save these scripts in your home directory or on a USB stick.

        It is also worth keeping in mind that the R help is extremely good. Just put a question mark before any function that you do not understand (e.g. ?length) to bring up the help file for that function. There are also many R websites and forums that you may find useful. The best way to learn any programming language is to fiddle - so please fiddle away with these tools to guide you!

Variables

The simple lines of code described above use R as little more than a pocket calculator. We can produce much more powerful programs by using variables. A variable is a symbolic way of storing a particular set of values and/or characters. For example, try typing x<-5 into the console and hitting return. Once we have typed this command the symbol x becomes essentially equivalent to the number 5 from R's point of view. We can check this by typing x into the console and hitting return. R outputs the number 5, exactly as it would do if we typed the number 5 and hit return. Similarly, try typing 3*x^2. We get the answer 75, which is exactly what we would obtain by typing 3*5^2 (the power operation is executed before the multiplication). We can say that x has been assigned the number 5, and the <- operation is called the assignment operator.


Note - the = symbol can also be used for assignment instead of <-, but this is frowned upon by most R users (it makes your code less readable). Thus please try to use <- . When in doubt about how to write something, check Google’s R style guide; it provides standard guidelines which most R users follow.


So, what would happen if we typed y<-3*x^2? We have already seen that R sees the variable x as equivalent to the number 5, meaning this must be the same as typing y<-3*5^2. It follows that the command y<-3*x^2 must assign the value 75 to the variable y. In this way it is possible to create new variables out of old variables, and thereby build up the complexity of a program.

It is worth pointing out that you don't have to give variables abstract mathematical names such as x and y, and in fact most of the time you are better off not doing so. At the other end of the spectrum, it is not a good idea to have variables like joebloggsresearchproject2013variable1. But in general naming variables is not rocket science! Just remember that variable names should be chosen so that if you return to a script after a number of months (or even years), the script will still make sense.


Note - some variable names are not allowed. Typing ?make.names in the console brings up a help file describing the important variable name restrictions.


Good coding practice

Finally, remember to keep your code clean and tidy by making use of comments and white space.  Copy the following code into a new script and run it:

# Copy from this line...

# Define first variable

var1 <- 5

# Define second variable

var2 <- 12

# Multiply them together

var1*var2

# ...to this line

Look at the output produced in the console. Notice that R has simply skipped over all lines starting with the # symbol. This symbol starts a comment, meaning you can type whatever you like after this symbol and it won't be read by the program. This allows you to annotate your code, i.e. to write helpful notes in between all the raw program code. Notice also that white space such as empty lines, spaces and indentations (obtained by pressing the Tab key) are invisible to R. Making good use of white space helps keep your code readable.

The following two examples make this point clear. Both programs do exactly the same thing, but one will make sense one year from now, and the other will not!

Example 1

blah <-0.01

blah2<-0.005

blah3<- 0.0036

bigblah<- blah+ blah2+ blah3

bobob<-100*exp(bigblah*10)

bobob

Example 2

#--------------------------------

# Program:        PopSize.R

# Author:        Bob Verity

# Date:        01/10/2013

# Purpose:

# Works out the size of a population under a simple model of exponential growth. The growth rate is assumed to be equal to the sum of the nutrient content (% sugars), the temperature (centigrade above 20), and the humidity (%). Other parameters include the starting population size (number of individuals) and the time allowed to grow (hours).

#--------------------------------

# Define input parameters

nutrients <- 0.01

temp      <- 0.005

humidity  <- 0.0036

starting.size <- 100

grow.time     <- 10

# Calculate total growth rate

growth.rate <- nutrients+temp+humidity

# Calculate population size at the end of the time period

end.size <- starting.size*exp(growth.rate*grow.time)

# Return end population size in the console

end.size

#--------------------------------

Q2. Write  your own well annotated and fully functional script for calculating the volume of a rectangular room with the following dimensions:

length: 5m

width: 4m

height: 3m

The layout and design of the program are much more important than the calculations here!

Reminder from last year: Data Classes

One thing that it is important to keep in mind is the class of your data. In simple terms, the class of your data tells you whether R interprets the data as numbers, letters, factors, logical values or a number of alternatives. You can find out the class of a variable using the function class(). For example, create the variable mydata<-1024. When you evaluate class(mydata) you will find that this data is of class "numeric". The majority of the data that you will be working with will be numeric, although there some other classes that you will come across when analyzing real data.

        An important class of data that you might not be familiar with is logical data. Simply put, logical data can only take one of two possible values: TRUE or FALSE. There are a number of different ways of arriving at a logical variable. The most obvious is to simply define a variable as true, for example x<-TRUE (or just x<-T for short), or false, for example x<-FALSE (or just x<-F for short). Try creating a logical variable in this way, and look at the class of the variable - if you have done it correctly it should read "logical"(keep in mind that R is case-sensitive, i.e. x<-false will throw an error). However, this is not the way that logical variables tend to be used in programming. More often than not we arrive at a logical variable through a particular type of calculation, called a logical expression. You can think of a logical expression as a statement that we send to R, which may be a true statement, or it may be a lie! For example, try evaluating the code 5>4. This statement is clearly true, and accordingly R returns a value "TRUE". The statement 5<4, on the other hand, returns the value "FALSE". We can assign this logical value to a variable by using the logical expression as input to the variable. This sounds complicated, but in practice it is very simple. Take the example x<-(5>4). You can read this to mean "the variable x is assigned the outcome of the logical expression 5>4". In this particular example the variable x will be assigned the logical value "TRUE". Notice that the logical expression itself has been placed within curly brackets. This is not strictly required, but is good coding practice, as it avoids confusion between the assignment symbol and the logical expression.

        The main logical operators that you should be familiar with are the following:

  1. >, is greater than
  2. <, is less than
  3. >=, is greater than or equal to
  4. <=, is less than or equal to
  5. ==, is equal to
  6. !=. is not equal to

Have a play around with some of these operators in your own made-up logical expressions. Make sure you are comfortable assigning a logical value to a variable.

        We can create more sophisticated logical expressions using the "and" command, and the "or" command. The "and" command is written & (keyboard shortcut Shift+7), while the "or" command is written | (keyboard shortcut Shift+\ on a standard Windows keyboard). These operators can be placed between two or more logical expressions - exactly as you would do in a spoken sentence. For example, the expression (x>5 & x<=10) can be read "x is greater than five, and x is less than or equal to ten". Similarly, the expression (x<6 | x==12) can be read "x is less than six, or x is equal to twelve". By using a combination of these operators, while making good use of curly brackets, it is possible to come up with some quite complex statements.

As well as numerical and logical data you will come across simple written words; such as species names, or the names of drugs used in a particular trial. Words such as this are represented in quotation marks. For example, try typing mydata<-"Homo.sapiens" and hitting return. By placing this name in quotation marks you tell R that the input data consists of a string of characters.  Without the quotation marks R would think that Homo.sapiens was another variable, and would then complain if it didn't find it. If you have input a string of characters correctly you will find that when you evaluate class(mydata) the data is of class "character".

        As you might expect, you cannot perform calculations on character data - there is no such thing as a word multiplied by 5! This is particularly important, as numbers can be stored as either characters or numeric values. For example, create the variable mydata<-"4". If you type class(mydata) you will see that this data is stored as text, rather than as a numerical value. Often this can be the cause of hours of headache!


Tip - if you have a number or a series of numbers that are stored in character form then you can convert them back to actual numbers using the function as.numeric().


Conversely, you can perform calculations on logical data. When being used in a standard calculation, the logical value TRUE is given a numeric value 1, while the logical value FALSE is given a numeric value 0. You can check this with a simple calculation like 5+TRUE, or 10*FALSE. This may be confusing at first, but in fact there are times when this feature is extremely useful.

Reminder from last year: Types of Objects

Scalars

There are many different types of objects available in R. So far we only looked at objects that contain a single variable, as in the variable x<-5. This is called a scalar object, meaning the variable x is assigned just a single value. Scalars are very useful, and it is rare for a program not to contain any scalar values, but it is unlikely that your actual data will be a single entry long! To deal with larger collections of data we need other types of objects.

Vectors

Another very commonly used type of object is the vector. You can create a vector in a number of different ways, including but not limited to the following:

Hopefully you are familiar with at least some of these methods from last year. Evaluate each of these commands, and in each case have a look at the values that have been assigned to the variable. You will see that these variables do not contain single numbers, as in previous examples. Rather, each variable contains an array of numbers arranged one after the other. This is the defining feature of a vector. The individual values that make up a vector are called the elements of the vector. For example, the number 3 is the 4th element of the vector vec1. You can access the individual elements of a vector using square brackets. For example, the notation vec1[4] returns the 4th element of vec1 only.

Simple calculations can be performed on vectors, in which case the operation is applied to every element of the vector separately. For example, try typing vec1.squared<-vec1^2 in the console. You will find that vec1.squared contains values taken from vec1, where each element has been squared individually. You can also perform operations involving several vectors, as long as the vectors have compatible lengths. For example, try typing combined.vec1<-vec1*vec2. You will find that each of the elements of combined.vec1 is equal to the product of the corresponding elements in vec1 and vec2.

Matrices

Moving forward, another major type of object in R is the matrix. A matrix is simply a rectangular grid of values. One of the simplest ways of producing a matrix is by combining several vectors through the functions rbind() and cbind().

You can also create matrices directly in a number of different ways. Have a look at the matrices produced by the following methods, and try to understand where the values come from in each case:

As with vectors, you can index the elements of a matrix using square brackets. Because a matrix has rows and columns, you now need to tell R which row(s) and which column(s) you are interested in. The notation takes the form matrix[row,column]. For example, try evaluating mat1[4,3]. You can look at multiple rows and/or columns by specifying a range of values. For example, try evaluating  mat1[1:4, 1:2].

        One of the most useful tricks that you can use here is to look at all of the rows or columns of a particular matrix by leaving a particular argument blank. For example, you could look at all the rows in the 2nd column of mat1 by evaluating mat1[ ,2]. Similarly, you could look at all of the columns in rows 1 to 3 by evaluating mat1[1:3, ] (it is even possible to leave out the blank, i.e. mat1[1:3,]). Try playing around with different ways of accessing the elements of a matrix until you are comfortable with them. This skill will be very important later on when we look at subsetting objects.

As with vectors, you can perform simple calculations on matrices, in which case the calculation applies to each element separately. For example, try evaluating (mat3+2)*2 and looking at the output. You can also combine the values in several matrices, as long as the dimensions of the matrices are compatible. For example, try evaluating (mat3*100)+mat4. Finally, you can create logical expressions that apply to an entire matrix. For example, try evaluating (mat1>10).

There are a number of useful functions that can be applied to matrices. Have a look at each of the following functions, and try to make sense of the output:

Keep in mind that if you ever need help in understanding a function, just bring up the help file for that function.

                mat1 <- matrix(1:50, nrow=10, ncol=5)

        What number would we expect to see when we evaluate mat1[1,2]? Why not a different number? 

Data frames and data classes

The final type of object that we will consider is the data frame. On the face of it these look very similar to matrices, however, there are some important differences between data frames and matrices. The most important difference is that in a matrix all the elements need to be of the same class, while in a data frame different classes are allowed.

Fortunately for us there are already a load of example data frames loaded into R by default. For the next few tasks we will work with the data set Puromycin, which details the reaction velocity versus substrate concentration in an enzymatic reaction involving untreated cells or cells treated with Puromycin. First, load this data into your own personal variable by typing puromycin.data<-Puromycin. From now on we will work with the variable puromycin.data, rather than Puromycin (this ensures that you do not accidentally overwrite the data already in R). Have a look at your variable puromycin.data. Notice that the first two columns contain numbers, while the third column contains the words "treated" or "untreated".


Tip - one of the nice things about data frames is that columns often have names - in this case conc, rate, and state. This means that we can access a particular column using the dollar symbol, for instance puromycin.data$conc. This command is equivalent to selecting a column using square brackets, as in puromycin.data[,1].


We can see that different classes are stored within the same object in the data frame puromycin.data. This is only possible in a data frame, and would not be possible in a matrix. Clearly this is a great advantage in scientific research, in which we will often have data in a range of different classes.

Reminder from last year: Subsetting

Very often we will want to focus our attention on a particular part of a data frame - for example particular rows or columns. In such cases we ideally would like to cut out the part of the data frame that we are interested in, and leave the rest. This is called subsetting a data frame.

How to subset a data frame

We have already come across one way of subsetting through the use of square brackets. By typing, for example, puromycin.data[1:3, ] we can isolate certain rows of the data frame that we are interested in. However, this would become very tedious in certain situations. Imagine that we wanted to look at the rows for which the rate (second column) was less than 100. This would require us to list all such rows manually, as follows

# List all rows in puromycin.data for which the rate is less than 100.

subsetdata <- puromycin.data[c(1,2,3,13,14,15,16,17), ]

This method becomes cumbersome very quickly, and is simply impossible for very large data sets. In fact, we could come up with a slightly better method using logical expressions (Brownie points to anyone who can figure out how!), although even this method can get tricky.

        A much better method is available through the function subset(). The first argument to this function is the name of the variable that you want to subset. For example, we could type subset(puromycin.data). Evaluating this expression on its own will not do anything, as we have not told R which parts of the data we are interested in. This information goes in the second argument to the function, through a logical expression. For example, we could type subset(puromycin.data, subset=(rate<100)). This tells R that we are only interested in those fields for which the rate is less than 100. R does this by using the vector of logicals rate<100 . Make sure you understand what this vector represents. Another example would be the code subset(puromycin.data, subset=(state=="treated")). This will return all fields for which the state is equal to "treated". Notice that the factor must be written in quotation marks here, as R needs to know that it is looking for a particular set of characters, rather than a variable.

New stuff: Regular Expressions

So far we have assumed that it is only the format of our data that needs changing. Unfortunately, there are situations in which the actual names in our data file are incorrect, or inconsistent. For example, run the following line of code to import the reptile.data data frame:

reptile.data <- read.table("http://yannick.poulet.org/teaching/2013stats/reptile_data.txt",row.names=1)

This data frame details the genus and species names of 16 endangered reptiles, along with the date at which they were listed as endangered. You can load just the names into a separate variable by running the code reptile.names<-row.names(reptile.data).

        Have a close look at these names. Notice that each reptile has been given a unique identification number next to its name (don't look these numbers up - they don't mean anything)! Also, we can see that some names have been recorded incorrectly - the genus names for the Liopholis group have been recorded in lowercase, while we all know that genus names should be capitalized! All in all, this data appears quite "messy", and needs cleaning up.

        The tools that allow us to deal with this sort of problem fall under the heading regular expressions. These consist of a suite of tools that allow us to search for, locate, and replace characters or words within a data set. The really powerful thing about regular expressions is that we can do a "fuzzy" search, meaning the pattern we are searching for has some flexibility built into it.

Basic find and replace

To start with we will consider a simple case of finding a pattern within a vector of names.  First of all we will search through the vector reptile.names to find a list of the elements that contain the word "liopholis". The function that allows us to do this is grep(), which has two main arguments; pattern and x. The pattern is the actual word, or part of a word, that we are looking for. The argument x describes the variable that we are searching through. In our case we want to evaluate the following code:

# Search through reptile.names for the word "liopholis", and output positions

grep(pattern="liopholis", x=reptile.names)

The output of this code is a list of numbers. Each of these numbers describes the position of an element in the vector reptile.names that matches the pattern - in this case the 12th and 13th elements. Make sure you fully understand where these numbers came from!

        Sometimes it may be more useful to obtain the actual names within which the pattern was found, rather than a list of positions. We can do this by making use of the additional argument value=T (see the help file for the grep() function for a complete list of possible arguments). The new code reads:

# Search through reptile.names for the word "liopholis", and output names

grep(pattern="liopholis", x=reptile.names, value=T)

Now we find that the output contains the actual elements of the vector that match the pattern, rather than just a list of positions. These names should correspond exactly with the positions found in the previous example.

        Finally, we may want to find and replace the pattern. This can be done using the function gsub(). The function gsub() takes arguments pattern and x, just like the function grep(), but it also has an additional argument replacement. The argument replacement describes the new word, or words, that we want to insert in place of pattern. For example, in the reptile.data the word "liopholis" is a genus name, and so should be capitalized. Thus, we want to replace the word "liopholis" with "Liopholis", as follows:

# Search through reptile.names for the word "liopholis" and replace with the word "Liopholis."

reptile.names2 <- gsub(pattern="liopholis", x=reptile.names, replacement="Liopholis")

The output of this function is a new vector in which the desired replacement has been carried out. Notice that the code above stores this new vector in the variable reptile.names2.

        Experiment with grep() and gsub() until you are confident at using them. Then answer the following questions:

        

Using "fuzzy" searching[a][b]

One of the most powerful features of regular expressions is the ability to perform "fuzzy" searching. Simply put, by using special characters we can introduce some flexibility into the pattern that we are searching for. A complete list of these special characters can be obtained by typing ?regex. A slightly reduced list of special characters and their meanings is given below.

Special Character

Meaning

.

Any character

?

The preceding item is optional and will be matched at most once.

*

The preceding item will be matched zero or more times.

+

The preceding item will be matched one or more times.

These special characters can be used on their own, or in combination with one another. To help you out with understanding these symbols, here are a few examples:

Some of these examples might seem very confusing at first, but if you learn what each special character means on its own and then go through the pattern one at a time you should find that it makes sense.

As an example of how "fuzzy" searching can be useful, we will now use these special characters to remove the ID tags from the reptile names. Notice that the ID numbers are of different lengths, but they are always separated with a colon from the part that we are interested in. Therefore, we can remove these characters by searching for zero or more copies of any character, followed by a colon, and replacing this pattern with an empty string. The single line of code that achieves this is as follows:

reptile.names3 <- gsub(pattern=".*:", x=reptile.names2, replacement="")

Have a look inside the variable reptile.names3. We have successfully isolated the genus and species names away from the pesky ID tags, even though the exact format of the tags may vary between different entries. Tricks like this can save us a great deal of time - especially when our data set is thousands of lines long. In fact, we have only skimmed the surface of what regular expressions can do - I encourage anyone who is interested to take a deeper look.

*Bonus Section: importing/exporting

Data Input from files

Open source data plays an increasingly important role recently, so it is vital to know how to input the data with various formats into your program. Choosing the right way of importing data will save you time and boost your efficiency when cleaning the data. To test your abilities, download river.csv and kaiser.xls and try loading them into R.

river.data  <- read.csv(“river.csv”)      # if you’re in the right directory

river.data  <- read.csv(file.choose())    # to choose the file

kaiser.data <- read.xls(“kaiser.xls”, sheet=1)

 

Setting the working directory

Setting directory is a highly intensely used function which helps group your code and your data, and access to different paths. For example, if you want to save a series of graphs that were generated by accessing data outside the current folder in a loop, it saves you time by grouping data for you.

# set the working directory

setwd(“~/Documents/R”)   # or use the equivalent menu options.

# get all the file name in current directory

list.files(path=".")

# get the current working directory

getwd()

 

Data output into various formats

Most of the time, as either a data recorder or a data analyst, knowing the best way to output data is important. The reason to export data into various format is to make work easier when switching data analysis software (e.g. from R to Excel and back).

# write a new subsetted data frame (create this using your previous skills) into a CSV file

write.csv(my.river.subset, file = “myRiverSubset.csv”)

# import this into Excel

# save variable a, b and c into .Rdata

save(a, b, c, file=”nodeProperty.Rdata”)

[a]Yannick - this bit didn't turn out exactly as planned, but hopefully it will do. Wanted to get them replacing genus names with abbreviated versions automatically, but this turned out to be way more complicated than they could handle!

[b]Although it is not complicated to understand but this is for me hard to recite