Advanced programming

In this section, we will introduce some advanced programming techniques including for loops and writing custom functions.

You will need to load the breast cancer data for use in this section. Clean the names using the clean_names() function from the {janitor} package, and save it as object mydf.

mydf <- readr::read_csv("~/mmedr/breastcancer.csv") |> 
  janitor::clean_names()

Custom functions

We have previously discussed the role of functions in R, and have seen examples of built-in R functions, such as mean() and table().

But sometimes we’ll want to do something that isn’t included in a built-in R function, or that simplifies use of existing functions.

User-defined functions are created using the function() function.

Basic usage is:

function(arguments) expression

Where arguments are arguments you supply to the function and expression is the expression you want to evaluate.

For more complicated procedures, you can wrap multiple expressions in curly brackets, and can also specify what value to return using the return() function:

function(arguments) {
  expression1
  expression2
  return(value)
  }

For example, I always want to show NA values when I look at a contingency table, which means I have to type in the useNA = "ifany" arguement every time I use the table() function, since the default in that function is to exclude missing values.

To streamline things, I can create a custom function that includes this option:

tabna <- function(x) table(x, useNA = 'ifany')

Now, if we want to get a frequency table of ER/PR+ status that shows how many missing values there are, instead of typing:

table(mydf$er_or_pr_pos, useNA = 'ifany')


   0    1 <NA> 
 376 2537   87

I can type:

tabna(mydf$er_or_pr_pos)

x
   0    1 <NA> 
 376 2537   87

This gets particularly useful for long or complex procedures, but is also really useful for short procedures that will be repeated many times.

Practice problem

Try writing a custom function based on the mean() function but including the option to remove NAs from the calculation, and test it on the variable tumor_size_cm in mydf. Name your function mean_no_na as we will use it later.

Solution

mean_no_na <- function(x) mean(x, na.rm = TRUE)

mean_no_na(mydf$tumor_size_cm)

[1] 2.389084

Loops

Often we will want to repeat a set of operations several times, and we can do so using a loop.

There are three main types of loops in R:

for loop
while loop
repeat loop

We will focus on the for loop today.

Here is a basic example using the print() function to repeatedly print a value:

for (i in 1:5) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Here are the steps of the execution:

The value of i is set to 1
The value of i is printed to the console (first iteration complete)
The value of i is set to 2 (the for loop loops back to the beginning)
The value of i is printed to the console

And so on until we reach the last value of i, and the process is complete.

Say you have a continuous variable in your dataset that contains missing values, and you want to do mean-value imputation. For mean-value imputation, you simply impute the mean of the non-missing distribution for any missing values. Let’s try to write a for loop to do this with the variable tumor_size_cm in mydf.

First, create a new variable to store the results so that we don’t overwrite the original variable.

mydf$tumor_size_imputed <- mydf$tumor_size_cm

Now we can use our newly created mean_no_na() function to do mean imputation for any missing values.

for(i in 1:nrow(mydf)) {
  if (is.na(mydf$tumor_size_cm[i])) {
    mydf$tumor_size_imputed[i] <- mean_no_na(mydf$tumor_size_cm)
  } else {
    mydf$tumor_size_imputed[i] <- mydf$tumor_size_cm[i]
  }
  }

And we can see that our new variable contains no missing values.

summary(mydf$tumor_size_imputed)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001745 1.722053 2.389083 2.389083 3.018589 5.696087

And the value for every originally missing value is the same, the mean of non-missing values:

summary(mydf$tumor_size_imputed[is.na(mydf$tumor_size_cm)])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.389   2.389   2.389   2.389   2.389   2.389

Practice problem

Try writing a for loop that creates a new variable that reassigns any tumors in grade I that are >2cm to grade II, and reassigns any tumors in grade II that are >5cm to grade III.

Note that you will need to consider how to handle missing values of tumor size

Solution

mydf$grade_new <- mydf$grade

for(i in 1:nrow(mydf)) {
  if(is.na(mydf$tumor_size_cm[i])) {
    mydf$grade_new[i] <- mydf$grade[i]
  }else if(mydf$grade[i] == "I" & mydf$tumor_size_cm[i] > 2) {
    mydf$grade_new[i] <- "II"
  } else if(mydf$grade[i] == "II" & mydf$tumor_size_cm[i] > 5) {
    mydf$grade_new[i] <- "III"
  }
}

table(mydf$grade, mydf$grade_new)

     
         I   II  III
  I    143  219    0
  II     0 1303    7
  III    0    0 1328

We see that we reassigned 219 patients from grade I to grade II, and 7 patients from grade II to grade III.