<- readr::read_csv("~/mmedr/breastcancer.csv") |>
mydf ::clean_names() janitor
Advanced programming
In this section, we will introduce some advanced programming techniques including for loops and writing custom functions.
You will need to load the breast cancer data for use in this section. Clean the names using the clean_names()
function from the {janitor} package, and save it as object mydf
.
Custom functions
We have previously discussed the role of functions in R, and have seen examples of built-in R functions, such as mean()
and table()
.
But sometimes we’ll want to do something that isn’t included in a built-in R function, or that simplifies use of existing functions.
User-defined functions are created using the function()
function.
Basic usage is:
function(arguments) expression
Where arguments
are arguments you supply to the function and expression
is the expression you want to evaluate.
For more complicated procedures, you can wrap multiple expressions in curly brackets, and can also specify what value to return using the return()
function:
function(arguments) {
expression1
expression2return(value)
}
For example, I always want to show NA values when I look at a contingency table, which means I have to type in the useNA = "ifany"
arguement every time I use the table()
function, since the default in that function is to exclude missing values.
To streamline things, I can create a custom function that includes this option:
<- function(x) table(x, useNA = 'ifany') tabna
Now, if we want to get a frequency table of ER/PR+ status that shows how many missing values there are, instead of typing:
table(mydf$er_or_pr_pos, useNA = 'ifany')
0 1 <NA>
376 2537 87
I can type:
tabna(mydf$er_or_pr_pos)
x
0 1 <NA>
376 2537 87
This gets particularly useful for long or complex procedures, but is also really useful for short procedures that will be repeated many times.
Loops
Often we will want to repeat a set of operations several times, and we can do so using a loop.
There are three main types of loops in R:
- for loop
- while loop
- repeat loop
We will focus on the for loop today.
Here is a basic example using the print()
function to repeatedly print a value:
for (i in 1:5) {
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Here are the steps of the execution:
- The value of
i
is set to 1 - The value of
i
is printed to the console (first iteration complete) - The value of
i
is set to 2 (the for loop loops back to the beginning) - The value of
i
is printed to the console
And so on until we reach the last value of i
, and the process is complete.
Say you have a continuous variable in your dataset that contains missing values, and you want to do mean-value imputation. For mean-value imputation, you simply impute the mean of the non-missing distribution for any missing values. Let’s try to write a for loop to do this with the variable tumor_size_cm
in mydf
.
First, create a new variable to store the results so that we don’t overwrite the original variable.
$tumor_size_imputed <- mydf$tumor_size_cm mydf
Now we can use our newly created mean_no_na()
function to do mean imputation for any missing values.
for(i in 1:nrow(mydf)) {
if (is.na(mydf$tumor_size_cm[i])) {
$tumor_size_imputed[i] <- mean_no_na(mydf$tumor_size_cm)
mydfelse {
} $tumor_size_imputed[i] <- mydf$tumor_size_cm[i]
mydf
} }
And we can see that our new variable contains no missing values.
summary(mydf$tumor_size_imputed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001745 1.722053 2.389083 2.389083 3.018589 5.696087
And the value for every originally missing value is the same, the mean of non-missing values:
summary(mydf$tumor_size_imputed[is.na(mydf$tumor_size_cm)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.389 2.389 2.389 2.389 2.389 2.389