Making your workflow reproducible

UMass Chan R Café

Emily C. Zabor

2025-10-24

Hello

Who am I?

Associate Staff Biostatistician at the Cleveland Clinic in the Department of Quantitative Health Sciences and the Taussig Cancer Institute.

Applied cancer biostatistics and methods research in early phase oncology clinical trial design and methods for retrospective data analyses.

Checkout my website for more.

Why am I here?

Good question.

This is not my area of expertise
But I have been doing data analysis projects in R for 15+ years
And I’ve learned a few things along the way
If you are an expert, chime in anytime!

What will I cover today?

How I try to make my project workflow reproducible, including:

{starter} to create standard project frameworks
Folder structure and naming
RStudio projects and {here} package for portability
Quarto for reproducible reporting
{renv} for reproducible environments

Context

Everything here evolved in the context of the work I do and how I do it.

Collaborate with doctors on clinical research projects: they send me data, I analyze it
Work independently as the only statistician/programmer for a given project
Patient data is sensitive
2 and 3 mean nothing goes on GitHub
2 also means that my interest in reproducibility is for future me and for sound science

The {starter} package

“Provides a toolkit for starting new projects”

Using {starter}: default settings

# install.packages("starter") 

starter::create_project(
  path = fs::path(tempdir(), "My Project Folder"),
  open = FALSE # don't open project in new RStudio session
)

✔ Using "Default Project Template" template

✔ Writing folder 'C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/My Project Folder'

✔ Writing files "README.md", ".gitignore", "My Project Folder.Rproj", and ".Rprofile"

✔ Initialising Git repo

✔ Initialising renv project

- Lockfile written to "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/My Project Folder/renv.lock".
- renv infrastructure has been generated for project "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/My Project Folder".

Resulting project structure

Custom {starter} templates

The default is a great start, but I want a bit more:

Shell code files
Include Word reference document for Quarto

See the {starter} website for details on creating custom templates.

The R script that created my custom template is in my personal R package on GitHub here.

Using {starter}: custom template

# devtools::install_github("zabore/ezfun") 

starter::create_project(
  path = fs::path(tempdir(), "example-custom-project"),
  template = ezfun::ez_analysis_template,
  open = FALSE
)

✔ Using "EZ Analysis Template" template

✔ Writing folder 'C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project'

✔ Creating 'C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project/code'

✔ Creating 'C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project/code/templates'

✔ Writing files "README.md", ".gitignore", "example-custom-project.Rproj", ".Rprofile", "code/example-custom-project-munge.R", "code/example-custom-project-report.qmd", and "code/templates/doc_template.docx"

✔ Initialising Git repo

✔ Initialising renv project

- Lockfile written to "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project/renv.lock".
- renv infrastructure has been generated for project "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project".

Resulting custom project structure

Structure inside the code folder

Munge file template

Used for data cleaning and pre-processing

Quarto file template

Used for analysis and text

Structure inside the templates folder

Note how this was referenced in the YAML of the Quarto report

See details on how to create your own reference document for Word output here.

Folder structure and naming

Find something that works for you and stick with it.

What I do as a collaborative biostatistician:

Store all project folders on the same drive, backed up by my organization
Each project gets its own folder
Name the folder as “PIName-brief-project-description”.
- For example, a project with Jane Smith about treatment for metastatic breast cancer might be “Smith-metastatic-breast-trt”
Initialize using {starter}
Also add a “data” folder
Project reports produced by Quarto saved in main project folder as, e.g., “Smith-metastatic-breast-trt-report-2025-10-18” for version control

RStudio projects

Benefits of working inside an RStudio project include:

Start a fresh R session every time the project is opened
The current working directory is set to the project directory
Previously open R scripts are restored at project startup
Other RStudio settings are restored
Multiple RStudio sessions can be open at one time, running independently in different RStudio projects

Creating RStudio projects

Automatically using the {starter} package
File menu in RStudio
Project menu in RStudio

Workflow with RStudio project

The {here} package

“Easy file referencing in project-oriented workflows”

What does it do?

Creates paths relative to the top-level directory.

# install.packages("here")

here::here()

[1] "D:/zabore.github.io"

How to use it: examples

Read in data

# install.packages("readr")
df <- readr::read_csv(here::here("data", "mydata.csv"))

Save files

myplot <- hist(rnorm(100))
save(here::here("plots", "myhistogram.jpg"))

Quarto reports

Started using RMarkdown reports, switched to Quarto.
Very easy to switch and I still use a lot of RMarkdown style programming in my Quarto files.
Never again:
- hardcode a number
- have separate documents for text and tables
- manually create tables
- have difficulty updating results when data change
Easily mix code chunks with text
Report numbers in-line in a programmatic way

Separate files for data preparation and data reporting

Recall my starter template created two shell documents:

R script where data are cleaned and coded and saved into a .rda file
Quarto file where clean data are read in, analyses done, results reported

What do I include?

I write my Quarto reports with four main sections:

Notes/questions: these are notes on things I did in the data cleaning process that I want to call attention to, i.e. how categories were combined, missing data to address, data issues or inconsistencies, etc
Background: a brief description of the problem or question being addressed by the project
Methods: A formal statistical methods section that can be copied and pasted directly into the eventual scientific publication
Results: Mostly tables and figures with some text interpretation mixed in.

Quarto output options

html: probably the most popular, with many customization options
pdf: the trickiest to use, in my opinion, requires a LaTeX installation
Word: unpopular, but my preference as it makes it easy to copy and paste entire tables and blocks of text from my report into the publication

Note that you can also make slides in Quarto, like these slides, but that is not our focus today

Components of a Quarto file

The YAML header
Code chunks
Markdown text

The YAML header

Code chunks

Markdown text

Rendering

This places the output file inside the same folder where the .qmd file is saved, in this case in the code folder
I always “Save As” to the main project folder with the date of the file creation for version control

The {renv} package

“create reproducible environments for your R projects”

Initialize the project

First run renv::init() to initialize a new library. This was done for us with starter::create_project().

Other {renv} functions

install() to install packages from CRAN, GitHub, or Bioconductor
update() gets the latest versions of all dependencies

For collaboration with others:

snapshot() adds metadata about currently used packages to the lockfile
restore() uses metadata from the lockfile to install exactly the same version of every package

Put it all together

Case study: I am starting a new project with Dr. Jane Smith about the association between radiation treatment and overall survival in women with breast cancer. Dr. Smith has emailed me an Excel dataset to analyze for the project, and we have discussed the analysis plan.

Run `starter::create_project()`

starter::create_project(
  path = fs::path("G:/StatTeam/zabore/Smith-breast-radiation"),
  template = ezfun::ez_analysis_template,
  open = FALSE
)

✔ Using "EZ Analysis Template" template

✔ Writing files

✔ Initialising renv project

- Lockfile written to "G:/StatTeam/zabore/Smith-breast-radiation/renv.lock".
- renv infrastructure has been generated for project "G:/StatTeam/zabore/Smith-breast-radiation".

Notes:

“G:/StatTeam/zabore” is my organization’s preferred and backed-up drive on my computer
A new project folder named “Smith-breast-radiation” will be created and populated

Add a data folder and save the data there

The investigator sent me an Excel file, which I save as is
I also “Save As” a csv, which I’ll import to R for data cleaning

Open the RStudio project

Once in the RStudio project:

Open the two shell files (R script and qmd)
Start to install needed packages using renv::install()

Insert comments on speed and other issues

Read in, clean up, and save the data

This is one place where the {here} package will come in handy

library(dplyr)
library(readr)

# Import data ------------------------------------
df0 <-
  read_csv(
    file = here::here("data", "breastcancer.csv")
  )  |>
  janitor::clean_names() |>
  janitor::remove_empty()

# Clean data -------------------------------------
df <- 
  df0 |> 
  mutate(
    # Insert data cleaning steps here
  ) |>
  labelled::set_variable_labels(
    # Insert variable labels here
  )

# Save the data ----------------------------------
save(
  df,
  file = here::here("data", "smith-breast-rt-data.rda"))

Insert comments on {janitor}

Analyze and report in Quarto

View the resulting report

Save the report with a new name

Thank you

Connect with me:

zabore2@ccf.org

https://www.emilyzabor.com/

https://github.com/zabore

https://www.linkedin.com/in/emily-zabor-59b902b7/

https://bsky.app/profile/zabore.bsky.social/

Making your workflow reproducible

Hello

Why am I here?

What will I cover today?

Context

The {starter} package

Using {starter}: default settings

Resulting project structure

Custom {starter} templates

Using {starter}: custom template

Resulting custom project structure

Structure inside the code folder

Munge file template

Quarto file template

Structure inside the templates folder

Folder structure and naming

RStudio projects

Creating RStudio projects

RStudio project from the file menu

RStudio project from the file menu

RStudio project from the file menu

RStudio project from the project menu

Workflow with RStudio project

The {here} package

What does it do?

How to use it: examples

Quarto reports

Separate files for data preparation and data reporting

What do I include?

Quarto output options

Components of a Quarto file

The YAML header

Code chunks

Markdown text

Rendering

The {renv} package

Initialize the project

Other {renv} functions

Put it all together

Run starter::create_project()

Add a data folder and save the data there

Open the RStudio project

Read in, clean up, and save the data

Analyze and report in Quarto

View the resulting report

Save the report with a new name

Thank you

Run `starter::create_project()`