But I have been doing data analysis projects in R for 15+ years
And I’ve learned a few things along the way
If you are an expert, chime in anytime!
What will I cover today?
How I try to make my project workflow reproducible, including:
{starter} to create standard project frameworks
Folder structure and naming
RStudio projects and {here} package for portability
Quarto for reproducible reporting
{renv} for reproducible environments
Context
Everything here evolved in the context of the work I do and how I do it.
Collaborate with doctors on clinical research projects: they send me data, I analyze it
Work independently as the only statistician/programmer for a given project
Patient data is sensitive
2 and 3 mean nothing goes on GitHub
2 also means that my interest in reproducibility is for future me and for sound science
The {starter} package
“Provides a toolkit for starting new projects”
Using {starter}: default settings
# install.packages("starter") starter::create_project(path = fs::path(tempdir(), "My Project Folder"),open =FALSE# don't open project in new RStudio session)
✔ Writing files "README.md", ".gitignore", "My Project Folder.Rproj", and ".Rprofile"
✔ Initialising Git repo
✔ Initialising renv project
- Lockfile written to "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/My Project Folder/renv.lock".
- renv infrastructure has been generated for project "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/My Project Folder".
Resulting project structure
Custom {starter} templates
The default is a great start, but I want a bit more:
Shell code files
Include Word reference document for Quarto
See the {starter} website for details on creating custom templates.
The R script that created my custom template is in my personal R package on GitHub here.
- Lockfile written to "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project/renv.lock".
- renv infrastructure has been generated for project "C:/Users/zabore2/AppData/Local/Temp/RtmpMTUbvb/example-custom-project".
Resulting custom project structure
Structure inside the code folder
Munge file template
Used for data cleaning and pre-processing
Quarto file template
Used for analysis and text
Structure inside the templates folder
Note how this was referenced in the YAML of the Quarto report
See details on how to create your own reference document for Word output here.
Folder structure and naming
Find something that works for you and stick with it.
What I do as a collaborative biostatistician:
Store all project folders on the same drive, backed up by my organization
Each project gets its own folder
Name the folder as “PIName-brief-project-description”.
For example, a project with Jane Smith about treatment for metastatic breast cancer might be “Smith-metastatic-breast-trt”
Initialize using {starter}
Also add a “data” folder
Project reports produced by Quarto saved in main project folder as, e.g., “Smith-metastatic-breast-trt-report-2025-10-18” for version control
RStudio projects
Benefits of working inside an RStudio project include:
Start a fresh R session every time the project is opened
The current working directory is set to the project directory
Previously open R scripts are restored at project startup
Other RStudio settings are restored
Multiple RStudio sessions can be open at one time, running independently in different RStudio projects
Creating RStudio projects
Automatically using the {starter} package
File menu in RStudio
Project menu in RStudio
RStudio project from the file menu
RStudio project from the file menu
RStudio project from the file menu
RStudio project from the project menu
Workflow with RStudio project
The {here} package
“Easy file referencing in project-oriented workflows”
What does it do?
Creates paths relative to the top-level directory.
Started using RMarkdown reports, switched to Quarto.
Very easy to switch and I still use a lot of RMarkdown style programming in my Quarto files.
Never again:
hardcode a number
have separate documents for text and tables
manually create tables
have difficulty updating results when data change
Easily mix code chunks with text
Report numbers in-line in a programmatic way
Separate files for data preparation and data reporting
Recall my starter template created two shell documents:
R script where data are cleaned and coded and saved into a .rda file
Quarto file where clean data are read in, analyses done, results reported
What do I include?
I write my Quarto reports with four main sections:
Notes/questions: these are notes on things I did in the data cleaning process that I want to call attention to, i.e. how categories were combined, missing data to address, data issues or inconsistencies, etc
Background: a brief description of the problem or question being addressed by the project
Methods: A formal statistical methods section that can be copied and pasted directly into the eventual scientific publication
Results: Mostly tables and figures with some text interpretation mixed in.
Quarto output options
html: probably the most popular, with many customization options
pdf: the trickiest to use, in my opinion, requires a LaTeX installation
Word: unpopular, but my preference as it makes it easy to copy and paste entire tables and blocks of text from my report into the publication
Note that you can also make slides in Quarto, like these slides, but that is not our focus today
Components of a Quarto file
The YAML header
Code chunks
Markdown text
The YAML header
Code chunks
Markdown text
Rendering
This places the output file inside the same folder where the .qmd file is saved, in this case in the code folder
I always “Save As” to the main project folder with the date of the file creation for version control
The {renv} package
“create reproducible environments for your R projects”
Initialize the project
First run renv::init() to initialize a new library. This was done for us with starter::create_project().
Other {renv} functions
install() to install packages from CRAN, GitHub, or Bioconductor
update() gets the latest versions of all dependencies
For collaboration with others:
snapshot() adds metadata about currently used packages to the lockfile
restore() uses metadata from the lockfile to install exactly the same version of every package
Put it all together
Case study: I am starting a new project with Dr. Jane Smith about the association between radiation treatment and overall survival in women with breast cancer. Dr. Smith has emailed me an Excel dataset to analyze for the project, and we have discussed the analysis plan.
- Lockfile written to "G:/StatTeam/zabore/Smith-breast-radiation/renv.lock".
- renv infrastructure has been generated for project "G:/StatTeam/zabore/Smith-breast-radiation".
Notes:
“G:/StatTeam/zabore” is my organization’s preferred and backed-up drive on my computer
A new project folder named “Smith-breast-radiation” will be created and populated
Add a data folder and save the data there
The investigator sent me an Excel file, which I save as is
I also “Save As” a csv, which I’ll import to R for data cleaning
Open the RStudio project
Once in the RStudio project:
Open the two shell files (R script and qmd)
Start to install needed packages using renv::install()
Insert comments on speed and other issues
Read in, clean up, and save the data
This is one place where the {here} package will come in handy
library(dplyr)library(readr)# Import data ------------------------------------df0 <-read_csv(file = here::here("data", "breastcancer.csv") ) |> janitor::clean_names() |> janitor::remove_empty()# Clean data -------------------------------------df <- df0 |>mutate(# Insert data cleaning steps here ) |> labelled::set_variable_labels(# Insert variable labels here )# Save the data ----------------------------------save( df,file = here::here("data", "smith-breast-rt-data.rda"))