Reproducibility recommendations: File organization and RStudio Projects

File-related recommendations to improve the reproducibility of your research

reproducibility
Author

Jelmer Poelstra

Published

September 22, 2025


Artwork by @allison_horst


1 Introduction: Reproducibility

After covering Quarto in the first two sessions, today is the third of a series of Code Club sessions covering several topics under the umbrella of “reproducibility”.

1.1 What is reproducibility?

I would like to start by taking a step back to talk about reproducibility in general. What do we mean by reproducibility? Your research is reproducible when third parties are able to perform the same analysis on your data, and produce the same results.

Reproducibility is perhaps a low bar compared to the related concept of replicability, which is the ability to produce the same (qualitative) results when applying the same analysis to different data. Here is a helpful table showing these two and two other related concepts:

For example:

  • Say that you’ve written a paper in which you present the results of one of your research projects. When this research is fully reproducible, it means that someone else should be able to be able to run the exact same analysis and produce all the results and figures using your paper and its associated documentation.

  • Relatedly, when you work in a reproducible manner and you abandon an analysis for say two years, you will be able to pick up from where you left off without much trouble.


1.2 Using R is already a big step in the right direction!

It is inherently more reproducible to write code, such as in R, than to do analyses by clicking in a program with a Graphical User Interface (GUI). This is because it it is easy to save your code and thereby to document and communicate what you did, but rather tedious to do this when you worked in a GUI.

In addition, R is open source and freely available. If you use a proprietary program that requires an expensive license, your work may be reproducible in principle, but for many people. won’t be in practice.


1.3 Additional aspects of reproducibility and what we’ll cover in Code Club

Research that is fully reproducible should use a set of tools and best-practices related to:

  • File organization (this session)
  • Code style and organization (next two sessions)
  • Version management (Git at the end of the semester)
  • Data and code sharing (In part: Git at the end of the semester)
  • Software management (Within R with renv - may cover this later?)
  • Project documentation

1.4 Today

Today, we’ll go over the following four recommendations related to file organization that improve your research projects’ reproducibility.

  1. Use a self-contained folder for each project
  2. Separate files using a consistent subfolder structure
  3. Use relative paths in your code
  4. Use RStudio Projects, not setwd()

And at the bottom of the page, there is at-home reading material on a fifth recommendation:

  1. Use good file names

2 Use a self-contained folder for each project

Using a self-contained folder, or really a hierarchy of folders, for one project means that you:

  • Don’t mix files for multiple distinct projects inside one folder.
  • Don’t keep files for one project in multiple places.

For example:

A diagram showing two research project folder hierarchies.

Two research project folder hierarchies. Each box is a folder.
The way to read this is that the blue data folder is contained within the project1 folder,
which in turn is contained within the Documents folder.

Using a self-contained folder for each projects has many downstream benefits and is fundamental to using other reproducibility-related best practices. It also is more convenient in the long run, as it is easier to find files, harder to accidentally throw away stuff, etc.


3 Separate files using a consistent subfolder structure

Within your project’s directory hierarchy, you should use a consistent subfolder structure to separate different kinds of files – for example, you should separate:

  • Code
  • Raw data
  • Results (including “processed data”)

Additionally, you should use plenty of subfolders to separate different kinds of results, and so on. While the specifics can depend a lot on what kind of data you have and how you are analyzing it, here is one good way of organizing a research project folder:

An example research project folder structure. (The README file is your overall project documentation file, and additional documentation like lab notebooks are stored in doc.)

An example research project folder structure.
(The README file is your overall project documentation file, and additional documentation like lab notebooks are stored in doc.)
Related recommendations
  • Treat raw data as read-only and as highly valuable
  • Avoid manually editing intermediate results

4 Use relative paths in your code

4.1 Key terms

To understand the recommendation to use relative paths in your code, let’s start by going over the following terms:

  1. “Directory” (“dir” for short) is another word for folder that is commonly used in coding contexts.

  2. Your “working directory” is the directory where you are currently located. When you start R, it will always have a starting point at a specific location in your computer. You can see what your working directory is with the function getwd() (short for “get working dir”):

    getwd()
    [1] "/Users/poelstra.1/Library/CloudStorage/Dropbox/mcic/teach/codeclub/codeclub-site/posts/S10E03_reprod_03"
  3. The output of getwd() is a path, which specifies the location of a file or folder on a computer. In R’s output, folders are separated by forward slashes, as shown above.


Native paths on Windows versus Mac and Linux

Even though R will report paths with forward slashes regardless of the operating system, Windows natively uses backslashes. Additionally, there are multiple root directories on Windows, indicated by letters. Here is an example native Windows file path:

C:\Users\John Doe\Desktop\cats.png

Versus an example Mac (or Linux) file path:

/Users/John Doe/Desktop/cats.png

If you are on Windows, we recommend you specify paths in R with forward slashes, because:

  1. It makes the path specification universal, i.e. independent of the operating system.
  2. Because backslashes have a separate purpose in R, if you wanted to use backslashes, you’d actually need two. For example: setwd("C:\\Users\\John Doe"). This is confusing and error-prone!

4.2 Absolute and relative paths

There are two types of paths:

  • Absolute paths, also referred to as full paths, start from (one of) the computer’s root (top-level) directory. They correctly point to a file or folder regardless of what your working dir is.

    If you think of a path as a way to point to a geographic location, then absolute paths are like GPS coordinates.

  • Relative paths start from a specific working dir, and won’t work if you’re located elsewhere.

    Again thinking of a path as a way to point to a geographic location, then relative paths are like directions like “Take the second left”.

The absolute versus a relative path to a file. The relative path has the project1 folder as the starting point.

The absolute versus a relative path to a file. The relative path has the project1 folder as the starting point.

Exercise: paths

  1. In the image below, what is the absolute path to todo_list.txt?
  2. If your working dir is oneils, what is the relative path to todo_list.txt?
  3. If your working dir is documents, what is the relative path to todo_list.txt?

A diagram of an example dir structure on a Mac computer.

Example dir structure on a Mac computer, from O’Neil (2019) (link to chapter)

4.3 Why you should prefer relative paths

Don’t absolute paths sound better? What could be a disadvantage of them?

Absolute paths:

  • Don’t generally work across computers
  • Break when your move a project folder hierarchy to a a different place on your computer

On the other hand, this does not apply to relative paths that use the root project folder as the working dir. Those paths keep working when moving the folder within and between computers:


Compared to the earlier image, everything was moved into a folder OndeDrive. This resulted in a change in the absolute path, but not to this relative path.

Compared to the earlier image, everything was moved into a folder OndeDrive.
This resulted in a change in the absolute path, but not to this relative path.

So, relative paths are preferred, but only when used in the following way:

  1. Your working directory is always your project’s root directory (project1 in the example above)
  2. You use relative paths that start at this directory

5 Use RStudio Projects, not setwd()

If your R working directory is not where you want it to be, you can change it with the setwd() function. However, there is a better way – letting RStudio Projects set your working dir!

5.1 RStudio Projects

RStudio Projects are an RStudio-specific concept that create a special file, .Rproj, whose location designates the R working directory when you open that project.

Given what we talked about so far, in which dir(s) do you think RStudio Projects should generally be created? In a project’s root directory! For example, in project1 in the diagram above.

Create a new RStudio Project:

  1. Click File > New Project > New Directory > New project
  2. In the next window:
    • Type codeclub-project-practice as the “Directory name”.
    • You may use the default location (parent dir) or change it – up to you.
    • Don’t check any of the boxes

After RStudio automatically reloads, the R working directory will be set to the dir in which your RStudio Project file is located. Therefore, you should see the file ending in .Rproj in the RStudio Files tab in the lower right pane. Also, you can check your working dir:

getwd()

5.2 Why RStudio Projects are useful

In brief, RStudio Projects help you to organize your work and make it more reproducible:

  • When using Projects, you can avoid manually setting your working directory altogether. To refer to files within the project, you can use relative file paths. This way, even if you move the project directory, or copy it to a different computer, the same paths will still work.

  • Projects encourage you to organize research projects inside self-contained folder hierarchies exactly as recommended above.

  • When you switch between Projects, R will restart — and this is a good thing, since you don’t want to randomly carry over objects and loaded packages across research projects.

RStudio Projects are also convenient because:

  • Projects record which scripts (and R Markdown files) are open in RStudio, and will reopen all of those when you reopen the project. This becomes quite handy, say, when you work on three different projects, each of which uses a number of different scripts.
  • The Files tab will always be (or at least start) in sync with your working dir

5.3 Using relative paths to refer to files within the project

Let’s recall the aforementioned way you should use relative paths:

  1. Your working directory should always be your project’s root directory => this is taken care of by using RStudio projects and avoiding setwd()
  2. You should use relative paths that start at this directory => we’ll practice that now

Here are a few examples of using relative paths in R:

# [Don't run - fictional examples]
read_tsv("data/experiment1.tsv")
ggsave("results/figures/barplot.png")

Two key aspects of specifying paths in R:

  • You have to use quotes (either double or single quotes) around the path
  • As soon as you open quotes, either in the console or in your script, R will assume that you are trying to specify a path, and you can use Tab completion to essentially browse through your files. Practice with that in the exercise below!

When you open quotes in the R console, your script, or Quarto code blocks, and press Tab, you can essentially browse your files starting from your working dir.

When you open quotes in the R console, your script, or Quarto code blocks, and press Tab, you can essentially browse your files starting from your working dir.
Accessing files that aren’t in the project dir

Occasionally, you may need to access a file that is outside of your project dir (but the less that happens, the better – and this comes down to proper file organization). In that case, you can either use an absolute path, or a relative path that starts by going “up” one or more levels. You can navigate up with the .. notation, for example:

# [Don't run - fictional examples]
# The file is located in the dir:
read_tsv("../myfile.tsv")

# The file is located in the dir Downloads, which is two levels up:
read_tsv("../../Downloads/myfile.tsv")

Exercise: File organization and relative paths

In this exercise, you are working on a mock research project within the folder that you created above (your RStudio Project folder). You will practice with file organization and using relative paths.

  1. You will be writing to file some mock data, a computed summary of the data, a plot based on the data, and an R script. Create an appropriate dir structure for those files within your RStudio Project dir. It’s up to you how you do this: in the RStudio file panes, in your computer’s file explorer, or with R functions (dir.create)).

    Click here to see a solution

    From the above description, it seems appropriate to create data, results, and scripts dirs. Here, I’m also choosing to create a dir plots within results for plots.

    dir.create("data")
    dir.create("results/plots", recursive = TRUE)
    dir.create("scripts")
  2. Open a new R script (File > New File > R Script) and save it in an appropriate location within your newly created dir structure. Then, add all the code for the next exercises in that script.

    Click here to see a solution

    It would be appropriate to save the script in a dir specifically for script, which was called scripts in the answer above. E.g., you could save it as scripts/exercise.R.

  3. Load the tidyverse as follows:

    library(tidyverse)
    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
    ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
    ✔ purrr     1.1.0     
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    # If that produces an error, run the following command to install, then try again:
    # install.packages("tidyverse")
    # library(tidyverse)
  4. Now, you will have access to a dataset (data frame) called diamonds. Modify the code below to write that data to file in an approriate location.

    write_tsv(diamonds, "<FILE-PATH>")

    Click here to see a solution

    Assuming that you store the raw data in a dir called data:

    write_tsv(diamonds, "data/diamonds.tsv")
  5. Run the ggplot() code below to create a plot of the diamonds data, then modify the ggsave() line to save your plot to file in an appropriate location.

    ggplot(diamonds, aes(x = carat, y = price)) +
      geom_point()
    ggsave("<FILE-PATH>")

    Click here to see a solution

    Assuming that you store the plot in a dir called results/plots:

    ggplot(diamonds, aes(x = carat, y = price)) +
      geom_point()

    ggsave("results/plots/plot.png")  
    Saving 7 x 5 in image
  6. Run the code below to summarize one aspect of the diamonds data, then modify the write_tsv() line to save your resulting data frame to file in an appropriate location.

    data_summary <- diamonds |>
      summarize(mean_price = mean(price), .by = cut)
    write_tsv(data_summary, "<FILE-PATH>")

    Click here to see a solution

    Assuming that you store the results in a dir called `results``:

    data_summary <- diamonds |>
      summarize(mean_price = mean(price), .by = cut)
    write_tsv(data_summary, "results/data_summary.tsv")
  7. In the RStudio files tab, take a look at the file (structure) you produced.




6 Bonus: Use good file names

Here are three principles for good file names (from Jenny Bryan) — good file names:

  • Are machine-readable
  • Are human-readable
  • Play well with default file ordering

6.1 Machine-readable

Consistent and informative naming helps you to programmatically find and process files.

  • Avoid spaces in file names. More generally, only use the following in file names:
    • Alphanumeric characters A-Z, a-z, 0-9
    • Underscores _ and hyphens (dashes) -
    • Periods (dots) .
  • In file names, you may provide metadata like Sample ID and date – (allowing you to easily select samples from e.g. a certain month): sample032_2016-05-03.txt

6.2 Human-readable

One good way to combine machine- and human-readable (opinionated recommendations):

  • Use underscores (_) to delimit units you may later want to separate on: sampleID, treatment, date.
  • Within such units, use dashes (-) to delimit words: grass-samples.
  • Limit the use of periods (.) to indicate file extensions.
  • Generally avoid capitals.

For example:

mmus001_treatmentA_filtered.tsv
mmus002_treatmentA_filtered.tsv
.
.
mmus086_treatmentA_filtered.tsv

6.3 Play well with default file ordering

  • Use leading zeros for lexicographic sorting: sample005.
  • Dates should always be written as YYYY-MM-DD: 2020-10-11.
  • Group similar files together by starting with same phrase, and number scripts by execution order:
DE-01_normalize.R
DE-02_test.R
DE-03_process-significant.R

Back to top

References

Community, The Turing Way. 2025. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research. 10.5281/ZENODO.15213042.
O’Neil, Shawn T. 2019. A Primer for Computational Biology. Oregon State University. https://open.oregonstate.education/computationalbiology/.
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. 2017. “Good Enough Practices in Scientific Computing.” PLOS Computational Biology 13 (6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510.