Reading and writing files, project workflow

In this section we learn how to work with simplistic project workflow, and how to read and write processed files.

Reading files and `CSV` files

In this section we will work with .csv files, which is simple data structure for storing tables in simple text file, see CSV on Wikipedia. You can open .csv in common spreadsheet editor such as MS Excel or Libre Oficce Calc. In R we can read it with the function read.csv(). As an argument, we pass the path to the file. The path can be absolute or relative, and it can be also url pointing to the file on some server. This principle is same in reading any other file types as .shp, .tif etc., but in theese cases, there are specific functions for reading the files.

At first we will read simple .csv file with same data we have in the previous section, but now stored in the file simple_data.csv. We will read the file with the read.csv(), pointing to the file with absolute path.

The absolute path should be replaced with the path to the file on your computer, for example:

df <- read.csv("C:/Users/student/Downloads/simple_data.csv")

Tip

You can copy the absolute path to the file, simply the way as you wnat copy whole file - select the file in file explorer, select copy (or ctrl+c) and paste (ctrl+v) it to the source pane in Rstudio.

Now we have the data stored in the df object, which is data.frame object.

df

     taxon sp_count year
1    birds       10 2020
2 reptiles        2 2020
3  mammals        4 2020
4    birds       15 2021

Project workflow

The R is always running in specific directory called working directory. Working directory is important when you want to work with paths or files. You can use absuloute paths, but better practice is to use relative paths to the working directory. You can show the path to working direcotry with function getwd(), or change the path with the function setwd(), but its not so good practice, and its better to make habit to work with relative paths in project directory based workflow. This ensures that the code will work without changes if you move the project direcory to another place, renaming part of path, use other IDE, use Git etc.

While wokring with the data in R, it is good practice to:

one directory - keep all files (data, scripts, outputs) in one directory (project or workspace directory) and run R from this directory (working directory)

Note

This is not necessary for working with R, but it is good practice to avoid building bad habits. In some cases, you will need to use absolute paths, i.e. when starting R project and you have dedicated directory for some general data which are often used. But try to avoid it as much as possible, since this disables the possibility of sharing the project.

Setting project directory

When starting, just create a directory for your project, move data files to the data directory, and create scripts in the scripts directory. Example of simple project directory structure:

project_name/
├── data/
├── scripts/
├── outputs/

data/ - directory for all input data
scripts/ - directory for R scripts
outputs/ - directory for output files - plots, tables, ...

Tip

You can create this structure in RStudio by creating a new project and than creating the directories incliding files (scripts, etc.) in the Files pane.

RStudio project

Use the Project feature, which creates a project file .Rproj in the project directory. Loading .Rproj automatically set the working directory to the project directory. - creating project - File -> New Project... - opening existing project - File -> Open Project... or File -> Recent Projects, or simply open the .Rproj file in the file explorer - RStudio opens last project by default, but you can change this in Tools -> Global options -> General/basic tab: Default working directory - disable Restore most recently opened project at startup

Note

Other useful settings for working with projects: - .RData workspace - R can save the entire workspace (all objects, variables, functions, ...) to the .RData file, which can be restored. I recommend to don't use the workspace, and write your scripts in a way that they recreate the worksapce. - option to set: Tools -> Global options -> General/basic tab: Workspace - Restore .RData - uncheck; .RData on exit Never - text encoding - use UTF-8 when saving files. - option to set: Tools -> Global options -> Code/Saving tab -> Default text encoding: UTF-8

First project

In this example project we write one script that reads the data, makes some simple processing and writes the output to new file. We will work with export from DRUSOP which is digital register of central list of nature conservation in the Czech Republic.

Create a new project and directory (first_project or name it as you want) - this can be done in RStudio
Create subdirectories in data, scripts and outputs - just creating directories
Create a new script file first_script.R in the scripts directory - just creating file
Copy the export.csv to the data directory

Note

Originally the file was downloaded this way - go to the https://drusop.nature.cz/ and select Maloplošná zvláště chráněná území in the Objekty ústředního seznamu section. - click the Export button, - check all with Označ/Zruš VŠE na stránce, select Excel (CSV) format in Formát, UTF-8 encoding in Kódování, and click export Exportovat. This will download the file export.csv. Move this file to the data directory.

Aim of the project:

Perform some basic exploration of the data, and create a .csv table only with protected areas with area larger than 500 ha.

Dataset

The dataset contain information about smaller specially protected areas in 4 categories: national nature reserves (NPR), nature reserves (PR), national natural monuments (NPP), and natural monuments (PP)

df <- read.csv("data/export.csv")

Now you need to check if the data are read correctly (which is not always the case). You can show the data with calling df as any other object:

df

But as you see this is not ideal. There are two functions to explore the data - head() and str() (from previous section).

head() returns a data.frame with the first 6 rows of the original data frame:

head(df)

Now we can se that the data is messy. Better way to explore the data is to use str() function, which returns the structure of the data frame, including rows and columns count, columns data types and names, and example of first values in the columns

str(df)

Tip

The str() and head() function are the most used functions for exploring the data. You can also use summary() function. These function are usualy executed directly in the console, but you can also use them in the script.

Now we can see that the data are not read correctly. We have 2681 rows (observations) and only 1 column (variable). Evidently this is caused by reading data with wrong separator, because the read.csv() function uses comma (,) as default separator, but the data are separated by semicolon (;).

This and many ohter parameters can be set in the read.csv() function as arguments, see ?read.csv for help.

To set the separator, we can use the sep argument:

df <- read.csv("data/export.csv", sep = ";")
str(df)

Strucutre of the data now looks better but still not ideal - check the column Rozloha..ha which have data type chr (character) but should be num (numeric). This is caused by the decimal separator, which is comma (,) instead of dot (.). This can be set in the read.csv() function with the dec argument:

df <- read.csv("data/export.csv", sep = ";", dec = ",")
str(df)

Exploring and cleaning data

Now when we correctly read the data, we can explore them. We did some exploration with the str() function, but in following exercises we will do some further exploration.

This is not some exploration routine, we just show in this section common cases you will encounter while processing data, so you learn how to solve these basic problems.

Dealing with `NA` - What is the mean area of the protected areas?

Sometimes there are some missing values in the data, which are represented by NA value in R - Not Available. The NA values in data can prohibit some calculations, for example if we try to calculate the mean area of the protected areas with the mean() function:

mean(df$Rozloha..ha)

this returns NA, which indicates that there are some NA values in the data.

This can be further explored by the is.na() function, which returns logical vector of TRUE and FALSE values. This can be simply checked be unique() function (show unique values of vector), or summarized with table() function (counts the unique values of vector).

is.na(df$Rozloha..ha)

table(is.na(df$Rozloha..ha))
unique(is.na(df$Rozloha..ha))

MAny function allow to skip the NA values in calculation, but is always good to know that is the case, and not ommit the NA values by default. We can use the na.rm argument to remove the NA values from the calculation:

mean(df$Rozloha..ha, na.rm = TRUE)

Explore the `NA` value of the `Rozloha..ha` column

We know that we can perform some calculation ommiting the na with the na.rm argument, but its good to know what is the NA value in the data.

We know that we can return logical vector of TRUE and FALSE values with the is.na() function, and further use this vector to subset the data frame, returning only the row with NA value in the Rozloha..ha column:

df[is.na(df$Rozloha..ha),]

Which returns line 2681, and as we know the df have 2681 rows (observations), so its the last row of the data.

Its clear that the last row is not a valid observation, and in this case its kind of metadata from data provider, so it should be cleaned up from the data.

Removing the row

In this case there are many ways to remove the row. Here are some simple examples:

simpy with the [-2681] notation, where - mens to exclude the row with index 2681:

df[-2681, ]

- but this is not good practice, because if we add new row to the data, the index of the last row will change - with the subset of using is.na() function, but using ! operator to negate (reverse) the logical vector, so it performs the opposite operation - get all rows where Rozloha..ha is not NA:

df[!is.na(df$Rozloha..ha), ]

with the subset of string of some column

df[df$Kód != "COUNT:", ]

Choose the one you like the most, and assign it to the df object:

df <- df[!is.na(df$Rozloha..ha), ]

Data exploration

Get the protected areas with area larger than 500 ha and category NPR

df[df$Rozloha..ha > 500,]

Get the protected areas with area larger than 500 ha and category NPR

df[df$Kategorie == "NPR",]

What is the total area of the protected areas?

Which protected area is the largest?

Which category has the largest protected area?

How many protected areas are in each category?

Writing the output

As we cane read the .csv file, we can also creating/write it. We can use the write.csv() function from the write. family of functions. The write.csv() function writes the data.frame to the .csv file.

write.csv(df, "outputs/processed_data.csv", row.names = FALSE)
# we use the row.names = FALSE to not write the row names to the file

If the task was: Create export of the protected areas with area larger than 500 ha and category NPR, the entire script can look like this:

df <- read.csv("data/export.csv", sep = ";", dec = ",")

df <- df[!is.na(df$Rozloha..ha), ]

df <- df[df$Rozloha..ha > 500,]
df <- df[df$Kategorie == "NPR",]

write.csv(df, "outputs/processed_data.csv", row.names = FALSE)

or using the "and" operator &:

df <- read.csv("data/export.csv", sep = ";", dec = ",")

df <- df[!is.na(df$Rozloha..ha) & df$Rozloha..ha > 500 & df$Kategorie == "NPR" & df$Datum.prvního.vyhlášení > d, ]

write.csv(df, "outputs/processed_data.csv", row.names = FALSE)

Summary

Main outcomes

understand the simple project workflow - working directory, project, relative paths
read CSV files with the read.csv() function, and set the parameters to read the data correctly
write CSV files with the write.csv() function
work with NA values - check for NA values with is.na(), and remove them from calculations with na.rm argument, or subset the data to remove rows with NA values
negating logical vector with ! operator, and subsetting data with - to exclude rows/columns with specific index

Function overview

read.csv() - read a CSV file and create a data frame
write.csv() - write a data frame to a CSV file
is.na() - return logical vector of TRUE and FALSE values, where TRUE means that the value is NA
table() - count the unique values of a vector
unique() - return unique values of a vector

Practice exploration

Practice some exploration on the Zoraptera Occurrence Dataset (https://zenodo.org/records/14652555). The dataset contains information about the occurrence of the order Zoraptera. The link directly to dataset is "https://raw.githubusercontent.com/kalab-oto/zoraptera-occurrence-dataset/refs/tags/1.1.0/zoraptera_occs.csv"

task 1solution 1

Read the data, and assign them to the zoraptera object

zoraptera <- read.csv("https://raw.githubusercontent.com/kalab-oto/zoraptera-occurrence-dataset/refs/tags/1.1.0/zoraptera_occs.csv")

task 2solution 2

How many observations and variables are in the data?

There are 656 rows (observations) and 41 columns (variables)

task 3solution 3

Is there species column in the data?

There is no species column

task 4solution 4

How many records are not identified to the species level?

581

task 5solution 5

What is the most common species in the data?

Usazoros hubbardi

task 6solution 6

What is the most common country in the data?

United States of America

task 7solution 8

How many records are from the Fiji?

9

Reading and writing files, project workflow

Reading files and CSV files

Project workflow

Setting project directory

RStudio project

First project

Dataset

Exploring and cleaning data

Dealing with NA - What is the mean area of the protected areas?

Explore the NA value of the Rozloha..ha column

Removing the row

Data exploration

Get the protected areas with area larger than 500 ha and category NPR

Get the protected areas with area larger than 500 ha and category NPR

What is the total area of the protected areas?

Which protected area is the largest?

Which category has the largest protected area?

How many protected areas are in each category?

Writing the output

Summary

Main outcomes

Function overview

Practice exploration

Reading files and `CSV` files

Dealing with `NA` - What is the mean area of the protected areas?

Explore the `NA` value of the `Rozloha..ha` column