Reading and writing files, project workflow
In this section we learn how to work with simplistic project workflow, and how to read and write processed files.
Reading files and CSV files
In this section we will work with .csv files, which is simple data structure for storing tables in simple text file, see CSV on Wikipedia. You can open .csv in common spreadsheet editor such as MS Excel or Libre Oficce Calc. In R we can read it with the function read.csv(). As an argument, we pass the path to the file. The path can be absolute or relative, and it can be also url pointing to the file on some server. This principle is same in reading any other file types as .shp, .tif etc., but in theese cases, there are specific functions for reading the files.
At first we will read simple .csv file with same data we have in the previous section, but now stored in the file simple_data.csv. We will read the file with the read.csv(), pointing to the file with absolute path.
The absolute path should be replaced with the path to the file on your computer, for example:
Tip
You can copy the absolute path to the file, simply the way as you wnat copy whole file - select the file in file explorer, select copy (or ctrl+c) and paste (ctrl+v) it to the source pane in Rstudio.
Now we have the data stored in the df object, which is data.frame object.
Project workflow
The R is always running in specific directory called working directory. Working directory is important when you want to work with paths or files. You can use absuloute paths, but better practice is to use relative paths to the working directory. You can show the path to working direcotry with function getwd(), or change the path with the function setwd(), but its not so good practice, and its better to make habit to work with relative paths in project directory based workflow. This ensures that the code will
work without changes if you move the project direcory to another place, renaming part of path, use other IDE, use Git etc.
While wokring with the data in R, it is good practice to:
- one directory - keep all files (data, scripts, outputs) in one directory (project or workspace directory) and run R from this directory (working directory)
Note
This is not necessary for working with R, but it is good practice to avoid building bad habits. In some cases, you will need to use absolute paths, i.e. when starting R project and you have dedicated directory for some general data which are often used. But try to avoid it as much as possible, since this disables the possibility of sharing the project.
Setting project directory
When starting, just create a directory for your project, move data files to the data directory, and create scripts in the scripts directory. Example of simple project directory structure:
- data/ - directory for all input data
- scripts/ - directory for R scripts
- outputs/ - directory for output files - plots, tables, ...
Tip
You can create this structure in RStudio by creating a new project and than creating the directories incliding files (scripts, etc.) in the Files pane.
RStudio project
Use the Project feature, which creates a project file .Rproj in the project directory. Loading .Rproj automatically set the working directory to the project directory.
- creating project - File -> New Project...
- opening existing project - File -> Open Project... or File -> Recent Projects, or simply open the .Rproj file in the file explorer
- RStudio opens last project by default, but you can change this in Tools -> Global options -> General/basic
tab: Default working directory - disable Restore most recently opened project at startup
Note
Other useful settings for working with projects:
- .RData workspace - R can save the entire workspace (all objects, variables, functions, ...) to the .RData file, which can be restored. I recommend to don't use the workspace, and write your scripts in a way that they recreate the worksapce.
- option to set: Tools -> Global options -> General/basic tab: Workspace - Restore .RData - uncheck; .RData on exit Never
- text encoding - use UTF-8 when saving files.
- option to set: Tools -> Global options -> Code/Saving tab -> Default text encoding: UTF-8
First project
In this example project we write one script that reads the data, makes some simple processing and writes the output to new file. We will work with export from DRUSOP which is digital register of central list of nature conservation in the Czech Republic.
- Create a new project and directory (first_project or name it as you want) - this can be done in RStudio
- Create subdirectories in data, scripts and outputs - just creating directories
- Create a new script file first_script.R in the scripts directory - just creating file
- Copy the export.csv to the data directory
Note
Originally the file was downloaded this way - go to the https://drusop.nature.cz/ and select Maloplošná zvláště chráněná území in the Objekty ústředního seznamu section. - click the Export button, - check all with Označ/Zruš VŠE na stránce, select Excel (CSV) format in Formát, UTF-8 encoding in Kódování, and click export Exportovat. This will download the file export.csv. Move this file to the data directory.
Aim of the project:
Perform some basic exploration of the data, and create a .csv table only with protected areas with area larger than 500 ha.
Dataset
The dataset contain information about smaller specially protected areas in 4 categories: national nature reserves (NPR), nature reserves (PR), national natural monuments (NPP), and natural monuments (PP)
Now you need to check if the data are read correctly (which is not always the case). You can show the data with calling df as any other object:
head() and str() (from previous section).
head() returns a data.frame with the first 6 rows of the original data frame:
Now we can se that the data is messy. Better way to explore the data is to use str() function, which returns the
structure of the data frame, including rows and columns count, columns data types and names, and example of first values in the columns
Tip
The str() and head() function are the most used functions for exploring the data. You can also use summary() function. These function are usualy executed directly in the console, but you can also use them in the script.
Now we can see that the data are not read correctly. We have 2681 rows (observations) and only 1 column (variable).
Evidently this is caused by reading data with wrong separator, because the read.csv() function uses comma (,) as
default separator, but the data are separated by semicolon (;).
This and many ohter parameters can be set in the read.csv() function as arguments, see ?read.csv for help.
To set the separator, we can use the sep argument:
Strucutre of the data now looks better but still not ideal - check the column Rozloha..ha which have data type chr (character) but should be num (numeric). This is caused by the decimal separator, which is comma (,) instead of dot (.). This can be set in the read.csv() function with the dec argument:
Exploring and cleaning data
Now when we correctly read the data, we can explore them. We did some exploration with the str() function,
but in following exercises we will do some further exploration.
This is not some exploration routine, we just show in this section common cases you will encounter while processing data, so you learn how to solve these basic problems.
Dealing with NA - What is the mean area of the protected areas?
Sometimes there are some missing values in the data, which are represented by NA value in R - Not Available. The NA values in data can prohibit some calculations, for example if we try to calculate the mean area of the protected areas with the mean() function:
this returns NA, which indicates that there are some NA values in the data.
This can be further explored by the is.na() function, which returns logical vector of TRUE and FALSE values. This can be simply checked be unique() function (show unique values of vector), or summarized with table() function (counts the unique values of vector).
MAny function allow to skip the NA values in calculation, but is always good to know that is the case, and not ommit the NA values by default. We can use the na.rm argument to remove the NA values from the calculation:
Explore the NA value of the Rozloha..ha column
We know that we can perform some calculation ommiting the na with the na.rm argument, but its good to know what is the NA value in the data.
We know that we can return logical vector of TRUE and FALSE values with the is.na() function, and further use this vector to subset the data frame, returning only the row with NA value in the Rozloha..ha column:
2681, and as we know the df have 2681 rows (observations), so its the last row of the data.
Its clear that the last row is not a valid observation, and in this case its kind of metadata from data provider, so it should be cleaned up from the data.
Removing the row
In this case there are many ways to remove the row. Here are some simple examples:
- simpy with the
[-2681]notation, where-mens to exclude the row with index2681:
is.na() function, but using ! operator to negate (reverse) the logical vector, so it performs the opposite operation - get all rows where Rozloha..ha is not NA:
- with the subset of string of some column
Choose the one you like the most, and assign it to the df object:
Data exploration
Get the protected areas with area larger than 500 ha and category NPR
Get the protected areas with area larger than 500 ha and category NPR
What is the total area of the protected areas?
Which protected area is the largest?
Which category has the largest protected area?
How many protected areas are in each category?
Writing the output
As we cane read the .csv file, we can also creating/write it. We can use the write.csv() function from the write. family of functions. The write.csv() function writes the data.frame to the .csv file.
write.csv(df, "outputs/processed_data.csv", row.names = FALSE)
# we use the row.names = FALSE to not write the row names to the file
If the task was: Create export of the protected areas with area larger than 500 ha and category NPR, the entire script can look like this:
df <- read.csv("data/export.csv", sep = ";", dec = ",")
df <- df[!is.na(df$Rozloha..ha), ]
df <- df[df$Rozloha..ha > 500,]
df <- df[df$Kategorie == "NPR",]
write.csv(df, "outputs/processed_data.csv", row.names = FALSE)
or using the "and" operator &:
df <- read.csv("data/export.csv", sep = ";", dec = ",")
df <- df[!is.na(df$Rozloha..ha) & df$Rozloha..ha > 500 & df$Kategorie == "NPR" & df$Datum.prvního.vyhlášení > d, ]
write.csv(df, "outputs/processed_data.csv", row.names = FALSE)
Summary
Main outcomes
- understand the simple project workflow - working directory, project, relative paths
- read CSV files with the
read.csv()function, and set the parameters to read the data correctly - write CSV files with the
write.csv()function - work with
NAvalues - check forNAvalues withis.na(), and remove them from calculations withna.rmargument, or subset the data to remove rows withNAvalues - negating logical vector with
!operator, and subsetting data with-to exclude rows/columns with specific index
Function overview
read.csv()- read a CSV file and create a data framewrite.csv()- write a data frame to a CSV fileis.na()- return logical vector ofTRUEandFALSEvalues, whereTRUEmeans that the value isNAtable()- count the unique values of a vectorunique()- return unique values of a vector
Practice exploration
Practice some exploration on the Zoraptera Occurrence Dataset (https://zenodo.org/records/14652555). The dataset contains information about the occurrence of the order Zoraptera. The link directly to dataset is "https://raw.githubusercontent.com/kalab-oto/zoraptera-occurrence-dataset/refs/tags/1.1.0/zoraptera_occs.csv"
How many observations and variables are in the data?
There are 656 rows (observations) and 41 columns (variables)
Is there species column in the data?
There is no species column
How many records are not identified to the species level?
581
What is the most common species in the data?
Usazoros hubbardi
What is the most common country in the data?
United States of America
How many records are from the Fiji?
9