Data frames
Data frame in R is structure similar to common tables. It is two dimensional array of values, where columns are vectors of same length and specific data type. It can be created with data.frame() function, or can be created by reading a table data, for example with read.csv() function.
Creating data frame
For example, we can create a data frame with three columns: taxon, sp_count and year, and one row with values "birds", 10 and 2020. Just use a desired names for columns as arguments of data.frame() function, and assign the values to these arguments:
Info
note that there is also row number 1 on the left. This is not part of the data, it is property (attribute) of the data frame object, and it is called row names. We will not use row names in this course.
As been mentioned, columns are vectors, so they can contain multiple values, but the vectors have to be of the same length. This is how we can create a data frame with multiple rows:
df <- data.frame(taxon = c("birds", "reptiles", "mammals", "birds"), sp_count = c(10, 2, 4, 15), year = c(2020, 2020, 2020, 2021))
df
it can be also written in multiple lines for better readability
df <- data.frame(
taxon = c("birds", "reptiles", "mammals", "birds"),
sp_count = c(10, 2, 4, 15),
year = c(2020, 2020, 2020, 2021)
)
Note
Also in this case the recycling rule applies:
Info
Most of the time you will work with data frames created by reading tables from files. We will learn how to work with files in the Working with files, workflow lesson.
Subsetting data - using indices
The are various ways to subset data frame with row and column names and position (index). The basic way to subset data frame is with $ and [] operator as we do with vectors, but with data frames we can subset rows and columns.
$ operator
Let's start with $ operator. This simply returns the column (vector) by its name:
Now you can quicly calculate some statistics on the column:
We will use this a lot in filtering data during the course.
[] operator
The brackets [] operator is used to subset rows and columns by name or index. The basic syntax is data_frame[row, column], where row and column can be name or index (numeric or logical), same as we subset vectods. If you want to subset only rows or columns, you can leave the other part empty (e.g. data_frame[row, ]).
subsetting single row results in data frame with one row and all columns:
subsetting with multiple values:
subsetting single column results in vector with values of the column (similar to $ operator):
but subsetting columns with multiple values will return data frame:
orNote
You can also subset with single index, which will return a data frame with one column with the indexed or named column.
orgetting specific value with row and column index:
"get the first two values oftaxon column":
Technical note: list? matrix?
In the previous lesson, we skipped the data type list because we will explain it later in the course. When you try call function typeof() on data frame object, the result will be "list". So don't be confused by this, because data frame is a special type of list, which will be explained in the detail later in the course.
l <- list(taxon = c("birds", "reptiles", "mammals", "birds"), sp_count = c(10, 2, 4, 15), year = c(2020, 2020, 2020, 2021))
l
Also simplier multi-dimensional structures exists in R. These are arrays and matrices, which are atomic vectors with two (matrix/array) or more (array) dimensions.
matrix with 2 rows and 3 columns:
While data.frame is a special list, its rectangular structure makes it also similiar to matrix. That's why we can subset data various ways, with $ and single index [i] like list, but also with two indices [row, column] like matrix.
Using logical values for subsetting
Most of the time you will subset a data with some condition, and you can do this with logical values in row index, same as we did it with vectors. In this case this returns only rows with position that corresponds to the position of TRUE value in logical vector.
For example, if you want to subset only rows with sp_count greater than 5, you can do it like this:
so we subset only 1. and 4. row, because only in these rows the condition is TRUE
Another example, get only birds sp_count:
birds only, like what is the minimum of recorded bird species in the data:
or
Mixing the subsetting methods
You can also mix the subsetting methods
Inspecting data frame
The most common function to inspect objects in R is str(), which shows the structure of the object. It gives information about the type of object (called class, not so important for now), number of rows and columns, and data type of each column.
'data.frame': 4 obs. of 3 variables:
$ taxon : chr "birds" "reptiles" "mammals" "birds"
$ sp_count: num 10 2 4 15
$ year : num 2020 2020 2020 2021
Question
Try function str() on other objects
Function summary() gives summary based on object nature. For data frame it gives summary statistics of each column.
taxon sp_count year
Length:4 Min. : 2.00 Min. :2020
Class :character 1st Qu.: 3.50 1st Qu.:2020
Mode :character Median : 7.00 Median :2020
Mean : 7.75 Mean :2020
3rd Qu.:11.25 3rd Qu.:2020
Max. :15.00 Max. :2021
For quickly accessing dimensions of data frame, you can use functionsdim(), nrow(), and ncol()/length(), which returns vectors with number of rows and columns, respectively:
dimensions - rows and columns
number of rows directly number of columns directly number of columns withlength() (same as ncol())
Also for larger data frames, you can use head() and tail() functions to show first and last rows of data frame, the default value is 6, but number of rows can be specified with n (second) argument:
first 6 rows of data frame, which is in our case all rows
first 2 rows of data frame
last 2 rows of data frame
Editing data frame
Use of $ allows you to add new column if the column name does not exist in the data frame:
taxon sp_count year month
1 birds 10 2020 8
2 reptiles 2 2020 8
3 mammals 4 2020 8
4 birds 15 2021 8
If the column name already exists, it replace the values. Recycling rule applies here
taxon sp_count year month
1 birds 10 2020 7
2 reptiles 2 2020 8
3 mammals 4 2020 7
4 birds 15 2021 8
Specific value can be also edited with row and column index:
``` taxon sp_count year month 1 birds 10 2020 7 2 reptiles 2 2020 9 3 mammals 4 2020 7 4 birds 15 2021 8or with logical values for row index. For example, you realize that birds were recorded in 2019
``` r
df[df$taxon == "birds", "year"] <- 2019
# or
# df$year[df$taxon == "birds"] <- 2019
# or
# df[df$taxon == "birds",]$year <- 2019
df
taxon sp_count year month
1 birds 10 2019 7
2 reptiles 2 2020 9
3 mammals 4 2020 7
4 birds 15 2019 8
taxon sp_count year month
1 birds 10 2022 7
2 reptiles 2 2020 9
3 mammals 4 2020 7
4 birds 15 2025 8
Summary
Main outcomes
- understand structure of data frame
- create data frame with
data.frame()function - using the
[]and$operators with data frames
Function overview
data.frame()- create a data framestr()- show structure of objectsummary()- show summary statistics of each columndim()- show dimensions of data framenrow()- show number of rowsncol()/length()- show number of columnshead()- show first rows of data frametail()- show last rows of data frame