= c(10.1, 2.2, 32.3, 5.4) my_vec
6 Subsetting
If you have a selection of items in a basket, selecting a few of the items from the basket is subsetting. You can also call it selecting a subset. Often we need to select a portion of the elements in a vector, or a few columns in a data frame, or a subset of the rows and columns based on come conditions. Subsetting is an essential skill to master for every data professional.
Subsetting in R is very fast. For some, it may feel natural. For others it may feel intimidating at first. But trust me–its easy if you are patient. Lets start with how to subset an atomic vector.
6.1 Subsetting Atomic Vector
6.1.1 Using integer position
There are six ways to subset an atomic vector. However, we will focus on two that you can reuse for other data types.
First, there are three subsetting operators. These are [
, [[
, and $
. Let’s see an example.
The number after the decimals indicate the actual position of the elements. If we want to select the first element of the vector, we use the vector name and within the bracket, put the position of the element. my_vec[1]
will give your the first element of my_vec
.
1] my_vec[
[1] 10.1
Notice that we used 1 to identify the first element. In R, the index starts at 1. In many popular programming languages such as Python and Java, the index starts at zero. Just keep this in mind.
To extract the second element, use [2]
2] my_vec[
[1] 2.2
How about extracting multiple elements? Not a problem. Just combine the element’s positions with the c()
function and wrap that with the subsetting operator [
on the my_vec
object
# extract the first and third element
c(1, 3)] # subsetting with integer position my_vec[
[1] 10.1 32.3
To exclude an element at a particular position, negate the function as -c()
# exclude the second element
-c(2)] my_vec[
[1] 10.1 32.3 5.4
However, we cannot include and exclude at the same time.
# keep first element but exclude the second
c(1, -2)] # only 0's may be mixed with negative subscripts my_vec[
6.1.2 Using logical vectors
We want to select all the elements that are bigger than 9. First we create a logical vector that satisfies our conditions
> 9 my_vec
[1] TRUE FALSE TRUE FALSE
# check the type of the resulting vector
typeof(my_vec > 9)
[1] "logical"
Now apply the resulting logical vector to subsetting from my_vec
to return only the elements where the condition (greater than 9) is TRUE
> 9] # subsetting with logical vector my_vec[my_vec
[1] 10.1 32.3
Using logical vector to subsetting is extremely fast. In R, try to avoid for loops. Instead, use logical vector for vectorized computation.
6.2 Subsetting Matrix
Since matrix is a two-dimensional object which has a row and a column, the subsetting must utilize its dimensions.
Let us create a matrix my_mat
whose elements 1:50 are arranged in 10 rows and 5 columns.
= matrix(1:50, nrow = 10, ncol = 5)
my_mat my_mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 11 21 31 41
[2,] 2 12 22 32 42
[3,] 3 13 23 33 43
[4,] 4 14 24 34 44
[5,] 5 15 25 35 45
[6,] 6 16 26 36 46
[7,] 7 17 27 37 47
[8,] 8 18 28 38 48
[9,] 9 19 29 39 49
[10,] 10 20 30 40 50
First element is 1 which is located at the first-row and first-column. That is, the location of the first element is [1, 1], where the first element represents the row-position, and the second element represents the column-position.
# extract the first element
1,1] my_mat[
[1] 1
6.2.1 Subsetting entire row of a matrix
# subset the first row
1, ] my_mat[
[1] 1 11 21 31 41
6.2.2 Subsetting multiple rows of a matrix
# subset rows 1, 2, and 4 and return all columns
c(1, 2, 4), ] my_mat[
[,1] [,2] [,3] [,4] [,5]
[1,] 1 11 21 31 41
[2,] 2 12 22 32 42
[3,] 4 14 24 34 44
6.2.3 Subsetting a column
# subset the second columns and keep values in all rows
2] my_mat[ ,
[1] 11 12 13 14 15 16 17 18 19 20
6.3 Subsetting List
# Create the list
= list(
my_list serial = 1:5,
age = c(10, 11, 20, 30, 32),
sex = c('M', 'F', 'F', 'M', 'm')
)
# prints the list
my_list
$serial
[1] 1 2 3 4 5
$age
[1] 10 11 20 30 32
$sex
[1] "M" "F" "F" "M" "m"
List can be subsetted using the $
operator too.
$age my_list
[1] 10 11 20 30 32
Or alternatively [
or the [[
operator can be used depending on your preference and exactly what you want to extract.
# extracts the first item of the list
1] my_list[
$serial
[1] 1 2 3 4 5
# extracts the elements of the first item of the list
1]] my_list[[
[1] 1 2 3 4 5
You can also use the name of the vector within the list to extract the list item
'age'] my_list[
$age
[1] 10 11 20 30 32
To extract the second element of the named vector age
'age']][2] my_list[[
[1] 11
You can also use the [[
operator on data frame to extract the items and the elements within the items of a data frame. This is because the data frame is a collection of atomic vectors and its data type is list
.
typeof(df)
[1] "closure"
6.4 Subsetting Data Frame
Data frames are created using the data.frame()
function by supplying a list of columns. data.frames, as it is typically referred to are of list data type with one important distinction. List can have elements of unequal length. In data.frame, all the elements must have the same length to make the data.frame a true rectangular array.
= c(1, 2, 3)
x = list(
my_list serial = 1:5,
age = c(10, 11, 20, 30, 32),
sex = c('M', 'F', 'F', 'M', 'M')
)= data.frame(my_list)
df
df
serial age sex
1 1 10 M
2 2 11 F
3 3 20 F
4 4 30 M
5 5 32 M
If you look at the data type for df
using typeof(df)
, you will see its a list.
typeof(df)
[1] "list"
To view the structure of df
object
str(df)
'data.frame': 5 obs. of 3 variables:
$ serial: int 1 2 3 4 5
$ age : num 10 11 20 30 32
$ sex : chr "M" "F" "F" "M" ...
To select the columns, we use $
operator to subset a column
$age df
[1] 10 11 20 30 32
$serial df
[1] 1 2 3 4 5
$sex df
[1] "M" "F" "F" "M" "M"
The data type of the extracted column age
is double. Likewise, the data type of sex
is character.
6.4.1 Selecting rows using conditions
Select all rows where the sex is male
$sex df
[1] "M" "F" "F" "M" "M"
$sex == 'M' df
[1] TRUE FALSE FALSE TRUE TRUE
# subset the males
$sex == 'M', ] df[df
serial age sex
1 1 10 M
4 4 30 M
5 5 32 M
If you want to select only age and sex of the data frame where sex = M
$sex == 'M', c('age', 'sex')] df[df
age sex
1 10 M
4 30 M
5 32 M
Alternatively we could use the column position integers to select the columns
$sex == 'M', c(2, 3)] df[df
age sex
1 10 M
4 30 M
5 32 M
6.5 Assigning values with Subsetting
Subsetting can be used to assign new values. This is also known as ‘setting’ a value
6.5.1 Atomic Vector
= my_vec
my_vec2
# replace the value in the second position
2] = 20.2
my_vec2[
my_vec
[1] 10.1 2.2 32.3 5.4
my_vec2
[1] 10.1 20.2 32.3 5.4
6.5.2 Matrix
= my_mat
my_mat2
1, 1] = 10
my_mat2[ my_mat2
[,1] [,2] [,3] [,4] [,5]
[1,] 10 11 21 31 41
[2,] 2 12 22 32 42
[3,] 3 13 23 33 43
[4,] 4 14 24 34 44
[5,] 5 15 25 35 45
[6,] 6 16 26 36 46
[7,] 7 17 27 37 47
[8,] 8 18 28 38 48
[9,] 9 19 29 39 49
[10,] 10 20 30 40 50
6.5.3 List
= my_list
my_list2
= my_list$age + 10
new_age $age = new_age
my_list2
my_list
$serial
[1] 1 2 3 4 5
$age
[1] 10 11 20 30 32
$sex
[1] "M" "F" "F" "M" "M"
my_list2
$serial
[1] 1 2 3 4 5
$age
[1] 20 21 30 40 42
$sex
[1] "M" "F" "F" "M" "M"
Adding a new element to the list object
$new_age = my_list$age + 20
my_list
my_list
$serial
[1] 1 2 3 4 5
$age
[1] 10 11 20 30 32
$sex
[1] "M" "F" "F" "M" "M"
$new_age
[1] 30 31 40 50 52
6.5.4 Data frames
Since data frames are lists, the same rule applies for subsetting and assigning new values and elements to the list (equivalently adding new columns to the data frame)
6.5.5 Exercise
- Create a matrix object and explore its attributes. What difference do you see from the attributes of a data frame?
= matrix(1:10, ncol=2)
x
xattributes(x)
Create a list object and explore its attributes.
Create a data frame object and explore its attributes.