6  Subsetting

If you have a selection of items in a basket, selecting a few of the items from the basket is subsetting. You can also call it selecting a subset. Often we need to select a portion of the elements in a vector, or a few columns in a data frame, or a subset of the rows and columns based on come conditions. Subsetting is an essential skill to master for every data professional.

Subsetting in R is very fast. For some, it may feel natural. For others it may feel intimidating at first. But trust me–its easy if you are patient. Lets start with how to subset an atomic vector.

6.1 Subsetting Atomic Vector

6.1.1 Using integer position

There are six ways to subset an atomic vector. However, we will focus on two that you can reuse for other data types.

First, there are three subsetting operators. These are [, [[, and $. Let’s see an example.

my_vec = c(10.1, 2.2, 32.3, 5.4)

The number after the decimals indicate the actual position of the elements. If we want to select the first element of the vector, we use the vector name and within the bracket, put the position of the element. my_vec[1] will give your the first element of my_vec.

my_vec[1]
[1] 10.1

Notice that we used 1 to identify the first element. In R, the index starts at 1. In many popular programming languages such as Python and Java, the index starts at zero. Just keep this in mind.

To extract the second element, use [2]

my_vec[2]
[1] 2.2

How about extracting multiple elements? Not a problem. Just combine the element’s positions with the c() function and wrap that with the subsetting operator [ on the my_vec object

# extract the first and third element
my_vec[c(1, 3)] # subsetting with integer position
[1] 10.1 32.3

To exclude an element at a particular position, negate the function as -c()

# exclude the second element
my_vec[-c(2)]
[1] 10.1 32.3  5.4

However, we cannot include and exclude at the same time.

# keep first element but exclude the second
my_vec[c(1, -2)] # only 0's may be mixed with negative subscripts

6.1.2 Using logical vectors

We want to select all the elements that are bigger than 9. First we create a logical vector that satisfies our conditions

my_vec > 9
[1]  TRUE FALSE  TRUE FALSE
# check the type of the resulting vector
typeof(my_vec > 9)
[1] "logical"

Now apply the resulting logical vector to subsetting from my_vec to return only the elements where the condition (greater than 9) is TRUE

my_vec[my_vec > 9] # subsetting with logical vector
[1] 10.1 32.3
Use logical vector instead of for loop

Using logical vector to subsetting is extremely fast. In R, try to avoid for loops. Instead, use logical vector for vectorized computation.

6.2 Subsetting Matrix

Since matrix is a two-dimensional object which has a row and a column, the subsetting must utilize its dimensions.

Let us create a matrix my_mat whose elements 1:50 are arranged in 10 rows and 5 columns.

my_mat = matrix(1:50, nrow = 10, ncol = 5)
my_mat
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1   11   21   31   41
 [2,]    2   12   22   32   42
 [3,]    3   13   23   33   43
 [4,]    4   14   24   34   44
 [5,]    5   15   25   35   45
 [6,]    6   16   26   36   46
 [7,]    7   17   27   37   47
 [8,]    8   18   28   38   48
 [9,]    9   19   29   39   49
[10,]   10   20   30   40   50

First element is 1 which is located at the first-row and first-column. That is, the location of the first element is [1, 1], where the first element represents the row-position, and the second element represents the column-position.

# extract the first element
my_mat[1,1]
[1] 1

6.2.1 Subsetting entire row of a matrix

# subset the first row
my_mat[1, ]
[1]  1 11 21 31 41

6.2.2 Subsetting multiple rows of a matrix

# subset rows 1, 2, and 4 and return all columns
my_mat[c(1, 2, 4), ]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1   11   21   31   41
[2,]    2   12   22   32   42
[3,]    4   14   24   34   44

6.2.3 Subsetting a column

# subset the second columns and keep values in all rows
my_mat[ , 2]
 [1] 11 12 13 14 15 16 17 18 19 20

6.3 Subsetting List

# Create the list

my_list = list(
  serial = 1:5,
  age = c(10, 11, 20, 30, 32),
  sex = c('M', 'F', 'F', 'M', 'm')
)

# prints the list
my_list
$serial
[1] 1 2 3 4 5

$age
[1] 10 11 20 30 32

$sex
[1] "M" "F" "F" "M" "m"

List can be subsetted using the $ operator too.

my_list$age
[1] 10 11 20 30 32

Or alternatively [ or the [[ operator can be used depending on your preference and exactly what you want to extract.

# extracts the first item of the list
my_list[1]
$serial
[1] 1 2 3 4 5
# extracts the elements of the first item of the list
my_list[[1]]
[1] 1 2 3 4 5

You can also use the name of the vector within the list to extract the list item

my_list['age']
$age
[1] 10 11 20 30 32

To extract the second element of the named vector age

my_list[['age']][2]
[1] 11

You can also use the [[ operator on data frame to extract the items and the elements within the items of a data frame. This is because the data frame is a collection of atomic vectors and its data type is list.

typeof(df)
[1] "closure"

6.4 Subsetting Data Frame

Data frames are created using the data.frame() function by supplying a list of columns. data.frames, as it is typically referred to are of list data type with one important distinction. List can have elements of unequal length. In data.frame, all the elements must have the same length to make the data.frame a true rectangular array.

x = c(1, 2, 3)
my_list = list(
  serial = 1:5,
  age = c(10, 11, 20, 30, 32), 
  sex = c('M', 'F', 'F', 'M', 'M')
)
df = data.frame(my_list)

df
  serial age sex
1      1  10   M
2      2  11   F
3      3  20   F
4      4  30   M
5      5  32   M

If you look at the data type for df using typeof(df), you will see its a list.

typeof(df)
[1] "list"

To view the structure of df object

str(df)
'data.frame':   5 obs. of  3 variables:
 $ serial: int  1 2 3 4 5
 $ age   : num  10 11 20 30 32
 $ sex   : chr  "M" "F" "F" "M" ...

To select the columns, we use $ operator to subset a column

df$age
[1] 10 11 20 30 32
df$serial
[1] 1 2 3 4 5
df$sex
[1] "M" "F" "F" "M" "M"

The data type of the extracted column age is double. Likewise, the data type of sex is character.

6.4.1 Selecting rows using conditions

Select all rows where the sex is male

df$sex
[1] "M" "F" "F" "M" "M"
df$sex == 'M'
[1]  TRUE FALSE FALSE  TRUE  TRUE
# subset the males
df[df$sex == 'M', ]
  serial age sex
1      1  10   M
4      4  30   M
5      5  32   M

If you want to select only age and sex of the data frame where sex = M

df[df$sex == 'M', c('age', 'sex')]
  age sex
1  10   M
4  30   M
5  32   M

Alternatively we could use the column position integers to select the columns

df[df$sex == 'M', c(2, 3)]
  age sex
1  10   M
4  30   M
5  32   M

6.5 Assigning values with Subsetting

Subsetting can be used to assign new values. This is also known as ‘setting’ a value

6.5.1 Atomic Vector

my_vec2 = my_vec

# replace the value in the second position

my_vec2[2] = 20.2

my_vec
[1] 10.1  2.2 32.3  5.4
my_vec2
[1] 10.1 20.2 32.3  5.4

6.5.2 Matrix

my_mat2 = my_mat

my_mat2[1, 1] = 10
my_mat2
      [,1] [,2] [,3] [,4] [,5]
 [1,]   10   11   21   31   41
 [2,]    2   12   22   32   42
 [3,]    3   13   23   33   43
 [4,]    4   14   24   34   44
 [5,]    5   15   25   35   45
 [6,]    6   16   26   36   46
 [7,]    7   17   27   37   47
 [8,]    8   18   28   38   48
 [9,]    9   19   29   39   49
[10,]   10   20   30   40   50

6.5.3 List

my_list2 = my_list

new_age = my_list$age + 10
my_list2$age = new_age

my_list
$serial
[1] 1 2 3 4 5

$age
[1] 10 11 20 30 32

$sex
[1] "M" "F" "F" "M" "M"
my_list2
$serial
[1] 1 2 3 4 5

$age
[1] 20 21 30 40 42

$sex
[1] "M" "F" "F" "M" "M"

Adding a new element to the list object

my_list$new_age = my_list$age + 20

my_list
$serial
[1] 1 2 3 4 5

$age
[1] 10 11 20 30 32

$sex
[1] "M" "F" "F" "M" "M"

$new_age
[1] 30 31 40 50 52

6.5.4 Data frames

Since data frames are lists, the same rule applies for subsetting and assigning new values and elements to the list (equivalently adding new columns to the data frame)

6.5.5 Exercise

  1. Create a matrix object and explore its attributes. What difference do you see from the attributes of a data frame?
x = matrix(1:10, ncol=2)
x
attributes(x)
  1. Create a list object and explore its attributes.

  2. Create a data frame object and explore its attributes.

References