3  Data Types

R’s data types are a bit complex. To keep it simple, everything in R can be thought of as vector. Vectors are of two kinds.

Atomic
Vector
List

Atomic vectors are of several types as shown in the diagram below.

Date
Double
Factor
Integer
Atomic
Vector
Logical
Character

3.1 Atomic Vector

Before discussing vectors, we first need to understand the term scalar. A scalar is a single value or an individual value. For example, age of a single individual when collected for recording is a scalar. But age of several individuals collected together can form a vector.

In most practical situations we work with a vector, which is a collection of scalars of the same type.

In R, we create a collection of values into a vector by the c() function. The c in c() is short for combine.

Let us create four types of atomic vectors.

logical_vec = c(TRUE, TRUE, F, T)
double_vec = c(1, 2, 10, 5)
integer_vec = c(1L, 2L, 10L, 5L)
character_vec = c('Dhaka', 'New York', 'Anything')

To check the type of each vector, use the typeof() function.

typeof(logical_vec)
[1] "logical"
typeof(double_vec)
[1] "double"
typeof(integer_vec)
[1] "integer"
typeof(character_vec)
[1] "character"
as.integer(10L/3L)
[1] 3

3.2 Matrix

Matrices are atomic vectors but with attributes. For example, matrices have dimensions, which can be viewed with the dim() function. In the example below, an atomic vector is assinged a dimension attribute of 2×2 and we read it as two-by-two. This means there are two rows and two columns of this object.

a = c(1, 2, 3, 4)
dim(a) = c(2,2)
a
     [,1] [,2]
[1,]    1    3
[2,]    2    4

We can also create matrix using matrix() function as follows

a = matrix(1:10, nrow = 2, ncol = 5)
a
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Dimension of this matrix is 2, 5, which means there are 2 rows and 5 columns.

To learn more about an R function, use the question mark (?) before the function name. For example, to learn about matrix function, type ?matrix on the R console. The space between ? and matrix is also permitted–i.e., ? matrix (notice the space in between) will display the documentation about matrix.

3.2.1 Elements Recycling

One important aspect of a matrix data type is that if the total number of elements is not the same as nrow × ncol, then the numbers will be recycled. See for example, we are creating a matrix with only 9 elements as created by 1:9. However, the nrow = 2 and ncol = 5 indicates there should be 2 × 5 = 10 elements. Since we have 9 elements to fill 10 spaces, it will start recycling from the beginning. A warning will be printed. Note, this is just a warning, and R will not stop processing the computation because of it.

a = matrix(1:9, nrow = 2, ncol = 5)
Warning in matrix(1:9, nrow = 2, ncol = 5): data length [9] is not a sub-
multiple or multiple of the number of rows [2]
a
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8    1

Try another one

a = matrix(1:5, nrow = 2, ncol = 5)
a
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    2    4
[2,]    2    4    1    3    5
Note

Did you notice in what order R fills the cells of a matrix? We are filling 1:5 in the 10 cells and R is filling them column-wise. That is, it first fills the first column and all the rows therein. Then moves to second column and fills all the rows until it exhausts the elements before recycling.

How to fill row-wise?

What if you wanted to fill the elements row-wise? Can you figure out how to do that? Hint: open the documentation for matrix using ? matrix on the R console. Then look for parameter byrow = FALSE which is the default. Change it to byrow = TRUE to fill the values row-wise.

Lets try that with the byrow = TRUE argument.

a = matrix(1:5, nrow = 2, ncol = 5, byrow = TRUE)
a
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    1    2    3    4    5

It now fills the elements row-wise. It fills the first row and all column therein. Then moves to the second row and fills all the columns in second row, and so on.

3.3 List

List is also a vector but it is a collection of one or more atomic vectors. To create a list, we combine one or more atomic vectors and wrap it around the function list()

a = list(
  1:10,
  c(T, F, F)
)
a
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
[1]  TRUE FALSE FALSE

Type of a is list as shown below.

typeof(a)
[1] "list"

Elements of a list can be named.

a = list(
  series = 1:10,
  series2 = c(1, 10),
  tf = c(T, F, F)
)

a
$series
 [1]  1  2  3  4  5  6  7  8  9 10

$series2
[1]  1 10

$tf
[1]  TRUE FALSE FALSE

To view the structure of an R object, use the str() function.

str(a)
List of 3
 $ series : int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ series2: num [1:2] 1 10
 $ tf     : logi [1:3] TRUE FALSE FALSE

3.3.1 Exercise

Put some quiz questions about the data types discussed so far. Use the list below for guidance.

  • Atomic vector
  • Scalar
  • Integer, Logical, Character, Double
  • List
  • Matrix
  • Dimension of a matrix

3.4 Factor

Factor data type is built on top of integer. Factors are also known as ‘category’ and ‘enumerated’ types. As demonstrated previously, they all belong to vector type.

city = c('Dhaka', 'Rajshahi', 'Chottogram', 
         'Kumilla', 'Sylhet')
print(city)
[1] "Dhaka"      "Rajshahi"   "Chottogram" "Kumilla"    "Sylhet"    
typeof(city)
[1] "character"

The city object is of character type. You can create a factor out of the character vector by wrapping it with the function factor() function.

city_factor = factor(city)

You can verify that the type of city_factor is integer

typeof(city_factor)
[1] "integer"
Tip

Factor can only contain predefined values. The values are often called levels. You can set levels even the data do not have the value.

You can assign levels to a factor variable

# social status as a character vector
status_char = c('High', 'Medium')

# social status as a factor vector
status_factor = factor(
  status_char, 
  levels = c('High', 'Medium', 'Low')
)

# print the values
print(status_factor)
[1] High   Medium
Levels: High Medium Low

Running table() function displays the frequency of each element of the vector with number of times they occur.

table(status_char)
status_char
  High Medium 
     1      1 

Do the same on the factor object, we see a slightly different result. This is because the level of the factor was explicitly assigned (predefined).

table(status_factor)
status_factor
  High Medium    Low 
     1      1      0 

How about if you assign different levels that do not exist in the data?

status_factor_extra = factor(
  status_char, 
  levels = c('High Status', 'Medium Status', 'Low Status')
)

table(status_factor_extra)
status_factor_extra
  High Status Medium Status    Low Status 
            0             0             0 

If you print the status_factor_extra object, you see that the data are all NA because the predetermined levels do not match with the values of the vector.

status_factor_extra
[1] <NA> <NA>
Levels: High Status Medium Status Low Status

3.4.1 Ordered factor

Factors can be ordered depending on the value it holds. For example, social class is an ordinal measure. It can be ‘high’, ‘medium’, ‘low’.

To created an ordered factor use the ordered() function

social_class = c('Medium', 'Low', 'Low', 'High')
social_class_factor = ordered(
  social_class,
  levels = c('Low', 'Medium', 'High')
)

social_class_factor
[1] Medium Low    Low    High  
Levels: Low < Medium < High
table(social_class_factor)
social_class_factor
   Low Medium   High 
     2      1      1 

Ordering of factor levels is useful and often more meaningful than unordered levels. Many statistical functions will utilize this ordering in statistical modeling and visualizations.

3.5 Data Frame and Tibble

Data frame is the most important concept in R. It was unique when it was introduced. Later, the idea was brought into Python via the Pandas library. Still widely used data structure, data.frame has its one issues, which is beyond the scope of this course. To overcome some of those issues, tibble was introduced by Wickham et al. ().

3.5.1 Data Frame

Data frames are created using the data.frame() function by supplying a list of columns. data.frames, as it is typically referred to are of list data type with one important distinction. List can have elements of unequal length. In data.frame, all the elements must have the same length to make the data.frame a true rectangular array.

df = data.frame(
  age = c(10, 11, 20, 30, 32),
  sex = c('M', 'F', 'F', 'M', 'M')
)
df
  age sex
1  10   M
2  11   F
3  20   F
4  30   M
5  32   M
str(df)
'data.frame':   5 obs. of  2 variables:
 $ age: num  10 11 20 30 32
 $ sex: chr  "M" "F" "F" "M" ...

We can create data.frame from a list as well by wrapping the list object with the data.frame() function.

my_list = list(
  serial = 1:5,
  age = c(10, 11, 20, 30, 32), 
  sex = c('M', 'F', 'F', 'M', 'M')
)
df = data.frame(my_list)

df
  serial age sex
1      1  10   M
2      2  11   F
3      3  20   F
4      4  30   M
5      5  32   M

3.5.2 Tibble

For the most part, we as a user of dataframes won’t notice the difference. All differences are under-the-hood. For those interested to learn two important distinctions between the two, please visit this link.

We will revisit tibbles shortly.

Do we care whether it’s a tibble or a data.frame? For the most part, the answer is no. But the R ecosystem is evolving and newer libraries will likely use tibble as the default replacement for data.frame.

References