3 Data Types
R’s data types are a bit complex. To keep it simple, everything in R can be thought of as vector. Vectors are of two kinds.
- Atomic vector (all elements must be of same type)
- List (elements can be of different types)
Atomic vectors are of several types as shown in the diagram below.
3.1 Atomic Vector
Before discussing vectors, we first need to understand the term scalar
. A scalar is a single value or an individual value. For example, age of a single individual when collected for recording is a scalar. But age of several individuals collected together can form a vector.
In most practical situations we work with a vector, which is a collection of scalars of the same type.
In R, we create a collection of values into a vector by the c()
function. The c
in c()
is short for combine.
Let us create four types of atomic vectors.
= c(TRUE, TRUE, F, T)
logical_vec = c(1, 2, 10, 5)
double_vec = c(1L, 2L, 10L, 5L)
integer_vec = c('Dhaka', 'New York', 'Anything') character_vec
To check the type of each vector, use the typeof()
function.
typeof(logical_vec)
[1] "logical"
typeof(double_vec)
[1] "double"
typeof(integer_vec)
[1] "integer"
typeof(character_vec)
[1] "character"
as.integer(10L/3L)
[1] 3
3.2 Matrix
Matrices are atomic vectors but with attributes. For example, matrices have dimensions, which can be viewed with the dim()
function. In the example below, an atomic vector is assinged a dimension attribute of
= c(1, 2, 3, 4)
a dim(a) = c(2,2)
a
[,1] [,2]
[1,] 1 3
[2,] 2 4
We can also create matrix using matrix()
function as follows
= matrix(1:10, nrow = 2, ncol = 5)
a a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
Dimension of this matrix is 2, 5, which means there are 2 rows and 5 columns.
To learn more about an R function, use the question mark (?) before the function name. For example, to learn about matrix
function, type ?matrix
on the R console. The space between ?
and matrix
is also permitted–i.e., ? matrix
(notice the space in between) will display the documentation about matrix
.
3.2.1 Elements Recycling
One important aspect of a matrix data type is that if the total number of elements is not the same as nrow 1:9
. However, the nrow = 2 and ncol = 5 indicates there should be 2
= matrix(1:9, nrow = 2, ncol = 5) a
Warning in matrix(1:9, nrow = 2, ncol = 5): data length [9] is not a sub-
multiple or multiple of the number of rows [2]
a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 1
Try another one
= matrix(1:5, nrow = 2, ncol = 5)
a a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 2 4
[2,] 2 4 1 3 5
Did you notice in what order R fills the cells of a matrix? We are filling 1:5 in the 10 cells and R is filling them column-wise. That is, it first fills the first column and all the rows therein. Then moves to second column and fills all the rows until it exhausts the elements before recycling.
What if you wanted to fill the elements row-wise? Can you figure out how to do that? Hint: open the documentation for matrix using ? matrix
on the R console. Then look for parameter byrow = FALSE
which is the default. Change it to byrow = TRUE
to fill the values row-wise.
Lets try that with the byrow = TRUE
argument.
= matrix(1:5, nrow = 2, ncol = 5, byrow = TRUE)
a a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 1 2 3 4 5
It now fills the elements row-wise. It fills the first row and all column therein. Then moves to the second row and fills all the columns in second row, and so on.
3.3 List
List is also a vector but it is a collection of one or more atomic vectors. To create a list, we combine one or more atomic vectors and wrap it around the function list()
= list(
a 1:10,
c(T, F, F)
) a
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] TRUE FALSE FALSE
Type of a is list as shown below.
typeof(a)
[1] "list"
Elements of a list can be named.
= list(
a series = 1:10,
series2 = c(1, 10),
tf = c(T, F, F)
)
a
$series
[1] 1 2 3 4 5 6 7 8 9 10
$series2
[1] 1 10
$tf
[1] TRUE FALSE FALSE
To view the structure of an R object, use the str()
function.
str(a)
List of 3
$ series : int [1:10] 1 2 3 4 5 6 7 8 9 10
$ series2: num [1:2] 1 10
$ tf : logi [1:3] TRUE FALSE FALSE
3.3.1 Exercise
Put some quiz questions about the data types discussed so far. Use the list below for guidance.
- Atomic vector
- Scalar
- Integer, Logical, Character, Double
- List
- Matrix
- Dimension of a matrix
3.4 Factor
Factor data type is built on top of integer. Factors are also known as ‘category’ and ‘enumerated’ types. As demonstrated previously, they all belong to vector type.
= c('Dhaka', 'Rajshahi', 'Chottogram',
city 'Kumilla', 'Sylhet')
print(city)
[1] "Dhaka" "Rajshahi" "Chottogram" "Kumilla" "Sylhet"
typeof(city)
[1] "character"
The city
object is of character type. You can create a factor out of the character vector by wrapping it with the function factor()
function.
= factor(city) city_factor
You can verify that the type of city_factor
is integer
typeof(city_factor)
[1] "integer"
Factor can only contain predefined values. The values are often called levels. You can set levels even the data do not have the value.
You can assign levels to a factor variable
# social status as a character vector
= c('High', 'Medium')
status_char
# social status as a factor vector
= factor(
status_factor
status_char, levels = c('High', 'Medium', 'Low')
)
# print the values
print(status_factor)
[1] High Medium
Levels: High Medium Low
Running table()
function displays the frequency of each element of the vector with number of times they occur.
table(status_char)
status_char
High Medium
1 1
Do the same on the factor object, we see a slightly different result. This is because the level of the factor was explicitly assigned (predefined).
table(status_factor)
status_factor
High Medium Low
1 1 0
How about if you assign different levels that do not exist in the data?
= factor(
status_factor_extra
status_char, levels = c('High Status', 'Medium Status', 'Low Status')
)
table(status_factor_extra)
status_factor_extra
High Status Medium Status Low Status
0 0 0
If you print the status_factor_extra
object, you see that the data are all NA because the predetermined levels do not match with the values of the vector.
status_factor_extra
[1] <NA> <NA>
Levels: High Status Medium Status Low Status
3.4.1 Ordered factor
Factors can be ordered depending on the value it holds. For example, social class is an ordinal measure. It can be ‘high’, ‘medium’, ‘low’.
To created an ordered factor use the ordered()
function
= c('Medium', 'Low', 'Low', 'High')
social_class = ordered(
social_class_factor
social_class,levels = c('Low', 'Medium', 'High')
)
social_class_factor
[1] Medium Low Low High
Levels: Low < Medium < High
table(social_class_factor)
social_class_factor
Low Medium High
2 1 1
Ordering of factor levels is useful and often more meaningful than unordered levels. Many statistical functions will utilize this ordering in statistical modeling and visualizations.
3.5 Data Frame and Tibble
Data frame is the most important concept in R. It was unique when it was introduced. Later, the idea was brought into Python via the Pandas library. Still widely used data structure, data.frame has its one issues, which is beyond the scope of this course. To overcome some of those issues, tibble was introduced by Wickham et al. (2018).
3.5.1 Data Frame
Data frames are created using the data.frame()
function by supplying a list of columns. data.frames, as it is typically referred to are of list data type with one important distinction. List can have elements of unequal length. In data.frame, all the elements must have the same length to make the data.frame a true rectangular array.
= data.frame(
df age = c(10, 11, 20, 30, 32),
sex = c('M', 'F', 'F', 'M', 'M')
) df
age sex
1 10 M
2 11 F
3 20 F
4 30 M
5 32 M
str(df)
'data.frame': 5 obs. of 2 variables:
$ age: num 10 11 20 30 32
$ sex: chr "M" "F" "F" "M" ...
We can create data.frame from a list as well by wrapping the list object with the data.frame()
function.
= list(
my_list serial = 1:5,
age = c(10, 11, 20, 30, 32),
sex = c('M', 'F', 'F', 'M', 'M')
)= data.frame(my_list)
df
df
serial age sex
1 1 10 M
2 2 11 F
3 3 20 F
4 4 30 M
5 5 32 M
3.5.2 Tibble
For the most part, we as a user of dataframes won’t notice the difference. All differences are under-the-hood. For those interested to learn two important distinctions between the two, please visit this link.
We will revisit tibbles
shortly.
Do we care whether it’s a tibble
or a data.frame
? For the most part, the answer is no. But the R ecosystem is evolving and newer libraries will likely use tibble as the default replacement for data.frame.