Categorical Data
Often, we have to deal with columns in a data frame that take on a small number of levels:
julia> v = ["Group A", "Group A", "Group A", "Group B", "Group B", "Group B"]
6-element Vector{String}:
"Group A"
"Group A"
"Group A"
"Group B"
"Group B"
"Group B"
The naive encoding used in a Vector represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. There are two benefits of doing this. The first is that such vectors will tend to use less memory. The second is that they can be efficiently grouped using the groupby function.
There are two common types that allow to perform level pooling:
-
PooledVectorfrom PooledArrays.jl; -
CategoricalVectorfrom CategoricalArrays.jl.
The difference between PooledVector and CategoricalVector is the following:
-
PooledVectoris intended for cases where data compression is the only objective; -
CategoricalVectoris designed to additionally provide full support for working with categorical variables, both with unordered (nominal variables) and ordered categories (ordinal variables) at the expense of allowing onlyAbstractString,AbstractChar, orNumberelement types (optionally in a union withMissing).
CategoricalVector is useful in particular when unique values in the array (levels) should respect a meaningful ordering, like when printing tables, drawing plots or fitting regression models. CategoricalArrays.jl provides functions to set and retrieve this order and compare values according to it. On the contrary, the PooledVector type is essentially a drop-in replacement for Vector with almost no user-visible differences except for lower memory use and higher performance.
Below we show selected examples of working with CategoricalArrays.jl. See the CategoricalArrays.jl documentation package for more information regarding categorical arrays. Also note that in this section we discuss only vectors because we are considering a data frame context. However, in general both packages allow to work with arrays of any dimensionality.
In order to follow the examples below you need to install the CategoricalArrays.jl package first.
julia> using CategoricalArrays
julia> cv = categorical(v)
6-element CategoricalArray{String,1,UInt32}:
"Group A"
"Group A"
"Group A"
"Group B"
"Group B"
"Group B"
CategoricalVectorss support missing values.
julia> cv = categorical(["Group A", missing, "Group A",
"Group B", "Group B", missing])
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
In addition to representing repeated data efficiently, the CategoricalArray type allows us to determine efficiently the allowed levels of the variable at any time using the levels function (note that levels may or may not be actually used in the data):
julia> levels(cv)
2-element Vector{String}:
"Group A"
"Group B"
The levels! function also allows changing the order of appearance of the levels, which can be useful for display purposes or when working with ordered variables.
julia> levels!(cv, ["Group B", "Group A"])
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
julia> levels(cv)
2-element Vector{String}:
"Group B"
"Group A"
julia> sort(cv)
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group B"
"Group B"
"Group A"
"Group A"
missing
missing
By default, a CategoricalVector is able to represent different levels. You can use less memory by calling the compress function:
julia> cv = compress(cv)
6-element CategoricalArray{Union{Missing, String},1,UInt8}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
The categorical function additionally accepts a keyword argument compress which when set to true is equivalent to calling compress on the new vector:
julia> cv1 = categorical(["A", "B"], compress=true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
If the ordered keyword argument is set to true, the resulting CategoricalVector will be ordered, which means that its levels can be tested for order (rather than throwing an error):
julia> cv2 = categorical(["A", "B"], ordered=true)
2-element CategoricalArray{String,1,UInt32}:
"A"
"B"
julia> cv1[1] < cv1[2]
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
julia> cv2[1] < cv2[2]
true
You can check if a CategoricalVector is ordered using the isordered function and change between ordered and unordered using ordered! function.
julia> isordered(cv1)
false
julia> ordered!(cv1, true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
julia> isordered(cv1)
true
julia> cv1[1] < cv1[2]
true