Functions
Multithreading support
By default, selected operations in DataFrames.jl automatically use multiple threads when available. It is task-based and implemented using the @spawn macro from Julia Base. Functions that take user-defined functions and may run it in parallel accept a threads keyword argument which allows disabling multithreading when the provided function requires serial execution or is not thread-safe.
This is a list of operations that currently make use of multi-threading:
-
DataFrameconstructor withcopycols=true; also recursively all functions that call this constructor, e.g.copy. -
getindexwhen multiple columns are selected. -
groupby(both when hashing is required and when fast path usingDataAPI.refpoolis used). -
*joinfunctions for composing output data frame (but currently not for finding matching rows in joined data frames). -
combine,select[!], andtransform[!]onGroupedDataFramewhen either of the conditions below is met:-
multiple transformations are performed (each transformation is spawned in a separate task)
-
a transformation produces one row per group and the passed transformation is a custom function (i.e. not for standard reductions, which use optimized single-threaded methods).
-
-
dropmissingwhen the provided data frame has more than 1 column andview=false(subsetting of individual columns is spawned in separate tasks).
In general at least Julia 1.4 is required to ensure that multi-threading is used and the Julia process must be started with more than one thread. Some operations turn on multi-threading only if enough rows in the processed data frame are present (the exact threshold when multi-threading is enabled is considered to be undefined and might change in the future).
Except for the list above, where multi-threading is used automatically, all functions provided by DataFrames.jl that update a data frame are not thread safe. This means that while they can be called from any thread, the caller is responsible for ensuring that a given DataFrame object is never modified by one thread while others are using it (either for reading or writing). Using the same DataFrame at the same time from different threads is safe as long as it is not modified.
Constructing data frames
#
DataAPI.allcombinations — Function
allcombinations(DataFrame, pairs::Pair...)
allcombinations(DataFrame; kwargs...)
Create a DataFrame from all combinations of values in passed arguments. The first passed values vary fastest.
Arguments associating a column name with values to expand can be specified either as Pairs passed as positional arguments, or as keyword arguments. Column names must be Symbols or strings and must be unique.
Column value can be a vector which is consumed as is or an object of any other type (except AbstractArray). In the latter case the passed value is treated as having length one for expansion. As a particular rule values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated as having length one.
See also: crossjoin can be used to get the cartesian product of rows from passed data frames.
Examples
julia> allcombinations(DataFrame, a=1:2, b='a':'c')
6×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 a
3 │ 1 b
4 │ 2 b
5 │ 1 c
6 │ 2 c
julia> allcombinations(DataFrame, "a" => 1:2, "b" => 'a':'c', "c" => "const")
6×3 DataFrame
Row │ a b c
│ Int64 Char String
─────┼─────────────────────
1 │ 1 a const
2 │ 2 a const
3 │ 1 b const
4 │ 2 b const
5 │ 1 c const
6 │ 2 c const
#
Base.copy — Function
copy(df::DataFrame; copycols::Bool=true)
Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.
Metadata: this function preserves all table-level and column-level metadata.
copy(dfr::DataFrameRow)
Construct a NamedTuple with the same contents as the DataFrameRow. This method returns a NamedTuple so that the returned object is not affected by changes to the parent data frame of which dfr is a view.
copy(key::GroupKey)
Construct a NamedTuple with the same contents as the GroupKey.
#
Base.similar — Function
similar(df::AbstractDataFrame, rows::Integer=nrow(df))
Create a new DataFrame with the same column names and column element types as df. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df.
Metadata: this function preserves table-level and column-level :note-style metadata.
Summary information
#
DataAPI.describe — Function
describe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)
Return descriptive statistics for a data frame as a new DataFrame where each row represents a variable and each column a summary statistic.
Arguments
-
df: theAbstractDataFrame -
stats::Union{Symbol, Pair}...: the summary statistics to report. Arguments can be:-
A symbol from the list
:mean,:std,:min,:q25,:median,:q75,:max,:sum,:eltype,:nunique,:nuniqueall,:first,:last,:nnonmissing, and:nmissing. The default statistics used are:mean,:min,:median,:max,:nmissing, and:eltype. -
:detailedas the onlySymbolargument to return all statistics except:first,:last,:sum,:nuniqueall, and:nnonmissing. -
:allas the onlySymbolargument to return all statistics. -
A
function => namepair wherenameis aSymbolor string. This will create a column of summary statistics with the provided name.
-
-
cols: a keyword argument allowing to select only a subset or transformation of columns fromdfto describe. Can be any column selector or transformation accepted byselect.
Details
For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.
When stats contains :nunique, describe will report the number of unique values in a column. If a column’s base type derives from Real, :nunique will return nothings. Use :nuniqueall to report the number of unique values in all columns.
Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable and :nnonmissing the number of non-missing values.
If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.
Metadata: this function drops all metadata.
Examples
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');
julia> describe(df)
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ i 5.5 1 5.5 10 0 Int64
2 │ x 0.55 0.1 0.55 1.0 0 Float64
3 │ y a j 0 Char
julia> describe(df, :min, :max)
3×3 DataFrame
Row │ variable min max
│ Symbol Any Any
─────┼────────────────────
1 │ i 1 10
2 │ x 0.1 1.0
3 │ y a j
julia> describe(df, :min, sum => :sum)
3×3 DataFrame
Row │ variable min sum
│ Symbol Any Union…
─────┼───────────────────────
1 │ i 1 55
2 │ x 0.1 5.5
3 │ y a
julia> describe(df, :min, sum => :sum, cols=:x)
1×3 DataFrame
Row │ variable min sum
│ Symbol Float64 Float64
─────┼────────────────────────────
1 │ x 0.1 5.5
#
Base.isempty — Function
isempty(df::AbstractDataFrame)
Return true if data frame df has zero rows, and false otherwise.
#
Base.length — Function
length(dfr::DataFrameRow)
Return the number of elements of dfr.
See also: size
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> length(dfr)
2
#
DataAPI.ncol — Function
ncol(df::AbstractDataFrame)
Return the number of columns in an AbstractDataFrame df.
Examples
julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
julia> ncol(df)
3
#
Base.ndims — Function
ndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})
Return the number of dimensions of a data frame, which is always 2.
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})
Return the number of dimensions of a data frame row, which is always 1.
#
DataAPI.nrow — Function
nrow(df::AbstractDataFrame)
Return the number of rows in an AbstractDataFrame df.
Examples
julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
julia> nrow(df)
10
#
DataAPI.rownumber — Function
rownumber(dfr::DataFrameRow)
Return a row number in the AbstractDataFrame that dfr was created from.
Note that this differs from the first element in the tuple returned by parentindices. The latter gives the row number in the parent(dfr), which is the source DataFrame where data that dfr gives access to is stored.
Examples
julia> df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
julia> dfr = df[2, :]
DataFrameRow
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
2 │ 2 5 8 11
julia> rownumber(dfr)
2
julia> parentindices(dfr)
(2, Base.OneTo(4))
julia> parent(dfr)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
julia> dfv = @view df[2:3, 1:3]
2×3 SubDataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 5 8
2 │ 3 6 9
julia> dfrv = dfv[2, :]
DataFrameRow
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
3 │ 3 6 9
julia> rownumber(dfrv)
2
julia> parentindices(dfrv)
(3, 1:3)
julia> parent(dfrv)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
#
Base.show — Function
show([io::IO, ]df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true,
eltypes::Bool = true,
truncate::Int = 32,
kwargs...)
Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io is omitted, the result is printed to stdout, and allrows, allcols and allgroups default to false.
Arguments
-
io::IO: The I/O stream to whichdfwill be printed. -
df::AbstractDataFrame: The data frame to print. -
allrows::Bool: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifiodoes not have theIOContextpropertylimitset. -
allcols::Bool: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifiodoes not have theIOContextpropertylimitset. -
allgroups::Bool: Whether to print all groups rather than the first and last, whendfis aGroupedDataFrame. By default this is the case only ifiodoes not have theIOContextpropertylimitset. -
rowlabel::Symbol = :Row: The label to use for the column containing row numbers. -
summary::Bool = true: Whether to print a brief string summary of the data frame. -
eltypes::Bool = true: Whether to print the column types under column names. -
truncate::Int = 32: the maximal display width the output can use before being truncated (in thetextwidthsense, excluding…). Iftruncateis 0 or less, no truncation is applied. -
kwargs...: Any keyword argument supported by the functionpretty_tableof PrettyTables.jl can be passed here to customize the output.
Examples
julia> using DataFrames
julia> df = DataFrame(A=1:3, B=["x", "y", "z"]);
julia> show(df, show_row_number=false)
3×2 DataFrame
A B
Int64 String
───────────────
1 x
2 y
3 z
show(io::IO, mime::MIME, df::AbstractDataFrame)
Render a data frame to an I/O stream in MIME type mime.
Arguments
-
io::IO: The I/O stream to whichdfwill be printed. -
mime::MIME: supported MIME types are:"text/plain","text/html","text/latex","text/csv","text/tab-separated-values"(the last two MIME types do not support showing#undefvalues) -
df::AbstractDataFrame: The data frame to print.
Additionally selected MIME types support passing the following keyword arguments:
-
MIME type
"text/plain"accepts all listed keyword arguments and their behavior is identical as forshow(::IO, ::AbstractDataFrame) -
MIME type
"text/html"accepts the following keyword arguments:-
eltypes::Bool = true: Whether to print the column types under column names. -
summary::Bool = true: Whether to print a brief string summary of the data frame. -
max_column_width::AbstractString = "": The maximum column width. It must be a string containing a valid CSS length. For example, passing "100px" will limit the width of all columns to 100 pixels. If empty, the columns will be rendered without limits. -
kwargs...: Any keyword argument supported by the functionpretty_tableof PrettyTables.jl can be passed here to customize the output.
-
Examples
julia> show(stdout, MIME("text/latex"), DataFrame(A=1:3, B=["x", "y", "z"]))
\begin{tabular}{r|cc}
& A & B\\
\hline
& Int64 & String\\
\hline
1 & 1 & x \\
2 & 2 & y \\
3 & 3 & z \\
\end{tabular}
14
julia> show(stdout, MIME("text/csv"), DataFrame(A=1:3, B=["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
#
Base.size — Function
size(df::AbstractDataFrame[, dim])
Return a tuple containing the number of rows and columns of df. Optionally a dimension dim can be specified, where 1 corresponds to rows and 2 corresponds to columns.
Examples
julia> df = DataFrame(a=1:3, b='a':'c');
julia> size(df)
(3, 2)
julia> size(df, 1)
3
size(dfr::DataFrameRow[, dim])
Return a 1-tuple containing the number of elements of dfr. If an optional dimension dim is specified, it must be 1, and the number of elements is returned directly as a number.
See also: length
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> size(dfr)
(2,)
julia> size(dfr, 1)
2
Working with column names
#
Base.names — Function
names(df::AbstractDataFrame, cols=:)
names(df::DataFrameRow, cols=:)
names(df::GroupedDataFrame, cols=:)
names(df::DataFrameRows, cols=:)
names(df::DataFrameColumns, cols=:)
names(df::GroupKey)
Return a freshly allocated Vector{String} of names of columns contained in df.
If cols is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Cols, Not, and Between). cols can be:
-
any column selector (
Symbol, string or integer;:,Cols,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers); these column selectors are documented in the General rules section of the Indexing part of the DataFrames.jl manual -
a
Type, in which case names of columns whoseeltypeis a subtype ofTare returned -
a
Functionpredicate taking the column name as a string and returningtruefor columns that should be kept
See also propertynames which returns a Vector{Symbol}.
Examples
julia> df = DataFrame(x1=[1, missing, missing], x2=[3, 2, 4], x3=[3, missing, 2], x4=Union{Int, Missing}[2, 4, 4])
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64? Int64 Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 3 3 2
2 │ missing 2 missing 4
3 │ missing 4 2 4
julia> names(df)
4-element Vector{String}:
"x1"
"x2"
"x3"
"x4"
julia> names(df, Int) # pick columns whose element type is Int
1-element Vector{String}:
"x2"
julia> names(df, x -> x[end] == '2') # pick columns for which last character in their name is '2'
1-element Vector{String}:
"x2"
julia> fun(col) = sum(skipmissing(col)) >= 10
fun (generic function with 1 method)
julia> names(df, fun.(eachcol(df))) # pick columns for which sum of their elements is at least 10
1-element Vector{String}:
"x4"
julia> names(df, eltype.(eachcol(df)) .>: Missing) # pick columns that allow missing values
3-element Vector{String}:
"x1"
"x3"
"x4"
julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
2-element Vector{String}:
"x1"
"x3"
#
Base.propertynames — Function
propertynames(df::AbstractDataFrame)
Return a freshly allocated Vector{Symbol} of names of columns contained in df.
#
DataFrames.rename — Function
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)
Create a new data frame that is a copy of df with changed column names. Each name is changed at most once. Permutation of names is allowed.
Arguments
-
df: theAbstractDataFrame; if it is aSubDataFramethen renaming is only allowed if it was created using:as a column selector. -
d: anAbstractDictor anAbstractVectorofPairs that maps the original names or column numbers to new names -
f: a function which for each column takes the old name as aStringand returns the new name that gets converted to aSymbol -
vals: new column names as a vector ofSymbols orAbstractStrings of the same length as the number of columns indf -
makeunique: iffalse(the default), an error will be raised if duplicate names are found; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).
If pairs are passed to rename (as positional arguments or in a dictionary or a vector) then:
-
fromvalue can be aSymbol, anAbstractStringor anInteger; -
tovalue can be aSymbolor anAbstractString.
Mixing symbols and strings in to and from is not allowed.
Metadata: this function preserves table-level and column-level :note-style metadata.
Column-level :note-style metadata is considered to be attached to column number: when a column is renamed, its :note-style metadata becomes associated to its new name.
See also: rename!
Examples
julia> df = DataFrame(i=1, x=2, y=3)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, [:a, :b, :c])
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, :i => "A", :x => "X")
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
Row │ i y x
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(uppercase, df)
1×3 DataFrame
Row │ I X Y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
#
DataFrames.rename! — Function
rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)
Rename columns of df in-place. Each name is changed at most once. Permutation of names is allowed.
Arguments
-
df: theAbstractDataFrame -
d: anAbstractDictor anAbstractVectorofPairs that maps the original names or column numbers to new names -
f: a function which for each column takes the old name as aStringand returns the new name that gets converted to aSymbol -
vals: new column names as a vector ofSymbols orAbstractStrings of the same length as the number of columns indf -
makeunique: iffalse(the default), an error will be raised if duplicate names are found; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).
If pairs are passed to rename! (as positional arguments or in a dictionary or a vector) then:
-
fromvalue can be aSymbol, anAbstractStringor anInteger; -
tovalue can be aSymbolor anAbstractString.
Mixing symbols and strings in to and from is not allowed.
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame). Column-level :note-style metadata is considered to be attached to column number: when a column is renamed, its :note-style metadata becomes associated to its new name.
See also: rename
Examples
julia> df = DataFrame(i=1, x=2, y=3)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.
julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
Row │ a b a_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(uppercase, df)
1×3 DataFrame
Row │ A B A_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
Mutating and transforming data frames and grouped data frames
#
Base.append! — Function
append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Add the rows of df2 to the end of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.
The exact behavior of append! depends on the cols argument:
-
If
cols == :setequal(this is the default) thendf2must contain exactly the same columns asdf(but possibly in a different order). -
If
cols == :orderequalthendf2must contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifdf2is aDictan error is thrown as it is an unordered collection). -
If
cols == :intersectthendf2may contain more columns thandf, but all column names that are present indfmust be present indf2and only these are used. -
If
cols == :subsetthenappend!behaves like for:intersectbut if some column is missing indf2then amissingvalue is pushed todf. -
If
cols == :unionthenappend!adds columns missing indfthat are present indf2, for columns present indfbut missing indf2amissingvalue is pushed.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
The above rule has the following exceptions:
-
If
dfhas no columns then copies of columns fromdf2are added to it. -
If
df2has no columns then callingappend!leavesdfunchanged.
Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
Metadata: table-level :note-style metadata and column-level :note-style metadata for columns present in df are preserved. If new columns are added their :note-style metadata is copied from the appended table. Other metadata is dropped.
See also: use push! to add individual rows to a data frame, prepend! to add a table at the beginning, and vcat to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4.0:6.0, B=4:6)
3×2 DataFrame
Row │ A B
│ Float64 Int64
─────┼────────────────
1 │ 4.0 4
2 │ 5.0 5
3 │ 6.0 6
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
#
DataFrames.combine — Function
combine(df::AbstractDataFrame, args...;
renamecols::Bool=true, threads::Bool=true)
combine(f::Callable, df::AbstractDataFrame;
renamecols::Bool=true, threads::Bool=true)
combine(gd::GroupedDataFrame, args...;
keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
combine(f::Base.Callable, gd::GroupedDataFrame;
keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Create a new data frame that contains columns from df or gd specified by args and return it. The result can have any number of rows that is determined by the values returned by passed transformations.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
-
combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame; it is typically used to compute summary statistics by group; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!is an in-place version ofselect; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!is an in-place version oftransform; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
-
standard column selectors (integers,
Symbols, strings, vectors of integers, vectors ofSymbols, vectors of strings,All,Cols,:,Between,Notand regular expressions) -
a
cols => functionpair indicating thatfunctionshould be called with positional arguments holding columnscols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunctionreturns a single value or a vector; the generated name is created by concatenating source column name andfunctionname by default (see examples below). -
a
cols => function => target_colsform additionally explicitly specifying the target column or columns, which must be a single name (as aSymbolor a string), a vector of names orAsTable. Additionally it can be aFunctionwhich takes a string or a vector of strings as an argument containing names of columns selected bycols, and returns the target columns names (all accepted types exceptAsTableare allowed). -
a
col => target_colspair, which renames the columncoltotarget_cols, which must be single name (as aSymbolor a string), a vector of names orAsTable. -
column-independent operations
function => target_colsor justfunctionfor specificfunctions where the input columns are omitted; withouttarget_colsthe new column has the same name asfunction, otherwise it must be single name (as aSymbolor a string). Supportedfunctions are:-
nrowto efficiently compute the number of rows in each group. -
proprowto efficiently compute the proportion of rows in each group. -
eachindexto return a vector holding the number of each row within each group. -
groupindicesto return the group number.
-
-
vectors or matrices containing transformations specified by the
Pairsyntax described in points 2 to 5 -
a function which will be called with a
SubDataFramecorresponding to each group if aGroupedDataFrameis processed, or with the data frame itself if anAbstractDataFrameis processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrameavoids excessive compilation)
Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.
Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).
As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.
What is allowed for function to return is determined by the target_cols value:
-
If both
colsandtarget_colsare omitted (so only afunctionis passed), then returning a data frame, a matrix, aNamedTuple, aTables.AbstractRowor aDataFrameRowwill produce multiple columns in the result. Returning any other value produces a single column. -
If
target_colsis aSymbolor a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple, aTables.AbstractRow, or aDataFrameRowraises an error. -
If
target_colsis a vector ofSymbols or strings orAsTableit is assumed thatfunctionreturns multiple columns. Iffunctionreturns one ofAbstractDataFrame,NamedTuple,DataFrameRow,Tables.AbstractRow,AbstractMatrixthen rules described in point 1 above apply. Iffunctionreturns anAbstractVectorthen each element of this vector must support thekeysfunction, which must return a collection ofSymbols, strings or integers; the return value ofkeysmust be identical for all elements. Then as many columns are created as there are elements in the return value of thekeysfunction. Iftarget_colsisAsTablethen their names are set to be equal to the key names except ifkeysreturns integers, in which case they are prefixed byx(so the column names are e.g.x1,x2, …). Iftarget_colsis a vector ofSymbols or strings then column names produced using the rules above are ignored and replaced bytarget_cols(the number of columns must be the same as the length oftarget_colsin this case). Iffunreturns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntablefunction is called on it to get the resulting columns and their names. The names are retained whentarget_colsisAsTableand are replaced iftarget_colsis a vector ofSymbols or strings.
In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.
select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:
-
if
copycols=truethen results of such transformations always perform a copy of the source column or its view; -
if
copycols=falsethen copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===, which excludes views of source columns) performs a copy;
Note that performing transform! or select! assumes that copycols=false.
If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.
Keyword arguments
-
renamecols::Bool=true: whether in thecols => functionform automatically generated column names should include the name of transformation functions or not. -
keepkeys::Bool=true: whether grouping columns ofgdshould be kept in the returned data frame. -
ungroup::Bool=true: whether the return value of the operation ongdshould be a data frame or aGroupedDataFrame. -
threads::Bool=true: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalseif some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> combine(df, :a => sum, nrow, renamecols=false)
1×2 DataFrame
Row │ a nrow
│ Int64 Int64
─────┼──────────────
1 │ 6 3
julia> combine(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
Row │ c b
│ Float64 Int64
─────┼─────────────────
1 │ 0.841471 4
2 │ 0.909297 5
3 │ 0.14112 6
julia> combine(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
Row │ a b a_b_function
│ Int64 Int64 Float64
─────┼────────────────────────────
1 │ 1 4 0.0
2 │ 2 5 2.0
3 │ 3 6 4.0
julia> combine(df, All() .=> [minimum maximum])
1×4 DataFrame
Row │ a_minimum b_minimum a_maximum b_maximum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ 1 4 3 6
julia> using Statistics
julia> combine(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> combine(df, AsTable(:) => ByRow(mean) => x -> join(x, "_"))
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> combine(first, df)
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> combine(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
Row │ stats mean std
│ NamedTup… Float64 Float64
─────┼───────────────────────────────────────────
1 │ (mean = 4.0, std = 3.0) 4.0 3.0
2 │ (mean = 5.0, std = 3.0) 5.0 3.0
3 │ (mean = 6.0, std = 3.0) 6.0 3.0
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 2
2 │ 2 8 2
3 │ 3 10 2
4 │ 4 12 2
julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 2
⋮
Last Group (1 row): a = 4
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 12 2
julia> combine(gd) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
Row │ a x1
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 2 8
3 │ 3 10
4 │ 4 12
julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
Row │ a sum_log_c
│ Int64 Float64
─────┼──────────────────
1 │ 1 1.60944
2 │ 2 2.48491
3 │ 3 3.04452
4 │ 4 3.46574
julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
Row │ a b_sum c_sum
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 6
2 │ 2 2 8
3 │ 3 4 10
4 │ 4 2 12
julia> combine(gd) do sdf # dropping group when DataFrame() is returned
sdf.c[1] != 1 ? sdf : DataFrame()
end
6×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 2
2 │ 2 1 6
3 │ 3 2 3
4 │ 3 2 7
5 │ 4 1 4
6 │ 4 1 8
auto-splatting, renaming and keepkeys
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
Row │ b1 c1 b_c_+
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 3
2 │ 2 5 7
3 │ 1 2 3
4 │ 1 6 7
5 │ 2 3 5
6 │ 2 7 9
7 │ 1 4 5
8 │ 1 8 9
broadcasting and column expansion
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
Row │ a b min max
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 2
2 │ 1 2 2 5
3 │ 2 1 1 2
4 │ 2 1 1 6
5 │ 3 2 2 3
6 │ 3 2 2 7
7 │ 4 1 1 4
8 │ 4 1 1 8
julia> combine(gd, [:b, :c] .=> Ref) # preventing vector from being spread across multiple rows
4×3 DataFrame
Row │ a b_Ref c_Ref
│ Int64 SubArray… SubArray…
─────┼─────────────────────────────
1 │ 1 [2, 2] [1, 5]
2 │ 2 [1, 1] [2, 6]
3 │ 3 [2, 2] [3, 7]
4 │ 4 [1, 1] [4, 8]
julia> combine(gd, AsTable(Not(:a)) => Ref) # protecting result
4×2 DataFrame
Row │ a b_c_Ref
│ Int64 NamedTup…
─────┼─────────────────────────────────
1 │ 1 (b = [2, 2], c = [1, 5])
2 │ 2 (b = [1, 1], c = [2, 6])
3 │ 3 (b = [2, 2], c = [3, 7])
4 │ 4 (b = [1, 1], c = [4, 8])
julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
Row │ a b c b_c
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 3
2 │ 1 2 5 7
3 │ 2 1 2 3
4 │ 2 1 6 7
5 │ 3 2 3 5
6 │ 3 2 7 9
7 │ 4 1 4 5
8 │ 4 1 8 9
#
DataFrames.fillcombinations — Function
fillcombinations(df::AbstractDataFrame, indexcols;
allowduplicates::Bool=false,
fill=missing)
Generate all combinations of levels of column(s) indexcols in data frame df. Levels and their order are determined by the levels function (i.e. unique values sorted lexicographically by default, or a custom set of levels for e.g. CategoricalArray columns), in addition to missing if present.
For combinations of indexcols not present in df these columns are filled with the fill value (missing by default).
If allowduplicates=false (the default) indexcols may only contain unique combinations of indexcols values. If allowduplicates=true duplicates are allowed.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(x=1:2, y='a':'b', z=["x", "y"])
2×3 DataFrame
Row │ x y z
│ Int64 Char String
─────┼─────────────────────
1 │ 1 a x
2 │ 2 b y
julia> fillcombinations(df, [:x, :y])
4×3 DataFrame
Row │ x y z
│ Int64 Char String?
─────┼──────────────────────
1 │ 1 a x
2 │ 2 a missing
3 │ 1 b missing
4 │ 2 b y
julia> fillcombinations(df, [:y, :z], fill=0)
4×3 DataFrame
Row │ x y z
│ Int64? Char String
─────┼──────────────────────
1 │ 1 a x
2 │ 0 b x
3 │ 0 a y
4 │ 2 b y
#
DataFrames.flatten — Function
flatten(df::AbstractDataFrame, cols; scalar::Type=Union{})
When columns cols of data frame df have iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector where the original entries are concatenated. Elements of row i of df in columns other than cols will be repeated according to the length of df[i, col]. These lengths must therefore be the same for each col in cols, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If scalar is passed then values that have this type in flattened columns are treated as scalars and broadcasted as many times as is needed to match lengths of values stored in other columns. If all values in a row are scalars, a single row is produced.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df1 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼───────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 [3, 4] [7, 8]
julia> flatten(df1, :b)
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Array…
─────┼──────────────────────
1 │ 1 1 [5, 6]
2 │ 1 2 [5, 6]
3 │ 2 3 [7, 8]
4 │ 2 4 [7, 8]
julia> flatten(df1, [:b, :c])
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 5
2 │ 1 2 6
3 │ 2 3 7
4 │ 2 4 8
julia> df2 = DataFrame(a=[1, 2], b=[("p", "q"), ("r", "s")])
2×2 DataFrame
Row │ a b
│ Int64 Tuple…
─────┼───────────────────
1 │ 1 ("p", "q")
2 │ 2 ("r", "s")
julia> flatten(df2, :b)
4×2 DataFrame
Row │ a b
│ Int64 String
─────┼───────────────
1 │ 1 p
2 │ 1 q
3 │ 2 r
4 │ 2 s
julia> df3 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7]])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼───────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 [3, 4] [7]
julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c are not the same in row 2
julia> df4 = DataFrame(a=[1, 2, 3],
b=[[1, 2], missing, missing],
c=[[5, 6], missing, [7, 8]])
3×3 DataFrame
Row │ a b c
│ Int64 Array…? Array…?
─────┼─────────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 missing missing
3 │ 3 missing [7, 8]
julia> flatten(df4, [:b, :c], scalar=Missing)
5×3 DataFrame
Row │ a b c
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 5
2 │ 1 2 6
3 │ 2 missing missing
4 │ 3 missing 7
5 │ 3 missing 8
#
Base.hcat — Function
hcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
Horizontally concatenate data frames.
If makeunique=false (the default) column names of passed objects must be unique. If makeunique=true then duplicate column names will be suffixed with _i (i starting at 1 for the first duplicate).
If copycols=true (the default) then the DataFrame returned by hcat will contain copied columns from the source data frames. If copycols=false then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame might lead to the corruption of the other object.
Metadata: hcat propagates table-level :note-style metadata for keys that are present in all passed data frames and have the same value; it propagates column-level :note-style metadata.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
Row │ A B A_1 B_1
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 1 4 4
2 │ 2 2 5 5
3 │ 3 3 6 6
julia> df3.A === df1.A
false
julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3.A === df1.A
true
#
Base.insert! — Function
insert!(df::DataFrame, index::Integer, row::Union{Tuple, AbstractArray}; promote::Bool=false)
insert!(df::DataFrame, index::Integer, row::Union{DataFrameRow, NamedTuple,
AbstractDict, Tables.AbstractRow};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add one row to df at position index in-place, taking the values from row. index must be a integer between 1 and nrow(df)+1.
Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.
If row is a DataFrameRow, NamedTuple, AbstractDict, or Tables.AbstractRow then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:
-
If
cols == :setequal(this is the default) thenrowmust contain exactly the same columns asdf(but possibly in a different order). -
If
cols == :orderequalthenrowmust contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifrowis aDictan error is thrown as it is an unordered collection). -
If
cols == :intersectthenrowmay contain more columns thandf, but all column names that are present indfmust be present inrowand only they are used to populate a new row indf. -
If
cols == :subsetthen the behavior is like for:intersectbut if some column is missing inrowthen amissingvalue is pushed todf. -
If
cols == :unionthen columns missing indfthat are present inroware added todf(usingmissingfor existing rows) and amissingvalue is pushed to columns missing inrowthat are present indf.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
As a special case, if df has no columns and row is a NamedTuple, DataFrameRow, or Tables.AbstractRow, columns are created for all values in row, using their names and order.
Please note that this function must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
Metadata: this function preserves table-level and column-level :note-style metadata.
See also: push!, pushfirst!
Examples
julia> df = DataFrame(A='a':'c', B=1:3)
3×2 DataFrame
Row │ A B
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insert!(df, 2, (true, false), promote=true)
4×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ true 0
3 │ b 2
4 │ c 3
julia> insert!(df, 5, df[1, :])
5×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ true 0
3 │ b 2
4 │ c 3
5 │ a 1
julia> insert!(df, 1, (C="something", A=11, B=12), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ 11 12
2 │ a 1
3 │ true 0
4 │ b 2
5 │ c 3
6 │ a 1
julia> insert!(df, 7, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼──────────────────────────
1 │ 11 12 missing
2 │ a 1 missing
3 │ true 0 missing
4 │ b 2 missing
5 │ c 3 missing
6 │ a 1 missing
7 │ 1.0 missing 1.0
julia> insert!(df, 3, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼─────────────────────────────
1 │ 11 12 missing
2 │ a 1 missing
3 │ missing missing missing
4 │ true 0 missing
5 │ b 2 missing
6 │ c 3 missing
7 │ a 1 missing
8 │ 1.0 missing 1.0
#
DataFrames.insertcols — Function
insertcols(df::AbstractDataFrame[, col], (name=>val)::Pair...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
Insert a column into a copy of df data frame using the insertcols! function and return the newly created data frame.
If col is omitted it is set to ncol(df)+1 (the column is inserted as the last column).
Arguments
-
df: the data frame to which we want to add columns -
col: a position at which we want to insert a column, passed as an integer or a column name (a string or aSymbol); the column selected withcoland columns following it are shifted to the right indfafter the operation -
name: the name of the new column -
val: anAbstractVectorgiving the contents of the new column or a value of any type other thanAbstractArraywhich will be repeated to fill a new vector; As a particular rule a values stored in aRefor a0-dimensionalAbstractArrayare unwrapped and treated in the same way -
after: iftruecolumns are inserted aftercol -
makeunique: defines what to do ifnamealready exists indf; if it isfalsean error will be thrown; if it istruea new unique name will be generated by adding a suffix -
copycols: whether vectors passed as columns should be copied
If val is an AbstractRange then the result of collect(val) is inserted.
If df is a SubDataFrame then it must have been created with : as column selector (otherwise an error is thrown). In this case the copycols keyword argument is ignored (i.e. the added column is always copied) and the parent data frame’s column is filled with missing in rows that are filtered out by df.
If df isa DataFrame that has no columns and only values other than AbstractVector are passed then it is used to create a one-element column. If df isa DataFrame that has no columns and at least one AbstractVector is passed then its length is used to determine the number of elements in all created columns. In all other cases the number of rows in all created columns must match nrow(df).
Metadata: this function preserves table-level and column-level :note-style metadata.
See also insertcols!.
Examples
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> insertcols(df, 1, :b => 'a':'c')
3×2 DataFrame
Row │ b a
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insertcols(df, :c => 2:4, :c => 3:5, makeunique=true)
3×3 DataFrame
Row │ a c c_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
2 │ 2 3 4
3 │ 3 4 5
julia> insertcols(df, :a, :d => 7:9, after=true)
3×2 DataFrame
Row │ a d
│ Int64 Int64
─────┼──────────────
1 │ 1 7
2 │ 2 8
3 │ 3 9
#
DataFrames.insertcols! — Function
insertcols!(df::AbstractDataFrame[, col], (name=>val)::Pair...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
Insert a column into a data frame in place. Return the updated data frame.
If col is omitted it is set to ncol(df)+1 (the column is inserted as the last column).
Arguments
-
df: the data frame to which we want to add columns -
col: a position at which we want to insert a column, passed as an integer or a column name (a string or aSymbol); the column selected withcoland columns following it are shifted to the right indfafter the operation -
name: the name of the new column -
val: anAbstractVectorgiving the contents of the new column or a value of any type other thanAbstractArraywhich will be repeated to fill a new vector; As a particular rule a values stored in aRefor a0-dimensionalAbstractArrayare unwrapped and treated in the same way -
after: iftruecolumns are inserted aftercol -
makeunique: defines what to do ifnamealready exists indf; if it isfalsean error will be thrown; if it istruea new unique name will be generated by adding a suffix -
copycols: whether vectors passed as columns should be copied
If val is an AbstractRange then the result of collect(val) is inserted.
If df is a SubDataFrame then it must have been created with : as column selector (otherwise an error is thrown). In this case the copycols keyword argument is ignored (i.e. the added column is always copied) and the parent data frame’s column is filled with missing in rows that are filtered out by df.
If df isa DataFrame that has no columns and only values other than AbstractVector are passed then it is used to create a one-element column. If df isa DataFrame that has no columns and at least one AbstractVector is passed then its length is used to determine the number of elements in all created columns. In all other cases the number of rows in all created columns must match nrow(df).
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).
See also insertcols.
Examples
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> insertcols!(df, 1, :b => 'a':'c')
3×2 DataFrame
Row │ b a
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insertcols!(df, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
Row │ b c c_1 a
│ Char Int64 Int64 Int64
─────┼───────────────────────────
1 │ a 2 3 1
2 │ b 3 4 2
3 │ c 4 5 3
julia> insertcols!(df, :b, :d => 7:9, after=true)
3×5 DataFrame
Row │ b d c c_1 a
│ Char Int64 Int64 Int64 Int64
─────┼──────────────────────────────────
1 │ a 7 2 3 1
2 │ b 8 3 4 2
3 │ c 9 4 5 3
#
Base.invpermute! — Function
invpermute!(df::AbstractDataFrame, p)
Like permute!, but the inverse of the given permutation is applied.
invpermute! will produce a correct result even if some columns of passed data frame or permutation p are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then invpermute! result might be incorrect.
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> permute!(df, [5, 3, 1, 2, 4])
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 3 8 13
3 │ 1 6 11
4 │ 2 7 12
5 │ 4 9 14
julia> invpermute!(df, [5, 3, 1, 2, 4])
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
#
DataFrames.mapcols — Function
mapcols(f::Union{Function, Type}, df::AbstractDataFrame)
Return a DataFrame where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).
Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
4 │ 16 196
#
DataFrames.mapcols! — Function
mapcols!(f::Union{Function, Type}, df::DataFrame)
Update a DataFrame in-place where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).
Note that mapcols! reuses the columns from df if they are returned by f.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> mapcols!(x -> x.^2, df);
julia> df
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
4 │ 16 196
#
Base.permute! — Function
permute!(df::AbstractDataFrame, p)
Permute data frame df in-place, according to permutation p. Throws ArgumentError if p is not a permutation.
To return a new data frame instead of permuting df in-place, use df[p, :].
permute! will produce a correct result even if some columns of passed data frame or permutation p are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then permute! result might be incorrect.
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> permute!(df, [5, 3, 1, 2, 4])
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 3 8 13
3 │ 1 6 11
4 │ 2 7 12
5 │ 4 9 14
#
Base.prepend! — Function
prepend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
prepend!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Add the rows of df2 to the beginning of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being prepended.
The exact behavior of prepend! depends on the cols argument:
-
If
cols == :setequal(this is the default) thendf2must contain exactly the same columns asdf(but possibly in a different order). -
If
cols == :orderequalthendf2must contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifdf2is aDictan error is thrown as it is an unordered collection). -
If
cols == :intersectthendf2may contain more columns thandf, but all column names that are present indfmust be present indf2and only these are used. -
If
cols == :subsetthenappend!behaves like for:intersectbut if some column is missing indf2then amissingvalue is pushed todf. -
If
cols == :unionthenappend!adds columns missing indfthat are present indf2, for columns present indfbut missing indf2amissingvalue is pushed.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
The above rule has the following exceptions:
-
If
dfhas no columns then copies of columns fromdf2are added to it. -
If
df2has no columns then callingprepend!leavesdfunchanged.
Please note that prepend! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
Metadata: table-level :note-style metadata and column-level :note-style metadata for columns present in df are preserved. If new columns are added their :note-style metadata is copied from the appended table. Other metadata is dropped.
See also: use pushfirst! to add individual rows at the beginning of a data frame, append! to add a table at the end, and vcat to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4.0:6.0, B=4:6)
3×2 DataFrame
Row │ A B
│ Float64 Int64
─────┼────────────────
1 │ 4.0 4
2 │ 5.0 5
3 │ 6.0 6
julia> prepend!(df1, df2);
julia> df1
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
4 │ 1 1
5 │ 2 2
6 │ 3 3
#
Base.push! — Function
push!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict,
Tables.AbstractRow};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add one row at the end of df in-place, taking the values from row.
Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.
If row is a DataFrameRow, NamedTuple, AbstractDict, or Tables.AbstractRow then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:
-
If
cols == :setequal(this is the default) thenrowmust contain exactly the same columns asdf(but possibly in a different order). -
If
cols == :orderequalthenrowmust contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifrowis aDictan error is thrown as it is an unordered collection). -
If
cols == :intersectthenrowmay contain more columns thandf, but all column names that are present indfmust be present inrowand only they are used to populate a new row indf. -
If
cols == :subsetthen the behavior is like for:intersectbut if some column is missing inrowthen amissingvalue is pushed todf. -
If
cols == :unionthen columns missing indfthat are present inroware added todf(usingmissingfor existing rows) and amissingvalue is pushed to columns missing inrowthat are present indf.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
As a special case, if df has no columns and row is a NamedTuple, DataFrameRow, or Tables.AbstractRow, columns are created for all values in row, using their names and order.
Please note that this function must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
Metadata: this function preserves table-level and column-level :note-style metadata.
See also: pushfirst!, insert!
Examples
julia> df = DataFrame(A='a':'c', B=1:3)
3×2 DataFrame
Row │ A B
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> push!(df, (true, false), promote=true)
4×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ true 0
julia> push!(df, df[1, :])
5×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ true 0
5 │ a 1
julia> push!(df, (C="something", A=11, B=12), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ true 0
5 │ a 1
6 │ 11 12
julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼──────────────────────────
1 │ a 1 missing
2 │ b 2 missing
3 │ c 3 missing
4 │ true 0 missing
5 │ a 1 missing
6 │ 11 12 missing
7 │ 1.0 missing 1.0
julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼─────────────────────────────
1 │ a 1 missing
2 │ b 2 missing
3 │ c 3 missing
4 │ true 0 missing
5 │ a 1 missing
6 │ 11 12 missing
7 │ 1.0 missing 1.0
8 │ missing missing missing
#
Base.pushfirst! — Function
pushfirst!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
pushfirst!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict,
Tables.AbstractRow};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add one row at the beginning of df in-place, taking the values from row.
Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.
If row is a DataFrameRow, NamedTuple, AbstractDict, or Tables.AbstractRow then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:
-
If
cols == :setequal(this is the default) thenrowmust contain exactly the same columns asdf(but possibly in a different order). -
If
cols == :orderequalthenrowmust contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifrowis aDictan error is thrown as it is an unordered collection). -
If
cols == :intersectthenrowmay contain more columns thandf, but all column names that are present indfmust be present inrowand only they are used to populate a new row indf. -
If
cols == :subsetthen the behavior is like for:intersectbut if some column is missing inrowthen amissingvalue is pushed todf. -
If
cols == :unionthen columns missing indfthat are present inroware added todf(usingmissingfor existing rows) and amissingvalue is pushed to columns missing inrowthat are present indf.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
As a special case, if df has no columns and row is a NamedTuple, DataFrameRow, or Tables.AbstractRow, columns are created for all values in row, using their names and order.
Please note that this function must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(A='a':'c', B=1:3)
3×2 DataFrame
Row │ A B
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> pushfirst!(df, (true, false), promote=true)
4×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ true 0
2 │ a 1
3 │ b 2
4 │ c 3
julia> pushfirst!(df, df[1, :])
5×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ true 0
2 │ true 0
3 │ a 1
4 │ b 2
5 │ c 3
julia> pushfirst!(df, (C="something", A=11, B=12), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ 11 12
2 │ true 0
3 │ true 0
4 │ a 1
5 │ b 2
6 │ c 3
julia> pushfirst!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼──────────────────────────
1 │ 1.0 missing 1.0
2 │ 11 12 missing
3 │ true 0 missing
4 │ true 0 missing
5 │ a 1 missing
6 │ b 2 missing
7 │ c 3 missing
julia> pushfirst!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼─────────────────────────────
1 │ missing missing missing
2 │ 1.0 missing 1.0
3 │ 11 12 missing
4 │ true 0 missing
5 │ true 0 missing
6 │ a 1 missing
7 │ b 2 missing
8 │ c 3 missing
#
Base.reduce — Function
reduce(::typeof(vcat),
dfs::Union{AbstractVector{<:AbstractDataFrame},
Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing,
init::AbstractDataFrame=DataFrame())
Efficiently reduce the given vector or tuple of AbstractDataFrames with vcat.
See the vcat docstring for a description of keyword arguments cols and source.
The keyword argument init is the initial value to use in the reductions. It must be a data frame that has zero rows. It is not taken into account when computing the value of the source column nor when determining metadata of the produced data frame.
The column order, names, and types of the resulting DataFrame, and the behavior of cols and source keyword arguments follow the rules specified for vcat of AbstractDataFrames.
Metadata: vcat propagates table-level :note-style metadata for keys that are present in all passed data frames and have the same value. vcat propagates column-level :note-style metadata for keys that are present in all passed data frames that contain this column and have the same value.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
Row │ A C
│ Int64 Int64
─────┼──────────────
1 │ 7 7
2 │ 8 8
3 │ 9 9
julia> reduce(vcat, (df1, df2))
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> reduce(vcat, [df1, df2, df3], cols=:union, source=:source)
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Int64
─────┼─────────────────────────────────
1 │ 1 1 missing 1
2 │ 2 2 missing 1
3 │ 3 3 missing 1
4 │ 4 4 missing 2
5 │ 5 5 missing 2
6 │ 6 6 missing 2
7 │ 7 missing 7 3
8 │ 8 missing 8 3
9 │ 9 missing 9 3
#
Base.repeat — Function
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)
Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.
Metadata: this function preserves table-level and column-level :note-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, inner=2, outer=3)
12×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 1 3
3 │ 2 4
4 │ 2 4
5 │ 1 3
6 │ 1 3
7 │ 2 4
8 │ 2 4
9 │ 1 3
10 │ 1 3
11 │ 2 4
12 │ 2 4
repeat(df::AbstractDataFrame, count::Integer)
Construct a data frame by repeating each row in df the number of times specified by count.
Metadata: this function preserves table-level and column-level :note-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, 2)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
3 │ 1 3
4 │ 2 4
#
DataFrames.repeat! — Function
repeat!(df::DataFrame; inner::Integer=1, outer::Integer=1)
Update a data frame df in-place by repeating its rows. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated. Columns of df are freshly allocated.
Metadata: this function preserves table-level and column-level :note-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat!(df, inner=2, outer=3);
julia> df
12×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 1 3
3 │ 2 4
4 │ 2 4
5 │ 1 3
6 │ 1 3
7 │ 2 4
8 │ 2 4
9 │ 1 3
10 │ 1 3
11 │ 2 4
12 │ 2 4
repeat!(df::DataFrame, count::Integer)
Update a data frame df in-place by repeating its rows the number of times specified by count. Columns of df are freshly allocated.
Metadata: this function preserves table-level and column-level :note-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, 2)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
3 │ 1 3
4 │ 2 4
#
Base.reverse — Function
reverse(df::AbstractDataFrame, start=1, stop=nrow(df))
Return a data frame containing the rows in df in reversed order. If start and stop are provided, only rows in the start:stop range are affected.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> reverse(df)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 4 9 14
3 │ 3 8 13
4 │ 2 7 12
5 │ 1 6 11
julia> reverse(df, 2, 3)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 3 8 13
3 │ 2 7 12
4 │ 4 9 14
5 │ 5 10 15
#
Base.reverse! — Function
reverse!(df::AbstractDataFrame, start=1, stop=nrow(df))
Mutate data frame in-place to reverse its row order. If start and stop are provided, only rows in the start:stop range are affected.
reverse! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then reverse! result might be incorrect.
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> reverse!(df)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 4 9 14
3 │ 3 8 13
4 │ 2 7 12
5 │ 1 6 11
julia> reverse!(df, 2, 3)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 3 8 13
3 │ 4 9 14
4 │ 2 7 12
5 │ 1 6 11
#
DataFrames.select — Function
select(df::AbstractDataFrame, args...;
copycols::Bool=true, renamecols::Bool=true, threads::Bool=true)
select(args::Callable, df::DataFrame;
renamecols::Bool=true, threads::Bool=true)
select(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
select(f::Base.Callable, gd::GroupedDataFrame;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Create a new data frame that contains columns from df or gd specified by args and return it. The result is guaranteed to have the same number of rows as df, except when no columns are selected (in which case the result has zero rows).
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
-
combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame; it is typically used to compute summary statistics by group; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!is an in-place version ofselect; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!is an in-place version oftransform; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
-
standard column selectors (integers,
Symbols, strings, vectors of integers, vectors ofSymbols, vectors of strings,All,Cols,:,Between,Notand regular expressions) -
a
cols => functionpair indicating thatfunctionshould be called with positional arguments holding columnscols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunctionreturns a single value or a vector; the generated name is created by concatenating source column name andfunctionname by default (see examples below). -
a
cols => function => target_colsform additionally explicitly specifying the target column or columns, which must be a single name (as aSymbolor a string), a vector of names orAsTable. Additionally it can be aFunctionwhich takes a string or a vector of strings as an argument containing names of columns selected bycols, and returns the target columns names (all accepted types exceptAsTableare allowed). -
a
col => target_colspair, which renames the columncoltotarget_cols, which must be single name (as aSymbolor a string), a vector of names orAsTable. -
column-independent operations
function => target_colsor justfunctionfor specificfunctions where the input columns are omitted; withouttarget_colsthe new column has the same name asfunction, otherwise it must be single name (as aSymbolor a string). Supportedfunctions are:-
nrowto efficiently compute the number of rows in each group. -
proprowto efficiently compute the proportion of rows in each group. -
eachindexto return a vector holding the number of each row within each group. -
groupindicesto return the group number.
-
-
vectors or matrices containing transformations specified by the
Pairsyntax described in points 2 to 5 -
a function which will be called with a
SubDataFramecorresponding to each group if aGroupedDataFrameis processed, or with the data frame itself if anAbstractDataFrameis processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrameavoids excessive compilation)
Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.
Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).
As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.
What is allowed for function to return is determined by the target_cols value:
-
If both
colsandtarget_colsare omitted (so only afunctionis passed), then returning a data frame, a matrix, aNamedTuple, aTables.AbstractRowor aDataFrameRowwill produce multiple columns in the result. Returning any other value produces a single column. -
If
target_colsis aSymbolor a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple, aTables.AbstractRow, or aDataFrameRowraises an error. -
If
target_colsis a vector ofSymbols or strings orAsTableit is assumed thatfunctionreturns multiple columns. Iffunctionreturns one ofAbstractDataFrame,NamedTuple,DataFrameRow,Tables.AbstractRow,AbstractMatrixthen rules described in point 1 above apply. Iffunctionreturns anAbstractVectorthen each element of this vector must support thekeysfunction, which must return a collection ofSymbols, strings or integers; the return value ofkeysmust be identical for all elements. Then as many columns are created as there are elements in the return value of thekeysfunction. Iftarget_colsisAsTablethen their names are set to be equal to the key names except ifkeysreturns integers, in which case they are prefixed byx(so the column names are e.g.x1,x2, …). Iftarget_colsis a vector ofSymbols or strings then column names produced using the rules above are ignored and replaced bytarget_cols(the number of columns must be the same as the length oftarget_colsin this case). Iffunreturns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntablefunction is called on it to get the resulting columns and their names. The names are retained whentarget_colsisAsTableand are replaced iftarget_colsis a vector ofSymbols or strings.
In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.
select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:
-
if
copycols=truethen results of such transformations always perform a copy of the source column or its view; -
if
copycols=falsethen copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===, which excludes views of source columns) performs a copy;
Note that performing transform! or select! assumes that copycols=false.
If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.
Keyword arguments
-
copycols::Bool=true: whether columns of the source data frame should be copied if no transformation is applied to them. -
renamecols::Bool=true: whether in thecols => functionform automatically generated column names should include the name of transformation functions or not. -
keepkeys::Bool=true: whether grouping columns ofgdshould be kept in the returned data frame. -
ungroup::Bool=true: whether the return value of the operation ongdshould be a data frame or aGroupedDataFrame. -
threads::Bool=true: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalseif some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> select(df, 2)
3×1 DataFrame
Row │ b
│ Int64
─────┼───────
1 │ 4
2 │ 5
3 │ 6
julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
Row │ c b
│ Float64 Int64
─────┼─────────────────
1 │ 0.841471 4
2 │ 0.909297 5
3 │ 0.14112 6
julia> select(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
Row │ a b a_b_function
│ Int64 Int64 Float64
─────┼────────────────────────────
1 │ 1 4 0.0
2 │ 2 5 2.0
3 │ 3 6 4.0
julia> select(df, All() .=> [minimum maximum])
3×4 DataFrame
Row │ a_minimum b_minimum a_maximum b_maximum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ 1 4 3 6
2 │ 1 4 3 6
3 │ 1 4 3 6
julia> using Statistics
julia> select(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> select(df, AsTable(:) => ByRow(mean) => x -> join(x, "_"))
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> select(first, df)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 1 4
3 │ 1 4
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> select(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
Row │ stats mean std
│ NamedTup… Float64 Float64
─────┼───────────────────────────────────────────
1 │ (mean = 4.0, std = 3.0) 4.0 3.0
2 │ (mean = 5.0, std = 3.0) 5.0 3.0
3 │ (mean = 6.0, std = 3.0) 6.0 3.0
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8)
8×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 1 2
3 │ 1 2 3
4 │ 2 1 4
5 │ 2 2 5
6 │ 1 1 6
7 │ 1 2 7
8 │ 2 1 8
julia> gd = groupby(df, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (5 rows): a = 1
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 1 2
3 │ 1 2 3
4 │ 1 1 6
5 │ 1 2 7
⋮
Last Group (3 rows): a = 2
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 4
2 │ 2 2 5
3 │ 2 1 8
specifying a name for target column
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c)
8×2 DataFrame
Row │ a sum_log_c
│ Int64 Float64
─────┼──────────────────
1 │ 1 5.52943
2 │ 1 5.52943
3 │ 1 5.52943
4 │ 2 5.07517
5 │ 2 5.07517
6 │ 1 5.52943
7 │ 1 5.52943
8 │ 2 5.07517
julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
Row │ a b_sum c_sum
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 8 19
2 │ 1 8 19
3 │ 1 8 19
4 │ 2 4 17
5 │ 2 4 17
6 │ 1 8 19
7 │ 1 8 19
8 │ 2 4 17
multiple arguments, renaming and keepkeys
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
Row │ b1 c1 b_c_+
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 3
2 │ 1 2 3
3 │ 2 3 5
4 │ 1 4 5
5 │ 2 5 7
6 │ 1 6 7
7 │ 2 7 9
8 │ 1 8 9
broadcasting and column expansion
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
Row │ a b min max
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 2
2 │ 1 1 1 2
3 │ 1 2 2 3
4 │ 2 1 1 4
5 │ 2 2 2 5
6 │ 1 1 1 6
7 │ 1 2 2 7
8 │ 2 1 1 8
julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
Row │ a b c b_c
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 3
2 │ 1 1 2 3
3 │ 1 2 3 5
4 │ 2 1 4 5
5 │ 2 2 5 7
6 │ 1 1 6 7
7 │ 1 2 7 9
8 │ 2 1 8 9
column-independent operations
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, nrow, proprow, groupindices, eachindex)
8×5 DataFrame
Row │ a nrow proprow groupindices eachindex
│ Int64 Int64 Float64 Int64 Int64
─────┼────────────────────────────────────────────────
1 │ 1 5 0.625 1 1
2 │ 1 5 0.625 1 2
3 │ 1 5 0.625 1 3
4 │ 2 3 0.375 2 1
5 │ 2 3 0.375 2 2
6 │ 1 5 0.625 1 4
7 │ 1 5 0.625 1 5
8 │ 2 3 0.375 2 3
#
DataFrames.select! — Function
select!(df::AbstractDataFrame, args...;
renamecols::Bool=true, threads::Bool=true)
select!(args::Base.Callable, df::DataFrame;
renamecols::Bool=true, threads::Bool=true)
select!(gd::GroupedDataFrame, args...; ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Mutate df or gd in place to retain only columns or transformations specified by args... and return it. The result is guaranteed to have the same number of rows as df or parent of gd, except when no columns are selected (in which case the result has zero rows).
If a SubDataFrame or GroupedDataFrame{SubDataFrame} is passed, the parent data frame is updated using columns generated by args..., following the same rules as indexing:
-
for existing columns filtered-out rows are filled with values present in the old columns
-
for new columns (which is only allowed if
SubDataFramewas created with:as column selector) filtered-out rows are filled withmissing -
dropped columns (which are only allowed if
SubDataFramewas created with:as column selector) are removed -
if
SubDataFramewas not created with:as column selector thenselect!is only allowed if the transformations keep exactly the same sequence of column names as is in the passeddf
If a GroupedDataFrame is passed then it is updated to reflect the new rows of its updated parent. If there are independent GroupedDataFrame objects constructed using the same parent data frame they might get corrupt.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
-
combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame; it is typically used to compute summary statistics by group; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!is an in-place version ofselect; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!is an in-place version oftransform; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
-
standard column selectors (integers,
Symbols, strings, vectors of integers, vectors ofSymbols, vectors of strings,All,Cols,:,Between,Notand regular expressions) -
a
cols => functionpair indicating thatfunctionshould be called with positional arguments holding columnscols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunctionreturns a single value or a vector; the generated name is created by concatenating source column name andfunctionname by default (see examples below). -
a
cols => function => target_colsform additionally explicitly specifying the target column or columns, which must be a single name (as aSymbolor a string), a vector of names orAsTable. Additionally it can be aFunctionwhich takes a string or a vector of strings as an argument containing names of columns selected bycols, and returns the target columns names (all accepted types exceptAsTableare allowed). -
a
col => target_colspair, which renames the columncoltotarget_cols, which must be single name (as aSymbolor a string), a vector of names orAsTable. -
column-independent operations
function => target_colsor justfunctionfor specificfunctions where the input columns are omitted; withouttarget_colsthe new column has the same name asfunction, otherwise it must be single name (as aSymbolor a string). Supportedfunctions are:-
nrowto efficiently compute the number of rows in each group. -
proprowto efficiently compute the proportion of rows in each group. -
eachindexto return a vector holding the number of each row within each group. -
groupindicesto return the group number.
-
-
vectors or matrices containing transformations specified by the
Pairsyntax described in points 2 to 5 -
a function which will be called with a
SubDataFramecorresponding to each group if aGroupedDataFrameis processed, or with the data frame itself if anAbstractDataFrameis processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrameavoids excessive compilation)
Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.
Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).
As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.
What is allowed for function to return is determined by the target_cols value:
-
If both
colsandtarget_colsare omitted (so only afunctionis passed), then returning a data frame, a matrix, aNamedTuple, aTables.AbstractRowor aDataFrameRowwill produce multiple columns in the result. Returning any other value produces a single column. -
If
target_colsis aSymbolor a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple, aTables.AbstractRow, or aDataFrameRowraises an error. -
If
target_colsis a vector ofSymbols or strings orAsTableit is assumed thatfunctionreturns multiple columns. Iffunctionreturns one ofAbstractDataFrame,NamedTuple,DataFrameRow,Tables.AbstractRow,AbstractMatrixthen rules described in point 1 above apply. Iffunctionreturns anAbstractVectorthen each element of this vector must support thekeysfunction, which must return a collection ofSymbols, strings or integers; the return value ofkeysmust be identical for all elements. Then as many columns are created as there are elements in the return value of thekeysfunction. Iftarget_colsisAsTablethen their names are set to be equal to the key names except ifkeysreturns integers, in which case they are prefixed byx(so the column names are e.g.x1,x2, …). Iftarget_colsis a vector ofSymbols or strings then column names produced using the rules above are ignored and replaced bytarget_cols(the number of columns must be the same as the length oftarget_colsin this case). Iffunreturns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntablefunction is called on it to get the resulting columns and their names. The names are retained whentarget_colsisAsTableand are replaced iftarget_colsis a vector ofSymbols or strings.
In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.
select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:
-
if
copycols=truethen results of such transformations always perform a copy of the source column or its view; -
if
copycols=falsethen copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===, which excludes views of source columns) performs a copy;
Note that performing transform! or select! assumes that copycols=false.
If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.
Keyword arguments
-
renamecols::Bool=true: whether in thecols => functionform automatically generated column names should include the name of transformation functions or not. -
ungroup::Bool=true: whether the return value of the operation ongdshould be a data frame or aGroupedDataFrame. -
threads::Bool=true: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalseif some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.
See select for examples.
#
Random.shuffle — Function
shuffle([rng=GLOBAL_RNG,] df::AbstractDataFrame)
Return a copy of df with randomly permuted rows. The optional rng argument specifies a random number generator.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> using Random
julia> rng = MersenneTwister(1234);
julia> shuffle(rng, DataFrame(a=1:5, b=1:5))
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 1 1
3 │ 4 4
4 │ 3 3
5 │ 5 5
#
Random.shuffle! — Function
shuffle!([rng=GLOBAL_RNG,] df::AbstractDataFrame)
Randomly permute rows of df in-place. The optional rng argument specifies a random number generator.
shuffle! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then shuffle! result might be incorrect.
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).
Examples
julia> using Random
julia> rng = MersenneTwister(1234);
julia> shuffle!(rng, DataFrame(a=1:5, b=1:5))
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 1 1
3 │ 4 4
4 │ 3 3
5 │ 5 5
#
DataFrames.table_transformation — Function
table_transformation(df_sel::AbstractDataFrame, fun)
This is the function called when AsTable(...) => fun is requested. The df_sel argument is a data frame storing columns selected by the AsTable(...) selector.
By default it calls default_table_transformation. However, it is allowed to add special methods for specific types of fun, as long as the result matches what would be produced by default_table_transformation, except that it is allowed to perform eltype conversion of the resulting vectors or value type promotions that are consistent with promote_type.
It is guaranteed that df_sel has at least one column.
The main use of special table_transformation methods is to provide more efficient than the default implementations of requested fun transformation.
This function might become a part of the public API of DataFrames.jl in the future, currently it should be considered experimental.
Fast paths are implemented within DataFrames.jl for the following functions fun:
-
sum,ByRow(sum),ByRow(sum∘skipmissing) -
length,ByRow(length),ByRow(length∘skipmissing) -
mean,ByRow(mean),ByRow(mean∘skipmissing) -
ByRow(var),ByRow(var∘skipmissing) -
ByRow(std),ByRow(std∘skipmissing) -
ByRow(median),ByRow(median∘skipmissing) -
minimum,ByRow(minimum),ByRow(minimum∘skipmissing) -
maximum,ByRow(maximum),ByRow(maximum∘skipmissing) -
fun∘collectandByRow(fun∘collect)wherefunis any function
Note that in order to improve the performance ByRow(sum), ByRow(sum∘skipmissing), ByRow(mean), and ByRow(mean∘skipmissing) perform all operations in the target element type. In some very rare cases (like mixing very large Int64 values and Float64 values) it can lead to a result different from the one that would be obtained by calling the function outside of DataFrames.jl. The way to avoid this precision loss is to use an anonymous function, e.g. instead of ByRow(sum) use ByRow(x -> sum(x)). However, in general for such scenarios even standard aggregation functions should not be considered to provide reliable output, and users are recommended to switch to higher precision calculations. An example of a case when standard sum is affected by the situation discussed is:
julia> sum(Any[typemax(Int), typemax(Int), 1.0]) -1.0 julia> sum(Any[1.0, typemax(Int), typemax(Int)]) 1.8446744073709552e19
#
DataFrames.transform — Function
transform(df::AbstractDataFrame, args...;
copycols::Bool=true, renamecols::Bool=true, threads::Bool=true)
transform(f::Callable, df::DataFrame;
renamecols::Bool=true, threads::Bool=true)
transform(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
transform(f::Base.Callable, gd::GroupedDataFrame;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Create a new data frame that contains columns from df or gd plus columns specified by args and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select(df, :, args...) or select(gd, :, args...).
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
-
combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame; it is typically used to compute summary statistics by group; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!is an in-place version ofselect; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!is an in-place version oftransform; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
-
standard column selectors (integers,
Symbols, strings, vectors of integers, vectors ofSymbols, vectors of strings,All,Cols,:,Between,Notand regular expressions) -
a
cols => functionpair indicating thatfunctionshould be called with positional arguments holding columnscols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunctionreturns a single value or a vector; the generated name is created by concatenating source column name andfunctionname by default (see examples below). -
a
cols => function => target_colsform additionally explicitly specifying the target column or columns, which must be a single name (as aSymbolor a string), a vector of names orAsTable. Additionally it can be aFunctionwhich takes a string or a vector of strings as an argument containing names of columns selected bycols, and returns the target columns names (all accepted types exceptAsTableare allowed). -
a
col => target_colspair, which renames the columncoltotarget_cols, which must be single name (as aSymbolor a string), a vector of names orAsTable. -
column-independent operations
function => target_colsor justfunctionfor specificfunctions where the input columns are omitted; withouttarget_colsthe new column has the same name asfunction, otherwise it must be single name (as aSymbolor a string). Supportedfunctions are:-
nrowto efficiently compute the number of rows in each group. -
proprowto efficiently compute the proportion of rows in each group. -
eachindexto return a vector holding the number of each row within each group. -
groupindicesto return the group number.
-
-
vectors or matrices containing transformations specified by the
Pairsyntax described in points 2 to 5 -
a function which will be called with a
SubDataFramecorresponding to each group if aGroupedDataFrameis processed, or with the data frame itself if anAbstractDataFrameis processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrameavoids excessive compilation)
Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.
Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).
As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.
What is allowed for function to return is determined by the target_cols value:
-
If both
colsandtarget_colsare omitted (so only afunctionis passed), then returning a data frame, a matrix, aNamedTuple, aTables.AbstractRowor aDataFrameRowwill produce multiple columns in the result. Returning any other value produces a single column. -
If
target_colsis aSymbolor a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple, aTables.AbstractRow, or aDataFrameRowraises an error. -
If
target_colsis a vector ofSymbols or strings orAsTableit is assumed thatfunctionreturns multiple columns. Iffunctionreturns one ofAbstractDataFrame,NamedTuple,DataFrameRow,Tables.AbstractRow,AbstractMatrixthen rules described in point 1 above apply. Iffunctionreturns anAbstractVectorthen each element of this vector must support thekeysfunction, which must return a collection ofSymbols, strings or integers; the return value ofkeysmust be identical for all elements. Then as many columns are created as there are elements in the return value of thekeysfunction. Iftarget_colsisAsTablethen their names are set to be equal to the key names except ifkeysreturns integers, in which case they are prefixed byx(so the column names are e.g.x1,x2, …). Iftarget_colsis a vector ofSymbols or strings then column names produced using the rules above are ignored and replaced bytarget_cols(the number of columns must be the same as the length oftarget_colsin this case). Iffunreturns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntablefunction is called on it to get the resulting columns and their names. The names are retained whentarget_colsisAsTableand are replaced iftarget_colsis a vector ofSymbols or strings.
In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.
select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:
-
if
copycols=truethen results of such transformations always perform a copy of the source column or its view; -
if
copycols=falsethen copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===, which excludes views of source columns) performs a copy;
Note that performing transform! or select! assumes that copycols=false.
If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.
Keyword arguments
-
copycols::Bool=true: whether columns of the source data frame should be copied if no transformation is applied to them. -
renamecols::Bool=true: whether in thecols => functionform automatically generated column names should include the name of transformation functions or not. -
keepkeys::Bool=true: whether grouping columns ofgdshould be kept in the returned data frame. -
ungroup::Bool=true: whether the return value of the operation ongdshould be a data frame or aGroupedDataFrame. -
threads::Bool=true: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalseif some transformations require serial execution or are not thread-safe.
Note that when the first argument is a GroupedDataFrame, keepkeys=false is needed to be able to return a different value for the grouping column:
Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.
Examples
julia> gdf = groupby(DataFrame(x=1:2), :x)
GroupedDataFrame with 2 groups based on key: x
First Group (1 row): x = 1
Row │ x
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): x = 2
Row │ x
│ Int64
─────┼───────
1 │ 2
julia> transform(gdf, x -> (x=10,), keepkeys=false)
2×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 10
2 │ 10
julia> transform(gdf, x -> (x=10,), keepkeys=true)
ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x
See select for more examples.
#
DataFrames.transform! — Function
transform!(df::AbstractDataFrame, args...;
renamecols::Bool=true, threads::Bool=true)
transform!(args::Callable, df::AbstractDataFrame;
renamecols::Bool=true, threads::Bool=true)
transform!(gd::GroupedDataFrame, args...;
ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)
transform!(f::Base.Callable, gd::GroupedDataFrame;
ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)
Mutate df or gd in place to add columns specified by args... and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select!(df, :, args...) or select!(gd, :, args...), except that column renaming performs a copy.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
-
combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame; it is typically used to compute summary statistics by group; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!is an in-place version ofselect; forGroupedDataFrameif grouping columns are kept they are put as first columns in the result; -
transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!is an in-place version oftransform; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
-
standard column selectors (integers,
Symbols, strings, vectors of integers, vectors ofSymbols, vectors of strings,All,Cols,:,Between,Notand regular expressions) -
a
cols => functionpair indicating thatfunctionshould be called with positional arguments holding columnscols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunctionreturns a single value or a vector; the generated name is created by concatenating source column name andfunctionname by default (see examples below). -
a
cols => function => target_colsform additionally explicitly specifying the target column or columns, which must be a single name (as aSymbolor a string), a vector of names orAsTable. Additionally it can be aFunctionwhich takes a string or a vector of strings as an argument containing names of columns selected bycols, and returns the target columns names (all accepted types exceptAsTableare allowed). -
a
col => target_colspair, which renames the columncoltotarget_cols, which must be single name (as aSymbolor a string), a vector of names orAsTable. -
column-independent operations
function => target_colsor justfunctionfor specificfunctions where the input columns are omitted; withouttarget_colsthe new column has the same name asfunction, otherwise it must be single name (as aSymbolor a string). Supportedfunctions are:-
nrowto efficiently compute the number of rows in each group. -
proprowto efficiently compute the proportion of rows in each group. -
eachindexto return a vector holding the number of each row within each group. -
groupindicesto return the group number.
-
-
vectors or matrices containing transformations specified by the
Pairsyntax described in points 2 to 5 -
a function which will be called with a
SubDataFramecorresponding to each group if aGroupedDataFrameis processed, or with the data frame itself if anAbstractDataFrameis processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrameavoids excessive compilation)
Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.
Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).
As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.
What is allowed for function to return is determined by the target_cols value:
-
If both
colsandtarget_colsare omitted (so only afunctionis passed), then returning a data frame, a matrix, aNamedTuple, aTables.AbstractRowor aDataFrameRowwill produce multiple columns in the result. Returning any other value produces a single column. -
If
target_colsis aSymbolor a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple, aTables.AbstractRow, or aDataFrameRowraises an error. -
If
target_colsis a vector ofSymbols or strings orAsTableit is assumed thatfunctionreturns multiple columns. Iffunctionreturns one ofAbstractDataFrame,NamedTuple,DataFrameRow,Tables.AbstractRow,AbstractMatrixthen rules described in point 1 above apply. Iffunctionreturns anAbstractVectorthen each element of this vector must support thekeysfunction, which must return a collection ofSymbols, strings or integers; the return value ofkeysmust be identical for all elements. Then as many columns are created as there are elements in the return value of thekeysfunction. Iftarget_colsisAsTablethen their names are set to be equal to the key names except ifkeysreturns integers, in which case they are prefixed byx(so the column names are e.g.x1,x2, …). Iftarget_colsis a vector ofSymbols or strings then column names produced using the rules above are ignored and replaced bytarget_cols(the number of columns must be the same as the length oftarget_colsin this case). Iffunreturns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntablefunction is called on it to get the resulting columns and their names. The names are retained whentarget_colsisAsTableand are replaced iftarget_colsis a vector ofSymbols or strings.
In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.
select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:
-
if
copycols=truethen results of such transformations always perform a copy of the source column or its view; -
if
copycols=falsethen copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===, which excludes views of source columns) performs a copy;
Note that performing transform! or select! assumes that copycols=false.
If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.
Keyword arguments
-
renamecols::Bool=true: whether in thecols => functionform automatically generated column names should include the name of transformation functions or not. -
ungroup::Bool=true: whether the return value of the operation ongdshould be a data frame or aGroupedDataFrame. -
threads::Bool=true: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalseif some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.
See select for examples.
#
Base.vcat — Function
vcat(dfs::AbstractDataFrame...;
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)
Vertically concatenate AbstractDataFrames.
The cols keyword argument determines the columns of the returned data frame:
-
:setequal: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used. -
:orderequal: require all data frames to have the same column names and in the same order. -
:intersect: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned. -
:union: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled withmissingwhere necessary. -
A vector of
Symbols or strings: only listed columns are kept. Columns not present in some data frames are filled withmissingwhere necessary.
The source keyword argument, if not nothing (the default), specifies the additional column to be added in the last position in the resulting data frame that will identify the source data frame. It can be a Symbol or an AbstractString, in which case the identifier will be the number of the passed source data frame, or a Pair consisting of a Symbol or an AbstractString and of a vector specifying the data frame identifiers (which do not have to be unique). The name of the source column is not allowed to be present in any source data frame.
The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.
The element types of columns are determined using promote_type, as with vcat for AbstractVectors.
vcat ignores empty data frames when composing the result (except for metadata), making it possible to initialize an empty data frame at the beginning of a loop and vcat onto it.
Metadata: vcat propagates table-level :note-style metadata for keys that are present in all passed data frames and have the same value. vcat propagates column-level :note-style metadata for keys that are present in all passed data frames that contain this column and have the same value.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
Row │ A C
│ Int64 Int64
─────┼──────────────
1 │ 7 7
2 │ 8 8
3 │ 9 9
julia> df4 = DataFrame()
0×0 DataFrame
julia> vcat(df1, df2)
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
Row │ A B C
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 missing
2 │ 2 2 missing
3 │ 3 3 missing
4 │ 7 missing 7
5 │ 8 missing 8
6 │ 9 missing 9
julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 7
5 │ 8
6 │ 9
julia> vcat(df4, df1)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> vcat(df1, df2, df3, df4, cols=:union, source="source")
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Int64
─────┼─────────────────────────────────
1 │ 1 1 missing 1
2 │ 2 2 missing 1
3 │ 3 3 missing 1
4 │ 4 4 missing 2
5 │ 5 5 missing 2
6 │ 6 6 missing 2
7 │ 7 missing 7 3
8 │ 8 missing 8 3
9 │ 9 missing 9 3
julia> vcat(df1, df2, df4, df3, cols=:union, source=:source => 'a':'d')
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Char
─────┼─────────────────────────────────
1 │ 1 1 missing a
2 │ 2 2 missing a
3 │ 3 3 missing a
4 │ 4 4 missing b
5 │ 5 5 missing b
6 │ 6 6 missing b
7 │ 7 missing 7 d
8 │ 8 missing 8 d
9 │ 9 missing 9 d
Reshaping data frames between tall and wide formats
#
Base.stack — Function
stack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=String)
Stack a data frame df, i.e. convert it from wide to long format.
Return the long-format DataFrame with: columns for each of the id_vars, column value_name (:value by default) holding the values of the stacked columns (measure_vars), and column variable_name (:variable by default) a vector holding the name of the corresponding measure_vars variable.
If view=true then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors that return views into the original data frame.
Arguments
-
df: the AbstractDataFrame to be stacked -
measure_vars: the columns to be stacked (the measurement variables), as a column selector (Symbol, string or integer;:,Cols,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). If neithermeasure_varsorid_varsare given,measure_varsdefaults to all floating point columns. -
id_vars: the identifier columns that are repeated during stacking, as a column selector (Symbol, string or integer;:,Cols,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). Defaults to all variables that are notmeasure_vars -
variable_name: the name (Symbolor string) of the new stacked column that shall hold the names of each ofmeasure_vars -
value_name: the name (Symbolor string) of the new stacked column containing the values from each ofmeasure_vars -
view: whether the stacked data frame should be a view rather than contain freshly allocated vectors. -
variable_eltype: determines the element type of columnvariable_name. By default aPooledArray{String}is created. Ifvariable_eltype=SymbolaPooledVector{Symbol}is created, and ifvariable_eltype=CategoricalValue{String}aCategoricalArray{String}is produced (callusing CategoricalArraysfirst if needed) Passing any other typeTwill produce aPooledVector{T}column as long as it supports conversion fromString. Whenview=true, aRepeatedVector{T}is produced.
Metadata: table-level :note-style metadata and column-level :note-style metadata for identifier columns are preserved.
Examples
julia> df = DataFrame(a=repeat(1:3, inner=2),
b=repeat(1:2, inner=3),
c=repeat(1:1, inner=6),
d=repeat(1:6, inner=1),
e=string.('a':'f'))
6×5 DataFrame
Row │ a b c d e
│ Int64 Int64 Int64 Int64 String
─────┼────────────────────────────────────
1 │ 1 1 1 1 a
2 │ 1 1 1 2 b
3 │ 2 1 1 3 c
4 │ 2 2 1 4 d
5 │ 3 2 1 5 e
6 │ 3 2 1 6 f
julia> stack(df, [:c, :d])
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, [:c, :d], [:a])
12×3 DataFrame
Row │ a variable value
│ Int64 String Int64
─────┼────────────────────────
1 │ 1 c 1
2 │ 1 c 1
3 │ 2 c 1
4 │ 2 c 1
5 │ 3 c 1
6 │ 3 c 1
7 │ 1 d 1
8 │ 1 d 2
9 │ 2 d 3
10 │ 2 d 4
11 │ 3 d 5
12 │ 3 d 6
julia> stack(df, Not([:a, :b, :e]))
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, Not([:a, :b, :e]), variable_name=:somemeasure)
12×5 DataFrame
Row │ a b e somemeasure value
│ Int64 Int64 String String Int64
─────┼──────────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
#
DataFrames.unstack — Function
unstack(df::AbstractDataFrame, rowkeys, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
unstack(df::AbstractDataFrame, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
unstack(df::AbstractDataFrame;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
Unstack data frame df, i.e. convert it from long to wide format.
Row and column keys are ordered in the order of their first appearance.
Positional arguments
-
df: the AbstractDataFrame to be unstacked -
rowkeys: the columns with a unique key for each row, if not given, find a key by grouping on anything not acolkeyorvalue. Can be any column selector (Symbol, string or integer;:,Cols,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). Ifrowkeyscontains no columns all rows are assumed to have the same key. -
colkey: the column (Symbol, string or integer) holding the column names in wide format, defaults to:variable -
values: the column storing values (Symbol, string or integer), defaults to:value
Keyword arguments
-
renamecols: a function called on each unique value incolkey; it must return the name of the column to be created (typically as a string or aSymbol). Duplicates in resulting names when converted toSymbolare not allowed. By default no transformation is performed. -
allowmissing: iffalse(the default) then an error is thrown ifcolkeycontainsmissingvalues; iftruethen a column referring tomissingvalue is created. -
combine: ifonly(the default) then an error is thrown if combination ofrowkeysandcolkeycontains duplicate entries. Otherwise the passed value must be a function that is called on a vector view containing all elements for each combination ofrowkeysandcolkeypresent in the data. -
fill: missing row/column combinations are filled with this value. The default ismissing. If thevaluecolumn is aCategoricalVectorandfillis notmissingthen in order to keep unstacked value columns alsoCategoricalVectorthefillmust be passed asCategoricalValue -
threads: whethercombinefunction may be run in separate tasks which can execute in parallel (possibly being applied to multiple groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalseifcombinerequires serial execution or is not thread-safe.
Metadata: table-level :note-style metadata and column-level :note-style metadata for row keys columns are preserved.
Deprecations
-
allowduplicateskeyword argument is deprecated; instead usecombinekeyword argument; an equivalent toallowduplicates=trueiscombine=lastand toallowduplicates=falseiscombine=only(the default);
Examples
julia> wide = DataFrame(id=1:6,
a=repeat(1:3, inner=2),
b=repeat(1.0:2.0, inner=3),
c=repeat(1.0:1.0, inner=6),
d=repeat(1.0:3.0, inner=2))
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> long = stack(wide)
18×4 DataFrame
Row │ id a variable value
│ Int64 Int64 String Float64
─────┼─────────────────────────────────
1 │ 1 1 b 1.0
2 │ 2 1 b 1.0
3 │ 3 2 b 1.0
4 │ 4 2 b 2.0
5 │ 5 3 b 2.0
6 │ 6 3 b 2.0
7 │ 1 1 c 1.0
8 │ 2 1 c 1.0
⋮ │ ⋮ ⋮ ⋮ ⋮
12 │ 6 3 c 1.0
13 │ 1 1 d 1.0
14 │ 2 1 d 1.0
15 │ 3 2 d 2.0
16 │ 4 2 d 2.0
17 │ 5 3 d 3.0
18 │ 6 3 d 3.0
3 rows omitted
julia> unstack(long)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :variable, :value)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :id, :variable, :value)
6×4 DataFrame
Row │ id b c d
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────
1 │ 1 1.0 1.0 1.0
2 │ 2 1.0 1.0 1.0
3 │ 3 1.0 1.0 2.0
4 │ 4 2.0 1.0 2.0
5 │ 5 2.0 1.0 3.0
6 │ 6 2.0 1.0 3.0
julia> unstack(long, [:id, :a], :variable, :value)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
6×4 DataFrame
Row │ id _b _c _d
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────
1 │ 1 1.0 1.0 1.0
2 │ 2 1.0 1.0 1.0
3 │ 3 1.0 1.0 2.0
4 │ 4 2.0 1.0 2.0
5 │ 5 2.0 1.0 3.0
6 │ 6 2.0 1.0 3.0
Note that there are some differences between the widened results above.
julia> df = DataFrame(id=["1", "1", "2"],
variable=["Var1", "Var2", "Var1"],
value=[1, 2, 3])
3×3 DataFrame
Row │ id variable value
│ String String Int64
─────┼─────────────────────────
1 │ 1 Var1 1
2 │ 1 Var2 2
3 │ 2 Var1 3
julia> unstack(df, :variable, :value, fill=0)
2×3 DataFrame
Row │ id Var1 Var2
│ String Int64 Int64
─────┼──────────────────────
1 │ 1 1 2
2 │ 2 3 0
julia> df = DataFrame(cols=["a", "a", "b"], values=[1, 2, 4])
3×2 DataFrame
Row │ cols values
│ String Int64
─────┼────────────────
1 │ a 1
2 │ a 2
3 │ b 4
julia> unstack(df, :cols, :values, combine=copy)
1×2 DataFrame
Row │ a b
│ Array…? Array…?
─────┼──────────────────
1 │ [1, 2] [4]
julia> unstack(df, :cols, :values, combine=sum)
1×2 DataFrame
Row │ a b
│ Int64? Int64?
─────┼────────────────
1 │ 3 4
#
Base.permutedims — Function
permutedims(df::AbstractDataFrame,
[src_namescol::Union{Int, Symbol, AbstractString}],
[dest_namescol::Union{Symbol, AbstractString}];
makeunique::Bool=false, strict::Bool=true)
Turn df on its side such that rows become columns and values in the column indexed by src_namescol become the names of new columns. In the resulting DataFrame, column names of df will become the first column with name specified by dest_namescol.
Arguments
-
df: theAbstractDataFrame -
src_namescol: the column that will become the new header. If omitted then column names:x1,:x2, … are generated automatically. -
dest_namescol: the name of the first column in the returnedDataFrame. Defaults to the same name assrc_namescol. Not supported whensrc_namescolis a vector or is omitted. -
makeunique: iffalse(the default), an error will be raised if duplicate names are found; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). Not supported whensrc_namescolis omitted. -
strict: iftrue(the default), an error will be raised if the values contained in thesrc_namescolare not allSymbolor allAbstractString, or can all be converted toStringusingconvert. Iffalsethen any values are accepted and the will be changed to strings using thestringfunction. Not supported whensrc_namescolis a vector or is omitted.
The element types of columns in resulting DataFrame (other than the first column if it is created from df column names, which always has element type String) will depend on the element types of all input columns based on the result of promote_type. That is, if the source data frame contains Int and Float64 columns, resulting columns will have element type Float64. If the source has Int and String columns, resulting columns will have element type Any.
|
Metadata: table-level :note-style metadata is preserved and column-level metadata is dropped.
Examples
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> permutedims(df)
2×2 DataFrame
Row │ x1 x2
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4
julia> permutedims(df, [:p, :q])
2×2 DataFrame
Row │ p q
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4
julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ a b c d
│ String Float64 Int64 Bool
─────┼───────────────────────────────
1 │ x 1.0 3 true
2 │ y 2.0 4 false
julia> permutedims(df1, 1) # note the column types
3×3 DataFrame
Row │ a x y
│ String Float64 Float64
─────┼──────────────────────────
1 │ b 1.0 2.0
2 │ c 3.0 4.0
3 │ d 1.0 0.0
julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ a b c d
│ String Any Int64 Bool
─────┼───────────────────────────
1 │ x 1 3 true
2 │ y two 4 false
julia> permutedims(df2, 1, "different_name")
3×3 DataFrame
Row │ different_name x y
│ String Any Any
─────┼─────────────────────────────
1 │ b 1 two
2 │ c 3 4
3 │ d true false
Sorting
#
Base.issorted — Function
issorted(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
Test whether data frame df sorted by column(s) cols. Checking against multiple columns is done lexicographically.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If cols selects no columns, check whether df is sorted on all columns (this behaviour is deprecated and will change in future versions).
If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique keyword allows for the situation to be caught. If checkunique is true and duplicate elements are found an error will be thrown. The use of the checkunique keyword is only supported when neither the by nor the lt keywords are being used. Similarly, the use of order(...) clauses that specify either by or lt are not supported, but specifying rev by itself is allowed.
The by keyword allows providing a function that will be applied to each cell before comparison; the lt keyword allows providing a custom "less than" function. If both by and lt are specified, the lt function is applied to the result of the by function.
Keyword arguments specifying sorting order (rev, lt or by) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols.
Examples
julia> df = DataFrame(a=[1, 2, 3, 4], b=[4, 3, 2, 1])
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 3
3 │ 3 2
4 │ 4 1
julia> issorted(df)
true
julia> issorted(df, :a)
true
julia> issorted(df, :b)
false
julia> issorted(df, :b, rev=true)
true
#
DataFrames.order — Function
order(col::ColumnIndex; kwargs...)
Specify sorting order for a column col in a data frame. kwargs can be lt, by, rev, and order with values following the rules defined in sort!.
Examples
julia> df = DataFrame(x=[-3, -1, 0, 2, 4], y=1:5)
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ -3 1
2 │ -1 2
3 │ 0 3
4 │ 2 4
5 │ 4 5
julia> sort(df, order(:x, rev=true))
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 4 5
2 │ 2 4
3 │ 0 3
4 │ -1 2
5 │ -3 1
julia> sort(df, order(:x, by=abs))
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 0 3
2 │ -1 2
3 │ 2 4
4 │ -3 1
5 │ 4 5
#
Base.sort — Function
sort(df::AbstractDataFrame, cols=All();
alg::Union{Algorithm, Nothing}=nothing,
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
view::Bool=false,
checkunique::Bool=false)
Return a data frame containing the rows in df sorted by column(s) cols. Sorting on multiple columns is done lexicographically.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If cols selects no columns, sort df on all columns (this behaviour is deprecated and will change in future versions).
If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique keyword allows for the situation to be caught. If checkunique is true and duplicate elements are found an error will be thrown. The use of the checkunique keyword is only supported when neither the by nor the lt keywords are being used. Similarly, the use of order(...) clauses that specify either by or lt are not supported, but specifying rev by itself is allowed.
The by keyword allows providing a function that will be applied to each cell before comparison; the lt keyword allows providing a custom "less than" function. If both by and lt are specified, the lt function is applied to the result of the by function.
Keyword arguments specifying sorting order (rev, lt or by) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols.
If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df.
If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sort(df, :x)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
julia> sort(df, [:x, :y])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 1 c
3 │ 2 a
4 │ 3 b
julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
3 │ 1 c
4 │ 1 b
julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
#
Base.sort! — Function
sort!(df::AbstractDataFrame, cols=All();
alg::Union{Algorithm, Nothing}=nothing,
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
Sort data frame df by column(s) cols by permuting its rows in-place. Sorting on multiple columns is done lexicographicallly.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If cols selects no columns, sort df on all columns (this behaviour is deprecated and will change in future versions).
If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique keyword allows for the situation to be caught. If checkunique is true and duplicate elements are found an error will be thrown. The use of the checkunique keyword is only supported when neither the by nor the lt keywords are being used. Similarly, the use of order(...) clauses that specify either by or lt are not supported, but specifying rev by itself is allowed.
The by keyword allows providing a function that will be applied to each cell before comparison; the lt keyword allows providing a custom "less than" function. If both by and lt are specified, the lt function is applied to the result of the by function.
Keyword arguments specifying sorting order (rev, lt or by) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols.
If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df.
sort! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then sort! result might be incorrect.
Metadata: this function preserves table-level and column-level :note-style metadata.
Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sort!(df, :x)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
julia> sort!(df, [:x, :y])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 1 c
3 │ 2 a
4 │ 3 b
julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
3 │ 1 c
4 │ 1 b
julia> sort!(df, [:x, order(:y, rev=true)])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
#
Base.sortperm — Function
sortperm(df::AbstractDataFrame, cols=All();
alg::Union{Algorithm, Nothing}=nothing,
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
Return a permutation vector of row indices of data frame df that puts them in sorted order according to column(s) cols. Order on multiple columns is computed lexicographically.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If cols selects no columns, return permutation vector based on sorting all columns (this behaviour is deprecated and will change in future versions).
If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique keyword allows for the situation to be caught. If checkunique is true and duplicate elements are found an error will be thrown. The use of the checkunique keyword is only supported when neither the by nor the lt keywords are being used. Similarly, the use of order(...) clauses that specify either by or lt are not supported, but specifying rev by itself is allowed.
The by keyword allows providing a function that will be applied to each cell before comparison; the lt keyword allows providing a custom "less than" function. If both by and lt are specified, the lt function is applied to the result of the by function.
Keyword arguments specifying sorting order (rev, lt or by) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols.
If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df.
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sortperm(df, :x)
4-element Vector{Int64}:
2
4
3
1
julia> sortperm(df, [:x, :y])
4-element Vector{Int64}:
4
2
3
1
julia> sortperm(df, [:x, :y], rev=true)
4-element Vector{Int64}:
1
3
2
4
julia> sortperm(df, [:x, order(:y, rev=true)])
4-element Vector{Int64}:
2
4
3
1
Joining
#
DataAPI.antijoin — Function
antijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)
Perform an anti join of two data frame objects and return a DataFrame containing the result. An anti join returns the subset of rows of df1 that do not match with the keys in df2.
The order of rows in the result is kept from df1.
Arguments
-
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: ignored as no columns are added todf1columns (it is provided for consistency with other functions). -
validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched; if equal to:notequalthen missings are dropped indf2oncolumns.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note-style metadata are taken from df1.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> antijoin(name, job, on = :ID)
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
#
DataAPI.crossjoin — Function
crossjoin(df1::AbstractDataFrame, df2::AbstractDataFrame;
makeunique::Bool=false, renamecols=identity => identity)
crossjoin(df1, df2, dfs...; makeunique = false)
Perform a cross join of two or more data frame objects and return a DataFrame containing the result. A cross join returns the cartesian product of rows from all passed data frames, where the first passed data frame is assigned to the dimension that changes the slowest and the last data frame is assigned to the dimension that changes the fastest.
Arguments
-
df1,df2,dfs...: theAbstractDataFramesto be joined
Keyword Arguments
-
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). -
renamecols: aPairspecifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbolcan be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString.
If more than two data frames are passed, the join is performed recursively with left associativity.
Metadata: table-level :note-style metadata is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note-style metadata is preserved from both tables.
Examples
julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
Row │ X
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
Row │ Y
│ String
─────┼────────
1 │ a
2 │ b
julia> crossjoin(df1, df2)
6×2 DataFrame
Row │ X Y
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 b
3 │ 2 a
4 │ 2 b
5 │ 3 a
6 │ 3 b
#
DataAPI.innerjoin — Function
innerjoin(df1, df2; on, makeunique=false, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error,
order=:undefined)
innerjoin(df1, df2, dfs...; on, makeunique=false,
validate=(false, false), matchmissing=:error,
order=:undefined)
Perform an inner join of two or more data frame objects and return a DataFrame containing the result. An inner join includes rows with keys that match in all passed data frames.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1. This behavior may change in future releases.
Arguments
-
df1,df2,dfs...: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). When joining only two data frames, aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). -
validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed. -
renamecols: aPairspecifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbolcan be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString. Note thatrenamecolsdoes not affectoncolumns, whose names are always taken from the left data frame and left unchanged. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched; if equal to:notequalthen missings are dropped indf1anddf2oncolumns. -
order: if:undefined(the default) the order of rows in the result is undefined and may change in future releases. If:leftthen the order of rows from the left data frame is retained. If:rightthen the order of rows from the right data frame is retained.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate keyword argument is applied recursively with left associativity.
Metadata: table-level :note-style metadata and column-level :note-style metadata for key columns is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note-style metadata is preserved for all other columns.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> innerjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
2×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
julia> innerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
2×3 DataFrame
Row │ ID NAME job
│ Int64 String String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
#
DataAPI.leftjoin — Function
leftjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error, order=:undefined)
Perform a left join of two data frame objects and return a DataFrame containing the result. A left join includes all rows from df1.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1. This behavior may change in future releases.
Arguments
-
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). -
source: Default:nothing. If aSymbolor string, adds indicator column with the given name, for whether a row appeared in onlydf1("left_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true. -
validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed. -
renamecols: aPairspecifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbolcan be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString. Note thatrenamecolsdoes not affectoncolumns, whose names are always taken from the left data frame and left unchanged. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched; if equal to:notequalthen missings are dropped indf2oncolumns. -
order: if:undefined(the default) the order of rows in the result is undefined and may change in future releases. If:leftthen the order of rows from the left data frame is retained. If:rightthen the order of rows from the right data frame is retained (non-matching rows are put at the end).
All columns of the returned data frame will support missing values.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note-style metadata is taken from df1 (including key columns), except for columns added to it from df2, whose column-level :note-style metadata is taken from df2.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String?
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> leftjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
#
DataFrames.leftjoin! — Function
leftjoin!(df1, df2; on, makeunique=false, source=nothing,
matchmissing=:error)
Perform a left join of two data frame objects by updating the df1 with the joined columns from df2.
A left join includes all rows from df1 and leaves all rows and columns from df1 untouched. Note that each row in df1 must have at most one match in df2. Otherwise, this function would not be able to execute the join in-place since new rows would need to be added to df1.
Arguments
-
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). -
source: Default:nothing. If aSymbolor string, adds indicator column with the given name, for whether a row appeared in onlydf1("left_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched; if equal to:notequalthen missings are dropped indf2oncolumns.
The columns added to df1 from df2 will support missing values.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
Metadata: table-level and column-level :note-style metadata are taken from df1 (including key columns), except for columns added to it from df2, whose column-level :note-style metadata is taken from df2.
See also: leftjoin.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin!(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin!(name, job2, on = :ID => :identifier, makeunique=true, source=:source)
3×5 DataFrame
Row │ ID Name Job Job_1 source
│ Int64 String String? String? String
─────┼───────────────────────────────────────────────
1 │ 1 John Doe Lawyer Lawyer both
2 │ 2 Jane Doe Doctor Doctor both
3 │ 3 Joe Blogs missing missing left_only
#
DataAPI.outerjoin — Function
outerjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error, order=:undefined)
outerjoin(df1, df2, dfs...; on, makeunique = false,
validate = (false, false), matchmissing=:error, order=:undefined)
Perform an outer join of two or more data frame objects and return a DataFrame containing the result. An outer join includes rows with keys that appear in any of the passed data frames.
The order of rows in the result is undefined and may change in future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the element type of these columns both df1 and df2. This behavior may change in future releases.
Arguments
-
df1,df2,dfs...: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). When joining only two data frames, aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). -
source: Default:nothing. If aSymbolor string, adds indicator column with the given name for whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true. This argument is only supported when joining exactly two data frames. -
validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed. -
renamecols: aPairspecifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbolcan be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString. Note thatrenamecolsdoes not affectoncolumns, whose names are always taken from the left data frame and left unchanged. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched. -
order: if:undefined(the default) the order of rows in the result is undefined and may change in future releases. If:leftthen the order of rows from the left data frame is retained (non-matching rows are put at the end). If:rightthen the order of rows from the right data frame is retained (non-matching rows are put at the end).
All columns of the returned data frame will support missing values.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator keyword argument is not supported and validate keyword argument is applied recursively with left associativity.
Metadata: table-level :note-style metadata and column-level :note-style metadata for key columns is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note-style metadata is preserved for all other columns.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
Row │ ID Name Job
│ Int64 String? String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> outerjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
4×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String? String?
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
julia> outerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
4×3 DataFrame
Row │ ID NAME job
│ Int64 String? String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
#
DataAPI.rightjoin — Function
rightjoin(df1, df2; on, makeunique=false, source=nothing,
validate=(false, false), renamecols=(identity => identity),
matchmissing=:error, order=:undefined)
Perform a right join on two data frame objects and return a DataFrame containing the result. A right join includes all rows from df2.
The order of rows in the result is undefined and may change in future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df2. This behavior may change in future releases.
Arguments
-
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate). -
source: Default:nothing. If aSymbolor string, adds indicator column with the given name for whether a row appeared in onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true. -
validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed. -
renamecols: aPairspecifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbolcan be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString. Note thatrenamecolsdoes not affectoncolumns, whose names are always taken from the left data frame and left unchanged. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched; if equal to:notequalthen missings are dropped indf1oncolumns. -
order: if:undefined(the default) the order of rows in the result is undefined and may change in future releases. If:leftthen the order of rows from the left data frame is retained (non-matching rows are put at the end). If:rightthen the order of rows from the right data frame is retained.
All columns of the returned data frame will support missing values.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note-style metadata is taken from df2 (including key columns), except for columns added to it from df1, whose column-level :note-style metadata is taken from df1.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String? String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
#
DataAPI.semijoin — Function
semijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)
Perform a semi join of two data frame objects and return a DataFrame containing the result. A semi join returns the subset of rows of df1 that match with the keys in df2.
The order of rows in the result is kept from df1.
Arguments
-
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
-
on: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>rightpair of names can be used instead of a name, for the case where a key has different names indf1anddf2(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal.onis a required argument. -
makeunique: ignored as no columns are added todf1columns (it is provided for consistency with other functions). -
indicator: Default:nothing. If aSymbolor string, adds categorical indicator column with the given name for whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true. -
validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed. -
matchmissing: if equal to:errorthrow an error ifmissingis present inoncolumns; if equal to:equalthenmissingis allowed and missings are matched; if equal to:notequalthen missings are dropped indf2oncolumns.
It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note-style metadata are taken from df1.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> semijoin(name, job, on = :ID)
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
Grouping
#
Base.get — Function
get(gd::GroupedDataFrame, key, default)
Get a group based on the values of the grouping columns.
key may be a GroupKey, NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby). It may also be an AbstractDict, in which case the order of the arguments does not matter.
Examples
julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[2]),
b=repeat([2, 1], outer=[3]),
c=1:6);
julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 1 4
⋮
Last Group (2 rows): a = :baz
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 2 3
2 │ baz 1 6
julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ bar 1 2
2 │ bar 2 5
julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 2 3
2 │ baz 1 6
julia> get(gd, (:qux,), nothing)
#
DataFrames.groupby — Function
groupby(d::AbstractDataFrame, cols;
sort::Union{Bool, Nothing, NamedTuple}=nothing,
skipmissing::Bool=false)
Return a GroupedDataFrame representing a view of an AbstractDataFrame split into row groups.
Arguments
-
df: anAbstractDataFrameto split -
cols: data frame columns to group by. Can be any column selector (Symbol, string or integer;:,Cols,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). In particular if the selector picks no columns then a single-groupGroupedDataFrameis created. As a special case, ifcolsis a single column or a vector of columns then it can contain columns wrapped inorderthat will be used to determine the order of groups ifsortistrueor aNamedTuple(ifsortisfalse, then passingorderis an error; ifsortisnothingthen it is set totruewhenorderis passed). -
sort: ifsort=truesort groups according to the values of the grouping columnscols; ifsort=falsegroups are created in their order of appearance indf; ifsort=nothing(the default) then the fastest available grouping algorithm is picked and in consequence the order of groups in the result is undefined and may change in future releases; below a description of the current implementation is provided. Additionallysortcan be aNamedTuplehaving some or all ofalg,lt,by,rev, andorderfields. In this case the groups are sorted and their order follows thesortpermorder. -
skipmissing: whether to skip groups withmissingvalues in one of the grouping columnscols
Details
An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into df. Within each group, the order of rows in df is preserved.
A GroupedDataFrame also supports indexing by groups, select, transform, and combine (which applies a function to each group and combines the result into a data frame).
GroupedDataFrame also supports the dictionary interface. The keys are GroupKey objects returned by keys(::GroupedDataFrame), which can also be used to get the values of the grouping columns for each group. Tuples and NamedTuples containing the values of the grouping columns (in the same order as the cols argument) are also accepted as indices. Finally, an AbstractDict can be used to index into a grouped data frame where the keys are column names of the data frame. The order of the keys does not matter in this case.
In the current implementation if sort=nothing groups are ordered following the order of appearance of values in the grouping columns, except when all grouping columns provide non-nothing DataAPI.refpool, in which case the order of groups follows the order of values returned by DataAPI.refpool. As a particular application of this rule if all cols are CategoricalVectors then groups are always sorted. Integer columns with a narrow range also use this this optimization, so to the order of groups when grouping on integer columns is undefined. A column is considered to be an integer column when deciding on the grouping algorithm choice if its eltype is a subtype of Union{Missing, Real}, all its elements are either missing or pass isinteger test, and none of them is equal to -0.0.
See also
Examples
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
⋮
Last Group (2 rows): a = 4
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
julia> gd[1]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
julia> last(gd)
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
julia> gd[(a=3,)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> gd[Dict("a" => 3)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> gd[(3,)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> k = first(keys(gd))
GroupKey: (a = 1,)
julia> gd[k]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
julia> for g in gd
println(g)
end
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 2
2 │ 2 1 6
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
#
DataFrames.groupcols — Function
groupcols(gd::GroupedDataFrame)
Return a vector of Symbol column names in parent(gd) used for grouping.
#
DataFrames.groupindices — Function
groupindices(gd::GroupedDataFrame)
Return a vector of group indices for each row of parent(gd).
Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).
The groupindices => target_col_name syntax (or just groupindices without specifying the target column name) is also supported in the transformation mini-language when passing a GroupedDataFrame to transformation functions (combine, select, etc.).
Examples
julia> df = DataFrame(id=["a", "c", "b", "b", "a"])
5×1 DataFrame
Row │ id
│ String
─────┼────────
1 │ a
2 │ c
3 │ b
4 │ b
5 │ a
julia> gdf = groupby(df, :id);
julia> combine(gdf, groupindices)
3×2 DataFrame
Row │ id groupindices
│ String Int64
─────┼──────────────────────
1 │ a 1
2 │ c 2
3 │ b 3
julia> select(gdf, groupindices => :gid)
5×2 DataFrame
Row │ id gid
│ String Int64
─────┼───────────────
1 │ a 1
2 │ c 2
3 │ b 3
4 │ b 3
5 │ a 1
#
Base.keys — Function
keys(gd::GroupedDataFrame)
Get the set of keys for each group of the GroupedDataFrame gd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple, NamedTuple, and AbstractDict, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.
Examples
julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[4]),
b=repeat([2, 1], outer=[6]),
c=1:12);
julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 2 7
⋮
Last Group (2 rows): a = :baz, b = 1
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 1 6
2 │ baz 1 12
julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (a = :foo, b = 2)
GroupKey: (a = :bar, b = 1)
GroupKey: (a = :baz, b = 2)
GroupKey: (a = :foo, b = 1)
GroupKey: (a = :bar, b = 2)
GroupKey: (a = :baz, b = 1)
julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)
julia> keys(k)
2-element Vector{Symbol}:
:a
:b
julia> values(k) # Same as Tuple(k)
(:foo, 2)
julia> NamedTuple(k)
(a = :foo, b = 2)
julia> k.a
:foo
julia> k[:a]
:foo
julia> k[1]
:foo
Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame:
julia> gd[k]
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 2 7
julia> gd[keys(gd)[1]] == gd[1]
true
keys(dfc::DataFrameColumns)
Get a vector of column names of dfc as Symbols.
#
Base.parent — Function
parent(gd::GroupedDataFrame)
Return the parent data frame of gd.
#
DataFrames.proprow — Function
proprow
Compute the proportion of rows which belong to each group, i.e. its number of rows divided by the total number of rows in a GroupedDataFrame.
This function can only be used in the transformation mini-language via the proprow => target_col_name syntax (or just proprow without specifying the target column name), when passing a GroupedDataFrame to transformation functions (combine, select, etc.).
Examples
julia> df = DataFrame(id=["a", "c", "b", "b", "a", "b"])
6×1 DataFrame
Row │ id
│ String
─────┼────────
1 │ a
2 │ c
3 │ b
4 │ b
5 │ a
6 │ b
julia> gdf = groupby(df, :id);
julia> combine(gdf, proprow)
3×2 DataFrame
Row │ id proprow
│ String Float64
─────┼──────────────────
1 │ a 0.333333
2 │ c 0.166667
3 │ b 0.5
julia> select(gdf, proprow => :frac)
6×2 DataFrame
Row │ id frac
│ String Float64
─────┼──────────────────
1 │ a 0.333333
2 │ c 0.166667
3 │ b 0.5
4 │ b 0.5
5 │ a 0.333333
6 │ b 0.5
#
DataFrames.valuecols — Function
valuecols(gd::GroupedDataFrame)
Return a vector of Symbol column names in parent(gd) not used for grouping.
Filtering rows
#
Base.allunique — Function
allunique(df::AbstractDataFrame, cols=:)
Return true if none of the rows of df are duplicated. Two rows are duplicates if all their columns contain equal values (according to isequal) for all columns in cols (by default, all columns).
Arguments
-
df:AbstractDataFrame -
cols: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselect.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> allunique(df)
true
julia> allunique(df, :x)
false
julia> allunique(df, :i => ByRow(isodd))
false
#
Base.deleteat! — Function
deleteat!(df::DataFrame, inds)
Delete rows specified by inds from a DataFrame df in place and return it.
Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not wrapping any valid selector.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> deleteat!(df, 2)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
#
Base.empty — Function
empty(df::AbstractDataFrame)
Create a new DataFrame with the same column names and column element types as df but with zero rows.
Metadata: this function preserves table-level and column-level :note-style metadata.
#
Base.empty! — Function
empty!(df::DataFrame)
Remove all rows from df, making each of its columns empty.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> empty!(df)
0×2 DataFrame
Row │ a b
│ Int64 Int64
─────┴──────────────
julia> df.a, df.b
(Int64[], Int64[])
#
Base.filter — Function
filter(fun, df::AbstractDataFrame; view::Bool=false)
filter(cols => fun, df::AbstractDataFrame; view::Bool=false)
Return a data frame containing only rows from df for which fun returns true.
If cols is not specified then the predicate fun is passed DataFrameRows. Elements of a DataFrameRow may be accessed with dot syntax or column indexing inside fun.
If cols is specified then the predicate fun is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.
If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.
Passing cols leads to a more efficient execution of the operation for large data frames.
|
This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the |
Metadata: this function preserves table-level and column-level :note-style metadata.
See also: filter!
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> filter(row -> row.x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter(row -> row["x"] > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
filter(fun, gdf::GroupedDataFrame; ungroup::Bool=false) filter(cols => fun, gdf::GroupedDataFrame; ungroup::Bool=false)
Return only groups in gd for which fun returns true as a GroupedDataFrame if ungroup=false (the default), or as a data frame if ungroup=true.
If cols is not specified then the predicate fun is called with a SubDataFrame for each group.
If cols is specified then the predicate fun is called for each group with views of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.
|
This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the |
Examples
julia> df = DataFrame(g=[1, 2], x=['a', 'b']);
julia> gd = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
⋮
Last Group (1 row): g = 2
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 2 b
julia> filter(x -> x.x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> filter(:x => x -> x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> filter(:x => x -> x[1] == 'a', gd, ungroup=true)
1×2 DataFrame
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
#
Base.filter! — Function
filter!(fun, df::AbstractDataFrame)
filter!(cols => fun, df::AbstractDataFrame)
Remove rows from data frame df for which fun returns false.
If cols is not specified then the predicate fun is passed DataFrameRows. Elements of a DataFrameRow may be accessed with dot syntax or column indexing inside fun.
If cols is specified then the predicate fun is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.
Passing cols leads to a more efficient execution of the operation for large data frames.
|
This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the |
Metadata: this function preserves table-level and column-level :note-style metadata.
See also: filter
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> filter!(row -> row.x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter!(row -> row["x"] > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter!(:x => x -> x == 3, df)
1×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"]);
julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"]);
julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
#
Base.keepat! — Function
keepat!(df::DataFrame, inds)
Delete rows at all indices not specified by inds from a DataFrame df in place and return it.
Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not wrapping any valid selector.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> keepat!(df, [1, 3])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
#
Base.first — Function
first(df::AbstractDataFrame)
Get the first row of df as a DataFrameRow.
Metadata: this function preserves table-level and column-level :note-style metadata.
first(df::AbstractDataFrame, n::Integer; view::Bool=false)
Get a data frame with the n first rows of df.
If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.
Metadata: this function preserves table-level and column-level :note-style metadata.
#
Base.last — Function
last(df::AbstractDataFrame)
Get the last row of df as a DataFrameRow.
Metadata: this function preserves table-level and column-level :note-style metadata.
last(df::AbstractDataFrame, n::Integer; view::Bool=false)
Get a data frame with the n last rows of df.
If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.
Metadata: this function preserves table-level and column-level :note-style metadata.
#
DataFrames.nonunique — Function
nonunique(df::AbstractDataFrame; keep::Symbol=:first)
nonunique(df::AbstractDataFrame, cols; keep::Symbol=:first)
Return a Vector{Bool} in which true entries indicate duplicate rows.
Duplicate rows are those for which at least another row contains equal values (according to isequal) for all columns in cols (by default, all columns). If keep=:first (the default), only the first occurrence of a set of duplicate rows is indicated with a false entry. If keep=:last, only the last occurrence of a set of duplicate rows is indicated with a false entry. If keep=:noduplicates, only rows without any duplicates are indicated with a false entry.
Arguments
-
df:AbstractDataFrame -
cols: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselectthat returns at least one column ifdfhas at least one column.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> nonunique(df)
8-element Vector{Bool}:
0
0
0
0
1
1
1
1
julia> nonunique(df, keep=:last)
8-element Vector{Bool}:
1
1
1
1
0
0
0
0
julia> nonunique(df, 2)
8-element Vector{Bool}:
0
0
1
1
1
1
1
1
#
Base.Iterators.only — Function
only(df::AbstractDataFrame)
If df has a single row return it as a DataFrameRow; otherwise throw ArgumentError.
Metadata: this function preserves table-level and column-level :note-style metadata.
#
Base.pop! — Function
pop!(df::DataFrame)
Remove the last row from df and return a NamedTuple created from this row.
|
Using this method for very wide data frames may lead to expensive compilation. |
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> pop!(df)
(a = 3, b = 6)
julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
#
Base.popat! — Function
popat!(df::DataFrame, i::Integer)
Remove the i-th row from df and return a NamedTuple created from this row.
|
Using this method for very wide data frames may lead to expensive compilation. |
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> popat!(df, 2)
(a = 2, b = 5)
julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
#
Base.popfirst! — Function
popfirst!(df::DataFrame)
Remove the first row from df and return a NamedTuple created from this row.
|
Using this method for very wide data frames may lead to expensive compilation. |
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> popfirst!(df)
(a = 1, b = 4)
julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 5
2 │ 3 6
#
Base.resize! — Function
resize!(df::DataFrame, n::Integer)
Resize df to have n rows by calling resize! on all columns of df.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> resize!(df, 2)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
#
DataFrames.subset — Function
subset(df::AbstractDataFrame, args...;
skipmissing::Bool=false, view::Bool=false, threads::Bool=true)
subset(gdf::GroupedDataFrame, args...;
skipmissing::Bool=false, view::Bool=false,
ungroup::Bool=true, threads::Bool=true)
Return a copy of data frame df or parent of gdf containing only rows for which all values produced by transformation(s) args for a given row are true. All transformations must produce vectors containing true or false. When the first argument is a GroupedDataFrame, transformations are also allowed to return a single true or false value, which results in including or excluding a whole group.
If skipmissing=false (the default) args are required to produce results containing only Bool values. If skipmissing=true, additionally missing is allowed and it is treated as false (i.e. rows for which one of the conditions returns missing are skipped).
Each argument passed in args can be any specifier following the rules described for select with the restriction that:
-
specifying target column name is not allowed as
subsetdoes not create new columns; -
every passed transformation must return a scalar or a vector (returning
AbstractDataFrame,NamedTuple,DataFrameRoworAbstractMatrixis not supported).
If view=true a SubDataFrame view is returned instead of a DataFrame.
If ungroup=false the resulting data frame is re-grouped based on the same grouping columns as gdf and a GroupedDataFrame is returned (preserving the order of groups from gdf).
If threads=true (the default) transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.
If a GroupedDataFrame is passed then it must include all groups present in the parent data frame, like in select!.
|
Note that as the |
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(id=1:4, x=[true, false, true, false],
y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12])
4×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 1 true true true 1
2 │ 2 false true true 2
3 │ 3 true false missing 11
4 │ 4 false false missing 12
julia> subset(df, :x)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼────────────────────────────────────
1 │ 1 true true true 1
2 │ 3 true false missing 11
julia> subset(df, :v => x -> x .> 3)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
julia> subset(df, :x, :y => ByRow(!))
1×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼────────────────────────────────────
1 │ 3 true false missing 11
julia> subset(df, :x, :z, skipmissing=true)
1×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────
1 │ 1 true true true 1
julia> subset(df, :x, :z)
ERROR: ArgumentError: missing was returned in condition number 2 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset(groupby(df, :y), :v => x -> x .> minimum(x))
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 2 false true true 2
2 │ 4 false false missing 12
julia> subset(groupby(df, :y), :v => x -> minimum(x) > 5)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
#
DataFrames.subset! — Function
subset!(df::AbstractDataFrame, args...;
skipmissing::Bool=false, threads::Bool=true)
subset!(gdf::GroupedDataFrame{DataFrame}, args...;
skipmissing::Bool=false, ungroup::Bool=true, threads::Bool=true)
Update data frame df or the parent of gdf in place to contain only rows for which all values produced by transformation(s) args for a given row is true. All transformations must produce vectors containing true or false. When the first argument is a GroupedDataFrame, transformations are also allowed to return a single true or false value, which results in including or excluding a whole group.
If skipmissing=false (the default) args are required to produce results containing only Bool values. If skipmissing=true, additionally missing is allowed and it is treated as false (i.e. rows for which one of the conditions returns missing are skipped).
Each argument passed in args can be any specifier following the rules described for select with the restriction that:
-
specifying target column name is not allowed as
subset!does not create new columns; -
every passed transformation must return a scalar or a vector (returning
AbstractDataFrame,NamedTuple,DataFrameRoworAbstractMatrixis not supported).
If ungroup=false the passed GroupedDataFrame gdf is updated (preserving the order of its groups) and returned.
If threads=true (the default) transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.
If GroupedDataFrame is subsetted then it must include all groups present in the parent data frame, like in select!. In this case the passed GroupedDataFrame is updated to have correct groups after its parent is updated.
|
Note that as the |
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false])
4×3 DataFrame
Row │ id x y
│ Int64 Bool Bool
─────┼─────────────────────
1 │ 1 true true
2 │ 2 false true
3 │ 3 true false
4 │ 4 false false
julia> subset!(df, :x, :y => ByRow(!));
julia> df
1×3 DataFrame
Row │ id x y
│ Int64 Bool Bool
─────┼────────────────────
1 │ 3 true false
julia> df = DataFrame(id=1:4, y=[true, true, false, false], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> x .> minimum(x));
julia> df
2×3 DataFrame
Row │ id y v
│ Int64 Bool Int64
─────┼─────────────────────
1 │ 2 true 2
2 │ 4 false 12
julia> df = DataFrame(id=1:4, x=[true, false, true, false],
z=[true, true, missing, missing], v=1:4)
4×4 DataFrame
Row │ id x z v
│ Int64 Bool Bool? Int64
─────┼──────────────────────────────
1 │ 1 true true 1
2 │ 2 false true 2
3 │ 3 true missing 3
4 │ 4 false missing 4
julia> subset!(df, :x, :z)
ERROR: ArgumentError: missing was returned in condition number 2 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset!(df, :x, :z, skipmissing=true);
julia> df
1×4 DataFrame
Row │ id x z v
│ Int64 Bool Bool? Int64
─────┼───────────────────────────
1 │ 1 true true 1
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> x .> minimum(x));
julia> df
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 2 false true true 2
2 │ 4 false false missing 12
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> minimum(x) > 5);
julia> df
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
#
Base.unique — Function
unique(df::AbstractDataFrame; view::Bool=false, keep::Symbol=:first)
unique(df::AbstractDataFrame, cols; view::Bool=false, keep::Symbol=:first)
Return a data frame containing only unique rows in df.
Non-unique (duplicate) rows are those for which at least another row contains equal values (according to isequal) for all columns in cols (by default, all columns). If keep=:first (the default), only the first occurrence of a set of duplicate rows is kept. If keep=:last, only the last occurrence of a set of duplicate rows is kept. If keep=:noduplicates, only rows without any duplicates are kept.
If view=false a freshly allocated DataFrame is returned, and if view=true then a SubDataFrame view into df is returned.
Arguments
-
df: the AbstractDataFrame -
cols: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselectthat returns at least one column ifdfhas at least one column.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> unique(df) # doesn't modify df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> unique(df, 2)
2×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
julia> unique(df, keep=:noduplicates)
0×2 DataFrame
Row │ i x
│ Int64 Int64
─────┴──────────────
#
Base.unique! — Function
unique!(df::AbstractDataFrame; keep::Symbol=:first)
unique!(df::AbstractDataFrame, cols; keep::Symbol=:first)
Update df in-place to contain only unique rows.
Non-unique (duplicate) rows are those for which at least another row contains equal values (according to isequal) for all columns in cols (by default, all columns). If keep=:first (the default), only the first occurrence of a set of duplicate rows is kept. If keep=:last, only the last occurrence of a set of duplicate rows is kept. If keep=:noduplicates, only rows without any duplicates are kept.
Arguments
-
df: the AbstractDataFrame -
cols: column indicator (Symbol,Int,Vector{Symbol},Regex, etc.) specifying the column(s) to compare. Can be any column selector or transformation accepted byselectthat returns at least one column ifdfhas at least one column.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> unique!(copy(df)) # modifies df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> unique(df, keep=:noduplicates)
0×2 DataFrame
Row │ i x
│ Int64 Int64
─────┴──────────────
Working with missing values
#
Missings.allowmissing — Function
allowmissing(df::AbstractDataFrame, cols=:)
Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=[1, 2])
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> allowmissing(df)
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 2
#
DataFrames.allowmissing! — Function
allowmissing!(df::DataFrame, cols=:)
Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
Metadata: this function preserves table-level and column-level :note-style metadata.
#
DataFrames.completecases — Function
completecases(df::AbstractDataFrame, cols=:)
Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df.
If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) that returns at least one column if df has at least one column.
See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.
Examples
julia> df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> completecases(df)
5-element BitVector:
0
0
0
1
1
julia> completecases(df, :x)
5-element BitVector:
0
1
0
1
1
julia> completecases(df, [:x, :y])
5-element BitVector:
0
0
0
1
1
#
Missings.disallowmissing — Function
disallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)
Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
If error=false then columns containing a missing value will be skipped instead of throwing an error.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(a=Union{Int, Missing}[1, 2])
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 2
julia> disallowmissing(df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> df = DataFrame(a=[1, missing])
2×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
julia> disallowmissing(df, error=false)
2×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
#
DataFrames.disallowmissing! — Function
disallowmissing!(df::DataFrame, cols=:; error::Bool=true)
Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.
cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
If error=false then columns containing a missing value will be skipped instead of throwing an error.
Metadata: this function preserves table-level and column-level :note-style metadata.
#
DataFrames.dropmissing — Function
dropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)
Return a data frame excluding rows with missing values in df.
If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned. In this case disallowmissing must be false.
If disallowmissing is true (the default when view is false) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.
See also: completecases and dropmissing!.
Metadata: this function preserves table-level and column-level :note-style metadata.
Examples
julia> df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> dropmissing(df)
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼────────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing(df, :x)
3×3 DataFrame
Row │ i x y
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 2 4 missing
2 │ 4 2 d
3 │ 5 1 e
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
#
DataFrames.dropmissing! — Function
dropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)
Remove rows with missing values from data frame df and return it.
If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.
Metadata: this function preserves table-level and column-level :note-style metadata.
See also: dropmissing and completecases.
julia> df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> dropmissing!(copy(df))
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼────────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing!(copy(df), :x)
3×3 DataFrame
Row │ i x y
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 2 4 missing
2 │ 4 2 d
3 │ 5 1 e
julia> dropmissing!(df, [:x, :y])
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
Iteration
#
Base.eachcol — Function
eachcol(df::AbstractDataFrame)
Return a DataFrameColumns object that is a vector-like that allows iterating an AbstractDataFrame column by column.
Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).
DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).
In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> eachcol(df)
4×2 DataFrameColumns
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> collect(eachcol(df))
2-element Vector{AbstractVector}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> map(eachcol(df)) do col
maximum(col) - minimum(col)
end
2-element Vector{Int64}:
3
3
julia> sum.(eachcol(df))
2-element Vector{Int64}:
10
50
#
Base.eachrow — Function
eachrow(df::AbstractDataFrame)
Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.
Because DataFrameRows have an eltype of Any, use copy(dfr::DataFrameRow) to obtain a named tuple, which supports iteration and property access like a DataFrameRow, but also passes information on the eltypes of the columns of df.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> eachrow(df)
4×2 DataFrameRows
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> copy.(eachrow(df))
4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4, 3], [2, 1]))
2×2 DataFrameRows
Row │ y x
│ Int64 Int64
─────┼──────────────
1 │ 14 4
2 │ 13 3
#
Base.values — Function
values(dfc::DataFrameColumns)
Get a vector of columns from dfc.
#
Base.pairs — Function
pairs(dfc::DataFrameColumns)
Return an iterator of pairs associating the name of each column of dfc with the corresponding column vector, i.e. name => col where name is the column name of the column col.
#
Base.Iterators.partition — Function
Iterators.partition(df::AbstractDataFrame, n::Integer)
Iterate over df data frame n rows at a time, returning each block as a SubDataFrame.
Examples
julia> collect(Iterators.partition(DataFrame(x=1:5), 2))
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}:
2×1 SubDataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
2×1 SubDataFrame
Row │ x
│ Int64
─────┼───────
1 │ 3
2 │ 4
1×1 SubDataFrame
Row │ x
│ Int64
─────┼───────
1 │ 5
Iterators.partition(dfr::DataFrameRows, n::Integer)
Iterate over DataFrameRows dfr n rows at a time, returning each block as a DataFrameRows over a view of rows of parent of dfr.
Examples
julia> collect(Iterators.partition(eachrow(DataFrame(x=1:5)), 2))
3-element Vector{DataFrames.DataFrameRows{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}}:
2×1 DataFrameRows
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
2×1 DataFrameRows
Row │ x
│ Int64
─────┼───────
1 │ 3
2 │ 4
1×1 DataFrameRows
Row │ x
│ Int64
─────┼───────
1 │ 5
Equality
#
Base.isapprox — Function
isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
rtol::Real=atol>0 ? 0 : √eps, atol::Real=0,
nans::Bool=false, norm::Function=norm)
Inexact equality comparison. df1 and df2 must have the same size and column names. Return true if isapprox with given keyword arguments applied to all pairs of columns stored in df1 and df2 returns true.
Metadata
#
DataAPI.metadata — Function
metadata(df::AbstractDataFrame, key::AbstractString, [default]; style::Bool=false)
metadata(dfr::DataFrameRow, key::AbstractString, [default]; style::Bool=false)
metadata(dfc::DataFrameColumns, key::AbstractString, [default]; style::Bool=false)
metadata(dfr::DataFrameRows, key::AbstractString, [default]; style::Bool=false)
Return table-level metadata value associated with df for key key. If style=true return a tuple of metadata value and metadata style.
SubDataFrame and DataFrameRow expose only :note-style metadata of their parent.
If default is passed then return it if key does not exist; if style=true return (default, :default).
See also: metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
<a id='DataAPI.metadatakeys' href='#DataAPI.metadatakeys'>#</a> **`DataAPI.metadatakeys`** — *Function*. ```julia metadatakeys(df::AbstractDataFrame) metadatakeys(dfr::DataFrameRow) metadatakeys(dfc::DataFrameColumns) metadatakeys(dfr::DataFrameRows)
Return an iterator of table-level metadata keys which are set in the object.
Values can be accessed using metadata(df, key).
SubDataFrame and DataFrameRow expose only :note-style metadata keys of their parent.
See also: metadata, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
#
DataAPI.metadata! — Function
metadata!(df::AbstractDataFrame, key::AbstractString, value; style::Symbol=:default)
metadata!(dfr::DataFrameRow, key::AbstractString, value; style::Symbol=:default)
metadata!(dfc::DataFrameColumns, key::AbstractString, value; style::Symbol=:default)
metadata!(dfr::DataFrameRows, key::AbstractString, value; style::Symbol=:default)
Set table-level metadata for object df for key key to have value value and style style (:default by default) and return df.
For SubDataFrame and DataFrameRow only :note-style is allowed. Trying to set a key-value pair for which the key already exists in the parent data frame with another style throws an error.
See also: metadata, metadatakeys, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
<a id='DataAPI.deletemetadata!' href='#DataAPI.deletemetadata!'>#</a> **`DataAPI.deletemetadata!`** — *Function*. ```julia deletemetadata!(df::AbstractDataFrame, key::AbstractString) deletemetadata!(dfr::DataFrameRow, key::AbstractString) deletemetadata!(dfc::DataFrameColumns, key::AbstractString) deletemetadata!(dfr::DataFrameRows, key::AbstractString)
Delete table-level metadata from object df for key key and return df. If key does not exist, return df without modification.
For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata, metadatakeys, metadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
<a id='DataAPI.emptymetadata!' href='#DataAPI.emptymetadata!'>#</a> **`DataAPI.emptymetadata!`** — *Function*. ```julia emptymetadata!(df::AbstractDataFrame) emptymetadata!(dfr::DataFrameRow) emptymetadata!(dfc::DataFrameColumns) emptymetadata!(dfr::DataFrameRows)
Delete all table-level metadata from object df.
For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata, metadatakeys, metadata!, deletemetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> emptymetadata!(df);
julia> metadatakeys(df)
()
#
DataAPI.colmetadata — Function
colmetadata(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
colmetadata(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
colmetadata(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
colmetadata(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
Return column-level metadata value associated with df for column col and key key.
SubDataFrame and DataFrameRow expose only :note-style metadata of their parent.
If default is passed then return it if key does not exist for column col; if style=true return (default, :default). If col does not exist in df always throw an error.
See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
<a id='DataAPI.colmetadatakeys' href='#DataAPI.colmetadatakeys'>#</a> **`DataAPI.colmetadatakeys`** — *Function*. ```julia colmetadatakeys(df::AbstractDataFrame, [col::ColumnIndex]) colmetadatakeys(dfr::DataFrameRow, [col::ColumnIndex]) colmetadatakeys(dfc::DataFrameColumns, [col::ColumnIndex]) colmetadatakeys(dfr::DataFrameRows, [col::ColumnIndex])
If col is passed return an iterator of column-level metadata keys which are set for column col. If col is not passed return an iterator of col => colmetadatakeys(x, col) pairs for all columns that have metadata, where col are Symbol.
Values can be accessed using colmetadata(df, col, key).
SubDataFrame and DataFrameRow expose only :note-style metadata of their parent.
See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadata!, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
<a id='DataAPI.colmetadata!' href='#DataAPI.colmetadata!'>#</a> **`DataAPI.colmetadata!`** — *Function*. ```julia colmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default) colmetadata!(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default) colmetadata!(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default) colmetadata!(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
Set column-level metadata in df for column col and key key to have value value and style style (:default by default) and return df.
For SubDataFrame and DataFrameRow only :note style is allowed. Trying to set a key-value pair for which the key already exists in the parent data frame with another style throws an error.
See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, deletecolmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
<a id='DataAPI.deletecolmetadata!' href='#DataAPI.deletecolmetadata!'>#</a> **`DataAPI.deletecolmetadata!`** — *Function*. ```julia deletecolmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString) deletecolmetadata!(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString) deletecolmetadata!(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString) deletecolmetadata!(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString)
Delete column-level metadata set in df for column col and key key and return df.
For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, emptycolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
#
DataAPI.emptycolmetadata! — Function
emptycolmetadata!(df::AbstractDataFrame, [col::ColumnIndex])
emptycolmetadata!(dfr::DataFrameRow, [col::ColumnIndex])
emptycolmetadata!(dfc::DataFrameColumns, [col::ColumnIndex])
emptycolmetadata!(dfr::DataFrameRows, [col::ColumnIndex])
Delete column-level metadata set in df for column col and key key and return df.
For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> emptycolmetadata!(df, :a);
julia> colmetadatakeys(df)
()