Tables¶
In this example, we will look in detail at tables and how to use them in Engee scripts. Let's start with connecting the auxiliary library and creating a table.
Pkg.add(["TypedTables"])
Pkg.add("TypedTables")
using TypedTables
t = Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
Next, we'll look at the functionality of interacting with tables in detail. Let's turn to the first row and the first column of the table.
t[1]
t.a
Now let's describe what a table is and what are the spheres of its application.
A table is an array of Julia type, where each element (row) represents a NamedTuple file. In particular:
Externally, a Table is an array of named tuples. That is, each row of a Table is represented as one of Julia's new NamedTuple elements, which are easy to use and very efficient. In the Table subtype designation <: AbstractArray{<:NamedTuple}.
Inside Table stores a (named) tuple of arrays, which is a convenient structure for storing column-based tabular data.
Thus, manipulating Table data is very easy. As well as working with arrays and named tuples - this efficiency, simplicity and fun is embedded in the very ideology of Julia.
Tables and their columns can be of type AbstractArray of any dimension. This allows you to take advantage of Julia's powerful array features, such as multidimensional broadcasting, for example. Each column must be an array of the same dimensionality and size as the other columns.
The goal of TypedTables.jl is to introduce very few concepts with minimal training so that you can manipulate tabular data natively. This Table-type is a simple wrapper over columns and is a well-known and extremely productive AbstractArray interface. If you are familiar with arrays and named tuples, you can write your own data analytics using the Table file.
However, this functionality would be of little use if the data container were inherently slow, or if the use of the container were subject to the pitfalls of performance degradation when the programmer uses an idiomatic pattern. In this case, for-cycles on rows of a Table are possible at the speed of handwritten code in a statically compiled language such as C because the compiler is fully aware of the types of each column. Thus, users can write generic functions using a combination of handwritten loops and calls to functions such as map, filter, reduce, as well as high-level interfaces provided by packages like groupQuery.jl, and still get optimal performance .innerjoin
Finally, since Table has no opinion on the underlying array storage (and acts more as a convenient metaprogramming layer), the arrays representing each column can have quite different properties - for example, support for in-memory, out-of-core, and distributed workloads (see "Data Representation" for more details).
Ways to create a table¶
The easiest way to create a table of columns is to use keyword arguments.
t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
The constructor will equally accept a NamedTuple from columns as Table((name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])) (note the extra brackets).
It is also easy to convert a vector of named tuples based on row storage to column storage using the Table constructor.
Table([(name = "Alice", age = 25), (name = "Bob", age = 42), (name = "Charlie", age = 37)])
Accessing data stored in Tables¶
Let's start with rows. In our table, a row is simply a NamedTuple, which, as we saw earlier, is easy to access. Let's simply refer to the row index.
t[1]
Multiple rows can be indexed similarly to standard arrays.
t[2:3]
We can also get table dimensions using standard Engee functions. It is important to note that the number of columns is not displayed when calling size.
length(t)
size(t)
Columns can be accessed by calling the column name.
t.name
The easiest way to retrieve more than one column is to create a new table from the columns (as in table2 = Table(column1 = table1.column1, column2 = table1.column2, ...)).
Columns can be accessed directly as NamedTuple arrays using the columns function.
columns(t)
In addition, we can access the function to get column names.
columnnames(t)
From all of the above, let's highlight two equivalent ways of obtaining a data cell.
t[1].name
t.name[1]
Comparison of TypedTables and DataFrame.¶
For those who have experience using the DataFrames.jl package, this comparison may be useful:
The columns stored in a Table are immutable: you cannot add, delete or rename columns. However, it is very easy to create a new table with different columns, which encourages a functional programming style for working with external data structure. (See also FlexTable, a more flexible alternative). By comparison, this is a similar approach for IndexedTables and JuliaDB, whereas DataFrames uses an untyped column vector.
The columns themselves can be modifiable. You can modify data in one or more columns, as well as add or remove rows. Thus, operations on the data (rather than the data structure) can take an imperative form if desired.
3.The column types are known to the compiler, which makes direct operations such as searching Table rows very fast. The programmer is free to write a combination of low-level for loops, use operations such as map, filter, reduce, or use a high-level query interface such as Query.jl - all with the high performance one would expect from a statically compiled group language. .innerjoin
Conversely, the Julia compiler spends effort keeping track of the names and types of all the columns in a table. If you have a very large number of columns (many hundreds), Table is probably not a suitable data structure (a column vector with dynamic size and DataFrame typing is better here).
Tables can be an array of any dimensionality.
Unlike DataFrame, you cannot access a single cell in a single getindex call (you must first retrieve a column and index a cell from that column). Similarly, the number of columns is not involved in size or .lengthTable.
Conclusion¶
You can check which is more suitable for your task - the statically compiled Table or the dynamic DataFrames approach - as follows. See if the written code tends to refer to columns by name or if column names are more dynamic (and, for example, require iteration across columns).