Engee documentation
Notebook

Tables

In this example, we will look in detail at tables and how to use them in Engee scripts. Let's start with connecting the auxiliary library and creating a table.

In [ ]:
Pkg.add(["TypedTables"])
In [ ]:
Pkg.add("TypedTables")
using TypedTables
   Resolving package versions...
   Installed Indexing ────────── v1.1.1
   Installed SplitApplyCombine ─ v1.2.3
   Installed TypedTables ─────── v1.4.6
   Installed Dictionaries ────── v0.4.2
    Updating `/user/.project/Project.toml`
  [9d95f2ec] + TypedTables v1.4.6
    Updating `/user/.project/Manifest.toml`
  [85a47980] + Dictionaries v0.4.2
  [313cdc1a] + Indexing v1.1.1
  [03a91e81] + SplitApplyCombine v1.2.3
  [9d95f2ec] + TypedTables v1.4.6
Precompiling project...
Indexing
Dictionaries
SplitApplyCombine
TypedTables
  4 dependencies successfully precompiled in 18 seconds. 88 already precompiled. 4 skipped during auto due to previous errors.
In [ ]:
t = Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
Out[0]:
Table with 2 columns and 3 rows:
     a  b
   ┌───────
 1 │ 1  2.0
 2 │ 2  4.0
 3 │ 3  6.0

Next, we'll look at the functionality of interacting with tables in detail. Let's turn to the first row and the first column of the table.

In [ ]:
t[1]
Out[0]:
(a = 1, b = 2.0)
In [ ]:
t.a
Out[0]:
3-element Vector{Int64}:
 1
 2
 3

Now let's describe what a table is and what are the spheres of its application.

A table is an array of Julia type, where each element (row) represents a NamedTuple file. In particular:

  1. Externally, a Table is an array of named tuples. That is, each row of a Table is represented as one of Julia's new NamedTuple elements, which are easy to use and very efficient. In the Table subtype designation <: AbstractArray{<:NamedTuple}.

  2. Inside Table stores a (named) tuple of arrays, which is a convenient structure for storing column-based tabular data.

Thus, manipulating Table data is very easy. As well as working with arrays and named tuples - this efficiency, simplicity and fun is embedded in the very ideology of Julia.

Tables and their columns can be of type AbstractArray of any dimension. This allows you to take advantage of Julia's powerful array features, such as multidimensional broadcasting, for example. Each column must be an array of the same dimensionality and size as the other columns.

The goal of TypedTables.jl is to introduce very few concepts with minimal training so that you can manipulate tabular data natively. This Table-type is a simple wrapper over columns and is a well-known and extremely productive AbstractArray interface. If you are familiar with arrays and named tuples, you can write your own data analytics using the Table file.

However, this functionality would be of little use if the data container were inherently slow, or if the use of the container were subject to the pitfalls of performance degradation when the programmer uses an idiomatic pattern. In this case, for-cycles on rows of a Table are possible at the speed of handwritten code in a statically compiled language such as C because the compiler is fully aware of the types of each column. Thus, users can write generic functions using a combination of handwritten loops and calls to functions such as map, filter, reduce, as well as high-level interfaces provided by packages like groupQuery.jl, and still get optimal performance .innerjoin

Finally, since Table has no opinion on the underlying array storage (and acts more as a convenient metaprogramming layer), the arrays representing each column can have quite different properties - for example, support for in-memory, out-of-core, and distributed workloads (see "Data Representation" for more details).

Ways to create a table

The easiest way to create a table of columns is to use keyword arguments.

In [ ]:
t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Out[0]:
Table with 2 columns and 3 rows:
     name     age
   ┌─────────────
 1 │ Alice    25
 2 │ Bob      42
 3 │ Charlie  37

The constructor will equally accept a NamedTuple from columns as Table((name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])) (note the extra brackets).

It is also easy to convert a vector of named tuples based on row storage to column storage using the Table constructor.

In [ ]:
Table([(name = "Alice", age = 25), (name = "Bob", age = 42), (name = "Charlie", age = 37)])
Out[0]:
Table with 2 columns and 3 rows:
     name     age
   ┌─────────────
 1 │ Alice    25
 2 │ Bob      42
 3 │ Charlie  37

Accessing data stored in Tables

Let's start with rows. In our table, a row is simply a NamedTuple, which, as we saw earlier, is easy to access. Let's simply refer to the row index.

In [ ]:
t[1]
Out[0]:
(name = "Alice", age = 25)

Multiple rows can be indexed similarly to standard arrays.

In [ ]:
t[2:3]
Out[0]:
Table with 2 columns and 2 rows:
     name     age
   ┌─────────────
 1 │ Bob      42
 2 │ Charlie  37

We can also get table dimensions using standard Engee functions. It is important to note that the number of columns is not displayed when calling size.

In [ ]:
length(t)
Out[0]:
3
In [ ]:
size(t)
Out[0]:
(3,)

Columns can be accessed by calling the column name.

In [ ]:
t.name
Out[0]:
3-element Vector{String}:
 "Alice"
 "Bob"
 "Charlie"

The easiest way to retrieve more than one column is to create a new table from the columns (as in table2 = Table(column1 = table1.column1, column2 = table1.column2, ...)).

Columns can be accessed directly as NamedTuple arrays using the columns function.

In [ ]:
columns(t)
Out[0]:
(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])

In addition, we can access the function to get column names.

In [ ]:
columnnames(t)
Out[0]:
(:name, :age)

From all of the above, let's highlight two equivalent ways of obtaining a data cell.

In [ ]:
t[1].name
Out[0]:
"Alice"
In [ ]:
t.name[1]
Out[0]:
"Alice"

Comparison of TypedTables and DataFrame.

For those who have experience using the DataFrames.jl package, this comparison may be useful:

  1. The columns stored in a Table are immutable: you cannot add, delete or rename columns. However, it is very easy to create a new table with different columns, which encourages a functional programming style for working with external data structure. (See also FlexTable, a more flexible alternative). By comparison, this is a similar approach for IndexedTables and JuliaDB, whereas DataFrames uses an untyped column vector.

  2. The columns themselves can be modifiable. You can modify data in one or more columns, as well as add or remove rows. Thus, operations on the data (rather than the data structure) can take an imperative form if desired.

3.The column types are known to the compiler, which makes direct operations such as searching Table rows very fast. The programmer is free to write a combination of low-level for loops, use operations such as map, filter, reduce, or use a high-level query interface such as Query.jl - all with the high performance one would expect from a statically compiled group language. .innerjoin

  1. Conversely, the Julia compiler spends effort keeping track of the names and types of all the columns in a table. If you have a very large number of columns (many hundreds), Table is probably not a suitable data structure (a column vector with dynamic size and DataFrame typing is better here).

  2. Tables can be an array of any dimensionality.

  3. Unlike DataFrame, you cannot access a single cell in a single getindex call (you must first retrieve a column and index a cell from that column). Similarly, the number of columns is not involved in size or .lengthTable.

Conclusion

You can check which is more suitable for your task - the statically compiled Table or the dynamic DataFrames approach - as follows. See if the written code tends to refer to columns by name or if column names are more dynamic (and, for example, require iteration across columns).