Tables
In this example, we will analyze the tables and their application options in Engee scripts in detail.
Let's start by connecting the auxiliary library and creating a table.
Pkg.add(["TypedTables"])
Pkg.add("TypedTables")
using TypedTables
t = Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
Next, we will analyze in detail all the functionality of interacting with tables. Let's turn to the first row and the first column of the table.
t[1]
t.a
Now let's describe what a table is and what are its areas of application.
The table is a Julia type array, where each element (row) is a NamedTuple file. In particular:
-
Externally, a Table is an array of named tuples. That is, each row of the table is represented as one of the new Julia NamedTuple elements, which are easy to use and very effective. In the Table subtype designation <: AbstractArray{<:NamedTuple}.
-
A (named) tuple of arrays is stored inside the Table, which is a convenient structure for storing column-based tabular data.
Thus, manipulating the Table data is very simple. As well as working with arrays and named tuples – this efficiency, simplicity and fascination is inherent in the very ideology of Julia.
Tables and their columns can be of the AbstractArray type of any dimension. This allows you to take advantage of the powerful features of Julia arrays, such as, for example, multidimensional broadcasting. Each column should be an array of the same dimension and size as the other columns.
The goal of TypedTables.jl is to present very few concepts with minimal training costs so that you can manipulate tabular data immediately. This Table type is a simple wrapper over columns and is a well-known and extremely productive AbstractArray interface. If you are familiar with arrays and named tuples, you can write your own data analytics using the Table file.
However, this functionality would be of little use if the data container was inherently slow or if using the container was subject to performance pitfalls when the programmer uses an idiomatic pattern. In this case, for-loops through the rows of a Table are possible at the speed of handwritten code in a statically compiled language such as C, since the compiler fully knows the types of each column. Thus, users can write common functions using a combination of handwritten loops and calls to functions such as map, filter, reduce, as well as high-level interfaces provided by packages such as groupQuery.jl, and at the same time get optimal performance .innerjoin
Finally, since the Table has no opinion on the underlying array storage (and acts more as a convenient metaprogramming layer), the arrays representing each column can have quite different properties– for example, support for in-memory, off-core, and distributed workloads (see Data Representation for more information).
Ways to create a table
The easiest way to create a table from columns is to use keyword arguments.
t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
The constructor will equally accept a NamedTuple from the columns as Table((name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])) ( note the additional brackets).
In addition, it is easy to convert a vector of named tuples based on row storage into columnar storage using the Table constructor.
Table([(name = "Alice", age = 25), (name = "Bob", age = 42), (name = "Charlie", age = 37)])
Access to data stored in Tables
Let's start with the lines. In our table, a row is just a NamedTuple, which, as we saw earlier, is easy to access. Let's just look at the row index.
t[1]
Multiple rows can be indexed in the same way as standard arrays.
t[2:3]
We can also get the dimensions of the table using standard Engee functions. It is important to note that the number of columns is not displayed when calling size.
length(t)
size(t)
Columns can be accessed by accessing the column name.
t.name
The easiest way to extract more than one column is to create a new table from the columns (as in table2 = Table(column1 = table1.column1, column2 = table1.column2, ...)).
Columns can be accessed directly as NamedTuple arrays using the columns function.
columns(t)
In addition, we can refer to the function for getting column names.
columnnames(t)
From all of the above, we will single out two equivalent ways to obtain a data cell.
t[1].name
t.name[1]
Comparison of TypedTables and DataFrame.
For those who have experience using the DataFrames.jl package, this comparison may be useful.:
-
Columns stored in a Table are immutable: you cannot add, delete, or rename columns. However, it is very easy to create a new table with different columns, which encourages a functional programming style for working with an external data structure. (see also FlexTable, a more flexible alternative). For comparison: this is a similar approach for IndexedTables and JuliaDB, whereas DataFrames uses an untyped column vector.
-
The columns themselves can be mutable. You can modify data in one or more columns, as well as add or remove rows. Thus, operations on data (rather than a data structure) can take an imperative form, if desired.
-
Column types are known to the compiler, which makes direct operations, such as iterating through Table rows, very fast. A programmer can freely write a combination of low-level for loops, use operations such as map, filter, reduce, or use a high–level query interface such as Query.jl - all with the high performance that can be expected from a statically compiled group language. .innerjoin
-
Conversely, the Julia compiler spends effort tracking the names and types of all columns in the table. If you have a very large number of columns (many hundreds), a Table may not be the right data structure (it is better to use a column vector with dynamic size and DataFrame typing).
-
Tables can be an array of any dimension.
-
Unlike a DataFrame, you cannot access one cell in a single getindex call (you must first extract the column and index the cell from this column). Similarly, the number of columns is not involved in the size or .lengthTable.
Conclusion
You can check which is more suitable for your task – a statically compiled Table or a dynamic DataFrames approach, as follows. See if the written code tends to refer to columns by name, or if column names are more dynamic (and, for example, column iteration is required).