Engee documentation
Notebook

Advanced file management

Let's imagine a situation where we have several data sets that are spread out in folders, and there may also be garbage in the folders (pictures with funny raccoons, text documents). I would like to be able to find the necessary files for later work with them. The problem is that Julia implements only basic work with files - <https://engee.com/helpcenter/stable/ru-en/julia/base/file.html >, which is clearly insufficient for such a task.

In this post, I will make a smarter tool for working with files using a data structure known as a tree.

Creating a tree

Let's observe that the structure of directories and files is strikingly similar to a tree. Trees in programming are a data structure that describes a special kind of graphs. In such graphs, all nodes except the "root" have one parent.:

image.png

Optimal algorithms for graph traversal and graph modification are known for such graphs, and they are often already implemented. My idea is as follows: to represent the contents of a folder and its subfolders as a tree, each node of which will be a file or folder. Each node will store separately the path, name, extension, date and time of creation and modification, as well as the folder attribute. And in order to organize the tree structure, I will store the "descendants" of this node.:

In [ ]:
using Dates

struct FileTreeNode
    path::String
    name::String
    ext::String
    isdir::Bool
    created::DateTime
    modified::DateTime
    children::Vector{FileTreeNode}
end

This way, I will be able to traverse the tree of files and folders much easier and faster than if I were using the Julia standard library.

To get the date and time when the file was created and modified, as well as to get the file name, path, and extension, I will write additional functions:

In [ ]:
function get_metadata(path::String)
    st = stat(path)
    created = unix2datetime(st.ctime)
    modified = unix2datetime(st.mtime)
    return created, modified
end

function split_name_ext(path::String)
    name = basename(path)
    base, ext = splitext(name)
    return base, ext
end
Out[0]:
split_name_ext (generic function with 1 method)

Growing a tree

The necessary data structures and auxiliary functions have been created, and you can proceed to the implementation of the directory structure tree.

Let's create a function that will receive the creation and modification time of a file or folder, the name and extensions for a certain path. Next, using readdir we will get a list of files and folders inside the current folder.

We will repeat the same operation for each of the detected folders. That is, our function will call itself. This technique is called recursion.

Let's add one more limitation to our function: the search depth. This restriction will limit the nesting level of folders to crawl and speed up tree building.

As a result of this function, we get a tree node.

In [ ]:
function build_tree(path::String; maxdepth=typemax(Int), depth=0)
    is_dir = isdir(path)

    name, ext = split_name_ext(path)
    created, modified = get_metadata(path)
    if is_dir && depth < maxdepth
        entries = readdir(path; join=true)
        children = [
            build_tree(e; maxdepth, depth=depth+1)
            for e in sort(entries)
        ]
    else
        children = FileTreeNode[]
    end

    return FileTreeNode(path, name, ext, is_dir, created, modified, children)
end
Out[0]:
build_tree (generic function with 1 method)

Next, we will simplify the task of visualizing the tree using the AbstractTrees.jl library. We will define two new functions for our tree:

  • children() - to get the descendants of a tree node
  • printnode() - to display the node on the screen
In [ ]:
import Pkg
Pkg.add("AbstractTrees")
import AbstractTrees: children, printnode

children(node::FileTreeNode) = node.children

function printnode(io::IO, node::FileTreeNode)

    if node.isdir
        print(io, "📁 ", node.name)
    else
        print(io, "📄 ", node.name, node.ext)
    end
end
   Resolving package versions...
   Installed FiniteDiff ──────────── v2.29.0
   Installed LazyArrays ──────────── v2.9.5
   Installed LineSearch ──────────── v0.1.6
   Installed DiffEqCallbacks ─────── v4.12.0
   Installed SparseMatrixColorings ─ v0.4.26
   Installed DiffEqNoiseProcess ──── v5.27.0
     Project No packages added to or removed from `~/.project/Project.toml`
    Manifest No packages added to or removed from `~/.project/Manifest.toml`
Out[0]:
printnode (generic function with 5 methods)

Ready! Let's check that the tree is going to:

In [ ]:
tree = build_tree(@__DIR__,maxdepth=1)
using AbstractTrees: print_tree
print_tree(tree)
📁 FileOps
├─ 📁 .git
├─ 📁 DataDepot
├─ 📄 dirprint.ngscript
├─ 📄 filetree_ops.ngscript
└─ 📄 filetree_ops_1.ipynb

Putting the tree into practice

Let's glue together 3 datasets located in the DataDepot folder.

First, let's see what's in our folder.:

In [ ]:
tree = build_tree(joinpath(@__DIR__,"DataDepot"),maxdepth=2)
print_tree(tree)
📁 DataDepot
├─ 📁 S1
│  ├─ 📄 data1.csv
│  └─ 📄 trash.rc
├─ 📁 S2
│  ├─ 📄 data2.csv
│  └─ 📄 notsotrash.dc
└─ 📁 S3
   └─ 📄 data3.csv

Then, we will get all *.csv files by bypassing the tree created above.:

In [ ]:
function find_files_by_ext(node::FileTreeNode, ext::String,acc=String[])
    if !startswith(ext,".")
        ext = "."*ext
    end
    if isequal(node.ext,ext)
        push!(acc, node.path)
        println("$(acc)")
    end
    for c in node.children
        accchild = find_files_by_ext(c, ext)
        if ~isempty(accchild)
            append!(acc,accchild)

        end
    end
    return acc
end

csv_files = find_files_by_ext(tree,"csv")
["/user/work/FileOps/DataDepot/S1/data1.csv"]
["/user/work/FileOps/DataDepot/S2/data2.csv"]
["/user/work/FileOps/DataDepot/S3/data3.csv"]
Out[0]:
3-element Vector{String}:
 "/user/work/FileOps/DataDepot/S1/data1.csv"
 "/user/work/FileOps/DataDepot/S2/data2.csv"
 "/user/work/FileOps/DataDepot/S3/data3.csv"

Now let's download them and glue them together.:

In [ ]:
using CSV
df_v = Vector{DataFrame}()
for f in csv_files
    df = CSV.read(joinpath(pwd(),f),DataFrame)
    println("Read $(nrow(df)) lines from the $f file")
    push!(df_v,df)
end
df_v = reduce(vcat,df_v)
10 lines have been read from the /user/work/FileOps/DataDepot/S1/data1.csv file
10 lines have been read from the /user/work/FileOps/DataDepot/S2/data2.csv file
10 lines have been read from the /user/work/FileOps/DataDepot/S3/data3.csv file
Out[0]:
30×3 DataFrame
5 rows omitted
Rowx1x2x3
Float64Float64Float64
10.8947660.736590.0567356
20.7408720.4223710.447714
30.6577770.8259970.168484
40.8940870.02343830.256112
50.8466650.03092550.895526
60.4627590.8838610.8525
70.1848760.4524160.432112
80.6785190.3619010.00114721
90.07336790.1528410.837117
100.5375860.1761290.0318991
110.8328410.3048720.843521
120.2476430.6342890.639223
130.3063910.0744850.85542
190.7236510.01892840.845076
200.288120.9466720.700205
210.8161040.01173130.145698
220.2942450.2190640.343588
230.3882120.5163990.235681
240.4011220.4960150.420165
250.1296090.7331280.098028
260.659690.1508910.102492
270.9379920.9765670.137387
280.8437420.8598980.522218
290.2764890.880830.271357
300.6450550.9117670.108152

Conclusions

In this publication, we looked at an example of using classical data structures from programming to solve practical engineering problems. In subsequent publications, other programming techniques that facilitate the tasks of technical calculations will be considered.