Engee documentation
Notebook

Advanced file management

Let's imagine a situation where we have several data sets that are spread out in folders, and there may also be garbage in the folders (pictures with funny raccoons, text documents). I would like to be able to find the necessary files for later work with them. Julia implements only basic work with files - <https://engee.com/helpcenter/stable/ru-en/julia/base/file.html >.

In this post, I will make a smarter tool for working with files.

How will we work?

We won't go into the details of how the file system works - we don't need it. It is enough to observe that the structure of directories and files strikingly resembles a tree. Trees are a special kind of graphs. In such a graph, all nodes except the "root" have one parent.:

image.png

Optimal algorithms for graph traversal and graph modification are known for such graphs, and they are often already implemented. My idea is as follows: I will represent the contents of the folder as a tree, each node of the tree is a file or folder. I will keep the path, name, extension, date and time of creation and modification, as well as the folder attribute separately. And in order to organize a tree structure, I will store the "descendants" of this node.:

In [ ]:
using Dates

struct FileTreeNode
    path::String
    name::String
    ext::String
    isdir::Bool
    created::DateTime
    modified::DateTime
    children::Vector{FileTreeNode}
end

Additionally, we will create auxiliary functions to get the date and time when the file was created and modified, as well as to get the file name, path, and extension.:

In [ ]:
function get_metadata(path::String)
    st = stat(path)

    created = unix2datetime(st.ctime)
    modified = unix2datetime(st.mtime)

    return created, modified
end

function split_name_ext(path::String)
    name = basename(path)
    base, ext = splitext(name)
    return base, ext
end
Out[0]:
split_name_ext (generic function with 1 method)

Growing a tree

We have everything to implement, and we can create a tree based on the directory structure.

Using readdir we will get a list of files and folders inside the current folder, and then repeat this operation for the detected folders. This is called recursion.

To put it programmatically, we perform a recursive traversal of the tree of files and folders.**

Let's add one more limitation to our function: the search depth.

In [ ]:
function build_tree(path::String; maxdepth=typemax(Int), depth=0)
    is_dir = isdir(path)

    name, ext = split_name_ext(path)
    created, modified = get_metadata(path)
    if is_dir && depth < maxdepth
        entries = readdir(path; join=true)
        children = [
            build_tree(e; maxdepth, depth=depth+1)
            for e in sort(entries)
        ]
    else
        children = FileTreeNode[]
    end

    return FileTreeNode(path, name, ext, is_dir, created, modified, children)
end
Out[0]:
build_tree (generic function with 1 method)
In [ ]:
import AbstractTrees: children, printnode

children(node::FileTreeNode) = node.children

function printnode(io::IO, node::FileTreeNode)
    if node.isdir
        print(io, "📁 ", node.name)
    else
        print(io, "📄 ", node.name, node.ext)
    end
end
Out[0]:
printnode (generic function with 5 methods)
In [ ]:
tree = build_tree(".",maxdepth=2)
using AbstractTrees: print_tree
print_tree(tree)
nothing