Working with ZIP/XML files on the example of MATLAB live-scripts translation to ngscipt format¶

In this example we will show how to work with OPC (Open Packaging Convensions) files, i.e. ZIP containers containing a set of different XML and other files. This format is found everywhere. For example, formats of this type are used in all Office applications (DOCX, XLSX, etc.) and in many engineering packages (Autodesk, Simulink, Engee).

We will convert a technical calculations file from mlx to ngscript format - transfer all text and code cells, illustrations, hyperlinks and formulas from one document to another.

Introduction¶

Ready libraries are usually available for working with popular formats like Office Open XML (e.g. XLSX.jl for spreadsheets). But often we need to quickly process a file format for which there are no ready libraries yet, or which do not take into account the elements of document syntax you need. Let's imagine ourselves in such a scenario, when we need to perform this processing manually. As a tutorial example, we will examine a programme for transcoding from the LiveScript format of MATLAB to the ngscript format.

For relatively low-level work with these file formats we need the following libraries:

Pkg.add(["ZipFile", "EzXML"])

using EzXML, ZipFile, JSON, Base64

If any of them are not already installed, run the following cell, having previously removed the # symbol (by uncommenting the string).

#]add EzXML ZipFile JSON Base64

It is enough to perform this installation once, but sometimes you can restart for the sake of updating library versions.

What an MLX file consists of¶

The Live Code file format uses Open Packaging Conventions technology, which is an extension of the zip file format. The code and formatted content is stored in an XML document that is different from a document using the Office Open XML format. To work with the contents of these files, simply change the file extension to *.zip, and then unzip it using the context menu of the Engee file browser.

Let's examine the contents of the file *.mlx after unpacking. We will need the following files from the archive:

document.xml, where all textual information of the document is stored
document.xml.rels - catalogue of additional materials included in the document (formulas, illustrations)

The folder media contains the illustrations that are inserted in the document, and the folder mathml contains the used formulas in MathML format.

Loading and processing MLX files¶

For the sake of simplifying the reuse of our code (and also for the sake of cleanliness), we organise it in the form of a set of functions.

Here are the functions we will implement at this stage:

getting the list of files mlx, lying in the catalogue,
unpacking the archive and reading the files we need,
processing a file of links to media files,
getting a list of cells from an XML file,
translating a cell from XML format to JSON format.

And one auxiliary function for working with illustrations embedded in the file:

get MIME information about illustrations in the required format (* from a name like "image.png" make the MIME identifier "image/png "*).

The first thing to do is to get a list of mlx files in the catalogue.

function get_list_of_files( base_folder )
    # Сканируем нужный нам каталог (не рекурсивно, без изучения вложенных папок)
    filenames = readdir( base_folder)
    
    # Отфильтруем только файлы с расширением `.mlx`
    list_of_files = [joinpath(base_folder,fname) for fname in filenames if endswith( fname, ".mlx")]
end;

Unzip mlx file and put the contents of interest into the list.

function get_mlx_content( mlx_full_filename )
    
    # Откроем архив для чтения содержимого 
    mlx_reader = ZipFile.Reader( mlx_full_filename )
    
    # Прочитаем файлы, которые нас будут интересовать
    document_file = read( [f for f in mlx_reader.files if endswith(f.name, "document.xml")][1], String )
    rels_files_list = [f for f in mlx_reader.files if endswith(f.name, "document.xml.rels")]
    relations_file = length(rels_files_list) > 0 ? read( rels_files_list[1], String ) : nothing;
    
    # В этом списке будут пары название-содержимое
    media_files_list = Dict([ ("../" * f.name, base64encode( read(f, String) )) for f in mlx_reader.files if occursin("media/", f.name ) ])
    
    # Закроем архив
    close( mlx_reader )
    
    return document_file, relations_file, media_files_list
end;

Function for reading the media register (pictures and equations).

function read_rels_file( input_string )

    rels_dict = Dict()
    if !(input_string == nothing) && !(input_string == "")
        rels_tree = EzXML.parsexml( input_string );
        ns = namespace( rels_tree.root )
        for Rel in findall( "w:Relationship", root(rels_tree), ["w"=>ns])
            rels_dict[Rel["Id"]] = Rel["Target"]
        end
    end
    
    return rels_dict
end;

Creating a list of cells in the format we need.

function file_to_cells_list( input_string  )
    
    tree = EzXML.parsexml( input_string );
    ns = namespace( tree.root )
    
    # В этом списке будут собраны пара значений для каждой ячейки (параграфа) исходного документа: их стиль и контент
    parsed_mlx = []

    body_node = findfirst( "w:body", root( tree ), ["w"=>ns] );
    for p in findall( "w:p", body_node, ["w"=>ns] )
        
        # Сохраним стиль параграфа отдельно – обычно этот узел встречается один раз внутри каждого параграфа
        pStyle = ""
        pPr_node = findfirst("w:pPr", p, ["w"=>ns])
        if !isnothing( pPr_node )
            # Не будем обрабатывать параграф, если он представляет собой разделитель секций
            if !isnothing( findfirst("w:sectPr", pPr_node, ["w"=>ns]) ) continue; end;
            # Обычный стиль параграфа
            pStyle_node = findfirst("w:pStyle", pPr_node, ["w"=>ns]);
            if !isnothing( pStyle_node ) pStyle = pStyle_node["w:val"]; end;
        end;
        
        # Теперь пройдемся по всем узлам типа run (фрагментам параграфа)
        pContent = []
        element_name = nothing;
        for run in findall("w:*", p, ["w"=>ns])
            #run_name = run.name
            if run.name == "pPr" continue; end;
            runProperty_node = findfirst("w:rPr", run, ["w"=>ns])
            run_content = run.content;
            
            if run.name == "customXml"
                element_name = run["w:element"];
                if element_name == "image"
                    imageNode = findfirst("w:customXmlPr", run, ["w"=>ns])
                    for attr in findall("w:attr", imageNode, ["w"=>ns])
                        if attr["w:name"] == "relationshipId"
                             run_content = attr["w:val"]; end
                    end
                end
            elseif run.name == "hyperlink" # Тип фрагмента w:hyperlink 
                hyperlink_target = "w:docLocation" in attributes(run) ? run["w:docLocation"] : nothing;
                run_content = (run.content, hyperlink_target)
            else
                element_name = nothing;
            end;
            append!( pContent, [(run.name, element_name, runProperty_node, run_content)] );
        end
        
        # Добавим стиль и параграф в список ячеек
        push!( parsed_mlx, (pStyle, pContent) )
    end
    
    return (parsed_mlx, ns)

end;

Create a function that returns meta-information about the illustrations for inclusion in the final document.

function process_image_info( image_name )
    
    image_description = ""
    image_name = lowercase(image_name)
    if endswith( image_name, ".png" ) image_description = "image/png"
    elseif endswith( image_name, ".jpg" ) image_description = "image/jpeg"
    elseif endswith( image_name, ".jpeg" ) image_description = "image/jpeg"
    elseif endswith( image_name, ".gif" ) image_description = "image/gif"
    elseif endswith( image_name, ".svg" ) image_description = "image/svg+xml"
    else image_description = "image/unknown"; end;
    
    image_base64_prefix = "data:" * image_description * ";base64,"
    return image_description, image_base64_prefix
end;

Translate a cell from XML format to JSON format.

function xml_text_cell_to_plain_text( cell_info, ns, rels_dict, media_files_list )
    
    cell_type, content = cell_info
    attachments = []

    # Иногда стиль ячейки задаст нам начало выводимой строки (в markdown)
    if cell_type == "title" plain_text = "# ";
    elseif cell_type == "heading" plain_text = "## ";
    else plain_text = ""; end;
    
    for (run_name, run_element_type, runProperty_node, run_content) in content
        
        if run_name == "pPr" continue
        elseif run_name == "customXml"
            # Если фрагмент – математическое выражение
            if run_element_type == "equation"
                plain_text = plain_text * "\$" * run_content * "\$"
            # Если фрагмент – иллюстрация
            elseif run_element_type == "image"
                image_name = split( rels_dict[run_content], "/")[end]
                image_content = media_files_list[rels_dict[run_content]]
                image_type, image_prefix = process_image_info( image_name )
                image_base64_content = image_prefix * image_content
                append!( attachments, [(image_name, image_type, image_base64_content)] )
                plain_text = "![$image_name](attachment:$image_name)"
            end
        # Если фрагмент – гиперссылка
        elseif run_name == "hyperlink"
            (hlink_name, hlink_target) = run_content
            if isnothing(hlink_target) plain_text = plain_text * hlink_name;
            else plain_text = plain_text * "[" * hlink_name * "](" * hlink_target * ")"; end;
        # Если фрагмент – кодовое выражение (моноширинный шрифт)
        elseif !isnothing(runProperty_node) && !isnothing(findfirst("w:rFonts", runProperty_node, ["w"=>ns])) && findfirst("w:rFonts", runProperty_node, ["w"=>ns])["w:cs"] == "monospace"
            plain_text = plain_text * "`" * run_content * "`";
        else
            # В ячейке просто текст
            plain_text = plain_text * run_content;
        end
    end
    
    return (cell_type, plain_text, attachments)
end;

Check function for examining input files¶

You can check how many cells, media files, and other pieces of information have been read from each input .mlx file.

for mlx_filename in get_list_of_files( "$(@__DIR__)/input" )
    
    (mlx_file, rels_file, media_list) = get_mlx_content( mlx_filename )    
    println(  )
    println( "* Файл ", mlx_filename, " содержит:" )
    
    cell_list, ns = file_to_cells_list( mlx_file )
    println( length([c for c in cell_list if c[1] != "code"]), " текстовых ячеек")
    println( length([c for c in cell_list if c[1] == "code"]), " кодовых ячеек")
    
    rels_dict = read_rels_file( rels_file )
    if length(keys(rels_dict)) > 0
        print( length(keys(rels_dict)), " отсылок к внешним файлам" );
        println( " (из них ", length([trg for (ref,trg) in rels_dict if occursin("../media", trg)]), " на иллюстрации)")
    end;
    
end

* Файл /user/prestart/mlx_to_ngscript_conversion/input/OverviewCreatingAndConcatenatingExample.mlx содержит:
30 текстовых ячеек
19 кодовых ячеек

* Файл /user/prestart/mlx_to_ngscript_conversion/input/nddemo.mlx содержит:
25 текстовых ячеек
13 кодовых ячеек
3 отсылок к внешним файлам (из них 3 на иллюстрации)

Creating a template for ngscript summary files¶

Now we need to prepare a file .ngscript, into which we will put the cells of our document. The Engee script format provides backwards compatibility, although changes sometimes occur. To have a reasonably up-to-date template for the document ngscript, we will use the same document that is currently open in front of you as a template - the script mlx_to_ngscript_parser.ngscript.

Let's download the template file ngscript and make a template from it:

document template,
text cell template,
code cell template,

which we will supplement with information as we process the input mlx files.

doc_template = JSON.parsefile( "$(@__DIR__)/mlx_to_ngscript_parser.ngscript" );
code_cell_template = [c for c in doc_template["cells"] if c["cell_type"]=="code" ][1];
text_cell_template = [c for c in doc_template["cells"] if c["cell_type"]=="markdown" ][1];
text_cell_template["isParagraph"] = false; # Правка для создания "обычного" параграфа, а не заголовка
doc_template["cells"] = [];

Let's see what templates we have received:

doc_template

Dict{String, Any} with 4 entries:
  "cells"          => Any[]
  "nbformat_minor" => 5
  "metadata"       => Dict{String, Any}("engee"=>Dict{String, Any}(), "language…
  "nbformat"       => 4

text_cell_template

Dict{String, Any} with 6 entries:
  "cell_type"   => "markdown"
  "isParagraph" => false
  "source"      => Any["# Работа с файлами файлов ZIP/XML на примере перевода l…
  "id"          => "6e8ad43b"
  "attachments" => Dict{String, Any}()
  "metadata"    => Dict{String, Any}("name"=>"Название секции", "engee"=>Dict{S…

code_cell_template

Dict{String, Any} with 6 entries:
  "outputs"         => Any[]
  "cell_type"       => "code"
  "source"          => Any["using EzXML, ZipFile, JSON, Base64"]
  "id"              => "582c09f7"
  "metadata"        => Dict{String, Any}("name"=>"Название секции", "engee"=>Di…
  "execution_count" => 0

These three templates will be enough for us to generate a new .ngscript file for each .mlx file in the input file directory.

Filling the JSON template¶

In the final function of this script, we will do the following:

Go through all mlx files in the catalogue
Process the contents of each file
For each one, create a script template ngscript
Add text and code cells to it one by one, not forgetting the graphic attachments.

In addition, we will create everything necessary for the MATLAB script to run in the Engee environment:

We will add a call to the required library at the beginning of the file,
For all cells where the output of graphs is called (for example, plot and scatter), we will add commands to save the graph to file storage and output it using the library Images.jl, so that the graph will appear in the report.

for mlx_filename in get_list_of_files( "$(@__DIR__)/input" )
    
    # Узнаем всё, что нам нужно, про очередной изучаемый файл mlx
    (mlx_file, rels_file, media_list) = get_mlx_content( mlx_filename )
    cell_list, ns = file_to_cells_list( mlx_file )
    rels_dict = read_rels_file( rels_file )
    
    # Создадим шаблон под новый документ и добавим новую ячейку с кодом инициализаици
    ngscript_doc = deepcopy(doc_template);
    new_cell = deepcopy( code_cell_template );
    new_cell["source"][1] = "using MATLAB\nmat\"cd('\$(@__DIR__)')\"";
    push!( ngscript_doc["cells"], new_cell ); # Добавим ячейку в документ
    
    for cell in cell_list
        
        cell_type, plain_text, attachments = xml_text_cell_to_plain_text( cell, ns, rels_dict, media_list )
        plot_counter = 0
        
        if cell_type == "code"
            
            # Перенесем MATLAB-код в ячейку и добавим обрамление в виде префикса mat"""..."""
            new_cell = deepcopy( code_cell_template )
            new_cell["source"][1] = "mat\"\"\"\n" * plain_text * "\n\"\"\"";
            
            # Если в ячейке содержится подстрока plot, добавим
            # MATLAB-инструкции сохранение графика
            # и Engee-инструкции для его оторбажения в отчете
            if cell_type == "code" && occursin("plot", plain_text) || occursin("scatter", plain_text)
                new_cell["source"][1] = "mat\"\"\"\n" * plain_text * "\nsaveas(gcf,'plot_$(plot_counter).png')\n\"\"\"\nusing Images; load(\"plot_$(plot_counter).png\")";
                plot_counter = plot_counter + 1;
            end;
            
            push!( ngscript_doc["cells"], new_cell );
            
        else

            # Добавим текстовую ячейку
            new_cell = deepcopy( text_cell_template )
            new_cell["source"][1] = plain_text
            # Добавим в ячейку вложения (attachments)
            if length(attachments) != 0
                for (image_name, image_type, image_base64_content) in attachments
                    new_cell["attachments"][image_name] = Dict()
                    new_cell["attachments"][image_name][image_type] = image_base64_content
                end
            end
            push!( ngscript_doc["cells"], new_cell )
        
        end
    
    end
    
    # Сохраним скрипт под новым названием
    new_script_name = replace( mlx_filename, ".mlx" => ".ngscript" )
    new_script_name = replace( new_script_name, "input" => "output")
    
    # Предварительно выполним сериализацию и сохраним ngscript
    stringdata = JSON.json( ngscript_doc )
    open(new_script_name, "w") do f write(f, stringdata); end;
    
end

Conclusion¶

In this demonstration, we converted a file from MLX/OPC (based on ZIP/XML) to ngscript (based on JSON). The result is a tool that converts some of the information from a set of LiveScript documents in the MATLAB environment to .ngscript documents in the Engee platform.

From the individual functions of this example, you can learn about opening files, working with XML data, and generating JSON-based documents. This example also demonstrates the strong technical computing capabilities of Engee, available in conjunction with model-driven design tools.