Working with ZIP/XML files on the example of MATLAB live-scripts translation to ngscipt format¶
In this example we will show how to work with OPC (Open Packaging Convensions) files, i.e. ZIP containers containing a set of different XML and other files. This format is found everywhere. For example, formats of this type are used in all Office applications (DOCX
, XLSX
, etc.) and in many engineering packages (Autodesk, Simulink, Engee).
We will convert a technical calculations file from mlx
to ngscript
format - transfer all text and code cells, illustrations, hyperlinks and formulas from one document to another.
Introduction¶
Ready libraries are usually available for working with popular formats like Office Open XML
(e.g. XLSX.jl
for spreadsheets). But often we need to quickly process a file format for which there are no ready libraries yet, or which do not take into account the elements of document syntax you need. Let's imagine ourselves in such a scenario, when we need to perform this processing manually. As a tutorial example, we will examine a programme for transcoding from the LiveScript format of MATLAB to the ngscript
format.
For relatively low-level work with these file formats we need the following libraries:
Pkg.add(["ZipFile", "EzXML"])
using EzXML, ZipFile, JSON, Base64
If any of them are not already installed, run the following cell, having previously removed the #
symbol (by uncommenting the string).
#]add EzXML ZipFile JSON Base64
It is enough to perform this installation once, but sometimes you can restart for the sake of updating library versions.
What an MLX file consists of¶
The Live Code file format uses Open Packaging Conventions technology, which is an extension of the zip file format. The code and formatted content is stored in an XML document that is different from a document using the Office Open XML format. To work with the contents of these files, simply change the file extension to *.zip
, and then unzip it using the context menu of the Engee file browser.
Let's examine the contents of the file *.mlx
after unpacking. We will need the following files from the archive:
document.xml
, where all textual information of the document is storeddocument.xml.rels
- catalogue of additional materials included in the document (formulas, illustrations)
The folder media
contains the illustrations that are inserted in the document, and the folder mathml
contains the used formulas in MathML format.
Loading and processing MLX files¶
For the sake of simplifying the reuse of our code (and also for the sake of cleanliness), we organise it in the form of a set of functions.
Here are the functions we will implement at this stage:
- getting the list of files
mlx
, lying in the catalogue, - unpacking the archive and reading the files we need,
- processing a file of links to media files,
- getting a list of cells from an XML file,
- translating a cell from XML format to JSON format.
And one auxiliary function for working with illustrations embedded in the file:
- get MIME information about illustrations in the required format (* from a name like "image.png" make the MIME identifier "image/png "*).
The first thing to do is to get a list of mlx
files in the catalogue.
function get_list_of_files( base_folder )
# Сканируем нужный нам каталог (не рекурсивно, без изучения вложенных папок)
filenames = readdir( base_folder)
# Отфильтруем только файлы с расширением `.mlx`
list_of_files = [joinpath(base_folder,fname) for fname in filenames if endswith( fname, ".mlx")]
end;
Unzip mlx
file and put the contents of interest into the list.
function get_mlx_content( mlx_full_filename )
# Откроем архив для чтения содержимого
mlx_reader = ZipFile.Reader( mlx_full_filename )
# Прочитаем файлы, которые нас будут интересовать
document_file = read( [f for f in mlx_reader.files if endswith(f.name, "document.xml")][1], String )
rels_files_list = [f for f in mlx_reader.files if endswith(f.name, "document.xml.rels")]
relations_file = length(rels_files_list) > 0 ? read( rels_files_list[1], String ) : nothing;
# В этом списке будут пары название-содержимое
media_files_list = Dict([ ("../" * f.name, base64encode( read(f, String) )) for f in mlx_reader.files if occursin("media/", f.name ) ])
# Закроем архив
close( mlx_reader )
return document_file, relations_file, media_files_list
end;
Function for reading the media register (pictures and equations).
function read_rels_file( input_string )
rels_dict = Dict()
if !(input_string == nothing) && !(input_string == "")
rels_tree = EzXML.parsexml( input_string );
ns = namespace( rels_tree.root )
for Rel in findall( "w:Relationship", root(rels_tree), ["w"=>ns])
rels_dict[Rel["Id"]] = Rel["Target"]
end
end
return rels_dict
end;
Creating a list of cells in the format we need.
function file_to_cells_list( input_string )
tree = EzXML.parsexml( input_string );
ns = namespace( tree.root )
# В этом списке будут собраны пара значений для каждой ячейки (параграфа) исходного документа: их стиль и контент
parsed_mlx = []
body_node = findfirst( "w:body", root( tree ), ["w"=>ns] );
for p in findall( "w:p", body_node, ["w"=>ns] )
# Сохраним стиль параграфа отдельно – обычно этот узел встречается один раз внутри каждого параграфа
pStyle = ""
pPr_node = findfirst("w:pPr", p, ["w"=>ns])
if !isnothing( pPr_node )
# Не будем обрабатывать параграф, если он представляет собой разделитель секций
if !isnothing( findfirst("w:sectPr", pPr_node, ["w"=>ns]) ) continue; end;
# Обычный стиль параграфа
pStyle_node = findfirst("w:pStyle", pPr_node, ["w"=>ns]);
if !isnothing( pStyle_node ) pStyle = pStyle_node["w:val"]; end;
end;
# Теперь пройдемся по всем узлам типа run (фрагментам параграфа)
pContent = []
element_name = nothing;
for run in findall("w:*", p, ["w"=>ns])
#run_name = run.name
if run.name == "pPr" continue; end;
runProperty_node = findfirst("w:rPr", run, ["w"=>ns])
run_content = run.content;
if run.name == "customXml"
element_name = run["w:element"];
if element_name == "image"
imageNode = findfirst("w:customXmlPr", run, ["w"=>ns])
for attr in findall("w:attr", imageNode, ["w"=>ns])
if attr["w:name"] == "relationshipId"
run_content = attr["w:val"]; end
end
end
elseif run.name == "hyperlink" # Тип фрагмента w:hyperlink
hyperlink_target = "w:docLocation" in attributes(run) ? run["w:docLocation"] : nothing;
run_content = (run.content, hyperlink_target)
else
element_name = nothing;
end;
append!( pContent, [(run.name, element_name, runProperty_node, run_content)] );
end
# Добавим стиль и параграф в список ячеек
push!( parsed_mlx, (pStyle, pContent) )
end
return (parsed_mlx, ns)
end;
Create a function that returns meta-information about the illustrations for inclusion in the final document.
function process_image_info( image_name )
image_description = ""
image_name = lowercase(image_name)
if endswith( image_name, ".png" ) image_description = "image/png"
elseif endswith( image_name, ".jpg" ) image_description = "image/jpeg"
elseif endswith( image_name, ".jpeg" ) image_description = "image/jpeg"
elseif endswith( image_name, ".gif" ) image_description = "image/gif"
elseif endswith( image_name, ".svg" ) image_description = "image/svg+xml"
else image_description = "image/unknown"; end;
image_base64_prefix = "data:" * image_description * ";base64,"
return image_description, image_base64_prefix
end;
Translate a cell from XML format to JSON format.
function xml_text_cell_to_plain_text( cell_info, ns, rels_dict, media_files_list )
cell_type, content = cell_info
attachments = []
# Иногда стиль ячейки задаст нам начало выводимой строки (в markdown)
if cell_type == "title" plain_text = "# ";
elseif cell_type == "heading" plain_text = "## ";
else plain_text = ""; end;
for (run_name, run_element_type, runProperty_node, run_content) in content
if run_name == "pPr" continue
elseif run_name == "customXml"
# Если фрагмент – математическое выражение
if run_element_type == "equation"
plain_text = plain_text * "\$" * run_content * "\$"
# Если фрагмент – иллюстрация
elseif run_element_type == "image"
image_name = split( rels_dict[run_content], "/")[end]
image_content = media_files_list[rels_dict[run_content]]
image_type, image_prefix = process_image_info( image_name )
image_base64_content = image_prefix * image_content
append!( attachments, [(image_name, image_type, image_base64_content)] )
plain_text = ""
end
# Если фрагмент – гиперссылка
elseif run_name == "hyperlink"
(hlink_name, hlink_target) = run_content
if isnothing(hlink_target) plain_text = plain_text * hlink_name;
else plain_text = plain_text * "[" * hlink_name * "](" * hlink_target * ")"; end;
# Если фрагмент – кодовое выражение (моноширинный шрифт)
elseif !isnothing(runProperty_node) && !isnothing(findfirst("w:rFonts", runProperty_node, ["w"=>ns])) && findfirst("w:rFonts", runProperty_node, ["w"=>ns])["w:cs"] == "monospace"
plain_text = plain_text * "`" * run_content * "`";
else
# В ячейке просто текст
plain_text = plain_text * run_content;
end
end
return (cell_type, plain_text, attachments)
end;
Check function for examining input files¶
You can check how many cells, media files, and other pieces of information have been read from each input .mlx
file.
for mlx_filename in get_list_of_files( "$(@__DIR__)/input" )
(mlx_file, rels_file, media_list) = get_mlx_content( mlx_filename )
println( )
println( "* Файл ", mlx_filename, " содержит:" )
cell_list, ns = file_to_cells_list( mlx_file )
println( length([c for c in cell_list if c[1] != "code"]), " текстовых ячеек")
println( length([c for c in cell_list if c[1] == "code"]), " кодовых ячеек")
rels_dict = read_rels_file( rels_file )
if length(keys(rels_dict)) > 0
print( length(keys(rels_dict)), " отсылок к внешним файлам" );
println( " (из них ", length([trg for (ref,trg) in rels_dict if occursin("../media", trg)]), " на иллюстрации)")
end;
end
Creating a template for ngscript summary files¶
Now we need to prepare a file .ngscript
, into which we will put the cells of our document. The Engee script format provides backwards compatibility, although changes sometimes occur. To have a reasonably up-to-date template for the document ngscript
, we will use the same document that is currently open in front of you as a template - the script mlx_to_ngscript_parser.ngscript
.
Let's download the template file ngscript
and make a template from it:
- document template,
- text cell template,
- code cell template,
which we will supplement with information as we process the input mlx
files.
doc_template = JSON.parsefile( "$(@__DIR__)/mlx_to_ngscript_parser.ngscript" );
code_cell_template = [c for c in doc_template["cells"] if c["cell_type"]=="code" ][1];
text_cell_template = [c for c in doc_template["cells"] if c["cell_type"]=="markdown" ][1];
text_cell_template["isParagraph"] = false; # Правка для создания "обычного" параграфа, а не заголовка
doc_template["cells"] = [];
Let's see what templates we have received:
doc_template
text_cell_template
code_cell_template
These three templates will be enough for us to generate a new .ngscript
file for each .mlx
file in the input file directory.
Filling the JSON template¶
In the final function of this script, we will do the following:
- Go through all
mlx
files in the catalogue - Process the contents of each file
- For each one, create a script template
ngscript
- Add text and code cells to it one by one, not forgetting the graphic attachments.
In addition, we will create everything necessary for the MATLAB
script to run in the Engee environment:
- We will add a call to the required library at the beginning of the file,
- For all cells where the output of graphs is called (for example,
plot
andscatter
), we will add commands to save the graph to file storage and output it using the libraryImages.jl
, so that the graph will appear in the report.
for mlx_filename in get_list_of_files( "$(@__DIR__)/input" )
# Узнаем всё, что нам нужно, про очередной изучаемый файл mlx
(mlx_file, rels_file, media_list) = get_mlx_content( mlx_filename )
cell_list, ns = file_to_cells_list( mlx_file )
rels_dict = read_rels_file( rels_file )
# Создадим шаблон под новый документ и добавим новую ячейку с кодом инициализаици
ngscript_doc = deepcopy(doc_template);
new_cell = deepcopy( code_cell_template );
new_cell["source"][1] = "using MATLAB\nmat\"cd('\$(@__DIR__)')\"";
push!( ngscript_doc["cells"], new_cell ); # Добавим ячейку в документ
for cell in cell_list
cell_type, plain_text, attachments = xml_text_cell_to_plain_text( cell, ns, rels_dict, media_list )
plot_counter = 0
if cell_type == "code"
# Перенесем MATLAB-код в ячейку и добавим обрамление в виде префикса mat"""..."""
new_cell = deepcopy( code_cell_template )
new_cell["source"][1] = "mat\"\"\"\n" * plain_text * "\n\"\"\"";
# Если в ячейке содержится подстрока plot, добавим
# MATLAB-инструкции сохранение графика
# и Engee-инструкции для его оторбажения в отчете
if cell_type == "code" && occursin("plot", plain_text) || occursin("scatter", plain_text)
new_cell["source"][1] = "mat\"\"\"\n" * plain_text * "\nsaveas(gcf,'plot_$(plot_counter).png')\n\"\"\"\nusing Images; load(\"plot_$(plot_counter).png\")";
plot_counter = plot_counter + 1;
end;
push!( ngscript_doc["cells"], new_cell );
else
# Добавим текстовую ячейку
new_cell = deepcopy( text_cell_template )
new_cell["source"][1] = plain_text
# Добавим в ячейку вложения (attachments)
if length(attachments) != 0
for (image_name, image_type, image_base64_content) in attachments
new_cell["attachments"][image_name] = Dict()
new_cell["attachments"][image_name][image_type] = image_base64_content
end
end
push!( ngscript_doc["cells"], new_cell )
end
end
# Сохраним скрипт под новым названием
new_script_name = replace( mlx_filename, ".mlx" => ".ngscript" )
new_script_name = replace( new_script_name, "input" => "output")
# Предварительно выполним сериализацию и сохраним ngscript
stringdata = JSON.json( ngscript_doc )
open(new_script_name, "w") do f write(f, stringdata); end;
end
Conclusion¶
In this demonstration, we converted a file from MLX/OPC
(based on ZIP/XML
) to ngscript
(based on JSON
). The result is a tool that converts some of the information from a set of LiveScript
documents in the MATLAB
environment to .ngscript
documents in the Engee platform.
From the individual functions of this example, you can learn about opening files, working with XML data, and generating JSON-based documents. This example also demonstrates the strong technical computing capabilities of Engee, available in conjunction with model-driven design tools.