An Introduction to the Lena Data Analysis Architectural Framework

Hello, Habr! I will talk about the architectural framework that I am developing.


Architecture determines the most general structure of the program and the interaction of its components. Lena as a framework implements a specific architecture for data analysis (more about it below) and provides the user with classes and functions that may be useful (taking into account this architecture).


Lena is written in the popular Python language and works with versions of Python 2, 3 and PyPy. It is published under the free Apache license (version 2) here . At the moment, it is still being developed, but the features described in this manual are already in use, tested (the total coverage of the entire framework is about 90%) and is unlikely to be changed. Lena arose in the analysis of experimental data in neutrino physics and is named after the great Siberian river.



Architecture issues arise, as a rule, in large and medium-sized projects. If you are thinking about using this framework, then here is a brief overview of its tasks and advantages.


From a programming point of view:


  • modularity, weak engagement. Algorithms can be easily added, replaced or reused.
  • ( ). . PyPy " ".
  • . . .
  • . , . . .
  • , .

, Python, , .


:


  • ( ).
  • (, , ).
  • . , , .
  • .

(tutorial) – Lena. , , , , . . .




Lena





. , , . .


, . Lena , , . , , .



Lena




Lena — . .


Lena . , :


>>> from __future__ import print_function
>>> from lena.core import Sequence
>>> s = Sequence(
...     lambda i: pow(-1, i) * (2 * i + 1),
... )
>>> results = s.run([0, 1, 2, 3])
>>> for res in results:
...     print(res)
1 -3 5 -7

Lena Python 2 3, print. .


Sequence . run. ( ).


, for.


. - , - . Source:


from lena.core import Sequence, Source
from lena.flow import CountFrom, ISlice

s = Sequence(
    lambda i: pow(-1, i) * (2 * i + 1),
)
spi = Source(
    CountFrom(0),
    s,
    ISlice(10**6),
    lambda x: 4./x,
    Sum(),
)
results = list(spi())
# [3.1415916535897743]

Source __call__, . : , .


CountFrom — , . , ¹. CountFrom ( ). CountFromstart ( ) step ( 1).


Source ( ) (callable) run. Sequence.


. s Source. , s s.


Sequence , Sequence. Sequence Source, (flow).


: Sequence Source , LenaTypeError ( TypeError Python).

LenaLenaException. ( , ).

, - . ISlice. ISlice CountFrom islice count itertools Python. ISlice start, stop[, step], ( ) step ( step , ).


, .


.




. run, flow:



class Sum():
    def run(self, flow):
        s = 0
        for val in flow:
            s += val
        yield s

, return, yield. YieldPython, .


Python.


>>> results = s.run([0, 1, 2, 3])

Sequence run . , , , . , . ( ) :


>>> for res in results:
...     print(res)

:


  • . . , , , . , .
  • . -. , , .

Python yield. Lena. run, . , , , , - .


(yield) . (flow) . , (value).




Lena . — , .


Lena , . Jinja . Lena , . LaTeX:


% histogram_1d.tex
\documentclass{standalone}
\usepackage{tikz}
\usepackage{pgfplots}
\pgfplotsset{compat=1.15}

\begin{document}
\begin{tikzpicture}
\begin{axis}[]
\addplot [
    const plot,
]
table [col sep=comma, header=false] {\VAR{ output.filepath }};
\end{axis}
\end{tikzpicture}
\end{document}

TikZ , : \VAR{ output.filepath }. \VAR{ var } var . , . output.filepath .


:


\BLOCK{ set var = variable if variable else '' }
\begin{tikzpicture}
\begin{axis}[
    \BLOCK{ if var.latex_name }
        xlabel = { $\VAR{ var.latex_name }$
        \BLOCK{ if var.unit }
            [$\mathrm{\VAR{ var.unit }}$]
        \BLOCK{ endif }
        },
    \BLOCK{ endif }
]
...

variable, var . latex_name unit (), x. , x [m] E [keV] . , , .


Jinja . , . Jinja² .


Jinja LaTeX, Lena ³: \BLOCK \VAR .


— Python . Flow Lena (data, context). dataflow, . , Lena. . , :


class ReadData():
    """Read data from CSV files."""

    def run(self, flow):
        """Read filenames from flow and yield vectors.

        If vector component could not be cast to float,
        *ValueError* is raised.
        """
        for filename in flow:
            with open(filename, "r") as fil:
                for line in fil:
                    vec = [float(coord)
                           for coord in line.split(',')]
                    # (data, context) pair
                    yield (vec, {"data": {"filename": filename}})

flow . data ( ). filename data["filename"] data.filename.


-, HTML LaTeX , , . , . — , - ( ).


Lena. , .


, , . , , .




. x.


docs/examples/tutorial .

main.py


from __future__ import print_function

import os

from lena.core import Sequence, Source
from lena.math import mesh
from lena.output import HistToCSV, Writer, LaTeXToPDF, PDFToPNG
from lena.output import MakeFilename, RenderLaTeX
from lena.structures import Histogram

from read_data import ReadData

def main():
    data_file = os.path.join("..", "data", "normal_3d.csv")
    s = Sequence(
        ReadData(),
        lambda dt: (dt[0][0], dt[1]),
        Histogram(mesh((-10, 10), 10)),
        HistToCSV(),
        MakeFilename("x"),
        Writer("output"),
        RenderLaTeX("histogram_1d.tex"),
        Writer("output"),
        LaTeXToPDF(),
        PDFToPNG(),
    )
    results = s.run([data_file])
    print(list(results))

if __name__ == "__main__":
    main()

, output/, :


$ python main.py
pdflatex -halt-on-error -interaction batchmode -output-directory output output/x.tex
pdftoppm output/x.pdf output/x -png -singlefile
[(‘output/x.png’, {‘output’: {‘filetype’: ‘png’}, ‘data’: {‘filename’: ‘../data/normal_3d.csv’}, ‘histogram’: {‘ranges’: [(-10, 10)], ‘dim’: 1, ‘nbins’: [10]}})]

LaTeXToPDF pdflatex, PDFToPNG pdftoppm. , LaTeX , output/x.tex ( ).


— , (run) . , , ( , ). , ( ) output/x.png.


. s ( ). ReadData (data, context), lambda , ( (data, context)).


lambda , . , .


x Histogram, (edges), (mesh) -10 10 .


, , CSV (, ). ( pdflatex) , .


MakeFilename context["output"]. Context.output.filename — ( : csv, pdf ..). , x.


Writer . . , "output".


csv, LaTeX histogram_1d.tex , pdf png. , RenderLaTeX , .


: , . Lena, .




:


from lena.context import Context
from lena.flow import Cache, End, Print

s = Sequence(
    Print(),
    ReadData(),
    # Print(),
    ISlice(1000),
    lambda val: val[0][0], # data.x
    Histogram(mesh((-10, 10), 10)),
    Context(),
    Cache("x_hist.pkl"),
    # End(),
    HistToCSV(),
    # ...
)

Print , . , , Print . print .


ISlice, , , . , , , .


Context — , , , . Context , , ( , ). .


Cache . — , . , Cache , , . , . Cache pickle, Python ( ). (, , ), Cache. Cache, , .


End . , Cache ( End), HistToCSV . End , .





Lena , . , , . , .


(callable) . , , . , .


. — . , .


. Sequence , . Source Sequence, .


Sequence__call__(value) run(flow) ( )s.run(flow)
Source__call__() ( ), Sequences()

, , , . .



  1. End. :


    class End(object):
        """Stop sequence here."""
    
        def run(self, flow):
            """Exhaust all preceding flow and stop iteration."""
            for val in flow:
                pass
            raise StopIteration()

    main.py . ,


    Traceback (most recent call last):
    File “main.py”, line 46, in <module>
    main()
    File “main.py”, line 42, in main
    results = s.run([data_file])
    File “lena/core/sequence.py”, line 70, in run
    flow = elem.run(flow)
    File “main.py”, line 24, in run
    raise StopIteration()
    StopIteration

    , , , . , StopIteration . ?


  2. , . , .


  3. Count , . , . ? , .


  4. , .


    " - ",- . " CSV, , , ,… , , code bloat ( )."


    ? ?


  5. ** Sum . , , .


    Sum , ? ? .



Answers to the exercises are given at the end of the manual .


Footnotes


1. This feature may be added in the future.
2. Jinja documentation
3. Using Jinja for LaTeX layout was proposed here and here , the syntax of the templates was taken from the original article.


Alternatives


Ruffus is a computational pipeline for Python used in science and bioinformatics. It connects program components through writing and reading files.


All Articles