How often do you hear from customers that they will send data to Excel or ask you to import or upload in an Excel-compatible format? I am sure that in most areas Excel is one of the most popular, powerful and at the same time simple and convenient tools. But the most problematic point is always the integration of such data with various automated systems. Our team was asked to consider the possibility of performing data calculations using the settings from a custom Excel file.

Quick calculation of formulas from Excel in C #

If you need to choose a productive library for working with Excel files or if you are looking for a solution for calculating complex financial (and not only) data with a convenient tool for managing and visualizing formulas out of the box, welcome to cat.

Studying the requirements of a new project in the company, we found one very interesting point: “The developed solution should be able to calculate the cost of the product (namely, the balcony) when changing the configuration and instantly display the new cost on the interface. It is necessary to provide the ability to download an Excel file with all the calculation functions and component price lists. ” The client needed to develop a portal for designing the configuration of balconies depending on the sizes, shapes, types of glazing and materials used, types of fastenings, as well as many other related parameters that are used to calculate the exact cost of production and reliability indicators.

We formalize the input requirements so that it is easier to understand in which context it was necessary to solve the problem:

Quick calculation of prices on the interface when changing the parameters of the balcony;
Quick calculation of design data, including many different configurations of balconies and individual offers, supplied by a separate calculation file;
In the long term - the development of functionality using various resource-consuming operations (tasks of optimizing parameters, etc.)
All this is on a remote server with output via API, because All formulas are the intellectual property of the customer and should not be visible to third parties;
Peak-loaded input data stream: the user can frequently and quickly change parameters to select the product configuration according to their requirements.

It sounds very unusual and super tempting, let's get started!

Turnkey solutions and data preparation

Any research to solve complex problems begins with surfing on StackOverflow, GitHub and many forums looking for ready-made solutions.

Several ready-made solutions were selected that supported reading data from Excel files, as well as being able to perform calculations based on formulas specified inside the library. Among these libraries there are both completely free solutions and commercial developments.

Library	Link	Documentation	License
EPPlus 4	github.com/JanKallman/EPPlus	github.com/JanKallman/EPPlus/wiki	GNU Library General Public License
EPPlus 5	github.com/EPPlusSoftware/EPPlus	github.com/EPPlusSoftware/EPPlus/wiki	Polyform Noncommercial License 1.0.0 ( www.epplussoftware.com/LicenseOverview )
NPOI	github.com/tonyqus/npoi	github.com/tonyqus/npoi/wiki	Apache License 2.0
Spire	www.e-iceblue.com/Introduce/excel-for-net-introduce.html	www.e-iceblue.com/Tutorials/Spire.XLS/Spire.XLS-Program-Guide/Spire.XLS-Program-Guide-Content.html	www.e-iceblue.com/Tutorials/Licensing/License-Agreement.html
Excel Interop(Microsoft.Office.Interop.Excel)	n/a	docs.microsoft.com/ru-ru/dotnet/api/microsoft.office.interop.excel._workbook?view=excel-pia	Excel

The next step is to write load tests and measure the runtime of each library. To do this, prepare test data. We create a new Excel file, define 10 cells for the input parameters and one ( remember this moment ) formula to obtain a result that uses all the input parameters. In the formula, we try to use all possible mathematical functions of various computational complexity and combine them in a tricky way.

Because it is also about money (the cost of the product), it is important to look at the accuracy of the results. In this case, we will take the resulting values that were obtained using Excel Interop as the referencebecause The data obtained in this way is calculated through the Excel core and is equal to the values that customers see when developing formulas and manually calculating the cost.

As a reference runtime, we will use native code manually written in pure C # to display the same formula written in Excel.

Translated Initial Test Formula:

public double Execute(double[] p)
{
    return Math.Pow(p[0] * p[8] / p[4] * Math.Sin(p[5]) * Math.Cos(p[2]) +
                          Math.Abs(p[1] - (p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8]))
                          * Math.Sqrt(p[0] * p[0] + p[1] * p[1]) / 2.0 * Math.PI, p[9]);
}

We form a stream of random input data for N iterations (in this case, we use 10,000 vectors).

We start the calculations for each vector of input parameters on the whole stream, get the results and measure the time of initialization of the library and the general calculation.

To compare the accuracy of the results, we use two indicators - the standard deviation and the percentage of matching values with a certain accuracy step epsilon. If you randomly generate a list of input values with a floating point after the decimal point, you must determine the accuracy of the input parameters - this will allow you to get the correct results. Otherwise, random numbers can have a large difference of orders of magnitude - this can greatly affect the accuracy of the results and affect the estimate of the error of the result.

Because Initially, we assume that it is required to operate with constant values of the cost of materials, as well as some constants from different fields of knowledge, we can accept that all input parameters will have values accurate to 3 decimal places. Ideally, you need to specify a valid range of values for each of the parameters and use only them to generate random values, but since the formula for testing was compiled randomly without any mathematical and physical justification, then it is not possible to calculate such a range of values for all 10 parameters in a reasonable period of time. Therefore, in calculations it is sometimes possible to obtain a calculation error. We exclude such input data vectors already at the time of calculation for accuracy indicators,but we will count such errors as a separate characteristic.

Test Architecture

For each separate library, its own class has been created that implements an interface ITestExecutor that includes 3 methods - SetUp, Execute and TearDown.

public interface ITestExecutor
{
    //      
    void SetUp();
    //   ,           
    double Execute(double[] p);
    //    ,      
    void TearDown();
}

Methods SetUpand TearDownused only once in the process of testing the library and are not considered when measuring time calculations on the entire set of input data.

As a result, the testing algorithm boiled down to the following:

Data preparation (we form a stream of input vectors of a given accuracy);
Allocation of memory for the resulting data (an array of results for each library, an array with the number of errors);
Initialization of libraries;
Obtaining calculation results for each of the libraries with saving the result in a pre-prepared array and recording the execution time;
Completion of work with the code of libraries;
Analysis of the data:

A graphical representation of this algorithm is presented below.

Flow chart of performance and accuracy testing of Excel support libraries

Results of the first iteration

Index	Native	EPPlus 4 & EPPlus 5	NPOI	Spire	Excel interop
Initialization Time (ms)	0	257	266	632	1653
Wed time for 1 pass (ms)	0,0002	0.4086	0.6847	6.9782	38.8423
Avg deviation	0,000394	0,000395	0,000237	0,000631	n / a
Accuracy	99.99%	99.92%	99.97%	99.84%	n / a
Mistakes	0,0%	1.94%	1.94%	1.52%	1.94%

Why are the results of EPPlus 5 and EPPlus 4 combined?

EPPlus . , , . , . EPPlus 5 , . , EPPlus, .

The first iteration of the tests showed a good result, in which the leaders among the libraries immediately became visible. As can be seen from the above data, the absolute leader in this situation is the EPPlus library.

The results are not impressive compared to the native code, but you can live.

This could have been stopped, but after talking with customers the first doubts crept in: the results were too good.

Features of working with the Spire library

Spire , InvalidCastException. , Excel- , , , . try...catch. , .

Rake of the first iteration of tests

Above, I asked you to draw your attention to one point that played an important role in obtaining results. We suspected that the formulas used in the real Excel file of the customer would be far from primitive, with many dependencies inside, but we did not suspect that this could greatly affect the indicators. Nevertheless, initially, when compiling the test data, I did not foresee this, and the data from the customer (at least the information that at least 120 input parameters are used in the final file) hinted that we need to think and add formulas with dependencies between cells .

We start preparing new data for the next iteration of testing. Let us dwell on 10 input parameters and add an additional 4 new formulas with dependence only on parameters and 1 aggregating formula, which is based on these four cells with formulas and will also rely on the values of the input data.

The new formula that will be used for subsequent tests:

public double Execute(double[] p)
{
    var price1 = Math.Pow(p[0] * p[8] / p[4] * Math.Sin(p[5]) * Math.Cos(p[2]) +
                          Math.Abs(p[1] - (p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8]))
                          * Math.Sqrt(p[0] * p[0] + p[1] * p[1]) / 2.0 * Math.PI, p[9]);

    var price2 = p[4] * p[5] * p[2] / Math.Max(1, Math.Abs(p[7]));

    var price3 = Math.Abs(p[7] - p[3]) * p[2];

    var price4 = Math.Sqrt(Math.Abs(p[1] * p[2] + p[3] * p[4] + p[5] * p[6]) + 1.0);

    var price5 = p[0] * Math.Cos(p[1]) + p[2] * Math.Sin(p[1]);

    var sum = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8] + p[9];

    var price6 = sum / Math.Max(1, Math.Abs((p[0] + p[1] + p[2] + p[3]) / 4.0))
                 + sum / Math.Max(1, Math.Abs((p[4] + p[5] + p[6] + p[7] + p[8] + p[9]) / 6.0));

    var pricingAverage = (price1 + price2 + price3 + price4 + price5 + price6) / 6.0;

    return pricingAverage / Math.Max(1, Math.Abs(price1 + price2 + price3 + price4 + price5 + price6));
}

As you can see, the resulting formula turned out to be much more complicated, which naturally affected the results - not for the better. A table with the results is presented below:

Index	Native	EPPlus 4 EPPlus 5	NPOI	Spire	Excel interop
Initialization Time (ms)	0	241	368	722	1640
Wed time for 1 pass (ms)	0,0004	0.9174 (+ 124%)	1.8996 (+ 177%)	7.7647 (+ 11%)	50.7194 (+ 30%)
Avg deviation	0,035884	0.000000	0.000000	0.000000	n / a
Accuracy	98.79%	100.00%	100.00%	100.00%	n / a
Mistakes	0,0%	0.3%	0.3%	0.28%	0.3%

Note: Because Excel Interop is too large, had to be excluded from the chart.

As can be seen from the results, the situation has become completely unsuitable for use in production. A little sad, stock up coffee and dive head deeper into the study - straight into the code generation.

Code Generation

If you suddenly never faced a similar task, then we will conduct a brief excursion.

Code generation is a way of solving a problem by dynamically generating source code based on input data with subsequent compilation and execution. There is both static code generation that occurs during the build of the project (as an example I can cite T4MVC, which creates new code based on templates and metadata that can be used when writing the main application code), and dynamic code that runs during runtime.

Our task is to form a new function based on the source data (formulas from Excel), which receives the result based on the vector of input values.

To do this, you must:

Read the formula from the file;
Collect all dependencies;
C#;
;
;
.

All the libraries presented are suitable for reading formulas, however, the EPPlus library turned out to be the most convenient interface for such functionality . Having rummaged a bit in the source code of this library, I discovered public classes for forming a list of tokens and its further transformation into an expression tree. Bingo, I thought! A ready-made expression tree from the box is ideal, just go through it and form our C # code.

But a big catch was waiting for me when I began to study the nodes of the resulting expression tree. Some nodes, in particular the call to Excel functions, encapsulated information on the function used and its input parameters and did not provide any open access to this data. Therefore, work with the finished expression tree had to be postponed.

We go one level lower and try to work with the list of tokens. Everything is quite simple here: we have tokens that have type and value. Because we are given a function and we need to form a function, then we can instead convert the tree tokens to the equivalent in C #. The main thing in this approach is to organize compatible functions. Most mathematical functions were already compatible - such as calculating the cosine, sine, obtaining the root, and raising to a power. But the aggregation functions - such as the maximum value, minimum, amount - needed to be finalized. The main difference is that in Excel, these functions work with a range of values. For simplicity of the prototype, we will make functions that take a list of parameters as input, previously expanding the range of values into a linear list.This way we get the correct and compatible conversion from Excel syntax to C # syntax.

Below is the main code for converting a list of tokens from an Excel formula into a valid C # code.

private string TransformToSharpCode(string formula, ParsingContext parsingContext)
{
    // Initialize basic compile components, e.g. lexer
    var lexer = new Lexer(parsingContext.Configuration.FunctionRepository, parsingContext.NameValueProvider);

    // List of dependency variables that can be filled during formula transformation
    var variables = new Dictionary<string, string>();
    using (var scope = parsingContext.Scopes.NewScope(RangeAddress.Empty))
    {
        // Take resulting code line
        var compiledResultCode = TransformToSharpCode(formula, parsingContext, scope, lexer, variables);

        var output = new StringBuilder();

        // Define dependency variables in reverse order.
        foreach (var variableDefinition in variables.Reverse())
        {
            output.AppendLine($"var {variableDefinition.Key} = {variableDefinition.Value};");
        }

        // Take the result
        output.AppendLine($"return {compiledResultCode};");

        return output.ToString();
    }
}

private string TransformToSharpCode(ICollection<Token> tokens, ParsingContext parsingContext, ParsingScope scope, ILexer lexer, Dictionary<string, string> variables)
{
    var output = new StringBuilder();

    foreach (Token token in tokens)
    {
        switch (token.TokenType)
        {
            case TokenType.Function:
                output.Append(BuildFunctionName(token.Value));
                break;
            case TokenType.OpeningParenthesis:
                output.Append("(");
                break;
            case TokenType.ClosingParenthesis:
                output.Append(")");
                break;
            case TokenType.Comma:
                output.Append(", ");
                break;
            case TokenType.ExcelAddress:
                var address = token.Value;
                output.Append(TransformAddressToSharpCode(address, parsingContext, scope, lexer, variables));
                break;
            case TokenType.Decimal:
            case TokenType.Integer:
            case TokenType.Boolean:
                output.Append(token.Value);
                break;
            case TokenType.Operator:
                output.Append(token.Value);
                break;
        }
    }

    return output.ToString();
}

The next nuance in the conversion was the use of Excel constants - they are functions, so in C # they also have to be wrapped in a function.

It remains to solve only one question: the conversion of cell references to a parameter. In the case where the token contains information about the cell, we first determine what is stored in this cell. If this is a formula, expand it recursively. If the constant is replaced with a C # -analog link, of the form p[row, column], where it pcan be either a two-dimensional array or an indexed access class for correct data mapping. With a range of cells, we do the same, just pre-expand the range into individual cells and process them separately. Thus, we cover the main functionality when translating an Excel function.

Below is the code for converting a link to an Excel spreadsheet cell into a valid C # code:

private string TransformAddressToSharpCode(string excelAddress, ParsingContext parsingContext, ParsingScope scope, ILexer lexer, Dictionary<string, string> variables)
{
    // Try to parse excel range of addresses
    // Currently, reference to another worksheet in address string is not supported

    var rangeParts = excelAddress.Split(':');
    if (rangeParts.Length == 1)
    {
        // Unpack single excel address
        return UnpackExcelAddress(excelAddress, parsingContext, scope, lexer, variables);
    }

    // Parse excel range address
    ParseAddressToIndexes(rangeParts[0], out int startRowIndex, out int startColumnIndex);
    ParseAddressToIndexes(rangeParts[1], out int endRowIndex, out int endColumnIndex);

    var rowDelta = endRowIndex - startRowIndex;
    var columnDelta = endColumnIndex - startColumnIndex;

    var allAccessors = new List<string>(Math.Abs(rowDelta * columnDelta));

    // TODO This part of code doesn't support reverse-ordered range address
    for (var rowIndex = startRowIndex; rowIndex <= endRowIndex; rowIndex++)
    {
        for (var columnIndex = startColumnIndex; columnIndex <= endColumnIndex; columnIndex++)
        {
            // Unpack single excel address
            allAccessors.Add(UnpackExcelAddress(rowIndex, columnIndex, parsingContext, scope, lexer, variables));
        }
    }

    return string.Join(", ", allAccessors);
}

private string UnpackExcelAddress(string excelAddress, ParsingContext parsingContext, ParsingScope scope, ILexer lexer, Dictionary<string, string> variables)
{
    ParseAddressToIndexes(excelAddress, out int rowIndex, out int columnIndex);
    return UnpackExcelAddress(rowIndex, columnIndex, parsingContext, scope, lexer, variables);
}

private string UnpackExcelAddress(int rowIndex, int columnIndex, ParsingContext parsingContext, ParsingScope scope, ILexer lexer, Dictionary<string, string> variables)
{
    var formula = parsingContext.ExcelDataProvider.GetRangeFormula(_worksheet.Name, rowIndex, columnIndex);
    if (string.IsNullOrWhiteSpace(formula))
    {
        // When excel address doesn't contain information about any excel formula, we should just use external input data parameter provider.
        return $"p[{rowIndex},{columnIndex}]";
    }

    // When formula is provided, try to identify that variable is not defined yet
    // TODO Worksheet name is not included in variable name, potentially that can cause conflicts
    // Extracting and reusing calculations via local variables improves performance for 0.0045ms
    var cellVariableId = $"C{rowIndex}R{columnIndex}";
    if (variables.ContainsKey(cellVariableId))
    {
        return cellVariableId;
    }

    // When variable does not exist, transform new formula and register that to variable scope
    variables.Add(cellVariableId, TransformToSharpCode(formula, parsingContext, scope, lexer, variables));

    return cellVariableId;
}

It remains only to wrap the resulting converted code into a static function, link the assembly with compatibility functions, and compile the dynamic assembly. Load it into memory, get a link to our function - and you can use it.

We write a wrapper class for testing and run tests with time measurement.

public void SetUp()
{
    // Initialize excel package by EPPlus library
    _package = new ExcelPackage(new FileInfo(_fileName));
    _workbook = _package.Workbook;
    _worksheet = _workbook.Worksheets[1];

    _inputRange = new ExcelRange[10];
    for (int rowIndex = 0; rowIndex < 10; rowIndex++)
    {
        _inputRange[rowIndex] = _worksheet.Cells[rowIndex + 1, 2];
    }

    // Access to result cell and extract formula string
    _resultRange = _worksheet.Cells[11, 2];

    var formula = _resultRange.Formula;

    // Initialize parsing context and setup data provider
    var parsingContext = ParsingContext.Create();
    parsingContext.ExcelDataProvider = new EpplusExcelDataProvider(_package);

    // Transform Excel formula to CSharp code
    var sourceCode = TransformToSharpCode(formula, parsingContext);

    // Compile CSharp code to IL dynamic assembly via helper wrappers
    _code = CodeGenerator.CreateCode<double>(
        sourceCode,
        new string[]
        {
            // List of used namespaces
            "System", // Required for Math functions
            "ExcelCalculations.PerformanceTests" // Required for Excel function wrappers stored at ExcelCompiledFunctions static class
        },
        new string[]
        {
            // Add reference to current compiled assembly, that is required to use Excel function wrappers located at ExcelCompiledFunctions static class
            "....\\bin\\Debug\\ExcelCalculations.PerformanceTests.exe"
        },
        // Notify that this source code should use parameter;
        // Use abstract p parameter - interface for values accessing.
        new CodeParameter("p", typeof(IExcelValueProvider))
    );
}

As a result, we have a prototype of this solution and mark it as EPPlusCompiled, Mark-I . After running the tests, we get the long-awaited result. The acceleration is almost 300 times. Already not bad, but the resulting code is still 16 times slower than the native one. Could it be better?

Yes you can! Let's try to improve the result due to the fact that we will replace all links to additional cells with formulas with variables. Our test uses multiple use of dependent cells in the formula, so in the first version of the translator we received multiple calculations of the same data. Therefore, it was decided to use intermediate variables in the calculations. After expanding the code using the generation of dependent variables, we got a performance increase of 2 more times. This improvement is calledEPPlusCompiled, Mark-II . The comparison table is presented below:

Library	Wed time (ms)	Coef. slowdowns
Native	0.00004	1
EPPlusCompiled, Mark-II	0.003	8
EPPlusCompiled, Mark-I	0.0061	sixteen
EPPlus	1,2089	3023

Under these conditions and the time limits that were allotted for the task, we got a result that brings us closer to the performance of the native code with a slight lag - 8 times, compared with the original version - a lag of several orders of magnitude, 3028 times. But is it possible to improve the result and get as close to the native code as possible, if you remove the time limits and how much will it be appropriate?

My answer is yes, but, unfortunately, I no longer had time to implement these techniques. Perhaps I will devote a separate article to this topic. At the moment, I can only talk about the main ideas and options, writing them in the form of short abstracts that have been verified by reverse transformation. By reverse conversion, I mean the degradation of native code, written by hand, in the direction of the generated code. This approach allows you to check some theses quickly enough and does not require significant changes in the code. It also allows you to answer the question of how the performance of the native code will deteriorate under certain conditions, which will mean that if the generated code is improved automatically in the opposite direction, we can get a performance improvement with a similar coefficient.

Abstracts

, ;
inline ;
- Sum, Max, Min ;
Sum inline ;
( ) .

	Native	EPPlus Compiled, Mark-II	EPPlus 4EPPlus 5	NPOI	Spire	Excel Interop
()	0	239	241	368	722	1640
. 1 ()	0,0004	0,003	0,9174	1,8996	7,7647	50,7194
Avg deviation	0,035884	0,0	0,0	0,0	0,0	n / a
Accuracy	98.79%	100.0%	100.0%	100.0%	100.0%	n / a
Mistakes	0,0%	0,0%	0.3%	0.3%	0.28%	0.3%

Summing up the steps, we have a mechanism for converting formulas directly from a custom Excel document into working code on the server. This allows you to use the incredible flexibility of integrating Excel with any business solution without losing the powerful user interface that a large number of users are used to working with. Was it possible to develop such a convenient interface with the same set of tools for analyzing and visualizing data as in Excel for such a short development period?

And what were the most unusual and interesting integrations with Excel documents you had to implement?