Convert text documents to xml in C #

Recently, I had to deal with the need to get text from office documents ( docx, xlsx, rtf , doc, xls, odt and ods ). The task was complicated by the requirement to present the text in xml format without garbage with the structure most convenient for further parsing.


The decision to use Interop immediately fell away due to its cumbersome nature, largely redundancy, and the need to install MS Office on the server . As a result, a solution was found and implemented on an internal project. However, the search turned out to be so complicated and not trivial due to the lack of any generally accessible manuals that I decided to write a library in my spare time that would solve the specified task, and also create a kind of instruction to write so that the developers read she was able, at least superficially, to understand the issue.


Before proceeding to the description of the solution found, I suggest that you familiarize yourself with some of the conclusions that were made as a result of my research:


  1. For the .Net platform, there is no ready-made solution for working with all of the listed formats, which will force us to castilize our solution in some places.
  2. Do not try to find a good manual on working with Microsoft OpenXML on the net: to deal with this library you will have to pretty red-eyed, smoke StackOverflow and play with the debugger.
  3. Yes, I still managed to tame the dragon.

I’ll immediately make a reservation that at the moment the library is not yet ready, but it is actively being written (as much as free time allows). It is assumed that separate posts for each format will be written and in parallel, together with their publication, the repository on the github will be updated, from where it will be possible to get the sources.


Work with xlsx and docx


.xlsx


, , , docx xlsx zip-, xml. , : zip . , : \xl\worksheets.


excel , , - , :



, , , ( <f>) ( <v>). , shared sharedStrings.xml, \xl.
: .


, -, IConvertable:


using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace ConverterToXml.Converters
{
    interface IConvertable
    {
        string Convert(Stream stream);
        string ConvertByFile(String path);

    }
}

, : string Convert(Stream stream) ( , - ), string ConvertByFile(String path) .


XlsxToXml, IConvertable Nuget DocumentFormat.OpenXml ( , 2.10.0).


string SpreadsheetProcess(Stream memStream), string Convert(Stream stream).


        public string Convert(Stream memStream)
        {
            return SpreadsheetProcess(memStream);
        }

, *string SpreadsheetProcess(Stream memStream)*:


string SpreadsheetProcess(Stream memStream)
        {
            using (SpreadsheetDocument doc = SpreadsheetDocument.Open(memStream, false))
            {
                memStream.Position = 0;
                StringBuilder sb = new StringBuilder(1000);
                sb.Append("<?xml version=\"1.0\"?><documents><document>");
                SharedStringTable sharedStringTable = doc.WorkbookPart.SharedStringTablePart.SharedStringTable; 
                int sheetIndex = 0;
                foreach (WorksheetPart worksheetPart in doc.WorkbookPart.WorksheetParts)
                {
                    WorkSheetProcess(sb, sharedStringTable, worksheetPart, doc, sheetIndex);
                    sheetIndex++;
                }
                sb.Append(@"</document></documents>");
                return sb.ToString();
            }
        }

, string SpreadsheetProcess(Stream memStream) :


  1. using excel . xlsx DocumentFormat.OpenXml SpreadsheetDocument.


  2. StringBuilder sb ( 1000 . StringBuilder , . , , .


  3. shared ( ). , SpreadsheetDocument :
    SharedStringTable sharedStringTable = doc.WorkbookPart.SharedStringTablePart.SharedStringTable.


  4. ,


    foreach (WorksheetPart worksheetPart in doc.WorkbookPart.WorksheetParts)
                {
                    WorkSheetProcess(sb, sharedStringTable, worksheetPart, doc, sheetIndex);
                    sheetIndex++;
                }


    WorkSheetProcess(sb, sharedStringTable, worksheetPart, doc, sheetIndex);:


    private void WorkSheetProcess(StringBuilder sb, SharedStringTable sharedStringTable, WorksheetPart worksheetPart, SpreadsheetDocument doc,
            int sheetIndex)
        {
            string sheetName = doc.WorkbookPart.Workbook.Descendants<Sheet>().ElementAt(sheetIndex).Name.ToString();
            sb.Append($"<sheet name=\"{sheetName}\">");
            foreach (SheetData sheetData in worksheetPart.Worksheet.Elements<SheetData>())
            {
                if (sheetData.HasChildren)
                {
                    foreach (Row row in sheetData.Elements<Row>())
                    {
                        RowProcess(row, sb, sharedStringTable);
                    }
                }
            }
            sb.Append($"</sheet>");
        }

  5. , :
    string sheetName = doc.WorkbookPart.Workbook.Descendants<Sheet>().ElementAt(sheetIndex).Name.ToString();
    , , . , . , , shift+F9( ), doc( )->WorkbookPart->Workbook Descendants(), Sheet. , ( ). :


  6. foreach , . sheetData - , , RowProcess:


    foreach (SheetData sheetData in worksheetPart.Worksheet.Elements<SheetData>())
            {
                if (sheetData.HasChildren)
                {
                    foreach (Row row in sheetData.Elements<Row>())
                    {
                        RowProcess(row, sb, sharedStringTable);
                    }
                }
            }

  7. void RowProcess(Row row, StringBuilder sb, SharedStringTable sharedStringTable) :


    void RowProcess(Row row, StringBuilder sb, SharedStringTable sharedStringTable)
        {
            sb.Append("<row>");
            foreach (Cell cell in row.Elements<Cell>())
            {
                string cellValue = string.Empty;
                sb.Append("<cell>");
                if (cell.CellFormula != null)
                {
                    cellValue = cell.CellValue.InnerText;
                    sb.Append(cellValue);
                    sb.Append("</cell>");
                    continue;
                }
                cellValue = cell.InnerText;
                if (cell.DataType != null && cell.DataType == CellValues.SharedString)
                {
                    sb.Append(sharedStringTable.ElementAt(Int32.Parse(cellValue)).InnerText);
                }
                else
                {
                    sb.Append(cellValue);
                }
                sb.Append("</cell>");
            }
            sb.Append("</row>");
        }

    foreach (Cell cell in row.Elements<Cell>()) :


    if (cell.CellFormula != null)
                {
                    cellValue = cell.CellValue.InnerText;
                    sb.Append(cellValue);
                    sb.Append("</cell>");
                    continue;
                }

    , , (cellValue = cell.CellValue.InnerText;) .
    , , shared: , :


    if (cell.DataType != null && cell.DataType == CellValues.SharedString)
                {
                    sb.Append(sharedStringTable.ElementAt(Int32.Parse(cellValue)).InnerText);
                }

    , .




.docx


, word excel-.
, , , , , , . , , .., , , , - , , .


, . zip . . word document. , , , , . , : - .


, w:t, w:r, w:p. , docx, . : , w:numPr, (w:ilvl) id , (w:numId).

, , , , ( , ), , id , , .
, , :

, . w:tr () w:tc().


Before I start coding, I want to pay attention to one very important nuance (yes, as in the joke about Petka and Vasily Ivanovich). When parsing lists, especially when it comes to nested lists, a situation may arise when the list items are separated by some kind of insertion of text, image, or anything else. Then the question arises, when do we put the closing tag of the list? My suggestion, smelling of crutching and bicycle building, comes down to adding a dictionary, the keys of which will be the id of the lists, and the value will correspond to the id of the paragraph (yes, it turns out each paragraph in the document has its own unique id), which is also the last in a list. Perhaps it’s written quite difficult, but I think when you look at the implementation, it will become somewhat clearer:
public string Convert(Stream memStream)
{
    Dictionary<int, string> listEl = new Dictionary<int, string>();
    string xml = string.Empty;
    memStream.Position = 0;
    using (WordprocessingDocument doc = WordprocessingDocument.Open(memStream, false))
    {
        StringBuilder sb = new StringBuilder(1000); 
        sb.Append("<?xml version=\"1.0\"?><documents><document>");
        Body docBody = doc.MainDocumentPart.Document.Body;
        CreateDictList(listEl, docBody);
        foreach (var element in docBody.ChildElements)
        {
            string type = element.GetType().ToString();
            try
            {
                switch (type)
                {
                    case "DocumentFormat.OpenXml.Wordprocessing.Paragraph":
                        if (element.GetFirstChild<ParagraphProperties>() != null)
                        {
                            if (element.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val != CurrentListID)
                            {
                                CurrentListID = element.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val;
                                sb.Append($"<li id=\"{CurrentListID}\">");
                                InList = true;
                                ListParagraph(sb, (Paragraph)element);
                            }
                            else
                            {
                                ListParagraph(sb, (Paragraph)element);
                            }
                            if (listEl.ContainsValue(((Paragraph)element).ParagraphId.Value))
                            {
                                sb.Append($"</li id=\"{element.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val}\">");
                            }
                            continue;
                        }
                        else
                        {
                            SimpleParagraph(sb, (Paragraph)element);
                            continue;
                        }
                    case "DocumentFormat.OpenXml.Wordprocessing.Table":
                        Table(sb, (Table)element);
                        continue;
                }
            }
            catch (Exception e)
            {
                continue;
            }
        }
        sb.Append(@"</document></documents>");
        xml = sb.ToString();
    }
    return xml;
}

  1. Dictionary<int, string> listEl = new Dictionary<int, string>(); β€” .


  2. using (WordprocessingDocument doc = WordprocessingDocument.Open(memStream, false)) β€” doc WordprocessingDocument, word, ( , OpenXML) .


  3. StringBuilder sb = new StringBuilder(1000); β€” xml.


  4. Body docBody = doc.MainDocumentPart.Document.Body; β€” ,


  5. CreateDictList(listEl, docBody);, foreach , :


    void CreateDictList(Dictionary<int, string> listEl, Body docBody)
    {
    foreach(var el in docBody.ChildElements)
    {
        if(el.GetFirstChild<ParagraphProperties>() != null)
        {
            int key = el.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val;
            listEl[key] = ((DocumentFormat.OpenXml.Wordprocessing.Paragraph)el).ParagraphId.Value;
        }
    }
    }

    GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val; β€” ( https://docs.microsoft.com/ru-ru/office/open-xml/open-xml-sdk ), . , , , )


  6. , , foreach . : . , , . , (, ) , . , . :


    string type = element.GetType().ToString();
                   try
    {
    switch (type)
    {
        case "DocumentFormat.OpenXml.Wordprocessing.Paragraph":
    
            if (element.GetFirstChild<ParagraphProperties>() != null) //  /  
            {
                if (element.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val != CurrentListID)
                {
                    CurrentListID = element.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val;
                    sb.Append($"<li id=\"{CurrentListID}\">");
                    InList = true;
                    ListParagraph(sb, (Paragraph)element);
                }
                else //  
                {
                    ListParagraph(sb, (Paragraph)element);
                }
                if (listEl.ContainsValue(((Paragraph)element).ParagraphId.Value))
                {
                    sb.Append($"</li id=\"{element.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val}\">");
                }
                continue;
            }
            else //  
            {
                SimpleParagraph(sb, (Paragraph)element);
                continue;
            }
        case "DocumentFormat.OpenXml.Wordprocessing.Table":
    
            Table(sb, (Table)element);
            continue;
    }
    }

    try-catch , - , switch-case ( , , ). , - , .


  7. , ListParagraph(sb, (Paragraph)element); :


    void ListParagraph(StringBuilder sb, Paragraph p)
    {
    //  
    var level = p.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingLevelReference>().Val;
    // id 
    var id = p.GetFirstChild<ParagraphProperties>().GetFirstChild<NumberingProperties>().GetFirstChild<NumberingId>().Val;
    sb.Append($"<ul id=\"{id}\" level=\"{level}\"><p>{p.InnerText}</p></ul id=\"{id}\" level=\"{level}\">");
    }

    <ul>, id .


  8. , , SimpleParagraph(sb, (Paragraph)element);:


    void SimpleParagraph(StringBuilder sb, Paragraph p)
    {
    sb.Append($"<p>{p.InnerText}</p>");
    }

    , <p>


  9. The table is processed in the method Table(sb, (Table)element);:


    void Table(StringBuilder sb, Table table)
    {
    sb.Append("<table>");
    foreach (var row in table.Elements<TableRow>())
    {
    sb.Append("<row>");
    foreach (var cell in row.Elements<TableCell>())
    {
    sb.Append($"<cell>{cell.InnerText}</cell>");
    }
    sb.Append("</row>");
    }
    sb.Append("</table>");}

    The processing of such an element is quite trivial: we read the lines, break them into cells, take values ​​from the cells, wrap them in tags <cell>, which we pack in tags <row>and put all this inside <table>.



On this, I propose to consider the task as solved for documents of the docx and xlsx format.


The source code can be viewed in the repository at the link


Rtf conversion article


All Articles