Convert rtf to xml in C #

In the continuation of a series of posts on converting text files to xml using C #, I propose moving on to converting rtf files .


It would seem that this format is quite ancient, and very common, and if there is no library for it to convert all data to xml format by calling one method, then there must definitely be some kind of solution from Microsoft, at least similar to OpenXML . However, if it were, then this article would not have been written.


So what is a ReachTextFile (rtf) file ? By and large, the contents of the file are already structured and even reminiscent of a mixture of json, xml and xpath. This can be easily verified by saving some Word document in rtf format , and then try reading it in a text editor such as Notepad ++ :



I highlighted in red a piece of the file that contains information about the encoding: in general, the tags are encrypted with ansi encoding, and the text itself is encoded with ansicpg1251. The text itself looks like this:



, , , , … , , , , php.


, , , , , , , . - .


, , , rtf RichTextBox, :


  • -, Windows.Forms, .
  • -, . , RichTextBox , , .

, nuget RtfPipe. .


RtfPipe . rtf html. , HtmlAgilityPack.


, - ? : , , , , . , rtf- , , , , — , , , , , .


, .


public string Convert(Stream stream)
        {
            stream.Position = 0;
            string rtf = string.Empty;
            using (StreamReader sr = new StreamReader(stream))
            {
                rtf = sr.ReadToEnd();
            }
            //      RtfPipe  Core 
            Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
            //    RtfPipe  html
            var html = Rtf.ToHtml(rtf);
            //  html     ,   xml
            return ClearHtml(html);
        }

  1. string. RtfPipe. ,
  2. , .Net Core Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);. RtfPipe. .
    .Net Framework, .
  3. var html = Rtf.ToHtml(rtf); — html . , , , ( ), .


string ClearHtml(string html), xml:


string ClearHtml(string html)
{
    //  html   HtmlAgilityPack
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    //    (),     style
    var elementsWithStyleAttribute = doc.DocumentNode.SelectNodes("//@style");
    //   ,    
    var excessNodes = doc.DocumentNode.SelectNodes("//b|//u|//strong|//br");
    //     html
    foreach (var element in excessNodes)
    {
        element.ParentNode.InnerHtml = element.InnerText;
        element.Remove();
    }
    foreach (var element in elementsWithStyleAttribute)
    {
        element.Attributes["style"].Remove();
    }
    //    html
    using (StringWriter writer = new StringWriter())
    {
        doc.Save(writer);
        html = writer.ToString();
    }
    //   html  xml
    StringBuilder xml = new StringBuilder();
    xml.Append("<?xml version=\"1.0\"?><documents><document>");
    xml.Append(html);
    xml.Append("</documents></document>");
    return xml.ToString();
}

  1. , HtmlAgilityPack nuget
  2. doc HtmlDocument html
  3. , , . , style, , , , .
  4. C , StringBuilder xml, .

, , , .



docx xlsx


All Articles