In the continuation of a series of posts on converting text files to xml using C #, I propose moving on to converting rtf files .
It would seem that this format is quite ancient, and very common, and if there is no library for it to convert all data to xml format by calling one method, then there must definitely be some kind of solution from Microsoft, at least similar to OpenXML . However, if it were, then this article would not have been written.
So what is a ReachTextFile (rtf) file ? By and large, the contents of the file are already structured and even reminiscent of a mixture of json, xml and xpath. This can be easily verified by saving some Word document in rtf format , and then try reading it in a text editor such as Notepad ++ :

I highlighted in red a piece of the file that contains information about the encoding: in general, the tags are encrypted with ansi encoding, and the text itself is encoded with ansicpg1251. The text itself looks like this:

, , , , … , , , , php.
, , , , , , , . - .
, , , rtf RichTextBox
, :
- -,
Windows.Forms
, . - -, . ,
RichTextBox
, , .
, nuget RtfPipe. .
RtfPipe . rtf html. , HtmlAgilityPack.
, - ? : , , , , . , rtf- , , , , — , , , , , .
, .
public string Convert(Stream stream)
{
stream.Position = 0;
string rtf = string.Empty;
using (StreamReader sr = new StreamReader(stream))
{
rtf = sr.ReadToEnd();
}
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var html = Rtf.ToHtml(rtf);
return ClearHtml(html);
}
- string.
RtfPipe
. , - , .Net Core
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
. RtfPipe. .
.Net Framework, . var html = Rtf.ToHtml(rtf);
— html . , , , ( ), .

string ClearHtml(string html)
, xml:
string ClearHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var elementsWithStyleAttribute = doc.DocumentNode.SelectNodes("//@style");
var excessNodes = doc.DocumentNode.SelectNodes("//b|//u|//strong|//br");
foreach (var element in excessNodes)
{
element.ParentNode.InnerHtml = element.InnerText;
element.Remove();
}
foreach (var element in elementsWithStyleAttribute)
{
element.Attributes["style"].Remove();
}
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
html = writer.ToString();
}
StringBuilder xml = new StringBuilder();
xml.Append("<?xml version=\"1.0\"?><documents><document>");
xml.Append(html);
xml.Append("</documents></document>");
return xml.ToString();
}
- , HtmlAgilityPack nuget
doc
HtmlDocument
html
- , , . ,
style
, , , , . - C ,
StringBuilder
xml, .
, , , .
docx xlsx