Sunday, October 23, 2011

Parsing and Manipulating HTML strings - C#

Recently I came across a situation, where I need to retrieve an HTML string created and stored by an ASP application. Upon retrieving the html string, I need to parse the string and manipulate html string by removing some deprecated attributes and us some other equivalent alternate instead. For instance the size attribute in the font element need to be replaced with css font-size attribute.

Snippet 1:

<font size="1">......</font>

to 

<font style="font-size: 9px;">......</font>


One additional thing we need to note here is, since we don't have any direct equivalent of the unit and value for the size attribute, I simply changed it to the expected size for the invoking applications.

Fine, now comes core of the issue about how to manipulate the HTML string. Initially I thought of using regex. Suddenly, I remembered of the nice tool Html Agility Pack which I explored few months back. Its an HTML parser that allows us to select an HTML element and manipulate it very easily. It uses the XPath to navigate through the elements. If you don't know XPath, no problem. You can refer to examples over the net (like in msdn and w3schools ) and find the XPath that suits better your requirement in less than a minute.

Following is the code for the situation I mentioned above.
Snippet 2:

private string RestructureDeprecatedAttributes(string htmlText)
{
    HtmlDocument document = new HtmlDocument();
    HtmlNodeCollection fontNodeCollection;
    
    document.LoadHtml(htmlText);
    fontNodeCollection = document.DocumentNode.SelectNodes("//font[@size]");

    if (fontNodeCollection != null)
    {
        foreach (HtmlNode fontNode in fontNodeCollection)
        {
            HtmlAttribute sizeAttribute = fontNode.Attributes["size"];

            if (!string.IsNullOrWhiteSpace(sizeAttribute.Value) && int.Parse(sizeAttribute.Value) > 0)
            {
                if (sizeAttribute.Value.Equals("1"))
                {
                    fontNode.SetAttributeValue("style", "font-size: 9px;");
                }
                else
                {
                    fontNode.SetAttributeValue("style", "font-size: 11px;");
                }
                
                fontNode.Attributes["size"].Remove();
            }
        }
    }

    return document.DocumentNode.WriteTo();
}

The selectors are really powerful which allows us to select the nodes that perfectly matches our criteria. The Html Agility Pack provides several methods that supports the manipulation of HTML strings in a very easy manner. You can infer this from the above code.

2 comments:

  1. Thank you for posting your solution, however I keep getting

    [At the line 3: HtmlDocument document = new HtmlDocument()]

    - System.Windows.Forms.HtmlDocument has no constructors defined

    ReplyDelete
    Replies
    1. Thank you for your comment.
      Actually its the Html Agility Pack (HAP) (http://htmlagilitypack.codeplex.com/), I've used in the Snippet. Just download it from the Codeplex and add reference to your project. It helps for you (HtmlAgilityPack.HtmlDocument). Hope this answers your question.

      Delete

Creative Commons License
This work by Tito is licensed under a Creative Commons Attribution 3.0 Unported License.