Remove Certain HTML tags in C#

I’m trying to remove a certain html tags in C# like this:

<div>
    <blockquote style="font-size: 30px" width="300px">
For 50 years, WWF has been protecting the future of nature. The world's leading conservation organization, WWF works in 100 countries and is supported by 1.2 million members in the United States and close to 5 million globally.
    </blockquote>
</div>

To be result as

<div>For 50 years, WWF has been protecting the future of nature. The world's leading conservation organization, WWF works in 100 countries and is supported by 1.2 million members in the United States and close to 5 million globally.</div>

So far, I’m trying to do the regex. (<.+?)s+styles*=s*([""']).*?2(.*?>) but this is only for removing the style but I’m not sure how can I able to achieve the result that I want.

Thanks!

Answer

As far as I can see, you want to remove the HTML elements that contain a style attribute, also remove their closing pairs. Unfortunately, there is no good way to do that with regexes. On the other hand, XSLT is the right tool for that:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="//*[not(@style)]">

    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

What’s happening here? The <xsl:template match="//*[not(@style)]"> part matches everything that does not have a style attribute. Then the <xsl:copy>...</xsl:copy> part copies them entirely. I.e. the items that have a style attribute, they will not be copied.

For the record, this is a slight variant of the XSLT identity transformation:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

Leave a Reply

Your email address will not be published. Required fields are marked *