Find outermost top level XML/HTML tags with regular expressions

I’m working on a personal big project (which I’m going to release soon) and in this project I need to parse strings containing XHTML tags with the goal of extract the top level of a given tag name, ie. from:

<onetag id="t1">
    <onetag id="t1-1"></onetag>
    <onetag id="t1-2"></onetag>
</onetag>
<onetag id="t2"></onetag>
<onetag id="t3"></onetag>
<onetag id="t4">
    <onetag id="t4-1"></onetag>
</onetag>

I have to get 4 tags (t1, t2, t3, t4 with t1 and t4 containing their child nodes).
My regex knowledge is unfortunately very basic, so I googled for a ready to use regex, but none satisfied my need… all the examples I found didn’t handle properly nested tags… so, after some hours of testing I realized my own regex (my first real one), the result is the following:

var pattern = /<(onetag)[^<>]*>(<\1[^<>]*><\/\1>)*<\/\1>/gi;

In my case I’m using that pattern in Javascript, but I think it can be used with any language, because it doesn’t make use of advanced features like “atomic grouping” and these kind of “black magics”. To match the desired tag you can use it by replacing “onetag” with the tag you are looking for (even a tag with a namespace like “<foo:mytag>”).

EDIT:

The pattern above will work only if applied to a single line string (ie: var myString = “<onetag id=’t1′>…”), if you use that pattern on a “complex string” (a string containing spaces and new lines) it won’t works properly. Fortunately you can remove “bad characters” before by using a simple replace:

var parsedString = originalString.replace(/\s(?!\w)/gi, '').match(pattern);

\s(?!\w) will match any space and new line not followed by an alphanumeric characters (in this way spaces between tag attributes won’t be removed)

EDIT 2:

The pattern /<(onetag)[^<>]*>(<\1[^<>]*><\/\1>)*<\/\1>/gi won’t works properly in presence of several type of nested tags, ie:

<onetag id="t1">
    <anothertag>
         <onetag id="t1-1"></onetag>
         <onetag id="t1-2"></onetag>
    </anothertag>
</onetag>
<onetag id="t2"></onetag>
<onetag id="t3"></onetag>
<onetag id="t4">
    <anothertag>
        <onetag id="t4-1"></onetag>
    </anothertag>
</onetag>

The updated pattern is the following:

var newP = /<(onetag)[^<>]*>.*?(<\1[^<>]*>.*?<\/\1>)*.*?<\/\1>/gi;

I hope this will works without further modifications :P

  • Hii

    i WANT TO write regex for this code can you help me it is similar to your one only but i am not getting data. My requirement is to get the first level P tags. My content is

    oneone-one h two three