Reading querystring variables using SSI and regular expressions

In these days I worked on a project into which I have to rely on SSI (apache’s Server Side Includes) in order to read and use url parameters to dynamically include certain html files with “include virtual” directive. Unfortunately the documentations available online is not exhaustive, and I had to figure out some things by myself.
Anyway, according to the docs, there are several global variables we can use in SSI, two of these are: DOCUMENT_URI and QUERY_STRING, which are the two we can use to handle the page url. The first returns the (%-decoded) URL path of the document, the second all the string starting with “?”.
So, how we can extract our desired variables from these strings, since SSI doesn’t offer method such “substring”, “split”, “indexOf” and similar? The answer is: by using Regular Expression in a tricky and ingenious way!
SSI offers a basic way to implementing decision flow (if, else, elif), the if command has an attribute expr which represents a declaration to be valuated, in this attribute is also possible to use a regex to test a given pattern. By knowing this, is possible to declare an SSI variable which represents the desired querystring parameter in the following way:

<!--#if expr="$QUERY_STRING = /year=([0-9]{4})/" -->
	<!--ssi-comment: year found -->
	<!--#set var="year" value="$1" -->
<!--#else -->
	<!--ssi-comment: year NOT found -->
	<!--#set var="year" value="$DATE_LOCAL" --> 
<!--#endif -->

In the code above I’m looking to a querystring parameter called year which must be a 4 ({4}) digit number ([0-9]).
If the pattern tested returns true, the matched value (returned by the regex) will be assigned to the SSI variable year, otherwise the current server date year ($DATE_LOCAL) will be assigned.

Notes:
1. “ssi-comment:” is not a special syntax, but just a comment style I decided to adopt to be readable and understandable.
2. To get only the year from $DATE_LOCAL variable, you must config the format using “#config timefmt=”%Y””

Find outermost top level XML/HTML tags with regular expressions

I’m working on a personal big project (which I’m going to release soon) and in this project I need to parse strings containing XHTML tags with the goal of extract the top level of a given tag name, ie. from:

<onetag id="t1">
    <onetag id="t1-1"></onetag>
    <onetag id="t1-2"></onetag>
</onetag>
<onetag id="t2"></onetag>
<onetag id="t3"></onetag>
<onetag id="t4">
    <onetag id="t4-1"></onetag>
</onetag>

I have to get 4 tags (t1, t2, t3, t4 with t1 and t4 containing their child nodes).
My regex knowledge is unfortunately very basic, so I googled for a ready to use regex, but none satisfied my need… all the examples I found didn’t handle properly nested tags… so, after some hours of testing I realized my own regex (my first real one), the result is the following:

var pattern = /<(onetag)[^<>]*>(<\1[^<>]*><\/\1>)*<\/\1>/gi;

In my case I’m using that pattern in Javascript, but I think it can be used with any language, because it doesn’t make use of advanced features like “atomic grouping” and these kind of “black magics”. To match the desired tag you can use it by replacing “onetag” with the tag you are looking for (even a tag with a namespace like “<foo:mytag>”).

EDIT:

The pattern above will work only if applied to a single line string (ie: var myString = “<onetag id=’t1′>…”), if you use that pattern on a “complex string” (a string containing spaces and new lines) it won’t works properly. Fortunately you can remove “bad characters” before by using a simple replace:

var parsedString = originalString.replace(/\s(?!\w)/gi, '').match(pattern);

\s(?!\w) will match any space and new line not followed by an alphanumeric characters (in this way spaces between tag attributes won’t be removed)

EDIT 2:

The pattern /<(onetag)[^<>]*>(<\1[^<>]*><\/\1>)*<\/\1>/gi won’t works properly in presence of several type of nested tags, ie:

<onetag id="t1">
    <anothertag>
         <onetag id="t1-1"></onetag>
         <onetag id="t1-2"></onetag>
    </anothertag>
</onetag>
<onetag id="t2"></onetag>
<onetag id="t3"></onetag>
<onetag id="t4">
    <anothertag>
        <onetag id="t4-1"></onetag>
    </anothertag>
</onetag>

The updated pattern is the following:

var newP = /<(onetag)[^<>]*>.*?(<\1[^<>]*>.*?<\/\1>)*.*?<\/\1>/gi;

I hope this will works without further modifications :P