<p>I decided to name this post to the name of the Stack Overflow question I posted, because it depicts how a simple question had guided me into the rabbit hole that is regular expressions. For those who are familiar with Perl and regexes, this might be trivial; for others who are not, this might be boring. I don't think there is any position in between. Anyway, let's get started.</p>
<p>I had this question when I was trying to use Perl to filter text, mainly in the form </p>
<p> I "selected" the matched subtrings as groups, and then proceeded to only print said substrings, leaving the rest behind. It worked. To have a group that is matched zero or 1 times, we use the <code>?</code> operator. This was exactly my case with the <code>test</code> substring, so naturally, it made sense to do this right?</p>
<p>A sad, double-spaced emptiness in between the first and second substring. Clearly something went wrong. I will skip you through and just let you know how we will end up with the final result.</p>
<p>One of the issues with my expression was the use of <code>.*</code> . In regular expressions, using it means it will match any character of any length with <b>greedy</b> matching. Regex engines will use a process known as backtracking to do the matching. It means they swallow up as much of the input feed as they can, and then spit it back bit by bit to match the next part. In my expression, the <code>.*</code> after <code>(This)</code> matches the rest of the line : there is nothing left in the input feed. Since the next part was optional, matching nothing was still considered acceptable and Perl went by its merry way, which was <code>.*(string).*</code> . Here, the matching was mandatory, so it backtracks from the rest of the line to the point where it matches. No issues here. </p>
<p>So, we use <b>non-greedy matching</b> then, right? We replace <code>.*</code> with <code>.*?</code> and everything will be fine. Turns out no. We still get the same string</p>
<p>This is because non-greedy matching matches the shortest string it can, which was nothing. In turn, the <code>(test)?</code> group does not match the next part of the input feed, but it is OK because it is optional, and then etc. In the end, this does not work out either. Turns out my expression required a rewrite, not a small fix. This is the solution I learnt from the amazing guys over at Stack Overflow.</p>
<p><code>?!</code> means <b>negative lookahead</b>. It means the matches fail when it matches <code>test</code> in the input feed. The <code> . </code> behind the group is a single character wildcard. Combining with the external parentheses,</p>
<p>we get a very interesting interaction. <code>?:</code> means <b>non-capturing group</b>. Any matches by this non-capturing group is dropped. While <code>*</code> means a match of any length. Combining all three of the previous interactions into</p>
<p>we get the following : Match <b>any character of any length that is not the pattern <code>test</code></b> , into a non-capturing group. When the pattern is found in the input feed, <b>the matching stops the input feed at that point, leaving the pattern intact at the start of the input feed</b>. In turn, the next part of the expression is our <code>(test)?</code> , where it will match our perfectly placed pattern from the input feed! This finally means our optional capturing group finally worked, and the rest is history. We finally get our highly anticipated</p>
<div style="text-align:center"><code>This test string</code></div>
<p>Thank you for sticking around until the end. I learnt a great deal about how regex engines behave, and many more functions of regular expressions from this small question. Was diving into the rabbit hole worth it? Hell yes. Thanks again for reading.</p>
<p>P.S. I also learnt a lot about HTML for this post. All hail divs.</p>