Files
personal-website/blog/content-003

55 lines
5.7 KiB
Plaintext
Raw Normal View History

2020-05-25 02:51:03 +08:00
<header>
2020-06-08 19:10:45 +08:00
<h1>? Matched Expression Does Not Print in Perl</h1>
2020-05-25 02:51:03 +08:00
</header>
2020-06-08 19:10:45 +08:00
<p>I decided to name this post to the name of the Stack Overflow question I posted, because it depicts how a simple question had guided me into the rabbit hole that is regular expressions. For those who are familiar with Perl and regexes, this might be trivial; for others who are not, this might be boring. I don't think there is any position in between. Anyway, let's get started.</p>
<p>I had this question when I was trying to use Perl to filter text, mainly in the form </p>
<div style="text-align:center"><code>printf "Some text\n" | perl -pe 's/text/fun/'</code></div>
<p>to get</p>
<div style="text-align:center"><code>Some fun</code> .</div>
<p> So I want to preface the situation here that I was using Perl as a part of a larger Bash script, not Perl scripting on its own.
<p>So, we have a string of text</p>
<div style="text-align:center"><code>This is a string.</code></div>
<p>However, sometimes we get </p>
</p>
<div style="text-align:center"><code>This is a test string.</code></div>
<p>We want to match</p>
<div style="text-align:center"><code>This string</code> ,</div>
<p>but when the second string appears, we want to match the <code>test</code> word as well. So, the regular expression has to match</p>
<div style="text-align:center"><code>This (test) string</code></div>
<p>Sounds straightforward, right?</p>
<h2>How do we set out to print an optionally matched substring?</h2>
<p>In our case, the substring is <code>test</code>, and it is optionally matched. Prior to this problem, I was using this</p>
<div style="text-align:center"><code>s/.*(This).*(string).*/$1 $2/</code> ,</div>
<p> I "selected" the matched subtrings as groups, and then proceeded to only print said substrings, leaving the rest behind. It worked. To have a group that is matched zero or 1 times, we use the <code>?</code> operator. This was exactly my case with the <code>test</code> substring, so naturally, it made sense to do this right?</p>
<div style="text-align:center"><code>s/.*(This).*(test)?.*(string).*/$1 $2 $3/</code></div>
<p>Except that it didn't work.</p>
<p>The above expression gave me</p>
<div style="text-align:center"><code>This &nbsp;string</code> .</div>
<p>A sad, double-spaced emptiness in between the first and second substring. Clearly something went wrong. I will skip you through and just let you know how we will end up with the final result.</p>
<p>One of the issues with my expression was the use of <code>.*</code> . In regular expressions, using it means it will match any character of any length with <b>greedy</b> matching. Regex engines will use a process known as backtracking to do the matching. It means they swallow up as much of the input feed as they can, and then spit it back bit by bit to match the next part. In my expression, the <code>.*</code> after <code>(This)</code> matches the rest of the line : there is nothing left in the input feed. Since the next part was optional, matching nothing was still considered acceptable and Perl went by its merry way, which was <code>.*(string).*</code> . Here, the matching was mandatory, so it backtracks from the rest of the line to the point where it matches. No issues here. </p>
<p>So, we use <b>non-greedy matching</b> then, right? We replace <code>.*</code> with <code>.*?</code> and everything will be fine. Turns out no. We still get the same string</p>
<div style="text-align:center"><code>This string</code> .</div>
<p>This is because non-greedy matching matches the shortest string it can, which was nothing. In turn, the <code>(test)?</code> group does not match the next part of the input feed, but it is OK because it is optional, and then etc. In the end, this does not work out either. Turns out my expression required a rewrite, not a small fix. This is the solution I learnt from the amazing guys over at Stack Overflow.</p>
<div style="text-align:center"><code>s/.*?(This)(?:(?!test).)*(test)?.*?(string).*/$1 $2 $3/</code></div>
<p>I hope I did not bore you at this point, because there is a lot to decipher from solution. The first part is</p>
<div style="text-align:center"><code>?!(test).</code></div>
<p><code>?!</code> means <b>negative lookahead</b>. It means the matches fail when it matches <code>test</code> in the input feed. The <code> . </code> behind the group is a single character wildcard. Combining with the external parentheses,</p>
<div style="text-align:center"><code>?:( ... )*</code></div>
<p>we get a very interesting interaction. <code>?:</code> means <b>non-capturing group</b>. Any matches by this non-capturing group is dropped. While <code>*</code> means a match of any length. Combining all three of the previous interactions into</p>
<div style="text-align:center"><code>?:(?!(test).)*</code> ,</div>
<p>we get the following : Match <b>any character of any length that is not the pattern <code>test</code></b> , into a non-capturing group. When the pattern is found in the input feed, <b>the matching stops the input feed at that point, leaving the pattern intact at the start of the input feed</b>. In turn, the next part of the expression is our <code>(test)?</code> , where it will match our perfectly placed pattern from the input feed! This finally means our optional capturing group finally worked, and the rest is history. We finally get our highly anticipated</p>
<div style="text-align:center"><code>This test string</code></div>
<p>Thank you for sticking around until the end. I learnt a great deal about how regex engines behave, and many more functions of regular expressions from this small question. Was diving into the rabbit hole worth it? Hell yes. Thanks again for reading.</p>
<p>P.S. I also learnt a lot about HTML for this post. All hail divs.</p>