Move indexes backwards by 1
This commit is contained in:
@ -1,61 +1,20 @@
|
||||
<header>
|
||||
<h1>? Matched Expression Does Not Print in Perl</h1>
|
||||
<h1>The SBC Change</h1>
|
||||
</header>
|
||||
|
||||
<p>I decided to name this post to the name of the Stack Overflow question I posted, because it depicts how a simple question had guided me into the rabbit hole that is regular expressions. For those who are familiar with Perl and regexes, this might be trivial; for others who are not, this might be boring. I don't think there is any position in between. Anyway, let's get started.</p>
|
||||
<p>Over the past few months, I had a bit of spare cash in hand, so I thought it was time for an upgrade. But which upgrade path should I take? I agonized over this decision for a very long time. Should I pick up a used 1U server? Should I build a new PC with all these new Ryzen options? Or should I just upgrade my current desktop? In the end, I ended up with the most economical solution of all: the humble Raspberry Pi 4. I built it together with a fan, heatsinks, and a fan to remove more heat. It is powered by the official USB-C charger. That is all. Here is a picture of it in all its glory. I swear I will take a better picture of it one day.</p>
|
||||
|
||||
<p>I had this question when I was trying to use Perl to filter text, mainly in the form </p>
|
||||
<div style="text-align:center"><code>printf "Some text\n" | perl -pe 's/text/fun/'</code></div>
|
||||
<p>to get</p>
|
||||
<div style="text-align:center"><code>Some fun</code> .</div>
|
||||
<p> So I want to preface the situation here that I was using Perl as a part of a larger Bash script, not Perl scripting on its own.
|
||||
<img src="../images/20200302-rpi.webp" margin=auto alt="Raspberry Pi 4">
|
||||
|
||||
<p>So, we have a string of text</p>
|
||||
<div style="text-align:center"><code>This is a string.</code></div>
|
||||
<p>However, sometimes we get </p>
|
||||
</p>
|
||||
<div style="text-align:center"><code>This is a test string.</code></div>
|
||||
<p>We want to match</p>
|
||||
<div style="text-align:center"><code>This string</code> ,</div>
|
||||
<p>but when the second string appears, we want to match the <code>test</code> word as well. So, the regular expression has to match</p>
|
||||
<div style="text-align:center"><code>This (test) string</code></div>
|
||||
<p>Sounds straightforward, right?</p>
|
||||
<p>You are still reading? Well I was planning to have a lab of my own for all kinds of applications, you see. I was also blinded by the plethora of stuff to experiment with. But I got greedy. Looking at the terabytes of LRDIMMs you can fit into a 2U hyperscale server does things to your mind. Never mind the fact that it was all just a pipe dream. I would never have the cash the buy all that RAM nor the applications to fully utilize said RAM. Same goes for all the CPU cores too. Do I *REALLY* need a 44-thread CPU running a few hundred watts on idle? Nah.</p>
|
||||
|
||||
<h2>How do we set out to print an optionally matched substring?</h2>
|
||||
<p>What about upgrading my CPU then? Turns out my CPU is so old there are no upgrade options that can justify forking any amount of money for it. I'd rather just build a new PC with the money. But then, even a new build with Ryzen 3600 still costs a pretty penny. Also, I would have to find a space somewhere to fit another micro-ATX case with all its cables and heat ventilation concerns.</p>
|
||||
|
||||
<p>In our case, the substring is <code>test</code>, and it is optionally matched. Prior to this problem, I was using this</p>
|
||||
<div style="text-align:center"><code>s/.*(This).*(string).*/$1 $2/</code> ,</div>
|
||||
<p> I "selected" the matched subtrings as groups, and then proceeded to only print said substrings, leaving the rest behind. It worked. To have a group that is matched zero or 1 times, we use the <code>?</code> operator. This was exactly my case with the <code>test</code> substring, so naturally, it made sense to do this right?</p>
|
||||
<div style="text-align:center"><code>s/.*(This).*(test)?.*(string).*/$1 $2 $3/</code></div>
|
||||
<p>Except that it didn't work.</p>
|
||||
|
||||
<p>The above expression gave me</p>
|
||||
<div style="text-align:center"><code>This string</code> .</div>
|
||||
<p>A sad, double-spaced emptiness in between the first and second substring. Clearly something went wrong. I will skip you through and just let you know how we will end up with the final result.</p>
|
||||
<p>One of the issues with my expression was the use of <code>.*</code> . In regular expressions, using it means it will match any character of any length with <b>greedy</b> matching. Regex engines will use a process known as backtracking to do the matching. It means they swallow up as much of the input feed as they can, and then spit it back bit by bit to match the next part. In my expression, the <code>.*</code> after <code>(This)</code> matches the rest of the line : there is nothing left in the input feed. Since the next part was optional, matching nothing was still considered acceptable and Perl went by its merry way, which was <code>.*(string).*</code> . Here, the matching was mandatory, so it backtracks from the rest of the line to the point where it matches. No issues here. </p>
|
||||
|
||||
<p>So, we use <b>non-greedy matching</b> then, right? We replace <code>.*</code> with <code>.*?</code> and everything will be fine. Turns out no. We still get the same string</p>
|
||||
<div style="text-align:center"><code>This string</code> .</div>
|
||||
<p>This is because non-greedy matching matches the shortest string it can, which was nothing. In turn, the <code>(test)?</code> group does not match the next part of the input feed, but it is OK because it is optional, and then etc. In the end, this does not work out either. Turns out my expression required a rewrite, not a small fix. This is the solution I learnt from the amazing guys over at Stack Overflow.</p>
|
||||
|
||||
<div style="text-align:center"><code>s/.*?(This)(?:(?!test).)*(test)?.*?(string).*/$1 $2 $3/</code></div>
|
||||
|
||||
<p>I hope I did not bore you at this point, because there is a lot to decipher from solution. The first part is</p>
|
||||
<div style="text-align:center"><code>?!(test).</code></div>
|
||||
<p><code>?!</code> means <b>negative lookahead</b>. It means the matches fail when it matches <code>test</code> in the input feed. The <code> . </code> behind the group is a single character wildcard. Combining with the external parentheses,</p>
|
||||
<div style="text-align:center"><code>?:( ... )*</code></div>
|
||||
<p>we get a very interesting interaction. <code>?:</code> means <b>non-capturing group</b>. Any matches by this non-capturing group is dropped. While <code>*</code> means a match of any length. Combining all three of the previous interactions into</p>
|
||||
<div style="text-align:center"><code>?:(?!(test).)*</code> ,</div>
|
||||
<p>we get the following : Match <b>any character of any length that is not the pattern <code>test</code></b> , into a non-capturing group. When the pattern is found in the input feed, <b>the matching stops the input feed at that point, leaving the pattern intact at the start of the input feed</b>. In turn, the next part of the expression is our <code>(test)?</code> , where it will match our perfectly placed pattern from the input feed! This finally means our optional capturing group finally worked, and the rest is history. We finally get our highly anticipated</p>
|
||||
<div style="text-align:center"><code>This test string</code></div>
|
||||
|
||||
<p>Thank you for sticking around until the end. I learnt a great deal about how regex engines behave, and many more functions of regular expressions from this small question. Was diving into the rabbit hole worth it? Hell yes. Thanks again for reading.</p>
|
||||
|
||||
<p>P.S. I also learnt a lot about HTML for this post. All hail divs.</p>
|
||||
<p>Which brings me back to the little Raspberry Pi 4. 4 cores, 4GB of RAM, 32GB of storage, and gigabit Ethernet. After giving myself a wake up call, I realized nothing fit my needs quite like this open-source oriented SBC. Yes, it isn't 'FSF certified' free, but given the reputation the Raspberry Pi Foundation has built over the years, I am more than happy to support their cause. I look forward to migrating my experiments to this new addition to the family.</p>
|
||||
|
||||
<hr>
|
||||
|
||||
<p><div class="navbar">
|
||||
<div><a href="blog-003">Prev</a></div>
|
||||
<div><a href="blog-005">Next</a></div>
|
||||
<div><a href="blog-004">Prev</a></div>
|
||||
<div><a href="blog-006">Next</a></div>
|
||||
</div></p>
|
||||
|
Reference in New Issue
Block a user