Regex: Can you escape?

ARGH!! My regex is not working!! It is not matching what it should. What could the problem be?

The situation

Sometimes you need to match a regex containing special characters. Sometimes these characters are also special in the language you are working in. In this post I will go through a few different languages (JavaScript, PowerShell, C#) and for each I will point out a few pitfalls and how to work with them.

For each language we will look at the following situations (problems):

  1. Matching part of a URL (getting www.andreasbijl.com from https://www.andreasbijl.com/whatever)
  2. Matching part of a local path (getting WindowsPowerShell from C:\Windows\System32\WindowsPowerShell\)
  3. Matching quotation symbols (getting the href, title and text from <a href=”https://www.andreasbijl.com/” title=’Mixed Quotes here’>A Link</a>)

Quick explanation of used regex parts

  • [] square brackets match only characters that are inside the square brackets
  • [^] square brackets that start with a ^ match all characters excep the ones inside the square brackets
  • dot matches all characters
  • an asterisk * behind something means zero or more (where it will attempt to get as much as possible). It can be behind a dot, a character or a set of square brackets
  • an asterisk followed by a question mark *? means zero or more, but this time as few characters as possible.
  • a plus behind a character, dot or square brackets means 1 or more of the proceeding items.
  • normal brackets () are used to capture a certain part of the data, in case these are used the result is an array, where the first item is the whole matched part, and then there is an item for each group of brackets.

JavaScript

Basic Regex in JavaScript

In JavaScript a regex is enclosed in forward slashes like so:

The benefit of this is that Quotation symbols do not need to be escaped. However the forward slash itself does need to be escaped. Since this is a common symbol in URLs, it is good to keep this in mind while constructing or copying regexes in JavaScript. The symbol used to escape is the backslash, so a regular expressing matching a (one) forward slash in JavaScript would look like this:

As you may see this can be come complicated quite quickly.

Solution for Problem 1 in JavaScript

Solution for Problem 2 in JavaScript

Solution for Problem 3 in JavaScript

Conclusion for JavaScript

In the first case we has a negative side effect of regular expressions being enclosed by forward slashes. However, as you can see in example three, a benefit of this is that quotation symbols (single or double) do not need to be escaped (just don’t forget about the forward slash in the closing of the link element). The second example is relatively typical, as in regular expressions it is expected that you need to escape the escape character if you want to match it.

PowerShell

Basic Regex in PowerShell

In PowerShell a regular expression is enclosed by regular quotation symbols (single or double), however what makes PowerShell different from the other two languages here is that the escape symbol is not the backslash, but instead it is the the apostrophe sign, on US keyboards left of the 1 key directly below the escape key.

Solution for Problem 1 in PowerShell

Solution for Problem 2 in PowerShell

Solution for Problem 3 in PowerShell

Conclusion Regex in PowerShell

Problem 1 was easy; nothing needed escaping. For problem 2 only the backslash needed escaping as the backslash is the escape symbol inside regular expressions (note that for inside strings the backslash does not need to be escaped). For Problem 3 the only thing to keep in mind is that while escaping the quotation marks to use the correct escape symbol (not the forward slash). The fact that in PowerShell you need to deal with two types of escape characters may make it more complicated to work with regular expressions in PowerShell. You need to constantly keep in mind, if you need to escape something in the regular expression or in the string itself.

C#

Basic Regex in C#

In C# the escape character inside a string is the same as inside a regular expression. This can lead to an overdose in backslashes. However there is an alternative to provide a literal string where slashes are ignored, but this may bring different situations to the party. I will be showing both variants for C#, choose which you prefer, both have upsides and downsides.

Solution for Problem 1 in C#

Solution for Problem 2 in C#

Solution for Problem 3 in C#

Conclusion Regex in C#

Problem 1 is very straightforward in C#; no characters that need escaping. Problem 3 is relatively easy as well where only the double quotation symbols need escaping using either the normal or literal strings. The only difference is method of escaping, using a backslash in normal strings and a double, double quotation marks in a literal string. Problem 2 becomes most complicated for C# as you can see the regular expression using normal strings becomes very complex as all backslashes need to be double escaped (once for the string and once for the regular expression). Using the literal strings makes this a lot easier where you only need to escape the backslashes once to accommodate the regular expression. Also the path variable no longer needs and escaped backslashes at all, which makes it much easier to read as well.

General Conclusions

All three of the shown examples have their own up and down sides. For JavaScript and PowerShell they share the situation where the delimiter of the regular expression is different from the escape character used in the language. This means you need to keep in mind which characters need to be escaped and how to do it. For C# it both the backslashes need to be double escaped (once for the string they are in and once because it is needed for the regular expression). Using a literal string will reduce this to a single escape of backslashes, however this means that double quotes need to be escaped by using double, double quotes.

Quick overview:

JavaScript

  • Regex delimited by forward slashes
  • Need to escape forward slashes inside regular expression
  • Quotation marks are no problem inside regular expression

PowerShell

  • Regex delimited by normal string delimiters (single or double quotes)
  • Escape symbol inside a string is the apostrophe on US keyboard left of the 1 key, and below the escape key
  • Escape symbol for regex is still the backslash
  • Keep in mind what you are escaping, a character in the string, or a symbol in the regular expression

C#

  • Regex delimited by normal string delimiters (double quotes)
  • Escape symbol for string and regular expression is the same, which leads to double escaped backslashes \\\\
  • Using a literal string can be used to prevent double escaping backslashes, but in turn it requires double quotation marks to be escaped using double, double quotation marks “”.

Simple search with PowerShell

One day I was looking for a certain video file on my computer but I didn’t know where to look, however I new the name (in this case it was a video I posted on YouTube and there I could easily find the original file name).

With the standard search bar in the Windows Explorer window the file could not be found (it was not indexed). I figured: “how hard can it be to use PowerShell to look for a file with a certain name?”. I limited my scope to only search based on the file name. After this I constructed a fairly simple but effective script which did just that. Not only did I find the file, it even appeared multiple times on my 2TB drive (several copies of the same file). It went through the 2TB drive faster then I would expect (benefit is this case was the limited scope which only looked at the names of files).

I wanted to share this simple (and really, it doesn’t get much more straight forward then this) script which enables empowers you to search large amounts of files as long as you know a part of the filename. If you are familiar with regular expressions you can use this in the search. If you are not familiar with them you can still use a part of the file name.

This is the script:

To make the function easier to work with I also created an alias “ff” as you can see in the last line.

Most of the lines consist of the two mandatory parameters: Path and regexSearchString (don’t worry if you do not know much about regular expressions; normal text will also work).

The actual search is basically a one liner. It collects all files below the provided path (as you can see by the switches “-File” and “-Recurse”. In addition if no files are found (or the path is no valid path) the result will be $null, which indicates there are no results for the current patch/search string combination. When all files are collected there is a filter which matches each file name against the regular expression (or search string).

If called directly it will simply write the result to the screen:

It makes more sense to store the result in a variable like this:

Here $result will contain a collection fileInfo objects; if there is only one match it will be no collection but it will be a fileInfo object directly.

If you are only interested in the location of the files you may choose to only collect the FullName (path+filename) property of the objects. You can do this by piping it to a select:

Here $result will be a collection of strings (or a single string if there is only one match).

If you are looking for a directory of which you know the name, the script can easily be modified to look for directories. Simply replace “-File” with “-Directory”, the rest works the same.