Regex: Can you escape?

ARGH!! My regex is not working!! It is not matching what it should. What could the problem be?

The situation

Sometimes you need to match a regex containing special characters. Sometimes these characters are also special in the language you are working in. In this post I will go through a few different languages (JavaScript, PowerShell, C#) and for each I will point out a few pitfalls and how to work with them.

For each language we will look at the following situations (problems):

  1. Matching part of a URL (getting www.andreasbijl.com from https://www.andreasbijl.com/whatever)
  2. Matching part of a local path (getting WindowsPowerShell from C:\Windows\System32\WindowsPowerShell\)
  3. Matching quotation symbols (getting the href, title and text from <a href=”https://www.andreasbijl.com/” title=’Mixed Quotes here’>A Link</a>)

Quick explanation of used regex parts

  • [] square brackets match only characters that are inside the square brackets
  • [^] square brackets that start with a ^ match all characters excep the ones inside the square brackets
  • dot matches all characters
  • an asterisk * behind something means zero or more (where it will attempt to get as much as possible). It can be behind a dot, a character or a set of square brackets
  • an asterisk followed by a question mark *? means zero or more, but this time as few characters as possible.
  • a plus behind a character, dot or square brackets means 1 or more of the proceeding items.
  • normal brackets () are used to capture a certain part of the data, in case these are used the result is an array, where the first item is the whole matched part, and then there is an item for each group of brackets.

JavaScript

Basic Regex in JavaScript

In JavaScript a regex is enclosed in forward slashes like so:

/.*/

The benefit of this is that Quotation symbols do not need to be escaped. However the forward slash itself does need to be escaped. Since this is a common symbol in URLs, it is good to keep this in mind while constructing or copying regexes in JavaScript. The symbol used to escape is the backslash, so a regular expressing matching a (one) forward slash in JavaScript would look like this:

/\//

As you may see this can be come complicated quite quickly.

Solution for Problem 1 in JavaScript

var url = "https://www.andreasbijl.com/whatever";
/https?:\/\/([^\/]+)/exec(url)[1];
//Result is: www.andreasbijl.com (group 1 from what is captured)

Solution for Problem 2 in JavaScript

//We need to escape the backslashes here as this is the escape symbol in JavaScript
var path = "C:\\Windows\\System32\\WindowsPowerShell\\";

/C:\\Windows\\System32\\([^\\]+)/.exec(path)[1];
// Result is WindowsPowerShell

Solution for Problem 3 in JavaScript

//We need to escape the double quotation symbols as this serves as delimiter for the string.
var linkElement = "<a href=\"https://www.andreasbijl.com/\" title='Mixed Quotes here'>A Link</a>";
/<a href="([^"]+)" title='([^']+)'>(.*?)<\/a>/.exec(linkElement);
//The result is an array where the first element will contain the entire link element
//The second item will contain the contents on the href attribute
//The third item will contain the contents of the title attribute
//The fourth item will contain the text of the link

Conclusion for JavaScript

In the first case we has a negative side effect of regular expressions being enclosed by forward slashes. However, as you can see in example three, a benefit of this is that quotation symbols (single or double) do not need to be escaped (just don’t forget about the forward slash in the closing of the link element). The second example is relatively typical, as in regular expressions it is expected that you need to escape the escape character if you want to match it.

PowerShell

Basic Regex in PowerShell

In PowerShell a regular expression is enclosed by regular quotation symbols (single or double), however what makes PowerShell different from the other two languages here is that the escape symbol is not the backslash, but instead it is the the apostrophe sign, on US keyboards left of the 1 key directly below the escape key.

Solution for Problem 1 in PowerShell

$url = "https://www.andreasbijl.com/whatever"
$url -Match "https?://([^/]+)"
$Matches[1]
#$Matches[1] will contain: www.andreasbijl.com

Solution for Problem 2 in PowerShell

#No need to escape the backslashes here as they are not the escape symbol
$path = "C:\Windows\System32\WindowsPowerShell\"
$path -Match "C:\\Windows\\System32\\([^\\]+)\\"
$Matches[1]
#$Matches[1] will contain: WindowsPowerShell

Solution for Problem 3 in PowerShell

#Note that the double quotes inside the element are escaped using the PowerShell escape symbol
$linkElement = "<a href=`"https://www.andreasbijl.com/`" title='Mixed Quotes here'>A Link</a>"
$linkElement -Match "<a href=`"([^`"]+)`" title='([^']+)'>(.*?)</a>"
$Matches
#$Matches will be an array where the first element will contain the entire link element
#The second item will contain the contents on the href attribute
#The third item will contain the contents of the title attribute
#The fourth item will contain the text of the link

Conclusion Regex in PowerShell

Problem 1 was easy; nothing needed escaping. For problem 2 only the backslash needed escaping as the backslash is the escape symbol inside regular expressions (note that for inside strings the backslash does not need to be escaped). For Problem 3 the only thing to keep in mind is that while escaping the quotation marks to use the correct escape symbol (not the forward slash). The fact that in PowerShell you need to deal with two types of escape characters may make it more complicated to work with regular expressions in PowerShell. You need to constantly keep in mind, if you need to escape something in the regular expression or in the string itself.

C#

Basic Regex in C#

In C# the escape character inside a string is the same as inside a regular expression. This can lead to an overdose in backslashes. However there is an alternative to provide a literal string where slashes are ignored, but this may bring different situations to the party. I will be showing both variants for C#, choose which you prefer, both have upsides and downsides.

Solution for Problem 1 in C#

string url = "https://www.andreasbijl.com/whatever";
System.Text.RegularExpressions.Regex.Match(url, "https://(^/]+)");
//Result is: www.andreasbijl.com (group 1 from the resulting Match object)

Solution for Problem 2 in C#

//Example with normal strings
string path = "C:\\Windows\\System32\\WindowsPowerShell\\";
System.Text.RegularExpressions.Regex.Match(path, "C:\\\\Windows\\\\System32\\\\([^\\\\]+)\\\\");
// Result is WindowsPowerShell (group 1 from the resulting Match object)//Example with literal strings
string path = @"C:\Windows\System32\WindowsPowerShell\";
System.Text.RegularExpressions.Regex.Match(path, @"C:\\Windows\\System32\\([^\\]+)\\");
// Result is WindowsPowerShell (group 1 from the resulting Match object)

Solution for Problem 3 in C#

//Example with normal strings
string linkElement = "<a href=\"https://www.andreasbijl.com/\" title='Mixed Quotes here'>A Link</a>";
System.Text.RegularExpressions.Regex.Match(linkElement, "<a href=\"([^\"]+)\" title='([^']+)'>(.*?)</a>");
//The result is a Match object with groups where the first group will contain the entire link element
//The second group will contain the contents on the href attribute
//The third group will contain the contents of the title attribute
//The fourth group will contain the text of the link
//Example with literal strings
string linkElement = @"<a href=""https://www.andreasbijl.com/"" title='Mixed Quotes here'>A Link</a>";
System.Text.RegularExpressions.Regex.Match(linkElement, @"<a href=""([^""]+)"" title='([^']+)'>(.*?)</a>");
//The result is a Match object with groups where the first group will contain the entire link element
//The second group will contain the contents on the href attribute
//The third group will contain the contents of the title attribute
//The fourth group will contain the text of the link

Conclusion Regex in C#

Problem 1 is very straightforward in C#; no characters that need escaping. Problem 3 is relatively easy as well where only the double quotation symbols need escaping using either the normal or literal strings. The only difference is method of escaping, using a backslash in normal strings and a double, double quotation marks in a literal string. Problem 2 becomes most complicated for C# as you can see the regular expression using normal strings becomes very complex as all backslashes need to be double escaped (once for the string and once for the regular expression). Using the literal strings makes this a lot easier where you only need to escape the backslashes once to accommodate the regular expression. Also the path variable no longer needs and escaped backslashes at all, which makes it much easier to read as well.

General Conclusions

All three of the shown examples have their own up and down sides. For JavaScript and PowerShell they share the situation where the delimiter of the regular expression is different from the escape character used in the language. This means you need to keep in mind which characters need to be escaped and how to do it. For C# it both the backslashes need to be double escaped (once for the string they are in and once because it is needed for the regular expression). Using a literal string will reduce this to a single escape of backslashes, however this means that double quotes need to be escaped by using double, double quotes.

Quick overview:

JavaScript

  • Regex delimited by forward slashes
  • Need to escape forward slashes inside regular expression
  • Quotation marks are no problem inside regular expression

PowerShell

  • Regex delimited by normal string delimiters (single or double quotes)
  • Escape symbol inside a string is the apostrophe on US keyboard left of the 1 key, and below the escape key
  • Escape symbol for regex is still the backslash
  • Keep in mind what you are escaping, a character in the string, or a symbol in the regular expression

C#

  • Regex delimited by normal string delimiters (double quotes)
  • Escape symbol for string and regular expression is the same, which leads to double escaped backslashes \\\\
  • Using a literal string can be used to prevent double escaping backslashes, but in turn it requires double quotation marks to be escaped using double, double quotation marks “”.