Regex: Can you escape?

ARGH!! My regex is not working!! It is not matching what it should. What could the problem be?

The situation

Sometimes you need to match a regex containing special characters. Sometimes these characters are also special in the language you are working in. In this post I will go through a few different languages (JavaScript, PowerShell, C#) and for each I will point out a few pitfalls and how to work with them.

For each language we will look at the following situations (problems):

  1. Matching part of a URL (getting www.andreasbijl.com from https://www.andreasbijl.com/whatever)
  2. Matching part of a local path (getting WindowsPowerShell from C:\Windows\System32\WindowsPowerShell\)
  3. Matching quotation symbols (getting the href, title and text from <a href=”https://www.andreasbijl.com/” title=’Mixed Quotes here’>A Link</a>)

Quick explanation of used regex parts

  • [] square brackets match only characters that are inside the square brackets
  • [^] square brackets that start with a ^ match all characters excep the ones inside the square brackets
  • dot matches all characters
  • an asterisk * behind something means zero or more (where it will attempt to get as much as possible). It can be behind a dot, a character or a set of square brackets
  • an asterisk followed by a question mark *? means zero or more, but this time as few characters as possible.
  • a plus behind a character, dot or square brackets means 1 or more of the proceeding items.
  • normal brackets () are used to capture a certain part of the data, in case these are used the result is an array, where the first item is the whole matched part, and then there is an item for each group of brackets.

JavaScript

Basic Regex in JavaScript

In JavaScript a regex is enclosed in forward slashes like so:

/.*/

The benefit of this is that Quotation symbols do not need to be escaped. However the forward slash itself does need to be escaped. Since this is a common symbol in URLs, it is good to keep this in mind while constructing or copying regexes in JavaScript. The symbol used to escape is the backslash, so a regular expressing matching a (one) forward slash in JavaScript would look like this:

/\//

As you may see this can be come complicated quite quickly.

Solution for Problem 1 in JavaScript

var url = "https://www.andreasbijl.com/whatever";
/https?:\/\/([^\/]+)/exec(url)[1];
//Result is: www.andreasbijl.com (group 1 from what is captured)

Solution for Problem 2 in JavaScript

//We need to escape the backslashes here as this is the escape symbol in JavaScript
var path = "C:\\Windows\\System32\\WindowsPowerShell\\";

/C:\\Windows\\System32\\([^\\]+)/.exec(path)[1];
// Result is WindowsPowerShell

Solution for Problem 3 in JavaScript

//We need to escape the double quotation symbols as this serves as delimiter for the string.
var linkElement = "<a href=\"https://www.andreasbijl.com/\" title='Mixed Quotes here'>A Link</a>";
/<a href="([^"]+)" title='([^']+)'>(.*?)<\/a>/.exec(linkElement);
//The result is an array where the first element will contain the entire link element
//The second item will contain the contents on the href attribute
//The third item will contain the contents of the title attribute
//The fourth item will contain the text of the link

Conclusion for JavaScript

In the first case we has a negative side effect of regular expressions being enclosed by forward slashes. However, as you can see in example three, a benefit of this is that quotation symbols (single or double) do not need to be escaped (just don’t forget about the forward slash in the closing of the link element). The second example is relatively typical, as in regular expressions it is expected that you need to escape the escape character if you want to match it.

PowerShell

Basic Regex in PowerShell

In PowerShell a regular expression is enclosed by regular quotation symbols (single or double), however what makes PowerShell different from the other two languages here is that the escape symbol is not the backslash, but instead it is the the apostrophe sign, on US keyboards left of the 1 key directly below the escape key.

Solution for Problem 1 in PowerShell

$url = "https://www.andreasbijl.com/whatever"
$url -Match "https?://([^/]+)"
$Matches[1]
#$Matches[1] will contain: www.andreasbijl.com

Solution for Problem 2 in PowerShell

#No need to escape the backslashes here as they are not the escape symbol
$path = "C:\Windows\System32\WindowsPowerShell\"
$path -Match "C:\\Windows\\System32\\([^\\]+)\\"
$Matches[1]
#$Matches[1] will contain: WindowsPowerShell

Solution for Problem 3 in PowerShell

#Note that the double quotes inside the element are escaped using the PowerShell escape symbol
$linkElement = "<a href=`"https://www.andreasbijl.com/`" title='Mixed Quotes here'>A Link</a>"
$linkElement -Match "<a href=`"([^`"]+)`" title='([^']+)'>(.*?)</a>"
$Matches
#$Matches will be an array where the first element will contain the entire link element
#The second item will contain the contents on the href attribute
#The third item will contain the contents of the title attribute
#The fourth item will contain the text of the link

Conclusion Regex in PowerShell

Problem 1 was easy; nothing needed escaping. For problem 2 only the backslash needed escaping as the backslash is the escape symbol inside regular expressions (note that for inside strings the backslash does not need to be escaped). For Problem 3 the only thing to keep in mind is that while escaping the quotation marks to use the correct escape symbol (not the forward slash). The fact that in PowerShell you need to deal with two types of escape characters may make it more complicated to work with regular expressions in PowerShell. You need to constantly keep in mind, if you need to escape something in the regular expression or in the string itself.

C#

Basic Regex in C#

In C# the escape character inside a string is the same as inside a regular expression. This can lead to an overdose in backslashes. However there is an alternative to provide a literal string where slashes are ignored, but this may bring different situations to the party. I will be showing both variants for C#, choose which you prefer, both have upsides and downsides.

Solution for Problem 1 in C#

string url = "https://www.andreasbijl.com/whatever";
System.Text.RegularExpressions.Regex.Match(url, "https://(^/]+)");
//Result is: www.andreasbijl.com (group 1 from the resulting Match object)

Solution for Problem 2 in C#

//Example with normal strings
string path = "C:\\Windows\\System32\\WindowsPowerShell\\";
System.Text.RegularExpressions.Regex.Match(path, "C:\\\\Windows\\\\System32\\\\([^\\\\]+)\\\\");
// Result is WindowsPowerShell (group 1 from the resulting Match object)//Example with literal strings
string path = @"C:\Windows\System32\WindowsPowerShell\";
System.Text.RegularExpressions.Regex.Match(path, @"C:\\Windows\\System32\\([^\\]+)\\");
// Result is WindowsPowerShell (group 1 from the resulting Match object)

Solution for Problem 3 in C#

//Example with normal strings
string linkElement = "<a href=\"https://www.andreasbijl.com/\" title='Mixed Quotes here'>A Link</a>";
System.Text.RegularExpressions.Regex.Match(linkElement, "<a href=\"([^\"]+)\" title='([^']+)'>(.*?)</a>");
//The result is a Match object with groups where the first group will contain the entire link element
//The second group will contain the contents on the href attribute
//The third group will contain the contents of the title attribute
//The fourth group will contain the text of the link
//Example with literal strings
string linkElement = @"<a href=""https://www.andreasbijl.com/"" title='Mixed Quotes here'>A Link</a>";
System.Text.RegularExpressions.Regex.Match(linkElement, @"<a href=""([^""]+)"" title='([^']+)'>(.*?)</a>");
//The result is a Match object with groups where the first group will contain the entire link element
//The second group will contain the contents on the href attribute
//The third group will contain the contents of the title attribute
//The fourth group will contain the text of the link

Conclusion Regex in C#

Problem 1 is very straightforward in C#; no characters that need escaping. Problem 3 is relatively easy as well where only the double quotation symbols need escaping using either the normal or literal strings. The only difference is method of escaping, using a backslash in normal strings and a double, double quotation marks in a literal string. Problem 2 becomes most complicated for C# as you can see the regular expression using normal strings becomes very complex as all backslashes need to be double escaped (once for the string and once for the regular expression). Using the literal strings makes this a lot easier where you only need to escape the backslashes once to accommodate the regular expression. Also the path variable no longer needs and escaped backslashes at all, which makes it much easier to read as well.

General Conclusions

All three of the shown examples have their own up and down sides. For JavaScript and PowerShell they share the situation where the delimiter of the regular expression is different from the escape character used in the language. This means you need to keep in mind which characters need to be escaped and how to do it. For C# it both the backslashes need to be double escaped (once for the string they are in and once because it is needed for the regular expression). Using a literal string will reduce this to a single escape of backslashes, however this means that double quotes need to be escaped by using double, double quotes.

Quick overview:

JavaScript

  • Regex delimited by forward slashes
  • Need to escape forward slashes inside regular expression
  • Quotation marks are no problem inside regular expression

PowerShell

  • Regex delimited by normal string delimiters (single or double quotes)
  • Escape symbol inside a string is the apostrophe on US keyboard left of the 1 key, and below the escape key
  • Escape symbol for regex is still the backslash
  • Keep in mind what you are escaping, a character in the string, or a symbol in the regular expression

C#

  • Regex delimited by normal string delimiters (double quotes)
  • Escape symbol for string and regular expression is the same, which leads to double escaped backslashes \\\\
  • Using a literal string can be used to prevent double escaping backslashes, but in turn it requires double quotation marks to be escaped using double, double quotation marks “”.

Programming languages

I enjoy using different programming languages and styles. As most languages have specific things they are designed for. Different languages have different strengths and weaknesses.

Java

Java has great portability (supported on almost any platform) and OOP capabilities. Java has multiple easy to use editors. One of the most popular being Eclipse. The IDE is lightweight and itself is portable (so it can run from a USB stick and you can take it anywhere with you). Because of it’s popularity there is a lot of information to find about Java on the internet.

C#

C# builds on the extensive .NET framework (as well as VB.NET) and can create nice native looking applications for the Windows platform. It is also easy to pick up, and with a free programming environment (Visual Studio Community Edition) it take little effort to get started. Because of the great popularity of the language a lot of information can be found on the internet. Apart from community support there is also support from Microsoft in the form of Technet or MSDN.

PowerShell

PowerShell is a language based on the .NET framework (just like C# and VB.NET). It comes default with Windows 7/Windows Server 2008 R2 (PowerShell 2.0) and newer versions. It also included a nice ISE (Integrated Scripting Environment) which comes as a feature which can be activated.

As it is build on top of .NET the entire framework is available to work with. Every variable or return value is a .NET object. A particular strong feature of PowerShell is the concept of piping, which allows the output of one cmdlet to be used as input of another cmdlet.

Another benifit is it can be run without needing to build an executable. It combines scripting power (known from batch files) with the power of a full programming language supported by the entire .NET framework. This enables you to create a Windows dialog from a few lines of PowerShell code. Through this it is possible to start a PowerShell script and then display a GUI with buttons, graphs, sliders and so on.

Also an advantage of PowerShell over C# is (IMHO) the ease of handling custom objects. Importing a csv will result in a collection of custom objects with properties with the names of every column of the CSV file. Also when creating custom objects you can add members  (properties) programmatically as you go along (runtime). These new objects can then be gathered is a collection of your custom objects. Finally these can be exported back to a csv or passed along to a function or cmdlet. This would not be easy to do in a language like C# because you would have to define the objects beforehand and every change in the imported csv file would require a rewrite of the code.

C

When it comes to writing fast code C/C++ is a commonly used language. I like C especially because it brings the programmer close to the core of programming. There are no strings, just arrays of characters. When traversing an array of integers the steps in memory are larger then when traversing a character array (aka string). Which brings a better understanding of how much memory you are consuming. Currently memory is rarely an issue. Some understanding of this can be quite useful when programming for custom devices with limited memory. This is the case with integrated electronics in devices such as a camera or a remote control.

Web based

Web based languages differ from previously mentioned languages in the aspect that they only have html and css as output. Browsers then decide what happens with the rendering of the eventual output.

JavaScript

As some have noted I did not mention JavaScript as an output. This is because JavaScript itself cannot be rendered. It are the results of JavaScript that are rendered. Changes in css, html or text. JavaScript is one of the most used web based language. No matter what the server side system is, the client side logic almost always is JavaScript.

For JavaScript there are many (often minified) libraries available. A popular one is Jquery. When it comes to programming languages the focus is shifting more and more towards client side code inside a browser (like JavaScript). Which requires future devices to have a browser and an internet connection. Complex calculations can still happen server side. But then through asynchronous calls to services.

PHP

For server sides code there are many ways to go. A popular language is PHP (the site you are looking at right now runs on PHP.), it is free to use and has many (free) editors available. PHP supports Object Oriented programming and is easy to get started with.

ASP.NET

Microsoft offers ASP.NET as server side language. As most of Microsofts languages it is supported by the .NET framework and can use most of the .NET libraries (because the web based nature it does not support all windows based features). There are many extensions on ASP.NET, like Microsofts MVC which allows easy creation of data driven sites.