Regex: Can you escape?

ARGH!! My regex is not working!! It is not matching what it should. What could the problem be?

The situation

Sometimes you need to match a regex containing special characters. Sometimes these characters are also special in the language you are working in. In this post I will go through a few different languages (JavaScript, PowerShell, C#) and for each I will point out a few pitfalls and how to work with them.

For each language we will look at the following situations (problems):

  1. Matching part of a URL (getting www.andreasbijl.com from https://www.andreasbijl.com/whatever)
  2. Matching part of a local path (getting WindowsPowerShell from C:\Windows\System32\WindowsPowerShell\)
  3. Matching quotation symbols (getting the href, title and text from <a href=”https://www.andreasbijl.com/” title=’Mixed Quotes here’>A Link</a>)

Quick explanation of used regex parts

  • [] square brackets match only characters that are inside the square brackets
  • [^] square brackets that start with a ^ match all characters excep the ones inside the square brackets
  • dot matches all characters
  • an asterisk * behind something means zero or more (where it will attempt to get as much as possible). It can be behind a dot, a character or a set of square brackets
  • an asterisk followed by a question mark *? means zero or more, but this time as few characters as possible.
  • a plus behind a character, dot or square brackets means 1 or more of the proceeding items.
  • normal brackets () are used to capture a certain part of the data, in case these are used the result is an array, where the first item is the whole matched part, and then there is an item for each group of brackets.

JavaScript

Basic Regex in JavaScript

In JavaScript a regex is enclosed in forward slashes like so:

/.*/

The benefit of this is that Quotation symbols do not need to be escaped. However the forward slash itself does need to be escaped. Since this is a common symbol in URLs, it is good to keep this in mind while constructing or copying regexes in JavaScript. The symbol used to escape is the backslash, so a regular expressing matching a (one) forward slash in JavaScript would look like this:

/\//

As you may see this can be come complicated quite quickly.

Solution for Problem 1 in JavaScript

var url = "https://www.andreasbijl.com/whatever";
/https?:\/\/([^\/]+)/exec(url)[1];
//Result is: www.andreasbijl.com (group 1 from what is captured)

Solution for Problem 2 in JavaScript

//We need to escape the backslashes here as this is the escape symbol in JavaScript
var path = "C:\\Windows\\System32\\WindowsPowerShell\\";

/C:\\Windows\\System32\\([^\\]+)/.exec(path)[1];
// Result is WindowsPowerShell

Solution for Problem 3 in JavaScript

//We need to escape the double quotation symbols as this serves as delimiter for the string.
var linkElement = "<a href=\"https://www.andreasbijl.com/\" title='Mixed Quotes here'>A Link</a>";
/<a href="([^"]+)" title='([^']+)'>(.*?)<\/a>/.exec(linkElement);
//The result is an array where the first element will contain the entire link element
//The second item will contain the contents on the href attribute
//The third item will contain the contents of the title attribute
//The fourth item will contain the text of the link

Conclusion for JavaScript

In the first case we has a negative side effect of regular expressions being enclosed by forward slashes. However, as you can see in example three, a benefit of this is that quotation symbols (single or double) do not need to be escaped (just don’t forget about the forward slash in the closing of the link element). The second example is relatively typical, as in regular expressions it is expected that you need to escape the escape character if you want to match it.

PowerShell

Basic Regex in PowerShell

In PowerShell a regular expression is enclosed by regular quotation symbols (single or double), however what makes PowerShell different from the other two languages here is that the escape symbol is not the backslash, but instead it is the the apostrophe sign, on US keyboards left of the 1 key directly below the escape key.

Solution for Problem 1 in PowerShell

$url = "https://www.andreasbijl.com/whatever"
$url -Match "https?://([^/]+)"
$Matches[1]
#$Matches[1] will contain: www.andreasbijl.com

Solution for Problem 2 in PowerShell

#No need to escape the backslashes here as they are not the escape symbol
$path = "C:\Windows\System32\WindowsPowerShell\"
$path -Match "C:\\Windows\\System32\\([^\\]+)\\"
$Matches[1]
#$Matches[1] will contain: WindowsPowerShell

Solution for Problem 3 in PowerShell

#Note that the double quotes inside the element are escaped using the PowerShell escape symbol
$linkElement = "<a href=`"https://www.andreasbijl.com/`" title='Mixed Quotes here'>A Link</a>"
$linkElement -Match "<a href=`"([^`"]+)`" title='([^']+)'>(.*?)</a>"
$Matches
#$Matches will be an array where the first element will contain the entire link element
#The second item will contain the contents on the href attribute
#The third item will contain the contents of the title attribute
#The fourth item will contain the text of the link

Conclusion Regex in PowerShell

Problem 1 was easy; nothing needed escaping. For problem 2 only the backslash needed escaping as the backslash is the escape symbol inside regular expressions (note that for inside strings the backslash does not need to be escaped). For Problem 3 the only thing to keep in mind is that while escaping the quotation marks to use the correct escape symbol (not the forward slash). The fact that in PowerShell you need to deal with two types of escape characters may make it more complicated to work with regular expressions in PowerShell. You need to constantly keep in mind, if you need to escape something in the regular expression or in the string itself.

C#

Basic Regex in C#

In C# the escape character inside a string is the same as inside a regular expression. This can lead to an overdose in backslashes. However there is an alternative to provide a literal string where slashes are ignored, but this may bring different situations to the party. I will be showing both variants for C#, choose which you prefer, both have upsides and downsides.

Solution for Problem 1 in C#

string url = "https://www.andreasbijl.com/whatever";
System.Text.RegularExpressions.Regex.Match(url, "https://(^/]+)");
//Result is: www.andreasbijl.com (group 1 from the resulting Match object)

Solution for Problem 2 in C#

//Example with normal strings
string path = "C:\\Windows\\System32\\WindowsPowerShell\\";
System.Text.RegularExpressions.Regex.Match(path, "C:\\\\Windows\\\\System32\\\\([^\\\\]+)\\\\");
// Result is WindowsPowerShell (group 1 from the resulting Match object)//Example with literal strings
string path = @"C:\Windows\System32\WindowsPowerShell\";
System.Text.RegularExpressions.Regex.Match(path, @"C:\\Windows\\System32\\([^\\]+)\\");
// Result is WindowsPowerShell (group 1 from the resulting Match object)

Solution for Problem 3 in C#

//Example with normal strings
string linkElement = "<a href=\"https://www.andreasbijl.com/\" title='Mixed Quotes here'>A Link</a>";
System.Text.RegularExpressions.Regex.Match(linkElement, "<a href=\"([^\"]+)\" title='([^']+)'>(.*?)</a>");
//The result is a Match object with groups where the first group will contain the entire link element
//The second group will contain the contents on the href attribute
//The third group will contain the contents of the title attribute
//The fourth group will contain the text of the link
//Example with literal strings
string linkElement = @"<a href=""https://www.andreasbijl.com/"" title='Mixed Quotes here'>A Link</a>";
System.Text.RegularExpressions.Regex.Match(linkElement, @"<a href=""([^""]+)"" title='([^']+)'>(.*?)</a>");
//The result is a Match object with groups where the first group will contain the entire link element
//The second group will contain the contents on the href attribute
//The third group will contain the contents of the title attribute
//The fourth group will contain the text of the link

Conclusion Regex in C#

Problem 1 is very straightforward in C#; no characters that need escaping. Problem 3 is relatively easy as well where only the double quotation symbols need escaping using either the normal or literal strings. The only difference is method of escaping, using a backslash in normal strings and a double, double quotation marks in a literal string. Problem 2 becomes most complicated for C# as you can see the regular expression using normal strings becomes very complex as all backslashes need to be double escaped (once for the string and once for the regular expression). Using the literal strings makes this a lot easier where you only need to escape the backslashes once to accommodate the regular expression. Also the path variable no longer needs and escaped backslashes at all, which makes it much easier to read as well.

General Conclusions

All three of the shown examples have their own up and down sides. For JavaScript and PowerShell they share the situation where the delimiter of the regular expression is different from the escape character used in the language. This means you need to keep in mind which characters need to be escaped and how to do it. For C# it both the backslashes need to be double escaped (once for the string they are in and once because it is needed for the regular expression). Using a literal string will reduce this to a single escape of backslashes, however this means that double quotes need to be escaped by using double, double quotes.

Quick overview:

JavaScript

  • Regex delimited by forward slashes
  • Need to escape forward slashes inside regular expression
  • Quotation marks are no problem inside regular expression

PowerShell

  • Regex delimited by normal string delimiters (single or double quotes)
  • Escape symbol inside a string is the apostrophe on US keyboard left of the 1 key, and below the escape key
  • Escape symbol for regex is still the backslash
  • Keep in mind what you are escaping, a character in the string, or a symbol in the regular expression

C#

  • Regex delimited by normal string delimiters (double quotes)
  • Escape symbol for string and regular expression is the same, which leads to double escaped backslashes \\\\
  • Using a literal string can be used to prevent double escaping backslashes, but in turn it requires double quotation marks to be escaped using double, double quotation marks “”.

Connect to SharePoint Online CSOM through ADFS with PowerShell

To manage a SharePoint Online environment I find the CSOM (Client Side Object Model) for SharePoint very usefull. Untill now we used a separate account for this. The UPN of this account was in this form: [account name]@[tenant name].onmicrosoft.com. This was very practical as it even allows access when ADFS is down.

Being one of the admins of the Office 365 enviroment I was able to create such an account. However there may be plenty of situations when one would like to query a site or site collection, but cannot use CSOM because of ADFS authentication. For the latter I found a solution which I will share here.

First of all lets look the answer given to this question by “Brite Shiny” (who also asked the question). This lists the prerequisites needed to authenticate through ADFS.

Summarized these are needed:

  1. Uninstalled the SharePoint Online Management Shell – I found this was not necessary in my case. However, it may be necessary in other cases.
  2. Installed the latest SharePoint Online Client Components SDK (http://www.microsoft.com/en-us/download/details.aspx?id=42038). As “Brite Shiny” explains note the “Online” part, as it is different from the SharePoint 2013 Client Components SDK
  3. Ensured that the Microsoft.SharePoint.Client and Microsoft.SharePoint.Client.Runtime dlls in the program loaded from the 16 version in the Web Server Extensions folder

Besides this you also need at least PowerShell 3.0 (otherwise you can’t use the needed dlls).

PowerShell 3.0 can be downloaded as part of Windows Management Framework 3.0.

For the script I give credit to Michael Blumenthal. On his (old) blog he posted this post to which I made some minor adjustments.

Short and sweet, here is the script:

Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
$webUrl = Read-Host -Prompt "HTTPS URL for your SP Online 2013 site"
$username = Read-Host -Prompt "Email address for logging into that site"
$password = Read-Host -Prompt "Password for $username" -AsSecureString
$ctx = New-Object Microsoft.SharePoint.Client.ClientContext($webUrl)
$ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($username, $password)
$web = $ctx.Web
$lists = $web.Lists
$ctx.Load($lists)
$ctx.ExecuteQuery()
$lists | select Title

As you may notice, I specify the full path to each of the dll’s so I am sure that the correct version is loaded.

As you may imagine, in stead of just getting the Title property of all lists there is so much more that can be done. However I leave this to each to decide for their own how far they want to go to script against SharePoint Online.

Programming languages

I enjoy using different programming languages and styles. As most languages have specific things they are designed for. Different languages have different strengths and weaknesses.

Java

Java has great portability (supported on almost any platform) and OOP capabilities. Java has multiple easy to use editors. One of the most popular being Eclipse. The IDE is lightweight and itself is portable (so it can run from a USB stick and you can take it anywhere with you). Because of it’s popularity there is a lot of information to find about Java on the internet.

C#

C# builds on the extensive .NET framework (as well as VB.NET) and can create nice native looking applications for the Windows platform. It is also easy to pick up, and with a free programming environment (Visual Studio Community Edition) it take little effort to get started. Because of the great popularity of the language a lot of information can be found on the internet. Apart from community support there is also support from Microsoft in the form of Technet or MSDN.

PowerShell

PowerShell is a language based on the .NET framework (just like C# and VB.NET). It comes default with Windows 7/Windows Server 2008 R2 (PowerShell 2.0) and newer versions. It also included a nice ISE (Integrated Scripting Environment) which comes as a feature which can be activated.

As it is build on top of .NET the entire framework is available to work with. Every variable or return value is a .NET object. A particular strong feature of PowerShell is the concept of piping, which allows the output of one cmdlet to be used as input of another cmdlet.

Another benifit is it can be run without needing to build an executable. It combines scripting power (known from batch files) with the power of a full programming language supported by the entire .NET framework. This enables you to create a Windows dialog from a few lines of PowerShell code. Through this it is possible to start a PowerShell script and then display a GUI with buttons, graphs, sliders and so on.

Also an advantage of PowerShell over C# is (IMHO) the ease of handling custom objects. Importing a csv will result in a collection of custom objects with properties with the names of every column of the CSV file. Also when creating custom objects you can add members  (properties) programmatically as you go along (runtime). These new objects can then be gathered is a collection of your custom objects. Finally these can be exported back to a csv or passed along to a function or cmdlet. This would not be easy to do in a language like C# because you would have to define the objects beforehand and every change in the imported csv file would require a rewrite of the code.

C

When it comes to writing fast code C/C++ is a commonly used language. I like C especially because it brings the programmer close to the core of programming. There are no strings, just arrays of characters. When traversing an array of integers the steps in memory are larger then when traversing a character array (aka string). Which brings a better understanding of how much memory you are consuming. Currently memory is rarely an issue. Some understanding of this can be quite useful when programming for custom devices with limited memory. This is the case with integrated electronics in devices such as a camera or a remote control.

Web based

Web based languages differ from previously mentioned languages in the aspect that they only have html and css as output. Browsers then decide what happens with the rendering of the eventual output.

JavaScript

As some have noted I did not mention JavaScript as an output. This is because JavaScript itself cannot be rendered. It are the results of JavaScript that are rendered. Changes in css, html or text. JavaScript is one of the most used web based language. No matter what the server side system is, the client side logic almost always is JavaScript.

For JavaScript there are many (often minified) libraries available. A popular one is Jquery. When it comes to programming languages the focus is shifting more and more towards client side code inside a browser (like JavaScript). Which requires future devices to have a browser and an internet connection. Complex calculations can still happen server side. But then through asynchronous calls to services.

PHP

For server sides code there are many ways to go. A popular language is PHP (the site you are looking at right now runs on PHP.), it is free to use and has many (free) editors available. PHP supports Object Oriented programming and is easy to get started with.

ASP.NET

Microsoft offers ASP.NET as server side language. As most of Microsofts languages it is supported by the .NET framework and can use most of the .NET libraries (because the web based nature it does not support all windows based features). There are many extensions on ASP.NET, like Microsofts MVC which allows easy creation of data driven sites.

Handling big data with PowerShell

Background

Recently I was involded in a project where a large amount of e-mails with logs had to inserted into a BI database. A colleague asked if I could assist in transforming a raw text export from outlook and transform the data from inside the e-mails (including some metadata of the e-mails themselves) to a csv file with certain formatting.

I gladly accepted this challenge and after designing the basic logic of extracting the data and transforming it to the required format I ran into performance issues.

Our current input file was a small part of the total load (36 MB out of 20 GB of PST files). The input file contained over 1.5 million lines of text which needed to be transformed to approximately 500,000 lines in a CSV file.

I have transformed xml to csv before, in this case the input file was only 5MB of XML. Here I loaded the inputfile into memory and then wrote every extracted csv line directly to the file.

For the text file of 36 MB my idea was to use the same approach and write the 500,000 lines directly to the CSV file.

I first tested with a small portion of the file (100 out of 5695 parts of the input file). Writing every line directly to the output file costed about 100 seconds. This would mean that the total file would take about 96 minutes. Since this file was only a small portion of the large total I wanted to improve performance before applying this to the main bulk of data.

This got my to try and reduce IO and instead store the result in memory and write it as a whole after the whole file is completed. As before I tested this with a subset of 100 out of 5695 parts of the input file. This approach reduced the running time from 100 seconds to 3.2 seconds; a reduction of 33 times.

With this result I wanted to immediately run this theory on the entire test file (46 MB of text). I expected the script to finish in 3 to 4 minutes. However after 20 minutes (during which the CPU PowerShell was using was maxed out) it was still not complete. The cause of this was because of the large size, the physical memory was not sufficient so swapping occured (which resulted in the IO I was trying to avoid). This got me back to the drawing board to figure out a solution. First of, instead of storing the input file in a variable (through System.IO.File.ReadAllLines()) I found that if I put this inside my loop PowerShell would claim less memory as before.

Foreach ($line in [System.IO.File]::ReadAllLines($inputPath))
{
    #Some code
}

The next improvement however is where I really gained most. Instead of writing every line or writing everything at once I started writing batches of lines to the output file. I my case I started with batches of 100 (out of the 5695 parts) and write them to file as they where complete. With this configuration all 5695 parts where completed in 8 minutes (a lot faster then the previously estimated 96 minutes or the more than 20 minute approach).

I have yet to figure out what the perfect balance will be between the amount of lines to write at one time and the amount of disk IO. For me the optimal performance was around 220 line in one batch, but this is may be different for other similar solutions. Best tip I can give you here: keep tweaking to find your sweet spot between disk I/O and Memory usage.

Solution

Below in short how my solution was built to write csv in batches instead of “at once” or “per line”.

Param(
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[string]$InputFile,
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[string]$OuputFile,
[Parameter(Mandatory=$false)]
[int]$BatchSize = 100
)
[int]$Counter = 0;
[string]$csvHeader = "Column1;Column2;Column3"; #this might be a lot more columns..
$csvHeader | Out-File -FilePath $OuputFile -Encoding utf8; #write header to file
$collection = @();
foreach($line in [System.IO.File]::ReadAllLines($InputFile)) #this is mostly to save some memory opposed to storing all lines in a variable
{
    #Do complex logic
    #Add items to $collection
    #Do more intensive stuff

    #Depending on which unit you want to count (csv line, parts, or other) you can place this in the appropriate location
    #In my case I did this every time I started with another of the 5695 parts, you can also use this to count csv lines,
    #but in that case I would recommend using a larger batch number (e.g. 10,000)
    $Counter++;

    if($Counter % $BatchSize -eq 0) #The following code will be executed whenever the current $Counter is divisible by $BatchSize (the modulus is zero).
    {
        $collection | Out-File -FilePath $OuputFile -Encoding utf8 -Append; #append current batch to file
        $collection = @() #clear the collection for the next batch
    }
}
$collection | Out-File -FilePath $OuputFile -Encoding utf8 -Append; #append the final lines (which did not reach the batch size limit)

PowerShell – Create collections of custom objects

Background

Today a colleague asked my how he could store object collections in memory (a PowerShell variable) instead of writing and reading to/from CSV files. While searching for it he found tons of examples but most were written specifically for one target, he needed something more basic and flexible. He asked me if I already had a topic about it on my blog. Sadly I had to disappoint him, it wasn’t on my blog, yet. However I knew the answer and I will now also share this on my blog.

Creating the collection

Lets start of with creating an ArrayList in PowerShell:

$collectionVariable = New-Object System.Collections.ArrayList

Done. This is our generic collection, it can contain any PowerShell (.NET) object.

Before adding items to the collection be very aware that the fields of the first item added dictate which fields the collection will have.

For example take object $A and object $B.
Object $A has two string fields: “fieldA” and “fieldB”
Object $B also has two string fields: “fieldB” and “fieldC”
If object $A is added to our new empty collection, the collection would then have two fields: “fieldA” and “fieldB”.
If we would then add $B to the same collection the item in the collection would have an empty value in the field “fieldA” and no field “fieldC” (fieldB would be added normally to the list).
Keep this in mind when adding different types of objects to a collection.

Creating a custom object

Creating a custom object is easy:

$item = New-Object System.Object

This creates an empty System.Object object. This Object has no fields and only has four methods:
bool Equals(System.Object obj)
int GetHashCode()
type GetType()
string ToString()

This makes it an ideal object to start with as we can manually define every field.

So how do we add fields to our empty object?

Like this:

$item | Add-Member -MemberType NoteProperty -Name "Field1" -Value "value"

This example create a field named “Field1” with the value “value”, you can also pass a variable as value or even a field of a different object. For adding multiple field just repeat the line with different “Name” values.

This method can also be used to add fields to existing objects. For example you can read a csv file, add fields (for example a calculated field based on values of other fields) to the objects and then add all of those to a new (empty) collection which you can then write to a csv again or process further.

Adding the custom object to the ArrayList

We now have an ArrayList and need to put our custom object in it.

For people who are used to .NET and the way the lists work the method will be mostly unsurprising. There is only one thing to keep in mind, the Add method returns the index for the new item in the array. If you do not need this (and don’t want a series of indexes appearing on the console) you could output the result to null as in below example:

$collectionVariable.Add($object) | Out-Null

This will add the custom object to our ArrayList and will ignore the returned value.

Putting it all together

For this example I add ten objects to an ArrayList; the ten objects are the same but you can modify this to your own specific situation.

$collectionWithItems = New-Object System.Collections.ArrayList
for($i = 0; $i -lt 10; $i++)
{
    $temp = New-Object System.Object
    $temp | Add-Member -MemberType NoteProperty -Name "Field1" -Value "Value1"
    $temp | Add-Member -MemberType NoteProperty -Name "Field2" -Value "Value2"
    $temp | Add-Member -MemberType NoteProperty -Name "Field3" -Value "Value3"
    $collectionWithItems.Add($temp) | Out-Null
}

If I would then call $collectionWithItems it will return the collection. This is what the output is of $collectionWithItems when called after the for loop:

Field1                      Field2                      Field3
------                      ------                      ------
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3

Update (2017-06-13): Quicker/dirtier way to create objects with certain fields

Another way (though less pretty) is to do a select statement on any object, this will create a PSCustomObject with only the selected properties. Instead of the above example where it took 4 lines to create an object with 3 properties, this object can be created with all three fields in 1 line. However adding the contents to the fields might still require some additional lines which makes it also end up with 4 lines to create and fill the object.

$collectionWithItems = New-Object System.Collections.ArrayList
for($i = 0; $i -lt 10; $i++)
{
    $temp = "" | select "Field1", "Field2", "Field3"
    $temp.Field1 = "Value1"
    $temp.Field2 = "Value2"
    $temp.Field3 = "Value3"
    $collectionWithItems.Add($temp) | Out-Null
}