Regex: Can you escape?

ARGH!! My regex is not working!! It is not matching what it should. What could the problem be?

The situation

Sometimes you need to match a regex containing special characters. Sometimes these characters are also special in the language you are working in. In this post I will go through a few different languages (JavaScript, PowerShell, C#) and for each I will point out a few pitfalls and how to work with them.

For each language we will look at the following situations (problems):

  1. Matching part of a URL (getting www.andreasbijl.com from https://www.andreasbijl.com/whatever)
  2. Matching part of a local path (getting WindowsPowerShell from C:\Windows\System32\WindowsPowerShell\)
  3. Matching quotation symbols (getting the href, title and text from <a href=”https://www.andreasbijl.com/” title=’Mixed Quotes here’>A Link</a>)

Quick explanation of used regex parts

  • [] square brackets match only characters that are inside the square brackets
  • [^] square brackets that start with a ^ match all characters excep the ones inside the square brackets
  • dot matches all characters
  • an asterisk * behind something means zero or more (where it will attempt to get as much as possible). It can be behind a dot, a character or a set of square brackets
  • an asterisk followed by a question mark *? means zero or more, but this time as few characters as possible.
  • a plus behind a character, dot or square brackets means 1 or more of the proceeding items.
  • normal brackets () are used to capture a certain part of the data, in case these are used the result is an array, where the first item is the whole matched part, and then there is an item for each group of brackets.

JavaScript

Basic Regex in JavaScript

In JavaScript a regex is enclosed in forward slashes like so:

The benefit of this is that Quotation symbols do not need to be escaped. However the forward slash itself does need to be escaped. Since this is a common symbol in URLs, it is good to keep this in mind while constructing or copying regexes in JavaScript. The symbol used to escape is the backslash, so a regular expressing matching a (one) forward slash in JavaScript would look like this:

As you may see this can be come complicated quite quickly.

Solution for Problem 1 in JavaScript

Solution for Problem 2 in JavaScript

Solution for Problem 3 in JavaScript

Conclusion for JavaScript

In the first case we has a negative side effect of regular expressions being enclosed by forward slashes. However, as you can see in example three, a benefit of this is that quotation symbols (single or double) do not need to be escaped (just don’t forget about the forward slash in the closing of the link element). The second example is relatively typical, as in regular expressions it is expected that you need to escape the escape character if you want to match it.

PowerShell

Basic Regex in PowerShell

In PowerShell a regular expression is enclosed by regular quotation symbols (single or double), however what makes PowerShell different from the other two languages here is that the escape symbol is not the backslash, but instead it is the the apostrophe sign, on US keyboards left of the 1 key directly below the escape key.

Solution for Problem 1 in PowerShell

Solution for Problem 2 in PowerShell

Solution for Problem 3 in PowerShell

Conclusion Regex in PowerShell

Problem 1 was easy; nothing needed escaping. For problem 2 only the backslash needed escaping as the backslash is the escape symbol inside regular expressions (note that for inside strings the backslash does not need to be escaped). For Problem 3 the only thing to keep in mind is that while escaping the quotation marks to use the correct escape symbol (not the forward slash). The fact that in PowerShell you need to deal with two types of escape characters may make it more complicated to work with regular expressions in PowerShell. You need to constantly keep in mind, if you need to escape something in the regular expression or in the string itself.

C#

Basic Regex in C#

In C# the escape character inside a string is the same as inside a regular expression. This can lead to an overdose in backslashes. However there is an alternative to provide a literal string where slashes are ignored, but this may bring different situations to the party. I will be showing both variants for C#, choose which you prefer, both have upsides and downsides.

Solution for Problem 1 in C#

Solution for Problem 2 in C#

Solution for Problem 3 in C#

Conclusion Regex in C#

Problem 1 is very straightforward in C#; no characters that need escaping. Problem 3 is relatively easy as well where only the double quotation symbols need escaping using either the normal or literal strings. The only difference is method of escaping, using a backslash in normal strings and a double, double quotation marks in a literal string. Problem 2 becomes most complicated for C# as you can see the regular expression using normal strings becomes very complex as all backslashes need to be double escaped (once for the string and once for the regular expression). Using the literal strings makes this a lot easier where you only need to escape the backslashes once to accommodate the regular expression. Also the path variable no longer needs and escaped backslashes at all, which makes it much easier to read as well.

General Conclusions

All three of the shown examples have their own up and down sides. For JavaScript and PowerShell they share the situation where the delimiter of the regular expression is different from the escape character used in the language. This means you need to keep in mind which characters need to be escaped and how to do it. For C# it both the backslashes need to be double escaped (once for the string they are in and once because it is needed for the regular expression). Using a literal string will reduce this to a single escape of backslashes, however this means that double quotes need to be escaped by using double, double quotes.

Quick overview:

JavaScript

  • Regex delimited by forward slashes
  • Need to escape forward slashes inside regular expression
  • Quotation marks are no problem inside regular expression

PowerShell

  • Regex delimited by normal string delimiters (single or double quotes)
  • Escape symbol inside a string is the apostrophe on US keyboard left of the 1 key, and below the escape key
  • Escape symbol for regex is still the backslash
  • Keep in mind what you are escaping, a character in the string, or a symbol in the regular expression

C#

  • Regex delimited by normal string delimiters (double quotes)
  • Escape symbol for string and regular expression is the same, which leads to double escaped backslashes \\\\
  • Using a literal string can be used to prevent double escaping backslashes, but in turn it requires double quotation marks to be escaped using double, double quotation marks “”.

Remove items by ID or server relative path in one line

Removing a specific set of items (this could also be documents) in SharePoint from a list (or Library) may be challenging. If all items could be removed of a few specific folders this could be easily done manually. But if only some items need to be removed (and in other similar cases) it may be useful to script the removal of items.

Assuming there is already a correct client context object $ctx (please have a look for some of my other posts if you want to know how to get a Client Context Object) I will show a few quick examples of how to remove an item if you know either its ID (and the title of the list or library it lives in) or its server relative path (only applies to files or folders).

By ID:

By Server relativeUrl (File):

By Server relativeUrl (Folder or Document Set):

You could optimize performance by running the ExecuteQuery command after each 100 removals. However, if the first of the 100 goes wrong, the other 99 are not processed. That is why I recommend to use this as mentioned above (have one ExecuteQuery for each item that is removed).

Connect to SharePoint Online CSOM through ADFS with PowerShell

To manage a SharePoint Online environment I find the CSOM (Client Side Object Model) for SharePoint very usefull. Untill now we used a separate account for this. The UPN of this account was in this form: [account name]@[tenant name].onmicrosoft.com. This was very practical as it even allows access when ADFS is down.

Being one of the admins of the Office 365 enviroment I was able to create such an account. However there may be plenty of situations when one would like to query a site or site collection, but cannot use CSOM because of ADFS authentication. For the latter I found a solution which I will share here.

First of all lets look the answer given to this question by “Brite Shiny” (who also asked the question). This lists the prerequisites needed to authenticate through ADFS.

Summarized these are needed:

  1. Uninstalled the SharePoint Online Management Shell – I found this was not necessary in my case. However, it may be necessary in other cases.
  2. Installed the latest SharePoint Online Client Components SDK (http://www.microsoft.com/en-us/download/details.aspx?id=42038). As “Brite Shiny” explains note the “Online” part, as it is different from the SharePoint 2013 Client Components SDK
  3. Ensured that the Microsoft.SharePoint.Client and Microsoft.SharePoint.Client.Runtime dlls in the program loaded from the 16 version in the Web Server Extensions folder

Besides this you also need at least PowerShell 3.0 (otherwise you can’t use the needed dlls).

PowerShell 3.0 can be downloaded as part of Windows Management Framework 3.0.

For the script I give credit to Michael Blumenthal. On his (old) blog he posted this post to which I made some minor adjustments.

Short and sweet, here is the script:

As you may notice, I specify the full path to each of the dll’s so I am sure that the correct version is loaded.

As you may imagine, in stead of just getting the Title property of all lists there is so much more that can be done. However I leave this to each to decide for their own how far they want to go to script against SharePoint Online.

Programming languages

I enjoy using different programming languages and styles. As most languages have specific things they are designed for. Different languages have different strengths and weaknesses.

Java

Java has great portability (supported on almost any platform) and OOP capabilities. Java has multiple easy to use editors. One of the most popular being Eclipse. The IDE is lightweight and itself is portable (so it can run from a USB stick and you can take it anywhere with you). Because of it’s popularity there is a lot of information to find about Java on the internet.

C#

C# builds on the extensive .NET framework (as well as VB.NET) and can create nice native looking applications for the Windows platform. It is also easy to pick up, and with a free programming environment (Visual Studio Community Edition) it take little effort to get started. Because of the great popularity of the language a lot of information can be found on the internet. Apart from community support there is also support from Microsoft in the form of Technet or MSDN.

PowerShell

PowerShell is a language based on the .NET framework (just like C# and VB.NET). It comes default with Windows 7/Windows Server 2008 R2 (PowerShell 2.0) and newer versions. It also included a nice ISE (Integrated Scripting Environment) which comes as a feature which can be activated.

As it is build on top of .NET the entire framework is available to work with. Every variable or return value is a .NET object. A particular strong feature of PowerShell is the concept of piping, which allows the output of one cmdlet to be used as input of another cmdlet.

Another benifit is it can be run without needing to build an executable. It combines scripting power (known from batch files) with the power of a full programming language supported by the entire .NET framework. This enables you to create a Windows dialog from a few lines of PowerShell code. Through this it is possible to start a PowerShell script and then display a GUI with buttons, graphs, sliders and so on.

Also an advantage of PowerShell over C# is (IMHO) the ease of handling custom objects. Importing a csv will result in a collection of custom objects with properties with the names of every column of the CSV file. Also when creating custom objects you can add members  (properties) programmatically as you go along (runtime). These new objects can then be gathered is a collection of your custom objects. Finally these can be exported back to a csv or passed along to a function or cmdlet. This would not be easy to do in a language like C# because you would have to define the objects beforehand and every change in the imported csv file would require a rewrite of the code.

C

When it comes to writing fast code C/C++ is a commonly used language. I like C especially because it brings the programmer close to the core of programming. There are no strings, just arrays of characters. When traversing an array of integers the steps in memory are larger then when traversing a character array (aka string). Which brings a better understanding of how much memory you are consuming. Currently memory is rarely an issue. Some understanding of this can be quite useful when programming for custom devices with limited memory. This is the case with integrated electronics in devices such as a camera or a remote control.

Web based

Web based languages differ from previously mentioned languages in the aspect that they only have html and css as output. Browsers then decide what happens with the rendering of the eventual output.

JavaScript

As some have noted I did not mention JavaScript as an output. This is because JavaScript itself cannot be rendered. It are the results of JavaScript that are rendered. Changes in css, html or text. JavaScript is one of the most used web based language. No matter what the server side system is, the client side logic almost always is JavaScript.

For JavaScript there are many (often minified) libraries available. A popular one is Jquery. When it comes to programming languages the focus is shifting more and more towards client side code inside a browser (like JavaScript). Which requires future devices to have a browser and an internet connection. Complex calculations can still happen server side. But then through asynchronous calls to services.

PHP

For server sides code there are many ways to go. A popular language is PHP (the site you are looking at right now runs on PHP.), it is free to use and has many (free) editors available. PHP supports Object Oriented programming and is easy to get started with.

ASP.NET

Microsoft offers ASP.NET as server side language. As most of Microsofts languages it is supported by the .NET framework and can use most of the .NET libraries (because the web based nature it does not support all windows based features). There are many extensions on ASP.NET, like Microsofts MVC which allows easy creation of data driven sites.

Handling big data with PowerShell

Background

Recently I was involded in a project where a large amount of e-mails with logs had to inserted into a BI database. A colleague asked if I could assist in transforming a raw text export from outlook and transform the data from inside the e-mails (including some metadata of the e-mails themselves) to a csv file with certain formatting.

I gladly accepted this challenge and after designing the basic logic of extracting the data and transforming it to the required format I ran into performance issues.

Our current input file was a small part of the total load (36 MB out of 20 GB of PST files). The input file contained over 1.5 million lines of text which needed to be transformed to approximately 500,000 lines in a CSV file.

I have transformed xml to csv before, in this case the input file was only 5MB of XML. Here I loaded the inputfile into memory and then wrote every extracted csv line directly to the file.

For the text file of 36 MB my idea was to use the same approach and write the 500,000 lines directly to the CSV file.

I first tested with a small portion of the file (100 out of 5695 parts of the input file). Writing every line directly to the output file costed about 100 seconds. This would mean that the total file would take about 96 minutes. Since this file was only a small portion of the large total I wanted to improve performance before applying this to the main bulk of data.

This got my to try and reduce IO and instead store the result in memory and write it as a whole after the whole file is completed. As before I tested this with a subset of 100 out of 5695 parts of the input file. This approach reduced the running time from 100 seconds to 3.2 seconds; a reduction of 33 times.

With this result I wanted to immediately run this theory on the entire test file (46 MB of text). I expected the script to finish in 3 to 4 minutes. However after 20 minutes (during which the CPU PowerShell was using was maxed out) it was still not complete. The cause of this was because of the large size, the physical memory was not sufficient so swapping occured (which resulted in the IO I was trying to avoid). This got me back to the drawing board to figure out a solution. First of, instead of storing the input file in a variable (through System.IO.File.ReadAllLines()) I found that if I put this inside my loop PowerShell would claim less memory as before.

The next improvement however is where I really gained most. Instead of writing every line or writing everything at once I started writing batches of lines to the output file. I my case I started with batches of 100 (out of the 5695 parts) and write them to file as they where complete. With this configuration all 5695 parts where completed in 8 minutes (a lot faster then the previously estimated 96 minutes or the more than 20 minute approach).

I have yet to figure out what the perfect balance will be between the amount of lines to write at one time and the amount of disk IO. For me the optimal performance was around 220 line in one batch, but this is may be different for other similar solutions. Best tip I can give you here: keep tweaking to find your sweet spot between disk I/O and Memory usage.

Solution

Below in short how my solution was built to write csv in batches instead of “at once” or “per line”.

PowerShell – Create collections of custom objects

Background

Today a colleague asked my how he could store object collections in memory (a PowerShell variable) instead of writing and reading to/from CSV files. While searching for it he found tons of examples but most were written specifically for one target, he needed something more basic and flexible. He asked me if I already had a topic about it on my blog. Sadly I had to disappoint him, it wasn’t on my blog, yet. However I knew the answer and I will now also share this on my blog.

Creating the collection

Lets start of with creating an ArrayList in PowerShell:

Done. This is our generic collection, it can contain any PowerShell (.NET) object.

Before adding items to the collection be very aware that the fields of the first item added dictate which fields the collection will have.

For example take object $A and object $B.
Object $A has two string fields: “fieldA” and “fieldB”
Object $B also has two string fields: “fieldB” and “fieldC”
If object $A is added to our new empty collection, the collection would then have two fields: “fieldA” and “fieldB”.
If we would then add $B to the same collection the item in the collection would have an empty value in the field “fieldA” and no field “fieldC” (fieldB would be added normally to the list).
Keep this in mind when adding different types of objects to a collection.

Creating a custom object

Creating a custom object is easy:

This creates an empty System.Object object. This Object has no fields and only has four methods:
bool Equals(System.Object obj)
int GetHashCode()
type GetType()
string ToString()

This makes it an ideal object to start with as we can manually define every field.

So how do we add fields to our empty object?

Like this:

This example create a field named “Field1” with the value “value”, you can also pass a variable as value or even a field of a different object. For adding multiple field just repeat the line with different “Name” values.

This method can also be used to add fields to existing objects. For example you can read a csv file, add fields (for example a calculated field based on values of other fields) to the objects and then add all of those to a new (empty) collection which you can then write to a csv again or process further.

Adding the custom object to the ArrayList

We now have an ArrayList and need to put our custom object in it.

For people who are used to .NET and the way the lists work the method will be mostly unsurprising. There is only one thing to keep in mind, the Add method returns the index for the new item in the array. If you do not need this (and don’t want a series of indexes appearing on the console) you could output the result to null as in below example:

This will add the custom object to our ArrayList and will ignore the returned value.

Putting it all together

For this example I add ten objects to an ArrayList; the ten objects are the same but you can modify this to your own specific situation.

If I would then call $collectionWithItems it will return the collection. This is what the output is of $collectionWithItems when called after the for loop:

Update (2017-06-13): Quicker/dirtier way to create objects with certain fields

Another way (though less pretty) is to do a select statement on any object, this will create a PSCustomObject with only the selected properties. Instead of the above example where it took 4 lines to create an object with 3 properties, this object can be created with all three fields in 1 line. However adding the contents to the fields might still require some additional lines which makes it also end up with 4 lines to create and fill the object.

 

Collect site collection information from Office 365

Collecting users and groups from Office 365 is a relatively long running script (as mentioned in my previous post). Fortunately collection more general site collection information is easier. There is not very much data that can be collected, this is because a limitation in PowerShell’s access to SharePoint Online (before the upgrade to SharePoint 2013 this was not possible at all for SharePoint Online).

This script is a fairly short one but can be expanded in several ways (of which I will offer a few suggestions).

What it all comes down to can be summarized in this one-liner (don’t forget you will need to be connected to the Office 365 Service and the SharePoint Online service before running the scripts on this post):

In this case you may replace the variable $outputPath with the path where you want to have the results written to.

The following example is what I use at a client to monitor site collection storage quota’s (among with a few other things):

Now compared to the one-liner it may seem a bit overwhelming, but the output will be more relevant then before (where all values of all sites where returned).

For example if you pass the parameter -selectSites you will be able to select the sites from which you want to collect data (which I personally find the easiest way to choose which sites I want to receive details about), another option would be to replace the Out-GridView by a Where-Object and filter the results in that way.

In lines 25 through 28 I calculate the percentage of the assigned stored that is in use. If you are not interested in this number you can simply remove this from the script (in this casealso remove line 37 which will try to write the percentage), however I think that in most cases this is relevant and useful information. You could also add other calculations (for example a percentage of used Resources or the ammount of sub webs).

The part from line 30 through 38 does nothing more than adding only the data that is relevant for my report to the output. If you want to collect additional information (like Resource usage etc.) you can add this to the $temp object just like the other properties are added.

Using .NET objects with PowerShell

Example of showing a folder dialog

In this short example I will demonstrate how easy it is to use .net objects from PowerShell. This practically enables you to develop straight from PowerShell (although there is no designer.cs nor a visual interface).

In the example I will open a dialog were the user can select a folder which then will be stored in a variable (only if the user clicked “OK”).

Here is the script:

First I check if System.Windows.Forms is already loaded (line 2)
[appdomain]::CurrentDomain.GetAssemblies() returns all currently loaded assemblies. If none of them have the name “System.Windows.Forms” it will be loaded then (at line 4).

In the same way other assemblies can be loaded (like System.Drawing, System.IO, System.Net etc.).

After this I simply create a FolderBrowserDialog object (line 6) using the default constructor with no parameters.

Next I call the “ShowDialog” function on the FolderBrowserDialog object and store the result in a DialogResult object (line 7)

If this object is equal to DialogResult.OK (line 9) the path will be stored in the $path variable (line 11).

Finally the $path variable is returned (line 13). This will be $null if the dialog was canceled.

Simple search with PowerShell

One day I was looking for a certain video file on my computer but I didn’t know where to look, however I new the name (in this case it was a video I posted on YouTube and there I could easily find the original file name).

With the standard search bar in the Windows Explorer window the file could not be found (it was not indexed). I figured: “how hard can it be to use PowerShell to look for a file with a certain name?”. I limited my scope to only search based on the file name. After this I constructed a fairly simple but effective script which did just that. Not only did I find the file, it even appeared multiple times on my 2TB drive (several copies of the same file). It went through the 2TB drive faster then I would expect (benefit is this case was the limited scope which only looked at the names of files).

I wanted to share this simple (and really, it doesn’t get much more straight forward then this) script which enables empowers you to search large amounts of files as long as you know a part of the filename. If you are familiar with regular expressions you can use this in the search. If you are not familiar with them you can still use a part of the file name.

This is the script:

To make the function easier to work with I also created an alias “ff” as you can see in the last line.

Most of the lines consist of the two mandatory parameters: Path and regexSearchString (don’t worry if you do not know much about regular expressions; normal text will also work).

The actual search is basically a one liner. It collects all files below the provided path (as you can see by the switches “-File” and “-Recurse”. In addition if no files are found (or the path is no valid path) the result will be $null, which indicates there are no results for the current patch/search string combination. When all files are collected there is a filter which matches each file name against the regular expression (or search string).

If called directly it will simply write the result to the screen:

It makes more sense to store the result in a variable like this:

Here $result will contain a collection fileInfo objects; if there is only one match it will be no collection but it will be a fileInfo object directly.

If you are only interested in the location of the files you may choose to only collect the FullName (path+filename) property of the objects. You can do this by piping it to a select:

Here $result will be a collection of strings (or a single string if there is only one match).

If you are looking for a directory of which you know the name, the script can easily be modified to look for directories. Simply replace “-File” with “-Directory”, the rest works the same.

Get groups with users from SharePoint Online

One of the things PowerShell enables you to do with Office 365 (particularly SharePoint Online) is collecting bulk info. In this post I will be providing a nice little script which can be used to collect groups from site collections including the names of users in those groups.

The main reason you might want to collect this is the information takes quite some time to be collected. By the time the information would be needed It would take a long unnecessarily amount of  time. If the data however is already collected the requested information can be looked up quickly. The only real downside is that your data used will be “old” data. How old depends on how often you execute the function in this post.

Before going into detail about what the script does, let me elaborate about what goes in and what comes out.

There is one mandatory parameter which must be specified: “outputFullFilePath”. This will be the path where the csv will be stored. Providing an invalid or unreachable path will result in the output being lost.

Optional parameters are:

  • csvSeparator: this will be used as separator for the output csv file, by default its value is ‘;’
  • internalSeparator: this will be used as separator inside csv fields (make sure it is different from the csvSeperator), by default its value = ‘,’
  • selectSites: if selected you will be prompted to select of which site collections the groups will be collected (this is a switch it requires no value, if omitted its value is false).

The output will be a csv file with the following headers: SiteCollectionUrl, LoginName, Title, OwnerLoginName, OwnerTitle, GroupUsers, GroupRoles

If the output file is opened in Microsoft Excel the columns can be used for filtering and searching. Making it an east way to find out who is in which group or where a certain person has access over all selected site collections.

Important note: groups can only be collected if the account that runs the script is site collection admin. Tenant admin is not enough! The account has to be specified at each site collection as site collection admin.

Important note: before the following script can be run a connection to the Microsoft Online service and the SharePoint Online service must be established. For more information on how to achieve this, check out this previous post.

Here is the total script (further down I will highlight the main parts of the script):

At line 14 we create a generic collection (which can hold any type of object). At line 58 each group is added to this collection. At line 67 this collection is exported to the csv file which is specified at the outputFullFilePath parameter.

If the switch is set to manually select site collections a prompt will be shown. This will be in form of an Out-Gridview (line 18). You can select multiple items with Ctrl or Shift. If manual selection of sites is off (not set) then the groups of all site collections will be collected. Because of the time it takes to collect groups it is advised to only collect the most important site collections. Keep in mind that the collection of the groups is dependant on the permissions of the account that runs them. If the account is not site collection admin of one site no groups will be collected and the host will show a red line where the site collection URL is mentioned (line 64).

Because the process may take a while I added a progress indicator. It does not give an accurate estimation for the remaining time (as it only counts the amount of site collections and not the remaining groups or users). For this three variables are used. They are defined at lines 24 through 26. At line 29 the counter is raised by one for every site collection. At line 30 through 32 the count is written to the host including the URL of the current site collection. Note the switch “NoNewLine” which means that the success or error message (lines 60 and 64) are places behind it in stead of below the counter.

The main loops are quite simple. First there is a loop through all site collections (starts at line 27). Inside this loop there is a loop which loops all groups for each site collection (starts at line 35). Inside each group, all users are added to a string, also all roles of the group (these are only roles on site collection / root site level). After the users and roles are collected the site collection URL, the groups users and the group roles are added to the group object (at lines 55 through 57). Finally the group object is added to the siteCollectionGroups collection.

At the bottom of the script there are three lines commented. The first of the three provides a brief explanation of the two following examples.

The first example (second comment line) is a minimum required use of the function. It only specifies the outputFullFilePath (if this parameter is omitted you will be prompted to enter it before the script is ran.

The second example (third comment line) has all optional parameters, this includes the separators and the manual selection switch.

Save the script someplace, remove the hash (#) before one of the examples, and modify this as it suits your need. Then simply run the file and wait… After completion check the file in the location that is specified in the script and start working the numbers.

Because the file is in CSV format it is easy to load it in PowerShell and use scripting to quickly analyse data.

In my next post I will share a followup script which collects external users over all site collections using the output csv of this script as input.