Handling big data with PowerShell

Background

Recently I was involded in a project where a large amount of e-mails with logs had to inserted into a BI database. A colleague asked if I could assist in transforming a raw text export from outlook and transform the data from inside the e-mails (including some metadata of the e-mails themselves) to a csv file with certain formatting.

I gladly accepted this challenge and after designing the basic logic of extracting the data and transforming it to the required format I ran into performance issues.

Our current input file was a small part of the total load (36 MB out of 20 GB of PST files). The input file contained over 1.5 million lines of text which needed to be transformed to approximately 500,000 lines in a CSV file.

I have transformed xml to csv before, in this case the input file was only 5MB of XML. Here I loaded the inputfile into memory and then wrote every extracted csv line directly to the file.

For the text file of 36 MB my idea was to use the same approach and write the 500,000 lines directly to the CSV file.

I first tested with a small portion of the file (100 out of 5695 parts of the input file). Writing every line directly to the output file costed about 100 seconds. This would mean that the total file would take about 96 minutes. Since this file was only a small portion of the large total I wanted to improve performance before applying this to the main bulk of data.

This got my to try and reduce IO and instead store the result in memory and write it as a whole after the whole file is completed. As before I tested this with a subset of 100 out of 5695 parts of the input file. This approach reduced the running time from 100 seconds to 3.2 seconds; a reduction of 33 times.

With this result I wanted to immediately run this theory on the entire test file (46 MB of text). I expected the script to finish in 3 to 4 minutes. However after 20 minutes (during which the CPU PowerShell was using was maxed out) it was still not complete. The cause of this was because of the large size, the physical memory was not sufficient so swapping occured (which resulted in the IO I was trying to avoid). This got me back to the drawing board to figure out a solution. First of, instead of storing the input file in a variable (through System.IO.File.ReadAllLines()) I found that if I put this inside my loop PowerShell would claim less memory as before.

The next improvement however is where I really gained most. Instead of writing every line or writing everything at once I started writing batches of lines to the output file. I my case I started with batches of 100 (out of the 5695 parts) and write them to file as they where complete. With this configuration all 5695 parts where completed in 8 minutes (a lot faster then the previously estimated 96 minutes or the more than 20 minute approach).

I have yet to figure out what the perfect balance will be between the amount of lines to write at one time and the amount of disk IO. For me the optimal performance was around 220 line in one batch, but this is may be different for other similar solutions. Best tip I can give you here: keep tweaking to find your sweet spot between disk I/O and Memory usage.

Solution

Below in short how my solution was built to write csv in batches instead of “at once” or “per line”.

Get groups with users from SharePoint Online

One of the things PowerShell enables you to do with Office 365 (particularly SharePoint Online) is collecting bulk info. In this post I will be providing a nice little script which can be used to collect groups from site collections including the names of users in those groups.

The main reason you might want to collect this is the information takes quite some time to be collected. By the time the information would be needed It would take a long unnecessarily amount of  time. If the data however is already collected the requested information can be looked up quickly. The only real downside is that your data used will be “old” data. How old depends on how often you execute the function in this post.

Before going into detail about what the script does, let me elaborate about what goes in and what comes out.

There is one mandatory parameter which must be specified: “outputFullFilePath”. This will be the path where the csv will be stored. Providing an invalid or unreachable path will result in the output being lost.

Optional parameters are:

  • csvSeparator: this will be used as separator for the output csv file, by default its value is ‘;’
  • internalSeparator: this will be used as separator inside csv fields (make sure it is different from the csvSeperator), by default its value = ‘,’
  • selectSites: if selected you will be prompted to select of which site collections the groups will be collected (this is a switch it requires no value, if omitted its value is false).

The output will be a csv file with the following headers: SiteCollectionUrl, LoginName, Title, OwnerLoginName, OwnerTitle, GroupUsers, GroupRoles

If the output file is opened in Microsoft Excel the columns can be used for filtering and searching. Making it an east way to find out who is in which group or where a certain person has access over all selected site collections.

Important note: groups can only be collected if the account that runs the script is site collection admin. Tenant admin is not enough! The account has to be specified at each site collection as site collection admin.

Important note: before the following script can be run a connection to the Microsoft Online service and the SharePoint Online service must be established. For more information on how to achieve this, check out this previous post.

Here is the total script (further down I will highlight the main parts of the script):

At line 14 we create a generic collection (which can hold any type of object). At line 58 each group is added to this collection. At line 67 this collection is exported to the csv file which is specified at the outputFullFilePath parameter.

If the switch is set to manually select site collections a prompt will be shown. This will be in form of an Out-Gridview (line 18). You can select multiple items with Ctrl or Shift. If manual selection of sites is off (not set) then the groups of all site collections will be collected. Because of the time it takes to collect groups it is advised to only collect the most important site collections. Keep in mind that the collection of the groups is dependant on the permissions of the account that runs them. If the account is not site collection admin of one site no groups will be collected and the host will show a red line where the site collection URL is mentioned (line 64).

Because the process may take a while I added a progress indicator. It does not give an accurate estimation for the remaining time (as it only counts the amount of site collections and not the remaining groups or users). For this three variables are used. They are defined at lines 24 through 26. At line 29 the counter is raised by one for every site collection. At line 30 through 32 the count is written to the host including the URL of the current site collection. Note the switch “NoNewLine” which means that the success or error message (lines 60 and 64) are places behind it in stead of below the counter.

The main loops are quite simple. First there is a loop through all site collections (starts at line 27). Inside this loop there is a loop which loops all groups for each site collection (starts at line 35). Inside each group, all users are added to a string, also all roles of the group (these are only roles on site collection / root site level). After the users and roles are collected the site collection URL, the groups users and the group roles are added to the group object (at lines 55 through 57). Finally the group object is added to the siteCollectionGroups collection.

At the bottom of the script there are three lines commented. The first of the three provides a brief explanation of the two following examples.

The first example (second comment line) is a minimum required use of the function. It only specifies the outputFullFilePath (if this parameter is omitted you will be prompted to enter it before the script is ran.

The second example (third comment line) has all optional parameters, this includes the separators and the manual selection switch.

Save the script someplace, remove the hash (#) before one of the examples, and modify this as it suits your need. Then simply run the file and wait… After completion check the file in the location that is specified in the script and start working the numbers.

Because the file is in CSV format it is easy to load it in PowerShell and use scripting to quickly analyse data.

In my next post I will share a followup script which collects external users over all site collections using the output csv of this script as input.