Handling big data with PowerShell

Background

Recently I was involded in a project where a large amount of e-mails with logs had to inserted into a BI database. A colleague asked if I could assist in transforming a raw text export from outlook and transform the data from inside the e-mails (including some metadata of the e-mails themselves) to a csv file with certain formatting.

I gladly accepted this challenge and after designing the basic logic of extracting the data and transforming it to the required format I ran into performance issues.

Our current input file was a small part of the total load (36 MB out of 20 GB of PST files). The input file contained over 1.5 million lines of text which needed to be transformed to approximately 500,000 lines in a CSV file.

I have transformed xml to csv before, in this case the input file was only 5MB of XML. Here I loaded the inputfile into memory and then wrote every extracted csv line directly to the file.

For the text file of 36 MB my idea was to use the same approach and write the 500,000 lines directly to the CSV file.

I first tested with a small portion of the file (100 out of 5695 parts of the input file). Writing every line directly to the output file costed about 100 seconds. This would mean that the total file would take about 96 minutes. Since this file was only a small portion of the large total I wanted to improve performance before applying this to the main bulk of data.

This got my to try and reduce IO and instead store the result in memory and write it as a whole after the whole file is completed. As before I tested this with a subset of 100 out of 5695 parts of the input file. This approach reduced the running time from 100 seconds to 3.2 seconds; a reduction of 33 times.

With this result I wanted to immediately run this theory on the entire test file (46 MB of text). I expected the script to finish in 3 to 4 minutes. However after 20 minutes (during which the CPU PowerShell was using was maxed out) it was still not complete. The cause of this was because of the large size, the physical memory was not sufficient so swapping occured (which resulted in the IO I was trying to avoid). This got me back to the drawing board to figure out a solution. First of, instead of storing the input file in a variable (through System.IO.File.ReadAllLines()) I found that if I put this inside my loop PowerShell would claim less memory as before.

Foreach ($line in [System.IO.File]::ReadAllLines($inputPath))
{
    #Some code
}

The next improvement however is where I really gained most. Instead of writing every line or writing everything at once I started writing batches of lines to the output file. I my case I started with batches of 100 (out of the 5695 parts) and write them to file as they where complete. With this configuration all 5695 parts where completed in 8 minutes (a lot faster then the previously estimated 96 minutes or the more than 20 minute approach).

I have yet to figure out what the perfect balance will be between the amount of lines to write at one time and the amount of disk IO. For me the optimal performance was around 220 line in one batch, but this is may be different for other similar solutions. Best tip I can give you here: keep tweaking to find your sweet spot between disk I/O and Memory usage.

Solution

Below in short how my solution was built to write csv in batches instead of “at once” or “per line”.

Param(
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[string]$InputFile,
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[string]$OuputFile,
[Parameter(Mandatory=$false)]
[int]$BatchSize = 100
)
[int]$Counter = 0;
[string]$csvHeader = "Column1;Column2;Column3"; #this might be a lot more columns..
$csvHeader | Out-File -FilePath $OuputFile -Encoding utf8; #write header to file
$collection = @();
foreach($line in [System.IO.File]::ReadAllLines($InputFile)) #this is mostly to save some memory opposed to storing all lines in a variable
{
    #Do complex logic
    #Add items to $collection
    #Do more intensive stuff

    #Depending on which unit you want to count (csv line, parts, or other) you can place this in the appropriate location
    #In my case I did this every time I started with another of the 5695 parts, you can also use this to count csv lines,
    #but in that case I would recommend using a larger batch number (e.g. 10,000)
    $Counter++;

    if($Counter % $BatchSize -eq 0) #The following code will be executed whenever the current $Counter is divisible by $BatchSize (the modulus is zero).
    {
        $collection | Out-File -FilePath $OuputFile -Encoding utf8 -Append; #append current batch to file
        $collection = @() #clear the collection for the next batch
    }
}
$collection | Out-File -FilePath $OuputFile -Encoding utf8 -Append; #append the final lines (which did not reach the batch size limit)

PowerShell – Create collections of custom objects

Background

Today a colleague asked my how he could store object collections in memory (a PowerShell variable) instead of writing and reading to/from CSV files. While searching for it he found tons of examples but most were written specifically for one target, he needed something more basic and flexible. He asked me if I already had a topic about it on my blog. Sadly I had to disappoint him, it wasn’t on my blog, yet. However I knew the answer and I will now also share this on my blog.

Creating the collection

Lets start of with creating an ArrayList in PowerShell:

$collectionVariable = New-Object System.Collections.ArrayList

Done. This is our generic collection, it can contain any PowerShell (.NET) object.

Before adding items to the collection be very aware that the fields of the first item added dictate which fields the collection will have.

For example take object $A and object $B.
Object $A has two string fields: “fieldA” and “fieldB”
Object $B also has two string fields: “fieldB” and “fieldC”
If object $A is added to our new empty collection, the collection would then have two fields: “fieldA” and “fieldB”.
If we would then add $B to the same collection the item in the collection would have an empty value in the field “fieldA” and no field “fieldC” (fieldB would be added normally to the list).
Keep this in mind when adding different types of objects to a collection.

Creating a custom object

Creating a custom object is easy:

$item = New-Object System.Object

This creates an empty System.Object object. This Object has no fields and only has four methods:
bool Equals(System.Object obj)
int GetHashCode()
type GetType()
string ToString()

This makes it an ideal object to start with as we can manually define every field.

So how do we add fields to our empty object?

Like this:

$item | Add-Member -MemberType NoteProperty -Name "Field1" -Value "value"

This example create a field named “Field1” with the value “value”, you can also pass a variable as value or even a field of a different object. For adding multiple field just repeat the line with different “Name” values.

This method can also be used to add fields to existing objects. For example you can read a csv file, add fields (for example a calculated field based on values of other fields) to the objects and then add all of those to a new (empty) collection which you can then write to a csv again or process further.

Adding the custom object to the ArrayList

We now have an ArrayList and need to put our custom object in it.

For people who are used to .NET and the way the lists work the method will be mostly unsurprising. There is only one thing to keep in mind, the Add method returns the index for the new item in the array. If you do not need this (and don’t want a series of indexes appearing on the console) you could output the result to null as in below example:

$collectionVariable.Add($object) | Out-Null

This will add the custom object to our ArrayList and will ignore the returned value.

Putting it all together

For this example I add ten objects to an ArrayList; the ten objects are the same but you can modify this to your own specific situation.

$collectionWithItems = New-Object System.Collections.ArrayList
for($i = 0; $i -lt 10; $i++)
{
    $temp = New-Object System.Object
    $temp | Add-Member -MemberType NoteProperty -Name "Field1" -Value "Value1"
    $temp | Add-Member -MemberType NoteProperty -Name "Field2" -Value "Value2"
    $temp | Add-Member -MemberType NoteProperty -Name "Field3" -Value "Value3"
    $collectionWithItems.Add($temp) | Out-Null
}

If I would then call $collectionWithItems it will return the collection. This is what the output is of $collectionWithItems when called after the for loop:

Field1                      Field2                      Field3
------                      ------                      ------
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3
Value1                      Value2                      Value3

Update (2017-06-13): Quicker/dirtier way to create objects with certain fields

Another way (though less pretty) is to do a select statement on any object, this will create a PSCustomObject with only the selected properties. Instead of the above example where it took 4 lines to create an object with 3 properties, this object can be created with all three fields in 1 line. However adding the contents to the fields might still require some additional lines which makes it also end up with 4 lines to create and fill the object.

$collectionWithItems = New-Object System.Collections.ArrayList
for($i = 0; $i -lt 10; $i++)
{
    $temp = "" | select "Field1", "Field2", "Field3"
    $temp.Field1 = "Value1"
    $temp.Field2 = "Value2"
    $temp.Field3 = "Value3"
    $collectionWithItems.Add($temp) | Out-Null
}

 

Using .NET objects with PowerShell

Example of showing a folder dialog

In this short example I will demonstrate how easy it is to use .net objects from PowerShell. This practically enables you to develop straight from PowerShell (although there is no designer.cs nor a visual interface).

In the example I will open a dialog were the user can select a folder which then will be stored in a variable (only if the user clicked “OK”).

Here is the script:

function Get-Folder { 
    if(!([appdomain]::CurrentDomain.GetAssemblies() | ? {$_.GetName().Name -eq "System.Windows.Forms"}))
    {
        Add-Type -AssemblyName System.Windows.Forms
    }
    [System.Windows.Forms.FolderBrowserDialog]$fbdlg = New-Object System.Windows.Forms.FolderBrowserDialog
    [System.Windows.Forms.DialogResult]$result = $fbdlg.ShowDialog();
    $path = $null;
    if($result -eq [System.Windows.Forms.DialogResult]::OK)
    {
        $path = $fbdlg.SelectedPath;
    }
    return $path
}

First I check if System.Windows.Forms is already loaded (line 2)
[appdomain]::CurrentDomain.GetAssemblies() returns all currently loaded assemblies. If none of them have the name “System.Windows.Forms” it will be loaded then (at line 4).

In the same way other assemblies can be loaded (like System.Drawing, System.IO, System.Net etc.).

After this I simply create a FolderBrowserDialog object (line 6) using the default constructor with no parameters.

Next I call the “ShowDialog” function on the FolderBrowserDialog object and store the result in a DialogResult object (line 7)

If this object is equal to DialogResult.OK (line 9) the path will be stored in the $path variable (line 11).

Finally the $path variable is returned (line 13). This will be $null if the dialog was canceled.