Handling big data with PowerShell

Background

Recently I was involded in a project where a large amount of e-mails with logs had to inserted into a BI database. A colleague asked if I could assist in transforming a raw text export from outlook and transform the data from inside the e-mails (including some metadata of the e-mails themselves) to a csv file with certain formatting.

I gladly accepted this challenge and after designing the basic logic of extracting the data and transforming it to the required format I ran into performance issues.

Our current input file was a small part of the total load (36 MB out of 20 GB of PST files). The input file contained over 1.5 million lines of text which needed to be transformed to approximately 500,000 lines in a CSV file.

I have transformed xml to csv before, in this case the input file was only 5MB of XML. Here I loaded the inputfile into memory and then wrote every extracted csv line directly to the file.

For the text file of 36 MB my idea was to use the same approach and write the 500,000 lines directly to the CSV file.

I first tested with a small portion of the file (100 out of 5695 parts of the input file). Writing every line directly to the output file costed about 100 seconds. This would mean that the total file would take about 96 minutes. Since this file was only a small portion of the large total I wanted to improve performance before applying this to the main bulk of data.

This got my to try and reduce IO and instead store the result in memory and write it as a whole after the whole file is completed. As before I tested this with a subset of 100 out of 5695 parts of the input file. This approach reduced the running time from 100 seconds to 3.2 seconds; a reduction of 33 times.

With this result I wanted to immediately run this theory on the entire test file (46 MB of text). I expected the script to finish in 3 to 4 minutes. However after 20 minutes (during which the CPU PowerShell was using was maxed out) it was still not complete. The cause of this was because of the large size, the physical memory was not sufficient so swapping occured (which resulted in the IO I was trying to avoid). This got me back to the drawing board to figure out a solution. First of, instead of storing the input file in a variable (through System.IO.File.ReadAllLines()) I found that if I put this inside my loop PowerShell would claim less memory as before.

Foreach ($line in [System.IO.File]::ReadAllLines($inputPath))
{
    #Some code
}

The next improvement however is where I really gained most. Instead of writing every line or writing everything at once I started writing batches of lines to the output file. I my case I started with batches of 100 (out of the 5695 parts) and write them to file as they where complete. With this configuration all 5695 parts where completed in 8 minutes (a lot faster then the previously estimated 96 minutes or the more than 20 minute approach).

I have yet to figure out what the perfect balance will be between the amount of lines to write at one time and the amount of disk IO. For me the optimal performance was around 220 line in one batch, but this is may be different for other similar solutions. Best tip I can give you here: keep tweaking to find your sweet spot between disk I/O and Memory usage.

Solution

Below in short how my solution was built to write csv in batches instead of “at once” or “per line”.

Param(
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[string]$InputFile,
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[string]$OuputFile,
[Parameter(Mandatory=$false)]
[int]$BatchSize = 100
)
[int]$Counter = 0;
[string]$csvHeader = "Column1;Column2;Column3"; #this might be a lot more columns..
$csvHeader | Out-File -FilePath $OuputFile -Encoding utf8; #write header to file
$collection = @();
foreach($line in [System.IO.File]::ReadAllLines($InputFile)) #this is mostly to save some memory opposed to storing all lines in a variable
{
    #Do complex logic
    #Add items to $collection
    #Do more intensive stuff

    #Depending on which unit you want to count (csv line, parts, or other) you can place this in the appropriate location
    #In my case I did this every time I started with another of the 5695 parts, you can also use this to count csv lines,
    #but in that case I would recommend using a larger batch number (e.g. 10,000)
    $Counter++;

    if($Counter % $BatchSize -eq 0) #The following code will be executed whenever the current $Counter is divisible by $BatchSize (the modulus is zero).
    {
        $collection | Out-File -FilePath $OuputFile -Encoding utf8 -Append; #append current batch to file
        $collection = @() #clear the collection for the next batch
    }
}
$collection | Out-File -FilePath $OuputFile -Encoding utf8 -Append; #append the final lines (which did not reach the batch size limit)

Simple search with PowerShell

One day I was looking for a certain video file on my computer but I didn’t know where to look, however I new the name (in this case it was a video I posted on YouTube and there I could easily find the original file name).

With the standard search bar in the Windows Explorer window the file could not be found (it was not indexed). I figured: “how hard can it be to use PowerShell to look for a file with a certain name?”. I limited my scope to only search based on the file name. After this I constructed a fairly simple but effective script which did just that. Not only did I find the file, it even appeared multiple times on my 2TB drive (several copies of the same file). It went through the 2TB drive faster then I would expect (benefit is this case was the limited scope which only looked at the names of files).

I wanted to share this simple (and really, it doesn’t get much more straight forward then this) script which enables empowers you to search large amounts of files as long as you know a part of the filename. If you are familiar with regular expressions you can use this in the search. If you are not familiar with them you can still use a part of the file name.

This is the script:

function Find-File {
    Param(
    [Parameter(Mandatory=$true)]
    [ValidateNotNull()]
    [string]$Path,
    [Parameter(Mandatory=$true)]
    [ValidateNotNull()]
    [string]$regexSearchString
    )
    return Get-ChildItem -Path $Path -Recurse -File -ErrorAction SilentlyContinue | ? {$_.Name -match $regexSearchString};
}
New-Alias -Name ff -Value Find-File;

To make the function easier to work with I also created an alias “ff” as you can see in the last line.

Most of the lines consist of the two mandatory parameters: Path and regexSearchString (don’t worry if you do not know much about regular expressions; normal text will also work).

The actual search is basically a one liner. It collects all files below the provided path (as you can see by the switches “-File” and “-Recurse”. In addition if no files are found (or the path is no valid path) the result will be $null, which indicates there are no results for the current patch/search string combination. When all files are collected there is a filter which matches each file name against the regular expression (or search string).

If called directly it will simply write the result to the screen:

ff -Path c:\ -regexSearchString test

It makes more sense to store the result in a variable like this:

$result = ff -Path c:\ -regexSearchString test

Here $result will contain a collection fileInfo objects; if there is only one match it will be no collection but it will be a fileInfo object directly.

If you are only interested in the location of the files you may choose to only collect the FullName (path+filename) property of the objects. You can do this by piping it to a select:

$result = ff -Path c:\ -regexSearchString test | Select FullName

Here $result will be a collection of strings (or a single string if there is only one match).

If you are looking for a directory of which you know the name, the script can easily be modified to look for directories. Simply replace “-File” with “-Directory”, the rest works the same.