Handling big data with PowerShell


Recently I was involded in a project where a large amount of e-mails with logs had to inserted into a BI database. A colleague asked if I could assist in transforming a raw text export from outlook and transform the data from inside the e-mails (including some metadata of the e-mails themselves) to a csv file with certain formatting.

I gladly accepted this challenge and after designing the basic logic of extracting the data and transforming it to the required format I ran into performance issues.

Our current input file was a small part of the total load (36 MB out of 20 GB of PST files). The input file contained over 1.5 million lines of text which needed to be transformed to approximately 500,000 lines in a CSV file.

I have transformed xml to csv before, in this case the input file was only 5MB of XML. Here I loaded the inputfile into memory and then wrote every extracted csv line directly to the file.

For the text file of 36 MB my idea was to use the same approach and write the 500,000 lines directly to the CSV file.

I first tested with a small portion of the file (100 out of 5695 parts of the input file). Writing every line directly to the output file costed about 100 seconds. This would mean that the total file would take about 96 minutes. Since this file was only a small portion of the large total I wanted to improve performance before applying this to the main bulk of data.

This got my to try and reduce IO and instead store the result in memory and write it as a whole after the whole file is completed. As before I tested this with a subset of 100 out of 5695 parts of the input file. This approach reduced the running time from 100 seconds to 3.2 seconds; a reduction of 33 times.

With this result I wanted to immediately run this theory on the entire test file (46 MB of text). I expected the script to finish in 3 to 4 minutes. However after 20 minutes (during which the CPU PowerShell was using was maxed out) it was still not complete. The cause of this was because of the large size, the physical memory was not sufficient so swapping occured (which resulted in the IO I was trying to avoid). This got me back to the drawing board to figure out a solution. First of, instead of storing the input file in a variable (through System.IO.File.ReadAllLines()) I found that if I put this inside my loop PowerShell would claim less memory as before.

The next improvement however is where I really gained most. Instead of writing every line or writing everything at once I started writing batches of lines to the output file. I my case I started with batches of 100 (out of the 5695 parts) and write them to file as they where complete. With this configuration all 5695 parts where completed in 8 minutes (a lot faster then the previously estimated 96 minutes or the more than 20 minute approach).

I have yet to figure out what the perfect balance will be between the amount of lines to write at one time and the amount of disk IO. For me the optimal performance was around 220 line in one batch, but this is may be different for other similar solutions. Best tip I can give you here: keep tweaking to find your sweet spot between disk I/O and Memory usage.


Below in short how my solution was built to write csv in batches instead of “at once” or “per line”.