Question:
I am collecting performance counters from NetApp Performance Manager software (OPM). OPM saves 30 days worth of data in MySQL database. So i have to put two queries to retrieve the data:
- First Query on 30th of every month and save in csv file.
- Second Query on 1st of every month and save in csv file.
Then merge the two csv files to get data if there are 31 days in a month.
Both files look like below:
1 2 3 4 5 |
"Time","objid","cpuBusy","cifsOps","avgLatency" "2016:06:04 00:04","72","50.6196","2069.11","7622.1" "2016:06:04 00:09","72","30.2233","2075.94","7633.27" "2016:06:04 00:14","72","35.2559","1980.64","8352.17" |
When i merge the two csv files with below code. I get duplicate rows with data from same data/time.
1 2 |
@(Import-Csv au2004npsa003-mm-business.csv) + @(Import-Csv au2004npsa003-nn-business.csv) | export-csv joined.csv -NoTypeInformation |
How can i merge the two csv files without getting duplicate data?
I have tried select -unique
however, it gives just one row.
Answer:
As for why Select-Object -Unique
didn’t work:
Select-Object -Unique
, when given instances of reference types (other than strings), compares their.ToString()
values in order to determine uniqueness.[pscustomobject]
instances, such as the onesImport-Csv
creates, regrettably return the empty string from their.ToString()
method.- This long-standing bug, still present as of PowerShell (Core) 7.2, was originally reported in GitHub issue #6163.
Thus, all input objects compare the same, and only the first input object is ever returned.
S9uare’s helpful Select-Object -Property * -Unique
approach overcomes this problem by forcing all properties to be compared invidually, but comes with a performance caveat:
The input objects are effectively recreated, and comparing all property values is overkill in this case, because comparing Time
values would suffice; with large input files, processing can take a long time.
Since the data at hand comes from CSV files, the performance problem can be helped with string processing, using Get-Content
rather than Import-Csv
:
1 2 3 4 |
Get-Content au2004npsa003-mm-business.csv, au2004npsa003-nn-business.csv | Select-Object -Unique | Set-Content -Encoding ASCII joined.csv |
Note that I’m using -Encoding ASCII
to mimic Export-Csv
‘s default behavior; change as needed.
With input objects that are strings, Select-Object -Unique
works as expected – and is faster.
Note, however, that with large input files that you may run out of memory, given that Select-Object
needs to build up an in-memory data structure containing all rows in order to determine uniqueness.