Transforming and Flattening Data

Transforming and flattening lists

A while back I had a the need to take what were essentially audit logs, where multiple records existed for each id, and flatten then into a single record for each id.

I pulled the data into python from a database as a list of over 100,000 tuples. Here is an example of the data I was looking at.

RECORD,ID,OLD_VALUE,NEW_VALUE,DATE,TABLE,COLUMN
1, id1, value1,  value2,  02/01/2020, table1, column1
2, id1, value2,  value3,  02/02/2020, table1, column1
3, id1, value3,  value1,  02/03/2020, table1, column1
4, id2, value4,  value5,  02/01/2020, table1, column1
5, id2, value5,  ,        02/03/2020, table1, column1
6, id3, value6,  value7,  02/01/2020, table1, column1
7, id3, value8,  value9,  02/02/2020, table1, column1
8, id4, value10, value11, 02/01/2020, table1, column1
9, id4, value11, value12, 02/02/2020, table1, column1
10,id4, value12, value4,  02/03/2020, table1, column1

This was an audit log for a series of undesired changes. Here is an example of the progression of these changes:

  • 02/01/2020: a large number of records were changed and iterated to the next highest available number
  • 02/02/2020: a similar number of records (but not all) were changed again, again iterating to the next available number
  • 02/03/2020: an additional event took place that resulted in many of the records being corrected (returning to their previous numbers), but some ended at yet another iteration higher, while still others ended up missing a value altogether.

In the sample information above there are 4 different unintended transformations of data that occurred:

  • id1: Example series of changes that ended back at the correct value:
  • id2: Example of series of changes that ended with a complete removal of the value:
  • id3: Example of series of changes that ended at a different number:
  • id4: Example of series of changes that ended up a different id's value (data not only incorrect, but conflicting):

In addition to the obvious differences between these 4 different changes, there are some that are harder to see:

  • Some id records went through 3 changes, others went through 2 (this one is not hard to see)
  • The last change of some id records was on 02/02, while others' last change was on 02/03

Altogether, the different factors that need to be understood are the following:

  • Some records changed each of the 3 days, some only on 2
  • Some of the records that only changed 2 times had their final changed on day 2, while others skipped day 2 and had their final change on day 3
  • Some records eventually were self corrected to their original value
  • Some records iterated 2 times and ended at a different number
  • Some records eventually ended up with a final value of null
  • Some records ended up at the original value of a different id

This example, as messy as it is, also is cleaner that the situation itself. The follow challenges existed:

  • There were some that only had 1 audit record where the original value was immediately replaced with a null
  • The list of tuples was not sorted in any fashion
  • The dates were actually spread over a period of 14 or 15 days, with the changes for a single record following anywhere in that time period -- not a clean 3 days like the example here.

The tuples that represent the audits for these changes are all over the place. In the end, what I wanted to see clearly what I was dealing with. In order to do that I needed to have a list of new records that would clearly show me the following for each id:

  • original value, and the date that value was lost
  • final value, and the date that value was added

Step 1: Gather the changes associated with a single id together, identified by that id

import csv
from collections import defaultdict

reader = csv.DictReader(open('audit_record.csv'))

dict_by_user = defaultdict
for i in reader:
    dict_by_user[i['ID']].append(i)

audit_summary_l = []
for i in dict_by_user:
    if dict_by_user:
        temp_dict = {}
        max = '0'
        min = '99/99/9999'
        for record in dict_by_user[i]:
            if record['DATE'] > max:
                max = record['DATE']
                last = record
            if record['DATE'] < min:
                min = record['DATE']
                first = record
        temp_dict = last
        temp_dict['OLD_VALUE'] = first['OLD_VALUE']
        temp_dict['ORIGINAL_DATE'] = first['DATE']
        audit_summary_l.append(temp_dict)

columns = ['ID','OLD_VALUE','ORIGINAL_DATE','NEW_VALUE',
           'DATE','TABLE','COLUMN','RECORD']
with open('audit_summary.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, fieldnames=columns, lineterminator='\n')
    dict_writer.writeheader()
    for data in audit_summary_l:
        dict_writer.writerow(data)

This results in the following csv data:

ID,OLD_VALUE,ORIGINAL_DATE,NEW_VALUE,DATE,TABLE,COLUMN,RECORD
id1, value1,  02/01/2020, value1, 02/03/2020, table1,column1, 3
id2, value4,  02/01/2020,       , 02/03/2020, table1,column1, 5
id3, value6,  02/01/2020, value9, 02/02/2020, table1,column1, 7
id4, value10, 02/01/2020, value4, 02/03/2020,table 1,column1, 10

You'll only receive email when they publish something new.

More from theogeek
All posts