Igor Kromin |   Consultant. Coder. Blogger. Tinkerer. Gamer.

One of my tasks recently has been to analyse erroneous duplicate data in our production system and to come up with a number of data fixes that could be executed using simple SQL update statements. The sanitised data was provided to me as an Excel spreadsheet containing just enough information to work out what should be done with the duplicate, but not enough to identify who the data belongs to. This meant that I could work on it outside of the production environment, but it also meant that I needed to provide a way for the production operations team to reproduce what I did on real data and with relative ease.

After some thinking I've decided to implement my solution in Excel, in fact, in the very same spreadsheet that was given to me. I didn't use any external scripts, nor any macros. Everything was done using some clever column positioning and standard Excel formulas.

My input data looked something like this (not what was given to me for obvious reasons):

Without going into details of the data, or the analysis of the data fixes, lets assume that the ID column in conjunction with the version and status columns is the primary key for each row (as well as in the database). What I wanted to produce is an update SQL statement for certain rows that I identified as needing fixing.

I wanted the update statement to look something like this:
update user.table1 set status='XXX', version=99 where id='YYY' and version=ZZZ and status='AAA';
update user.table1 set status='XXX' where id='YYY' and version=ZZZ and status='AAA';
update user.table1 set version=99 where id='YYY' and version=ZZZ and status='AAA';

To break this down, I wanted to set a new status and/or version for an existing record, with these rules:
  • Update can update the status OR the version
  • Update can update BOTH status and version
  • Only some rows were to be updated (as marked)

To aid in this, I added three new columns to my spreadsheet: NeedUpdate, NewVersion, NewStatus. The NeedUpdate column would designate whether an update statement should be generated for that row, the two other columns contained new values to set in the update statement.

The resulting spreadsheet took on this form after my analysis (example values only, not representative of actual values):

Now I was ready to implement the SQL generator. I did this in a new worksheet. The original worksheet was called 'Export Worksheet', which is important for the formulas shown below.

Before writing the formulas, I broke down the update SQL statement into the following parts (one column per part):
  • Update clause - 'update user.table1 set'
  • 1st value set e.g. version=99
  • Set list comma separator (if both status and version are being set)
  • 2nd value set e.g. status='AAA'
  • Where clause
  • Where clause conditions

Looking back on this, the where clause and it's conditions could have been combined.

Now for each of the cell formulas, I did a check to see whether that row was to be updated e.g. if the F2 cell in my data worksheet had the value 'y'. If it did, the cell would either get a value from the formula or further conditional processing was applied by combining IF statements.

The set list comma separator was a special case, it could only appear if both the status and version columns were being updated, for this I used the IF function in combination with the ISBLANK function to check whether values in the data workbook were set that corresponded to these variables.

The formulas ended up looking like this (split over multiple lines for readability here):
 A1 cell
=IF(EXACT('Export Worksheet'!F2, "y"),
"update user.table1 set",
 B1 cell
=IF(EXACT('Export Worksheet'!F2, "y"),
IF(ISBLANK('Export Worksheet'!G2),
CONCATENATE("version=", 'Export Worksheet'!G2)),
 C1 cell
=IF(ISBLANK('Export Worksheet'!H2),
IF(ISBLANK('Export Worksheet'!G2),
 D1 cell
=IF(EXACT('Export Worksheet'!F2, "y"),
IF(ISBLANK('Export Worksheet'!H2),
CONCATENATE("status='", 'Export Worksheet'!H2,"'")),
 E1 cell
=IF(EXACT('Export Worksheet'!F2, "y"),
 F1 cell
=IF(EXACT('Export Worksheet'!F2, "y"),
"id='", 'Export Worksheet'!A2,
"' and version=", 'Export Worksheet'!C2,
" and status='", 'Export Worksheet'!D2, "';"),

This produced output like this:

Generating all of the SQL was a matter of selecting the first row and dragging down. Any rows from the data worksheet that needed updates had SQL generated, others just got a blank line.

At this point it was a matter of copy/pasting the worksheet into SQL Developer to execute the queries.

This is nice and simple and doesn't require any shell scripts or using the SQL Loader. This also means that the spreadsheet is completely reusable, just paste new data values into it and the SQL is automatically generated for you!


Have comments or feedback on what I wrote? Please share them below!
comments powered by Disqus
Other posts you may like...