I recently had the rather unpleasant task of writing some PHP to compare a CSV file (with some 22,000+ entries) with a mySQL database. With the CSV file holding the master copy of data, it would update/insert and delete from mySQL. The script needed to run as a daily Cron on my (shared) Dreamhost box.
This would normally be simple enough, using a status field on the CSV file to indicate fields that had been updated. Unfortunately there was no status field, and none could be added. In fact the CSV file could not be modified at all. The only way to check if a row had been modified was to do a field by field comparison on every row.
I started off with a single script that imported the CSV to an Array, and also extracted all rows from the db table. With some looping to search through all rows and all fields in each row, I got the script to work. Great! (I Thought)
But with 22,000 rows looping ~22,000 other rows, (22,000 × 22,000 = 484 million loops ) – in short the script took minutes to execute, and if left long enough it ate up 100% CPU usage (through php). Even using a exponential back-off search took too long.
On Dreamhost, if any script you run nears 100% CPU usage, its killed automatically. A major rethink was required. So I decided to split the script in two.
- script 1 – would create a temporary table in the database and simply import the CSV file into it – row by row.
- script 2 – running a few minutes later, would then compare the two tables using mysql queries (rather than a php loop search) – after performing all updates/inserts and deletes the temporary table would be destroyed.
The comparison script (2), works by looking for id matches between the two tables, and marking any rows found. If found – both rows are fetched and a field by field comparison is made to check if an UPDATE statement is needed.
Finally any rows not marked as found in the master CSV file were added, and any rows not marked as found in the DB were deleted.
Using 2 tables for the comparison rather than looping and searching in php, meant that the strain was now on mySQL (rather than PHP). Dreamhost seems to tolerate this, and the php script execution time is reduced from minutes to seconds.
And, why am I explaining all this ? – you ask,
1. So I can remember what on earth I did. and;
2. Im curious to know if anyone can think of a better way to do this. Bearing in mind the limiting factors,
the CSV file CANNOT be altered in any way. It has to execute in seconds and the CPU usage cannot approach 100%
1 comment so far
someone Nov 06, 2011
why not just export the DB and use comm to find the changes and awk to fetch only the rows that you need from the CSV?