I Will Buy You Lunch…
May 16, 2008 – 11:29 am…or maybe even dinner, if you will build a usable table differencing tool. I have some SQLite database files under version control that I need to merge (updated on my laptop yesterday while flying, not remembering that I’d made some updates on my desktop machine as well). Ditto spreadsheets: I store all my grades in Excel now, but every once in a while, I have to merge changes made at home with changes made in the office. Both are multi-tabular data (”multi” because there can be many tables in a database, and many worksheets in a spreadsheet); after images, I’d guess tabular is the most common format that people want to put into version control that diff(1) can’t deal with. I can’t promise wealth or fame, but you’d sure be popular.
8 Responses to “I Will Buy You Lunch…”
I bet you’d find it a lot easier to compare tables of data using Resolver One.
By Michael Foord on May 16, 2008
Can you not dump the tables to INSERT statements and use a text differ? How big are the tables?
By Robert Brewer on May 16, 2008
@Michael: Can it load from SQLite, MySQL, PostgreSQL, etc., is it table-aware, and does it have a nice diff UI?
@Robert: yep, and I can dump two images as columns of (X, Y, R, G, B) and diff those textually too
By Greg Wilson on May 16, 2008
Given two versions of a grid of data it would be trivial to report what cells have changed. The problem is in expressing those differences in a meaningful way:
* has a column of three cells been inserted pushing original values down?
* has a group of cells been copied and used to overwrite another area of the grid? Multiple times? 3 times from position P1 with x-offset X and y offset Y?
For such a tool it would merit dinner at Claridges (http://www.gordonramsay.com/claridges/)
- Paddy.
By Paddy3118 on May 17, 2008
I ran into this issue a while ago and came up with some one off hacks.
Because the data was Django tables where the primary key is always ‘id’ and the table schema’s are quite regular (Complex multi-column PK’s and django generic relations were a PITA).
Export the tables as CSV files.
I exported 3 files. The common ancestor before any changes were made, then each of the two end results. I then used Araxis Merge to do a three-way merge of the files into an end result.
http://www.araxis.com/merge/index.html
There are better 3-way merge tools out there. Araxis was just what I had on hand at the time.
Then it was a re-import of the data from CSV.
The key was to deal with managing the PK’s. This was a pain and I have some ideas on automating this with difflib and python, but have not spent any time on it. The process went like this:
1. mark down the last PK id from the common ancestor in each table.
2. Mark down the last PK id from the A CSVs in each table.
3. keep a mapping of column FK’s (Django made this dead simple)
4. Update all the PK id’s in the B CSVs which are greater than those in the common ancestor to be values greater than those in the A CSVs.
5. Update all the B FK id’s for those entries
I did all that in python code, not by hand. The Django ORM made it a simple process, and I actually did it as part of the CSV exports. The common ancestor came from a backup (I assume you are backing up your data).
Then I did the merge with the merge tool, and re-imported the CSV files. If I had had ClearCase available I would have used their merge tool and most likely would have had no work to do. But purchasing a $200K license for just that is a bit much.
By Doug Napoleone on May 17, 2008
(On behalf of David Goodger, whose comments are being eaten by my spam filter):
I heard about “Redgate SQL Compare” (http://www.red-gate.com/products/SQL_Compare/index.htm) from Joel Spolsky on the StackOverflow podcast #4 (http://blog.stackoverflow.com/). Joel recommends it highly.
By Greg Wilson on May 18, 2008
@All: right, but I want to compare *any* tabular data using *one* tool, just as I can compare *any* line-oriented textual data using diff. (I’d like a better XML diff tool, too, one that doesn’t rely on the textual representation, but that’s actually a lower priority.) Seems to me that we ought to be able to do all tabular with one interface/algorithm plus some adapters, or am I being naive?
By Greg Wilson on May 18, 2008
Resolver One *doesn’t* have a built in diff tool.
What it is is a programming platform with a spreadsheet interface and full access to the spreadsheet object model from your code.
This makes it ideal for working with tables of data as you don’t have to create a UI.
It uses IronPython, and can work with any database for which there is a .NET interface - i.e. most of them. If you have a named data source (or know the connection string) on your computer you can create ‘database worksheets’ that automatically pull in their contents from a database, or you can do it from a few lines of usercode.
You have the full power of Python (and Python libraries that work with .NET, or any .NET libraries) to work with the data.
See the following two sites for details and examples:
http://www.resolversystems.com
http://www.resolverhacks.net
On the main site we’ve now done a series of two minute screencasts that highlight some of the most important features.
By Michael Foord on May 22, 2008