I was recently creating with Ruby simple
Extract-Transform-Load tool for importing data set from XML file to our databse. At the beginning it seemed very easy, although painful task. The first problem I was trying to solve was testing. Nothing interesting, I thought. And I was almost right. Almost...
Testing algorithm was very simple:
1. expected = predefined data set (so called test fixture)
2. import data from XML file to DB
3. actual = load imported data from DB
4. assert_equal expected, actual, "all data should be equal"
The problem was in assert_equal, because "default" equal was not I wanted to be.
Default equal is shallow, i.e. compares only objects itself without its dependencies.
I needed deep equality.Example 1
title_a = ['Some title', author_a]
author_a = ['Gauss', 'C']In this example title and author are in relation "has a" - title_a has author_a. This causes "standard" equal to not work correctly. Let for example
author_b = ['Gauss', 'C'].Authors a and b are equal, but have different identifiers (identities, references), thus if we define
title_b = ['Some title', author_b]we'll get title_a not equal title_b. To obtain expected equality result I must redefine default equality operator. So far, so obvious...
The
problem was, that I had to
redefine this
operator in all data model
classes. Ordinary thing, I thought. I was always doing this way. This is very simple, I have to choose which "fields" compare directly and which "through references". Very simple. And stupid. And
error prone. And time consuming. And very
hard to maintain. Why, the hell, languages I am working with do not provide such obvious functionality!? Maybe I missed something? Maybe everyone knows how to do this very easily, except me?
I did small
research and it appeared that such
functionality is
implemented in...
Eiffel, but only partailly [1]. Java, C++, Ruby and even Smalltalk don't have such language feature.
Maybe it is hard? I thought. My
first idea was to
compare two directed
graphs created from object instances and (some*) references between them. The problem is well known, it is called
graph isomorphism. Quickly it appeared, that graph isomorphism (GI) problem has no polynomial solution. Curiously,
it is believed that GI is neither P nor NP-complete... There is special complexity class, called GI-complete, for this problem and all problems with polynomial-time Turing reduction. It is belived, although obviously (P=NP?) not proven, that:
P-complete < GI-complete < NP-complete
Anyway it seemed, that although
solution is
possible, it may be
not efficient...
* - more about this in my next post
My
second idea was that, comparing to GI problem, we have additional information - "root" vertices we are starting comparison from, and thus the solution may be more efficient than for GI problem. After some googling I found I was right. I found very interesting paper [2] which, besides formal point of view on deep equality, defines
polynomial algorithm for the problem, even in the case with circular references! Additionally, algorithm looks quite simple to implement.
My third idea was, that the solution may be even simpler if we assume no circular references. I don't know if this is true, but I would like to check this, because my...
Fourth,
final idea, was to
implement deep equality solution for
Ruby. But
before I'll do this
I will write next post entitled
"Deep equality and Aggregation" i.a. describing how to declare which relations should be taken into consideration by deep_equal operation.
References:
1.
Copying and Comparing: Problems and Solutions, P. Grogono &
M. Sakkinen, 2000
2.
Deep Equality Revisited, S. Abiteboul & J.V. den Bussche, 1995