Sunday, February 24, 2008

Deep equality in modern languages

I was recently creating with Ruby simple Extract-Transform-Load tool for importing data set from XML file to our databse. At the beginning it seemed very easy, although painful task. The first problem I was trying to solve was testing. Nothing interesting, I thought. And I was almost right. Almost...

Testing algorithm was very simple:

1. expected = predefined data set (so called test fixture)
2. import data from XML file to DB
3. actual = load imported data from DB
4. assert_equal expected, actual, "all data should be equal"

The problem was in assert_equal, because "default" equal was not I wanted to be. Default equal is shallow, i.e. compares only objects itself without its dependencies. I needed deep equality.

Example 1

title_a = ['Some title', author_a]
author_a = ['Gauss', 'C']



In this example title and author are in relation "has a" - title_a has author_a. This causes "standard" equal to not work correctly. Let for example

author_b = ['Gauss', 'C'].

Authors a and b are equal, but have different identifiers (identities, references), thus if we define

title_b = ['Some title', author_b]

we'll get title_a not equal title_b. To obtain expected equality result I must redefine default equality operator. So far, so obvious...

The problem was, that I had to redefine this operator in all data model classes. Ordinary thing, I thought. I was always doing this way. This is very simple, I have to choose which "fields" compare directly and which "through references". Very simple. And stupid. And error prone. And time consuming. And very hard to maintain. Why, the hell, languages I am working with do not provide such obvious functionality!? Maybe I missed something? Maybe everyone knows how to do this very easily, except me?

I did small research and it appeared that such functionality is implemented in... Eiffel, but only partailly [1]. Java, C++, Ruby and even Smalltalk don't have such language feature.

Maybe it is hard? I thought. My first idea was to compare two directed graphs created from object instances and (some*) references between them. The problem is well known, it is called graph isomorphism. Quickly it appeared, that graph isomorphism (GI) problem has no polynomial solution. Curiously, it is believed that GI is neither P nor NP-complete... There is special complexity class, called GI-complete, for this problem and all problems with polynomial-time Turing reduction. It is belived, although obviously (P=NP?) not proven, that:

P-complete < GI-complete < NP-complete

Anyway it seemed, that although solution is possible, it may be not efficient...

* - more about this in my next post

My second idea was that, comparing to GI problem, we have additional information - "root" vertices we are starting comparison from, and thus the solution may be more efficient than for GI problem. After some googling I found I was right. I found very interesting paper [2] which, besides formal point of view on deep equality, defines polynomial algorithm for the problem, even in the case with circular references! Additionally, algorithm looks quite simple to implement.

My third idea was, that the solution may be even simpler if we assume no circular references. I don't know if this is true, but I would like to check this, because my...

Fourth, final idea, was to implement deep equality solution for Ruby. But before I'll do this I will write next post entitled "Deep equality and Aggregation" i.a. describing how to declare which relations should be taken into consideration by deep_equal operation.

References:

1. Copying and Comparing: Problems and Solutions, P. Grogono &
    M. Sakkinen, 2000
2. Deep Equality Revisited, S. Abiteboul & J.V. den Bussche, 1995

1 comment:

knowak said...

Interesting. If a reverse transform (DB to XML) was easy to code you could use existing tools for deep XML comparison.

Maybe you wouldn't even have to build a complete XML tree - list of XPath-like expressions should suffice.