SSDDiff — a diff for semistructured data

Semistructured data is a generic term for data that does have structure information, while not being tabular or very tightly restricted. XML and HTML data is the most prominent examples for this. You normally would not use this term e.g. for database tables (which for example do not allow nesting of entries).

While this application currently only supports XML, the algorithms should be able to process other semistructured data as well.

The approach used here is usually much slower than other well-known xmldiff applications, however it produces better results in many "tricky" cases. You say that other xmldiff applications try to do a syntactic diff, whereas xmldiff tries to do a semantic diff.

News:

2006-03-02: Autotoolized sourcecode, should be even easier to compile now (read: ./configure && make)

2005-08-04: Added a tarball for easier download (no subversion necessary) to the ssddiff alioth release area.
I also updated/continued the doxygen documentation.

2005-08-04: Just finished my talk at the conference. Slides are temporarily available here, and you can find the article in the Proceedings.

2005-05-15: There will be a talk about SSDDiff at Extreme Markup Languages 2005 on Thursday August 4 2005.

2005-05-10: I started "doxyfiing" the source code, so its easier to understand by others. Still lots of cleanup is needed... I hope I'll find time to do so.
Here are the current docs.

2005-02-10: I flagged the current SVN version as 0.1. Consider this an alpha release of the application. I also clarified the licence by adding the GPLv2 as COPYING file and a corresponding line into each file.

State of the project

There is a project thesis available which explains the reasoning, the benefits and the algorithm.

An example implementation (C++, libxml) was checked into the Subversion repository at svn.debian.org. You can also browse the source code using websvn. Beware that this implementation is development code that has not been properly cleaned up yet.

There will be a presentation on Thursday August 4 2005 at Extreme Markup Languages 2005.

Contributing:

I really appreciate contribution, especially if someone could help me in autotoolifying the package, bringing it to a beta release, improving the fast mode, making a library...

The only big demand I have for contributions: all contributed code must be distributeable under both the GPL and the LGPL, so that I (or a future maintainer if I resign) can change the licence to the LGPL in case this is considered desireable.

For more information, please contact Erich Schubert (Homepage).