2010-02-09

RDBMS to RDF... WTF?

I've been tasked with taking a few bits of the USGS National Hydrography Dataset (NHD) and turn it into RDF (or N3) so that it can be published on the CEGIS website and handed out to the SOCoP. Should be easy, right? I mean there are lots and lots of converters out there that automatically general RDF from things like Excel spreadsheet or XML of my data. There's even a wonderful toolset called D2RQ that helps you build mappings between RDBMS and RDF and even acts as a bridge to an RDF store like Sesame.

But I have three caveats:

1. My data is currently in Oracle 11g Spatial. Not a problem for D2RQ (except for the spatial bits). So far, however, I can get a mapping out of D2RQ but can't get the data out as RDF. To complicate matters, the NHD is broken into 8 tables (not counting spatial). Each of the six study areas I am looking at were dumped into their own set of tables, e.g., NHDWATERBODY_WV. So I have 48 tables (not counting spatial) instead of six. The mapping generated by D2RQ includes all 48 tables (even though the mappings are identical) plus the spatial. It's just under 8000 lines long.

2. Whatever output I come up with needs to import nicely into TopBraid Composer. Fortunately, I have a demo of TBCM right now, so I can actually test my output. The reason for this requirement is that this is the only way my project lead knows how to say "Eric, you are done." I managed to take some delimited text versions of my tables and convert them to N3 using tab2n3.py - but TBC complains about not having a valid namespace (which lead me to another issue with my project lead - but that discussion will have to wait - just suffice it to say that the URI for your namespace has to be a valid URI).

3. I'm supposed to be using Oracle whenever possible (actually, yesterday morning I was told I could only use Oracle but I managed to argue my way out of that requirement).

I've successfully managed to get some of the data into TBC as Triples - but I'm generally not happy with the structure of the triples. You see, it's really easy to wrap an RDF conversion with what I guess would be called "converter triples". So the triples I get have been completely stripped of meaning or have this extra layer of cruft. It's even worse if I go from XML to RDF because I get a layer of cruft in the XML as well.

Today my goal is try to hand-write an ontology/namespace for the 8 table types. There are only 8 of them and they only use standard types in Oracle. Then I'll tweak my Python to generate N3 files of the data based on these ontologies. There is a set of undocumented (or the documentation isn't online any more) PL/SQL procedures from the Relational.OWL project.