viksoe.dk

OLE DB Provider for XML


This article was submitted .


While researching how I could generate data reports for a project, I found myself needing access to XML with a regular database interface. After looking for a solution on the Internet, I found to my surprise that I had only two options:
  • Use the Simple Provider or SQLXML from Microsoft. For some reason this requires a MS SQL Server.
  • Purchase an XML ODBC Driver from Merant, which comes with a price-tag of $2,500 for a server license (who are they kidding?)
None of these options were very compelling, especially because all I needed was to retrieve a few nodes and deliver the result to a reporting utility. So after carefully planning a design, I wrote my own OLE DB Provider for XML.

An OLE DB Provider is a database driver, which uses Microsoft COM technology. The more famous layer, ADO, is built on top of OLE DB and leverages an "easy access to databases for Visual Basic programmers"-layer.
OLE DB is not solely targeted for SQL databases, but rather tries to encompass all kinds of data sources. For this reason (and because it uses a lot of COM stuff) it is a rather complex task to develop them.

The XML Provider

This Provider allows any OLE-DB / ADO application to access data in tabular and hierarchical formats. Using standard structured query language (SQL) to query and retrieve XML data.
The Provider is built by using existing XML technologies. Specifically the Microsoft XML Parser 6.0 is required. Performance was never a consideration for this project; simplicity was.

SQL conformance

The Provider supports a minimum of SQL, enough to allow searches and filtering of columns. This basically means that you can use a simple standard SQL query such as...
SELECT Name, Phone FROM Employees WHERE Phone = '1234'

The Provider deduces the table relations itself, so no additional information is required to describe how SQL relations are structured in the document (no additional schema information required).

It is possible to use AND and OR logical connectives to refine searches in the WHERE clause, as well as most arithmethic operators. It basically translates what you write into an XPath query so that should give you an idea of the limitations of the query processor.

The Provider is a read-only provider. It does not support INSERT, UPDATE or DELETE data manipulation.
There is no support for built-in functions, grouping or calculated column values. Nor does it support table alias syntax (it can be specified but is ignored). No aggregated functions are implemented.

Tables

XML excels at hierarchial structuring. SQL on the other hand tends to look at relations. It is not easy to mix the two or create a query language that seems natural for the other.

The XML Provider looks for repeating nodes at the same hierarchy level and treats them as tables. Meaning, all nodes that contain child nodes are potential tables.
It determines the columns available in the table by looking at the child nodes of the first entry it finds.

Here is a sample:

<AddressBook>
  <Person familyName="Gates" firstName="Bob">
    <Address street="Pine Rd." number="1239" state="CA"/>
    <JobInfo jobDescription="Manager"/>
    <Description>Family man</Description>
  </Person>
  <Person familyName="Brown" firstName="Robert">
    <Address street="118 St." number="344" state="NY"/>
    <JobInfo jobDescription="Group Leader"/>
    <Description>Hacker</Description>
  </Person>
  <Person familyName="DePaiva" firstName="Kassie">
    <Address street="Pine Rd." number="1234" state="CA"/>
    <JobInfo jobDescription="Actor"/>
    <Description>Actor</Description>
  </Person>
</AddressBook>
In this XML document, the "Person" table contains the set of Person nodes. Each record row in this table contains columns, such as "familyName", "firstName" and "Description". The Address and JobInfo nodes are not part of the table, because they do not contain a text node.

There are several other tables, which can be queried in this sample. They are "AddressBook", "Address" and "JobInfo".

Table Hints

XML documents are set up in hierarchical structure. It may be nessecary to indicate which hierarchy level represents a table. For this purpose, you can use table hints.
The syntax is:
SELECT <fields> FROM <table> hint("<XPath expression>")
Here is a sample.
SELECT * FROM 
  Person, Address hint("./Address[@state='CA']")
This finds all persons, but selects only those with an address in California. Of course, this is the same as writing...
SELECT * FROM Person, Address WHERE state='CA'
But hints are still useful when you need to navigate to parent nodes, or specific grand-children nodes.

Columns

All immediate child nodes in a table that do not themselves contain child nodes are cadidates as selectable table columns. All node attributes are included as columns as well.
Remember that columns names are case-sensitive and that attribute columns are not referenced with their @-character prefix.

If your query contains multiple relations (tables) you may reference columns in different tables by prefixing with the table-name. This is standard SQL stuff.

SELECT Address.street FROM Person, Address
  WHERE Person.firstName='Bob' AND Address.state='CA'

Special Columns

A special column is available for all tables. The rowid column is automatically included and represents a unique identifier for each row. Sort of the same as a unique key in a relational database. The key is generated using the generate-id() XPath method.

Data types

The Provider data source can read any valid XML - including raw XML. Raw XML is XML without schema information. For this kind of XML, all columns are treated as string columns. If schema information is present the Provider will load it and attempt to do mapping between the most ordinary data types. This includes integers and numeric values.

String columns are truncated to 120 characters! This is configurable through the Providers properties as described later.

Joining tables

Only simple table joins are supported.

The provider supports two types of joins. The first is a little non-standard.

SELECT firstName, state FROM Person, Address
This is an otherwise rarely used SQL syntax, but the Provider accepts this and assumes that the Address table is located below the previous table, the Person table in this sample. It then does a simple table outer join and includes all addresses of the person in the result set. Use a WHERE clause to filter records.

The second join syntax looks more like the standard SQL syntax.

SELECT firstName FROM Person, Address WHERE
  Person.id = Address.personId
When the provider stumbles over this syntax, it assumes that the Address is not necessarily placed directly below the Person table and scans the entire XML document for Address nodes.

The Provider always starts from the left-most table (as defined in the FROM clause) and works its way through the remaining tables. It is usually a good idea to start searches with the top-most XML nodes, and adding child node tables as they appear in the XML document.

Because of the nesting of table access when you don't provide a unique ID for the join, you may need a table hint when you join more than 2 tables:

SELECT firstName, state, jobDescription FROM Person, Address,
  JobInfo hint("../JobInfo")

Sorting

Sorting is possible by using the ORDER BY SQL clause. However, only columns from the first table can be included in the sort list.

Reserved Keyword

The following SQL keywords are reserved as they are used by the SQL engine.
SELECT UPDATE DELETE 
FROM WHERE ORDER BY 
SET AND OR 
Keywords are case-insensitive. Do not use these names for XML nodes.

Using the Provider

Configuration properties

Any OLE DB Provider can be configured by setting a number of properties on the data source object. The XML Provider defines a few of its own values:

  • XML Root - Sets the default root to query for the first table. Use this to minimize the initial search.
  • Max String Length - All string columns are truncated to this size.
  • Validate on Parse - Validate document when loaded. Default is True.
  • Resolve Externals - Indicates whether external definitions, resolvable namespaces, document type definition (DTD) external subsets, and external entity references, are to be resolved at load time. Default is True.
  • Protect Missing Tables - Indicates whether unknown tables in a query causes a failure or the Provider to ignore/skip the table. Default is False (failure).

Accessing the data

The following JavaScript sample opens an XML document and reads the firstName field from the table.
var adUseClient = 3;
var oDb = new ActiveXObject("ADODB.Connection");
oDb.Open("Provider=XmlOleDb.XML; Location=C:\\xml\\Test.xml");
oDb.CursorLocation = adUseClient;
var sSQL = "SELECT * FROM Person";
var oRec = oDb.Execute(sSQL);
while( !oRec.EOF ) {
   WScript.Echo( oRec("firstName") );
   oRec.MoveNext();
}
oRec.Close();
oDb.Close();
The use of adUseClient may be needed if ADO behaves strangely when it tries to access it using a server cursor.

Accessing the Provider from C++ is similar, and the standard ATL Consumer templates can be used, as well as most other OLE DB framworks. Note that the Provider's ProgID is XmlOleDb.XML.

The provider also works with the Data Bound controls in Visual Basic 6. You need to use the ADO Data Source as the data source control.

A final note...

This provider has not been thoroughly tested. It is not unlikely that you will stumble into problems and get all sorts of strange errors. Even worse, if you hit a bug the driver will probably crumble and Dr. Watson takes over the conversation... You have been warned.

And please don't ask me to publish the source code for this component. The source code for the XML OLEDB Provider will not be released.

Installation Requirements

Microsoft XML Parser 6.0

Installation Guide

  • Extract and copy the files to a local folder.
  • Execute the "install.js" script.

Download Files  This software is free!

DownloadOLE DB Provider for XML (377 Kb)

To the top