Publishing RDF from CSV via Microsoft® Excel®, XML Schema, SAWSDL and XSLT
0.4.0.6
Introduction
This is a technology demonstration aimed at technically-minded data publishers who are interested in the semantic web and linked data, but at present have some data in very basic formats like CSV, and have Microsoft Excel 2003 or 2007. The following is not a demonstration of any best publishing practice, and if the RDF produced is in any way plausible, then that is a happy accident. The CSV format and Excel application were chosen because of their current popularity. Excel 2007 was primarily tested, although the required XML features are also available in Excel 2003.
The document types used by this demonstration.
Getting started
- You will need Microsoft Excel 2003 or 2007 to follow the demonstration.
- If you use Excel 2007, you will need to have the Developer tab shown. Microsoft Office Button > Excel Options > Popular > Top options for working with Excel > Show Developer tab in the ribbon.
- Download the sample files, and extract them into a working folder. Save all of the files you produce in this folder too.
- You will need Internet access to allow the Dublin Core DCEMS and W3C SAWSDL schemas to be referenced. If you wish to work without Internet access, download the DCEMS schema, save it in your working folder, and change the appropriate schemaLocation in dataset.xsd to dc.xsd; then download the SAWSDL schema, save it in your working folder, and change the appropriate schemaLocation in dataset.xsd to sawsdl.xsd.
Description of the files
- dataset.xsd is a sample XML schema file which will be mapped to your data in Excel. It also has some extra features which will be discussed later;
- datasettordf.xslt is an XSLT transformation stylesheet, which takes an XML document like sample.xml and creates an RDF document based on it and its schema (dataset.xsd);
- sample.csv is a comma separated file, which will be opened in Excel 2007;
- sampleexcel2003.csv is an alternative comma separated file with UTC dates, which Excel 2003 apparently needs. This has datetimes in UTC format (otherwise they are unrecognized), and lacks the multi-valued properties4, as these are untested with this version;
- sample.rdf is the final RDF document created from sample.xml and dataset.xsd by datasettordf.xslt;
- sample.xml is an XML file exported from the Excel document created from sample.csv.
Basic procedure for exporting XML from Excel
Here is a walk-through of the steps to assign an XML schema to Excel, map it to a data list, and export the data as XML. Refer to Excel documentation for more detail.
- Open the sample.csv file in Excel 2007 (or sampleexcel2003.csv in Excel 2003) and save it as an ordinary Excel workbook, call the file myentity.xlsx (or myentity.xls in Excel 2003).
- In Excel 2007, Developer > XML > Source > XML Maps... Add > choose dataset.xsd > Open (ignore warnings about unsupported elements or structures) > Yes > OK. This should open a map called dataSet_Map in the XML Source. In Excel 2003, access the same functionality from Data > XML > XML Source...
- Make sure that in the XML Source > Options > My Data Has Headings is ticked.
- Map the XML elements in dataSet_Map to your columns by right-clicking on the elements under dataSet/record and choosing Map element:
- for the Dublin Core elements identifier, title and publisher, right-click on the <value>.
- for the other elements, just right-click on their names (property1, property2, property3 and property4).
- Click on the heading of the matching data column (ID for identifier, Name for title, and so on, including the heading) and OK. You can also drag and drop the element on the column heading.
- Once you have mapped all six elements to the six columns provided in the sample, save the Excel document.
- Check that Excel can export the data as XML by clicking on Verify Map for Export...
- Check that Excel will validate the data exported as XML against the schema in Developer > XML > Map Properties > Validate data against schema for import and export (or Data > XML > XML Map Properties...).
- From the Developer tab > XML > Export the data as an XML file (or Data > XML > Export), call it mysample.xml. You can open this file in a text editor, web browser or an XML editor.
It is a pity that Excel does not appear to maintain a link to the schema in exported XML, or let you assign an XSLT stylesheet to your XML data on export. In the next section, we'll add both to the mysample.xml document.
Relinking the XML document to its schema and transforming it to RDF with XSLT
When editing XML documents, if you are new to XML, please bear in mind that the syntax is case sensitive, and the document needs to be well-formed.
Link sample XML to the schema
- Open mysample.xml in an editor.
- Replace the line(s)
<ns1:dataSet xmlns:ns2="http://purl.org/dc/elements/1.1/" xmlns:ns1="http://example.com/data/dataset" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
with
<ns1:dataSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://example.com/data/dataset dataset.xsd" xmlns:ns1="http://example.com/data/dataset"
xmlns:ns2="http://purl.org/dc/elements/1.1/">
and save mysample.xml.
The XML document holding the data now references its schema; now it can be validated easily.
Link sample XML to the XSLT style sheet
- Open mysample.xml in an editor.
- Insert the line
<?xml-stylesheet type="text/xsl" href="datasettordf.xslt"?>
immediately below the XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
and while you are at it, you could change standalone to no, since you have attached a schema in the previous section.
- In theory, you could now open up mysample.xml in a modern browser, and it would do the transformation to RDF-XML for you. In practice, I think these browsers have their limitations, even when I rewrote the XSLT to use older XPath 1.0 functions instead of XPath 2.0. The transformation into RDF does work with the Saxon-EE 9.2.0.2 and Altova XML processors in tests.
The datasettordf.xslt style sheet is a rough draft. Such a style sheet should be written to be as generic as possible, so it picks up everything it needs from any suitable XML document and associated schema.
Publishing the sample data
You could now publish the sample data online, as CSV, Excel, XML or RDF. You do not have to transform the data first; a server language like ASP.NET will let you publish an XML document transformed by an XSLT style sheet very simply with an XML control. As before, the XSLT may have to be rewritten to meet the constraints of various XML processors.
Review of process
We started with a CSV document:
"ID","Name","Published by","Property 1","Property 2","Property 3","Properties 4"
101882,"","http://example.com","apple",432.4,02/11/2009 12:33,"GrannySmith CoxsOrangePippin"
324332,"Sample","http://example.com","banana",235.5,03/11/2009 14:31,"Cavendish GrosMichel"
351232,"Sample","http://example.com","cherry",,04/11/2009 16:29,"Bing Rainier"
643324,"Sample","http://example.com","apple",343,05/11/2009 18:27,"GoldenDelicious"
844556,"Sample","http://example.com","apple",1435.9,,"GrannySmith CoxsOrangePippin GoldenDelicious"
934530,"Sample","http://example.com","cherry",453.8,07/11/2009 22:23,"RoyalAnn"
and ended up with an RDF document:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:data="http://example.com/data/dataset" xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sawsdl="http://www.w3.org/ns/sawsdl" xmlns="http://example.com/data/dataset#">
<!--$schemaUrl is dataset.xsd-->
<!--$targetNamespace is http://example.com/data/dataset-->
<record xmlns="http://example.com/data/dataset" rdf:about="http://example.com/data/dataset/entity#">
<publisher xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#"
>http://example.com</publisher>
<property1 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>apple</property1>
<property2 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
>432.4</property2>
<property3 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2009-11-02T12:33:00.000</property3>
<properties4 xmlns="http://example.com/data/dataset/entity#" rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/GrannySmith"/>
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/CoxsOrangePippin"/>
</properties4>
</record>
<record xmlns="http://example.com/data/dataset" rdf:about="http://example.com/data/dataset/entity#">
<rdfs:label>Sample</rdfs:label>
<publisher xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#"
>http://example.com</publisher>
<property1 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>banana</property1>
<property2 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
>235.5</property2>
<property3 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2009-11-03T14:31:00.000</property3>
<properties4 xmlns="http://example.com/data/dataset/entity#" rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/Cavendish"/>
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/GrosMichel"/>
</properties4>
</record>
<record xmlns="http://example.com/data/dataset" rdf:about="http://example.com/data/dataset/entity#">
<rdfs:label>Sample</rdfs:label>
<publisher xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#"
>http://example.com</publisher>
<property1 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>cherry</property1>
<property3 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2009-11-04T16:29:00.000</property3>
<properties4 xmlns="http://example.com/data/dataset/entity#" rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/Bing"/>
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/Rainier"/>
</properties4>
</record>
<record xmlns="http://example.com/data/dataset" rdf:about="http://example.com/data/dataset/entity#">
<rdfs:label>Sample</rdfs:label>
<publisher xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#"
>http://example.com</publisher>
<property1 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>apple</property1>
<property2 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
>343</property2>
<property3 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2009-11-05T18:27:00.000</property3>
<properties4 xmlns="http://example.com/data/dataset/entity#" rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/GoldenDelicious"/>
</properties4>
</record>
<record xmlns="http://example.com/data/dataset" rdf:about="http://example.com/data/dataset/entity#">
<rdfs:label>Sample</rdfs:label>
<publisher xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#"
>http://example.com</publisher>
<property1 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>apple</property1>
<property2 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
>1435.9</property2>
<property3 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"/>
<properties4 xmlns="http://example.com/data/dataset/entity#" rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/GrannySmith"/>
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/CoxsOrangePippin"/>
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/GoldenDelicious"/>
</properties4>
</record>
<record xmlns="http://example.com/data/dataset" rdf:about="http://example.com/data/dataset/entity#">
<rdfs:label>Sample</rdfs:label>
<publisher xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#"
>http://example.com</publisher>
<property1 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>cherry</property1>
<property2 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
>453.8</property2>
<property3 xmlns="http://example.com/data/dataset/entity#" rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2009-11-07T22:23:00.000</property3>
<properties4 xmlns="http://example.com/data/dataset/entity#" rdf:parseType="Collection">
<rdf:Description rdf:about="http://example.com/data/dataset/entity/propertyList/RoyalAnn"/>
</properties4>
</record>
</rdf:RDF>
Adapting this approach to your data
This is not a tool or a solution, but you could adapt this approach to your own CSV data. The key lines are in dataset.xsd:
<xs:element ref="dc:identifier" sawsdl:modelReference="http://purl.org/dc/elements/1.1/identifier" />
<xs:element ref="dc:title" sawsdl:modelReference="http://purl.org/dc/elements/1.1/title" minOccurs="0"/>
<xs:element ref="dc:publisher" sawsdl:modelReference="http://purl.org/dc/elements/1.1/publisher" />
<xs:element name="property1" type="xs:string" />
<xs:element name="property2" type="xs:decimal" minOccurs="0"/><!-- removed sawsdl:modelReference="http://example.org/ontologies/SampleOntology#Quantity" -->
<xs:element name="property3" type="xs:dateTime" nillable="true"/>
<xs:element name="properties4" type="propertyList">
<xs:annotation><xs:documentation xml:lang="en-gb">A list of restricted strings. These will be separated by spaces.</xs:documentation></xs:annotation>
</xs:element>
I would probably keep the identifier and try to match something descriptive to title. Otherwise, this set of elements could be edited to match your own data set.
SAWSDL could be used to semantically annotate the schema, and these annotations can be pulled through into the resultant RDF document.
References
Change history
- 0.3.0 (2009-11-27)
- added a multi-valued property (Properties 4) to the CSV;
- updated the schema with a matching element (properties4) which has a list datatype;
- updated the XSLT to produce an RDF Collection set of elements for properties4;
- produced RDF output that validates on the W3C validator.
- 0.4.0 (2009-11-30)
- tried to shape the RDF output in a manner that makes sense;
- added named types to the dataset schema which can be referenced in the final RDF;
- lifted earlier constraint of aiming to use only XPath 1.0 only functions.
Intellectual property rights for this code
To the extent possible under law, Sleeping Dog
has waived all copyright and related or neighboring rights to
Publishing RDF from CSV via Microsoft Excel, XML Schema, SAWSDL and XSLT.
This work is published from
United Kingdom.