Repository Schema + XML Schema = Object Database
One of the great things about the web is that it is an object database. However, doing queries/mashups/aggregating/reusing the data in web documents is difficult when there is not a systematic way to grab mulitple documents and query them. Often to sidestep this problem, APIs are built to allow pulling information from a database. Repository Schemas are a way to make APIs unnecessary by showing a simple map to the data.
Click here for a slide presentation of the Repository Schema idea.
Also there is some important work on what the XQueries would look like if there was a Repository Schema at http://en.wikibooks.org/wiki/XQuery/Link_gathering
One additional piece to connect Repositories is called Rosetta Stone Documents. More on that soon.
Outline of Repository Schema
The Repository Schema includes a portion that sets out to allow for URL discovery both through recreating URL using variables (that XML Schemas can define) as well as discovery through XPath (finding URLs by looking at navigation/index/site maps, as well as search results, Atom/RSS lists, etc). The other pieces of the Repository Schema are about mapping out the components/objects/subobjects (content divs/microformatted/RDFa objects/etc) along with mapping the metadata. Here is a quick outline of all of the parts. Note that a Repository Schema can be built by anyone, not just the repository publisher (as opposed to most APIs that are publisher generated). And this allows for realtime queries/mashups especially using XQuery. And it also allows transparency in creating an audit trail back to the raw data (again especially with the open standard XQuery). Important: the Repository Schema will have a standard XSL to allow it to be human readable and perhaps with building the URLs and starting searches.
The outline of the Repository Schema (thanks to Chris Wallace who has been very helpful):
The Repository Schema is used to define a set of documents, preferably XML, and the objects or useful information contained in them:
- for URL discovery,
- by using URL patterns by teasing out variables
- and/or indicating XPath to bring back URLs
- based on navigation links
- or based on search result links
- either from the repository publisher who has allows for variables or keywords for search
- or an external search engine
- indicating the purpose and pieces of the documents including
- pointing out the XML Schema(s) of the documents or creating new ones (for example to be more or less restrictive),
- map the metadata for the documents
- describe the objects contained that can be used
- content of a web page
- microformatted/RDFa objects
- determine the type by in what container; e.g. phone list as opposed to phone number for web site help; and then show XML Schema for the object (create or point to XML Schemas inferred by the RDFa or Microformatted objects)
- standard objects for web pages that were created by advocatehope.org including:
- pointing to XSL tranforms that can be used for the documents
- to create Semantic/RDF versions
- to create PDF, XHTML or other human readable formats
- to create SIMILE.mit.edu like versions for displays
- to transform XML from using nonstandard/proprietary tags to standard ones
- point to external indexes
- to faceted browsing indexes
- to indexes that XQuery engines can access rather than downloading all the repository documents in realtime to allow faster manipulation/searching of the data