OUseful Info: CRIG DRY (Don't Repeat Yourself) Metadata Barcamp

June 07, 2008

CRIG DRY (Don't Repeat Yourself) Metadata Barcamp

Just back from another of Paul "IE Demonstrator"-pushing Walk'n'David Flander's CRIG DRY (Don't Repeat Yourself) Metadata workshop/barcamp'n'free beer affairs, where the hardcore(?!) of the UK's academic repository hackers were plotting on how to get the most out of helper webservices being produced by other JISC funded webservices projects...

As I bookmarked the various services, it struck me that I'd already bookmarked at least one of them - and have just never revisited it... Hmm...

Anyway, here were the services:

STAR - a semantic, archaoelogy thesaurus webservice (which put me in mind of the Archaeology Data Service);
GeoCrossWalk - a geocoding and gazeteer serivce (of which, more below);
Names - an authority name providing service; this is only just beginning, and I do wonder about it's viability... the idea is to provide an authority name proivider for institutions and academic authors. What I'd like to see is something that in the first instance mines and annotates the names of authors who've written documents that appear in local open research repositories, and then exposes that infomation to an academic people search tool. Practical and pragmatic, in other words...
HILT - another 'high level' thesaurus; I didn't really get this (yet another etc etc...?)
Deposit PLAIT: metadata usage across repositories, and automated metadata creation (once I saw the 'metadata extraction from document templates' bit of this, I thought of the (user created) bibliographic data scraping tools used in CiteUlike);
CRIG Service Messaging Architecture: orchestrating RESTful web services.

A couple of things to note: JISC apparently likes to fund SOAP powered webservices. Whilst these might conceivably make sense for complicated web service transcations, they're probably overkill in our sector most of the time (a sigh went up from the developers whenever a SOAP interface was mentioned).

Now I know there are tools around that are making playing with services much easier (some of the IBM mashup tools, for example, and I think Popfly?) but sometimes it's just so much quicker to be able to hack around an almost human readable URL and look directly at whatever the data the webservice actually returns in a browser, than have to use a tool to put the webservice call together, get the data out, and then make it readable...

Something that came up a couple of times was managing webservice workflow. In SOAP based processes, of course, there is all many or (complicated) ways of managing orchestration. But in REST, it's not so easy (unless you're happy with a simple pipeline.... Hmm, I wonder how far has moved on lately?). Scott Wilson mentioend a stalled workflow standard for RESTful services from a decade or so ago that may well be worth a look: Simple Workflow Access Protocol (SWAP); see also this old discussion about Developing Web Services Choreography Standards – The Case of REST vs. SOAP.

Another issue relating to webservices is one of IPR and licensing. Take for example the geoCrossWalk project. This project actually has two components - a geocoding service, that wil parse a block of text and pull out placenames (among other things) and geocode them, either using "low quality" data (which is probably good enough to help you get thinking about how to build an app, IMHO), and good quality, Ordnance Survey data - which is licensed and requires authenticated access...

I did have a quick look at the Edina Digimap pages (where access to the GeoCrossWalk API is hidden away), but I couldn't log in... I guess I could have tried to scour the OU library website, looking for a way in, or a password, or something, but I wanted a quick hit, couldn't get it, so went off to the bar instead...

How ironic would i be for a researcher to fight with their institution about opening up their research data, only to find out they couldnlt expose it on a public website becuase the JISC geocoding service had inherited O/S license conditions that said they couldn't? This isn't at all like software licensed using an aggressive open source software license contaminating all the code it comes into contact with, but then in some respects, it has a similar feel to it...That is, licensed metadata effectively closing the original data it annotates... There has to be a better way for providers of commercial data to license their wares and generate revenue from it... There has to be... They really, really need to sort out a new business model... Becuase if they don't, we'll just start using good enough data, that we can find for free...

(One of the tellings of 'fairy gold' is that it turns to dust if not passed on before midnight of the day you recieved (or is that dawn of the following day..?!;-) That is, you can spend it, but you can't hoard it. The data providers need to find ways of generating revenue from the flow rate of their data, rather than hoarding it in the hope of a big payoff down the line. If I'm making money in part from your data, I'll give you a cut... But don't prevent me from getting the data to see if I can make use of it in the first place...)

The other part of geoCrossTalk is the CrossWalk part proper. This gazeteer will take one sort of geo reference, such as a place name, postcode, lat/long, region name, local government ward identifier, etc etc, and return references in other forms, including 'polygon bounding boxes' (there must be a proper term for this?) that describe an area: so for example, I could ask about "Bath" and get a KML overlay back that plots the extent of Bath on a map. I was also shown a demo of an overlay for the river Tay, and so on. Given an area overlay, it them becomes possible to ask for all the place names that lay with 1km of the Tay, for example; or all the postcode areas in Bath.

One service I though might be quite handy relates to the display of multiple poorly located markers on a map - for example, markers for 100 photos tagged 'Bath'. By adding an hovering overlay marking out 'Bath' on the map, each marker could be placed somewhere within it, and set to jiggle - like the jiggling or jittery icons on an iPhone or iPod touch when you edit the 'desktop' in those appliances. The jitter suggests that the location is uncertain, and maybe allows the marker to be pushed down to a lower layer that is more specific, such as a general postcode area - though again, the actual specific location may still be unknown. Only when the marker is jiggle free do you know it is precisley located... (of course, some markers may not designate a point - thay may designate an area...)

As to how to geocode repository data, the assumption was that geocoding would be part of the ingress process whereby documents are added to a repository and annotated with geometadata. Paul Walk suggested that this could actually be added post hoc - a seearch is made over a repository and the results are then geocoded and displayed. (I guess that ideally, the geodata would then be submitted back to the repository?). This put me in mind of Weinberger's aphorism of 'filter on the way out'... If the geodata is not the primary or most important search facet, why go through the pain of adding it to every record, many of which are unlikely to be surfaced ever again, anyway?! The geo data then becomes a search refinement that can be offered as a 'search within results' feature?

I'd actually thought about this post hoc search process earlier in the day, again in conversation with Paul (who also referred to the NaCTem TerMine term extraction service that I need to have a quick play with? Now is it as easy to use as the Yahoo Pipes Term Extraction service; or the not quite so easy to use Reuters Calais service (SOAP, sigh...;-)?) This time it related to thesaurus data - and obvious use case I could see was in providing thesaurus generated 'did you mean?' or 'search with results' facets on a search interface.

One thing I came to realise is that for all their good, many JISC services don't offer a quick hit. In many cases, you can't just start cranking out RESTful URLs, or cobbling Pipes together. You can't get away with not reading the documentation... You have to invest time in finding out what the service is good for, before deciding whether or not it's good for what you want... You can't just have a quick play...

As to other highlights from the day - the ORE work that Richard Jones and ?? are doing looks really fascinating, offereing the ability to unpack repository holdings into aggregate objects (that is, disaggregatable ones). I didnlt make time to talk to Richard - big mistake... Ho hum...

So going forward, what would I like to see on the webserivces front? For every JISC web service, how about one or two of the calls exposed via a Yahoo Pipes or a Popfly block...? Or maybe provided as 'mashlets' in some of the enterprise mashup environments?

Blogged with the Flock Browser

Tags: CRIG, CRIG, iedemonstrator

Posted by ajh59 at June 7, 2008 12:26 AM

Comments