Friday, March 04, 2005

InfoWorld: Implementing real-world structured searches

Image hosted by Photobucket.com

======================================================================== STRATEGIC DEVELOPER: JON UDELL http://www.infoworld.com ======================================================================== Thursday, March 3, 2005

IMPLEMENTING REAL-WORLD STRUCTURED SEARCHES

By Jon Udell

Posted February 25, 2005 3:00 PM Pacific Time

In the early days of XML, smart search was often cited as a key benefit. Instead of just trawling for single-celled keywords in an ocean of undifferentiated text, the story went, we'd navigate islands of structure looking for more evolved creatures. Product descriptions, calendar events, and media objects are all examples of the kinds of things we were meant to be finding by now.

ADVERTISEMENT -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- INFOWORLD STORE: TECHNOLOGY PROFESSIONAL HIRING KITS Gather all the essential materials you need to hire the best technical candidates for your organization with Tech Toolshed's Hiring Kits. Whether you're looking for a Network Administrator or Support Professionals, these packs include job descriptions, prepared interview questions, candidate evaluation worksheets, classified advertising text, and even sample offer and rejection letters. http://newsletter.infoworld.com/t?ctl=BF6C38:353CA35 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

That vision hasn't materialized yet, but I'm not ready to give up on the idea. A year ago I wrote about my efforts to chart "a middle course between the Scylla of simple full-text search and the Charybdis of unwieldy tagging schemes and brittle ontologies." The Scylla of this myth was Google's Sergey Brin, and the Charybis was the W3C's Tim Berners-Lee. Between Brin's "we don't need no stinking structure" and Berners-Lee's "wrap everything in RDF (Resource Description Framework) and OWL (Web Ontology Language)," there is a vast, fertile middle ground awaiting discovery.

For example, the current craze for tagging things -- Flickr photos, del.icio.us, and Furl URLs -- shows that people are more likely than you'd guess to add structure to content. Under what conditions will they make the effort? First, tagging must be easy -- a two-second no-brainer. Second, it must deliver both instant gratification and longer-term value to the person doing the tagging. Third and most important, it must occur in a shared context so that network effects can kick in.

Of course, some tags are implicitly woven into the fabric of our content. Consider, for example, the recent Demo conference in Scottsdale, Ariz. As information about the event flowed into the blogosphere, a likely tag to hang on conference-related items would have been the distinctive name Demo@15. And sure enough, that tag was used on both Flickr and del.icio.us, although by only one person. (Hint to conference planners: If you want the blogosphere to synchronize its coverage of your event, pick a tag and promote it.)

But there are also implicit tags -- namely links -- that identify items about the conference, and a new service I built this week is helping me find them. After Jason Hunter showed me Mark Logic's XQuery-based XML database, Content Interaction Server, in a screencast, I set up an instance of it and began pumping in the RSS feeds of all the blogs I read. Then I wrote a query that combines free-text search for items containing the strings "Demo" or "Demo@15" with structured search for items that contain links to demo.com. It yielded a nice list of Demo-related items that I couldn't have built any other way.

The service works by converting the HTML content of my feeds into well-formed XHTML, storing it in the Mark Logic database, and then using the XQuery engine to perform hybrid free-text and structured searches. Although the vocabulary of XHTML is not very rich, certain elements -- notably links -- carry a latent semantic payload.

It's also possible to enrich the semantic payload of blog content, and on my own blog I've been doing that for a while. Using my XPath query service, you can easily find quotes by Ward Cunningham, Python code fragments, and a number of other things I'm marking with simple CSS tags. Can these ad hoc syntaxes be collaboratively extended? If we can get structured search working for the whole blogosphere, we'll find out.

Jon Udell is lead analyst at the InfoWorld Test Center.

======================================================================== Because no network is an island, including yours. Your business depends on your network and the networks you network with. InfoWorld's "Networking Report" newsletter summarizes key developments that might affect your networking plans and performance (and lets you click through for more). Keep current with what you need to know about your network and the greater world of networking that you're part of. Subscribe at http://newsletter.infoworld.com/t?ctl=BF6C33:353CA35

ADVERTISE ======================================================================== For information on advertising, contact Elisabeth_raphel@infoworld.com.

UNSUBSCRIBE/MANAGE NEWSLETTERS ======================================================================== To subscribe, unsubscribe or change your e-mail address for any of InfoWorld's e-mail newsletters, go to: http://newsletter.infoworld.com/t?ctl=BF6C34:353CA35

To subscribe to InfoWorld.com, or InfoWorld Print, or both, or to renew or correct a problem with any InfoWorld subscription, go to http://newsletter.infoworld.com/t?ctl=BF6C37:353CA35

To view InfoWorld's privacy policy, visit: http://newsletter.infoworld.com/t?ctl=BF6C35:353CA35

Copyright (C) 2005 InfoWorld Media Group, 501 Second St., San Francisco, CA 94107

This message was sent to: GDEWILDE@GMAIL.COM