The Most Interesting Data-Integration Challenge Right Now? The Web


What's the most interesting data-integration problem imaginable?


Alon Halevy, a former computer science professor at the University of Washington, believes it's integrating the Web. And by Web he doesn't just mean the crawl-able information we all know and love. He means the down-deep data hidden in databases connected to the Web, but not accessible by a search engine. This is the information used to create dynamic pages, schedule flights and so on.


It's called the Deep Web, or sometimes the Invisible Web. Of course, this isn't news to technologists, but, this week, The New York Times gives us a peak at how some in the computer science field are approaching this mother-of-all integration challenges.


Halevy is now leading a team at -- where else? -- Google that's attempting to solve this. Google's tactic is to use a program to analyze the contents of every database it encounters on the Web. Every. Database. The thought process being that you need to know what's in the database before you can decide whether to search it for information. The program works by finding a form on a Web page and then guessing at likely query terms, based on the Web site's content. Once it gets a match, "the search engine then analyzes the results and develops a predictive model of what the database contains," explains the Times.


Of course, Halevy and Google aren't the only ones trying to solve the Deep Web integration problem, though, as reigning champ of search engines, Google may have the most at stake, because whoever solves this problem would become the next King of Search Engine Hill.


But there's more at stake here. As the article points out, this technology could be a major boon for businesses and the ever-elusive Semantic Web:

This level of data integration could eventually point the way toward something like the Semantic Web, the much-promoted - but so far unrealized - vision of a Web of interconnected data. Deep Web technologies hold the promise of achieving similar benefits at a much lower cost, by automating the process of analyzing database structures and cross-referencing the results.