Semantic Web technology makes data integration more flexible, reducing a lot of the work previously required for new integration and for revamping databases, according to Lee Feigenbaum, vice president of marketing and technology for Cambridge Semantics. He also co-chairs the W3C SPARQL working group, which is an RDF query language and considered a key technology for the Semantic Web. He also explains who some of the main players are in this space, and why you should buy rather than build.
In part one of this interview, he explained the differences in semantic technologies and what’s changed in the past year.
Lawson: How does it change what you're able to do when you talk about BI and master data management?
Feigenbaum: There are a couple of key things that are going on here. As a precursor, generally speaking, this is an overlay technology and that’s super important to adoption. This is not people going in and ripping out their SQL databases and putting in Semantic Web databases. Instead, these are technologies that lie on top of existing systems, existing databases, and sort of virtualize the data into the Semantic Web format.
The benefit of doing that is it gives you a very flexible data integration layer. What I mean by flexible integration layer is you can integrate different sources of data incrementally. So today I can integrate two databases and start using that and if two days from now I want to start integrating data from a third database or from spreadsheets, I can do that without having to build an entirely new data model.
We had a conversation maybe a year ago with a big financial services company and they were telling us how they did an audit of all their traditional, relational, SQL-based data warehouses and they found that to do something as simple as add a single field to a data warehouse that was already deployed cost them $10,000, let alone any more complicated change like bringing in a whole new source of data. And what usually happens in IT, because of that, is that you have these long projects. You have projects that are 9, 12, 18 months where you have the long requirements gathering phase where you have to get all the stakeholders to the table and agree on exactly what you want the final integration project to look like, and then you go off and you build your warehouse and you define all of your ETL jobs from your different source systems and you get it up and running. And you use it for one or two use cases.
But then somebody comes along and they say, “Hey, this is great. Can we add this other database in so we can use this for this third and this fourth use case?” And IT says, “Well, we’ll put that on the to-do list for the next time we revisit this project,” which typically will be, you know, another two years down the road. Depending on what you're doing, these lifecycles for these data warehouses can be three years or eight years, right, but it’s a very slow process. It’s not responsive to change.
There are two key things Semantic Web tech is doing in the data integration world and both fall under the mantle of flexibility. The first is this idea that you can incrementally integrate things. You can do it ad hoc, as needed, in a matter of days or weeks instead of months and years. The second is that the graph model that the Semantic Web uses is a much more flexible way to represent information than relational ways. And what I mean by that is it can be used much more universally for data integration.
Say you're applying text analytics to financial disclosures. Companies file their disclosures with the FCC on a quarterly basis and you're going to run text analytics on that and pull out information that you want to put into your data warehouse along with information from your internal databases. You would basically have to say when you're designing your data warehouse, “Well, I want to get these four pieces of information out of my text analytics tool,” because in the relational model, it’s not sort of this universal way to represent information. It’s a very rigid, constrained way to represent information. So I need to make specific columns or specific dimensions in my database for those three or four pieces of information.
So then I go and I deploy my warehouse and I start running all my ETL jobs and running my text analytics and populating the warehouse and a month later, I’m saying, hey, these reports have a lot more interesting information, and some of the reports have this particular piece of information that others don’t, and wouldn’t it be great if I could capture some of the forward-looking statements that are in these reports? I also want to capture the overall sentiment of the report and sometimes there’s footnotes on those reports about changes in executive management at the company and I don’t have anywhere in my warehouse to put that information.
So I’m basically losing all of this information that I can’t easily map to my warehouse or my relational database. The universality or the flexibility that the Semantic Web approach gives you is the flipside of that. It’s the idea that I can take data, regardless of the source, and continue to extend my model and bring it in, even if I didn’t anticipate a particular kind of information up front when I started the project.
Lawson: I understand data modeling is the biggest part of most integration projects anyway, or a significant cost.
Feigenbaum: Oh, absolutely. David Wood, who is a colleague in the Semantic Web community, coined the phrase “Cooperation without coordination,” and the idea is traditionally you need to do exactly what you just said: That Big Data modeling exercise, right? And you need to do it all up front.
It’s not like you don’t have to do any data modeling in the Semantic Web world. You still need to say, “Hey, this information here represents an account,” and “This information here represents a trade,” and, “This represents a drug.” But you don’t need that upfront coordination. You can still get cooperation and you can still build integrated models that can do multi-dimensional integrated analysis on top of, but you don’t need that upfront coordination. And that’s a big difference.
Lawson: So what does Cambridge Semantics sell? Can companies buy a Semantic Web solution?
Feigenbaum: We are an enterprise Semantic Web software vendor. We sell middleware and tools for making just the sort of high-level stuff I’m talking about. We sell connectors for mapping information from relational databases into a Semantic Web world or taking spreadsheet data and mapping it to the Semantic Web world. We’re taking unstructured content and mapping it to the Semantic Web world, and then tools for your ordinary, non-technical business end users to consume all this information. So dashboard tools, visualization tools, reporting tools to bring information back out into, like, Excel, for instance.
Then there’s a bunch of plumbing that goes along with that; it’s not just about data integration and visualization. You also want to be able to act on the information you brought together and do things like work flows and alerts and kickoff services and stuff like that. So there’s a whole sort of middleware part to what we do.
We sell this software suite to our customers who are very large enterprises. Our primary industries are pharma and finance. We work with companies like Merck and Johnson and Johnson; Biogen, that’s another big pharma company. Staples uses our stuff, and they use it for pretty diverse types of applications, but at a high level, the reasons they choose the software is because of this improved flexibility to integrate very diverse data and do it in sort of reasonable timeframes and reasonable levels of effort.
But beyond Cambridge Semantics, there are a whole bunch of vendors that are selling the stuff, from other small guys and then there are big players. Oracle has a version of the database that supports the Semantic Web standards. IBM announced a new version of BB2 a few months ago that supports the Semantic Web standards. Kray has a spinoff or a subsidiary company called Yark Data — Yark is Kray backwards. Yark Data is basically using, you know, one of the Kray huge, in-memory supercomputers as a sort of Semantic Web appliance. There are a lot of vendors that are starting to sell this.
We talked about what’s changed in the past year or two. One of the things that’s changed is if I talk to somebody who’s looking to get started with this stuff, and wants to build it all, I tell that company unequivocally that that is now a mistake. There are enough options out there for both sort of full toolsets and sort of individual components that you could use to get started. It’s mature enough that you should just take — you know, you should buy something from a vendor or there are open source tools as well and that’s how you should get started.