Q & A with Open Calais Guru Tom Tague
Deborah Gage Tague also manages feedback from the 13,000 developers associated with Open Calais and talks to potential customers and partners. He claims he has 6,000 e-mails, plus a petabyte of Twitter. He joined Thomson Reuters two years ago as part of the publisher's acquisition of ClearForest, which developed the software for Open Calais. "The technology was amazing to me, but somehow it was not exploding in the market," Tague said, when asked why he joined ClearForest. "It seemed like an interesting problem to take on, to have technology looking for a problem and then to figure out the problem set it was applicable to." The software is free to all users, because, as Tague says, "the wider range of things people throw at it, the better." Tague talked with SemanticWeb.com's Deborah Gage.
Q: What is so special about Calais?Tague: We take text and tear it apart, down to most fundamental parts of speech. We understand nouns and verbs and patterns, and we send them through hundreds of thousands of rules so we can extract entities -- things, people, organizations. Unlike a lot of solutions, we identify these things and extract facts -- if we see John Doe, CEO of IBM, we've found a person-to-position relationship. We also understand events. If John Doe announced that Jane Doe will assume the position of CFO next July, that's a management-change event. We understand events, ranging from management changes to labor actions to earnings announcements to product recalls, and we've started to expand over the last year and a half into natural disasters and sporting events and album releases, things that are happening. We know there's an explosion of content and the noise level is high, so the more finely I can filter that to get it to me as an information consumer, the better. Q: You started out working with financial publishers -- Thomson Reuters -- and now you're working with other publishers, like the Huffington Post and CBSinteractive. How is that different?It's wildly different with other publishers. One of our longer-term strategic goals is to connect all of the business-relevant information in the world. But CBS and the Huffington Post are news, and the range of things that the other 13,000 develop throw at Calais is unbelievable -- from French novels to scripts to e-mails, and with every one of those things we learn a bit about how to handle content. Q: What have you learned so far?When we're looking at Web content, we all know there's a fairly low signal-to-noise ratio, but we're surprised at how low it is. The amount of content out there that is duplicated or derivative is overwhelming. It wouldn't surprise me if 90 percent of it was duplicate information. Also, it's pabulum, but the world is more complex than we thought. The type of events and information that people are interested in tracking is incredibly diverse. Where we had the luxury for 10 years of carefully evolving out the software to understand bits and pieces of business domains, we now have to explore mechanisms to evolve it very rapidly. If somebody wants us to do a good job extracting information about fashion articles, and we've never seen those before, we've had to develop technology and procedures to get good in that new domain very rapidly. The degree of specialization in those domains is amazing -- fashion, museums, sports. Q: Does Calais pass judgment on the quality or truth of information?No, and it's an interesting challenge that everybody is facing in this space -- the trust issue.We look at content, analyze it, and send it back. We do things inside the document on what we think is important, but we're a simple trusting machine. I think the next level up the food chain is where issues of trust have to be addressed. If publishers know the source of documents they're sending us, they can use that to start to understand credibility around those documents. If the only things someone is sending us is curated news articles that they produced, they already know the trust relationship. But if they're scraping blogs, they'd better be pretty careful making decisions. Here's how Reuters has evolved to meet the trust issue: If we extract a company from an article, we go through a laborious project to understand which company it is, then create linked data assets, basically XML out there on the web, that says I found IBM, where is my info about IBM stored, and here's how to get it. The whole linked data world is exploding, and we will invest make sure that if you come to us and we say we found IBM or this product, you'll be able to trust that. It's not like going to Wikipedia and wondering if somebody spammed the page. Q: You've talked about Web 3.0 and how it needs to clean up the mess left by Web 2.0 -- that from a user's perspective, the experience on the Web right now is not good. Why not?All I know is that content has gotten crazy. It's all over the place, there are different takes, it's hard to get a complete view of any story. That was hard in the days of newspapers -- you'd have to read five newspapers to get a complete story -- and it's been multiplied on the Web. I go here to be social and here to read news and here to get gossip and blogs. It's very fragmented, although for the tiny subset of the world's population who can use RSS and filters it's it a little better. But that feels like bubble gum and bailing wire to me. Shouldn't the Web be better than a newspaper? On the New York Times home page, underneath every story there's a widget that says, here's other related stories from non-New York Times media. It's a clumsy user interface, but at least they're trying to say we're not the only source of the news. I'd like to see more of that, but less clumsy. Q: You've been hard on semantic search, which Microsoft and other companies have made big bets on. Why?There are certain domains where embedding search has high value, but those domains are small -- good examples are pharmaco or real estate. But look at consumer search -- Google's got 75 percent of the market, Microsoft has 15 percent, and Yahoo 5 percent, or maybe it's the other way around, and you're left with a tiny little air bubble of marketplace available. The number two and number three players in the big search marketplace don't make money. I'm a technologist at heart, and we (get excited and) say this is so obvious, but we never step back and say, but nobody cares. I ran out of patience with the Paris Hilton example: will I get the entertainer or the hotel, it's bogus. Type "hotel paris Hilton," and you won't even see the entertainer on the first two pages of results. There will be pieces built into search engines like Bing, and surely in Google, which is starting to add faceted navigation components to search so that if I search for jaguar, is it a car or animal, and the search engine might be smart enough to have two tabs. But it's fairly subtle and only relevant to small subset of searches. Q: What innovations in semantic technology are coming next?I expect to see a rush of activity around the browser and the device. I think all it's going to take is a few examples to show how to create a compelling user experience, attach an ad and make money, and then there will be an explosion. Although my gut feeling is that it will be an explosion of low quality, because most of them don't have a lot to offer for that experience other than social networking around it. I think the other big thing will be the adoption of semantic technology in the back office. It's not about the beautiful user interface or dimensional navigation or cooler search engine optimization -- it's about workflow management, the automation of manual functions, letting one editor do the work of 1.5 editors in a publishing environment. That's what we're starting to see. It's simple triage ... if we can help an editor throw away 500 out of 1000 articles that may come in. Data is coming in in a very straightforward manner that we can reduce the workload of monitoring corporate events by about 60 percent. The hope is that it frees up the editor to do more editorial things. But however it's used, it changes the cost and efficiency structure. Those are real businesses that might make money. Q: What are the pitfalls of semantic technology -- what about privacy for instance?The biggest pitfall now is the chasm between expectations and the reality of what the technology can do. Most casual participants see it and say, my god, it's artificial intelligence. Pretty soon we won't even need writers -- we'll just aggregate Twitter. And I say, whoa, slow down. One issue is expectations management. Also, as with any technology designed to deal with large amounts of information, there will be privacy and trust issues and we need to pay a lot of attention to those. From day one, we have erred on the side of privacy and transparency and not retaining data to make sure people are comfortable with the service. I think another thing that's interesting is that we're starting to see seeing certain businesses where the barriers to the business are dramatically lowered because of the technology. For example, there are a lot of media monitoring companies, and you just went from hiring 100 editors and 500 more offshore to two guys in a garage buying a content feed and using Open Calais. We've seen more based on those than any other category. I think there will be other business models where the economics of starting in the marketplace will be dramatically lowered, but we haven't figured those out yet. Email This Post |
The Voice of Semantic Web Business
|
|||||||