Why the Deep Web Needs the Semantic Web
Jennifer Zaino But the two are more interrelated than the article suggests. To get some insight into whether the Deep Web and the Semantic Web are competing or complementary agendas, SemanticWeb.com conducted an email interview with Professor James Geller of the New Jersey Institute of Technology and Professor Soon Ae Chun of the College of Staten Island. They were the chairs of last year's "The Semantic Web Meets the Deep Web" workshop in Washington D.C. and have authored a number of papers and articles on the topic. As background, the professors note that the Deep Web, as defined by M. K. Bergman, deals with Web pages that traditional search engines cannot "see," either because they are created on the fly, or because they are hidden in some way. For example, the Web pages of most e-commerce sites are built dynamically based on backend databases. A ticketing Web site of an airline, for instance, stores information about flights in such a database and dynamically creates a display according to the needs of the traveler. Another component of the Deep Web consists of files in formats that are not understood by common search engines. Because the Deep Web is not indexed by search engines, a user needs to visit a Web site and use its front-end to find out whether a product or service that he is interested in is available at all, the experts explain. That potentially leads to inefficiencies, such as loss of productivity or wasted time. A user who wants to book a flight to Korea, for example, could wind up losing time searching for such a flight on the web site of a carrier that doesn't travel there. But that could be avoided if the cities stored in the backend databases of all airlines are indexed, because that carrier would not even show up in Web search results for "flight tickets to Korea." By being able to index the contents of the Deep Web, the search would be much more effective, the professors say. In fact, they conclude that the only reason e-commerce is successful is because there are a few large suppliers, in every category, that have everything -- Amazon, for instance. If the Web consisted of thousands of mom-and-pop size e-stores, the lack of support for Deep Web indexing would have led e-commerce to be a resounding failure, they claim. Our Q&A picks up here:
Q:So. It's correct to conclude that on the Deep Web, it is possible to get to information -- but not very easily?
A:As mentioned, the Deep Web can be accessed through forms, where a person enters a request or issues a query to retrieve Deep Web data. This is the "normal," manual way to access the Deep Web. It is possible for some Web sites, by using repeated "robotic" queries, to extract information from their backend databases. But many Web sites use different techniques to make this impossible. Ticketmaster is especially known for that. A human user has to type in a word, displayed to her in a distorted manner, before the system accepts any request from this user. Ticketmaster had to do this, to limit the number of tickets bought by each user. But not all Web sites are so hostile. Using a robot program, we were able to find that a certain airline flies to about 1,200 airports, what their names are, and to which cities they are close to.
Getting information from the Deep Web is in general not a trivial task. A separate robot program has to be written for every Web page, and if the Web page owner changes the page layout, the robot usually breaks. Some Deep Web providers may offer an API or Web service to deliver their Deep Web data. These Web services have to be described and their interface information has to be published for discovery.
Q:Why is the semantic web potentially an answer to the problems?
A:The big vision of the Semantic Web is to automate tasks that humans do on the WWW. This requires agent programs that find resources (pages or services) on the Web. These agent programs will only be effective if they have access to information and knowledge. The standard repository for knowledge in computers is called "ontology." Ontologies make it possible to bridge the gap between user expressions and raw data. For example, if a user requests vacation flights to "Southeast Asia," most airline systems would not be able to respond, because they contain information about flights to Korea, Japan, China, Vietnam, etc., but they lack the knowledge that these countries are in Southeast Asia. An ontology would contain this kind of knowledge and can be used to correctly translate such a query into something that the database would understand.In our research, we define the "Semantic Deep Web" as the combination of Semantic Web and Deep Web techniques and structures. The connection between the Deep Web and the Semantic Web is two-fold. On one hand it is possible to include indexing information to relevant Web sites directly in a Semantic Web ontology. Thus, a query for flights to Southeast Asia would be directly translated into links to airlines flying to countries in Southeast Asia. As a second issue, the task of building ontologies is very difficult. Many researchers have attempted to build ontologies automatically by natural language processing techniques. But natural language processing is still not a proven technology. English is hard to process. On the other hand, databases are very well-behaved. If my program gets access to a database column that contains Paris, London, Rome, Seoul, and Vienna, then we can be pretty sure that every item in this column is a city. That allows us to automatically build an ontology that has much more knowledge about cities than the average person. Furthermore, we can link the knowledge about cities directly to the airline, allowing us to only consider an airline that flies to our desired destination. In the long run, what we hope to see is that there will be a new generation of Ontology-Enabled Web Browsers. Initially the user would select an interest area, and the Web browser would load any known relevant ontologies. Later on, this process would be automated. However, there are still challenging research problems on the way to this goal.
Q:Where can this lead for business or scientific communities, as well as general consumers?
A:The issue of dynamic composition requires bringing in semantic issues. Businesses rely on data to generate meaningful information to make timely, accurate decisions that help them stay competitive. Many business intelligence (BI) tools thus need to identify appropriate data sources, extract relevant data from the sources, and integrate data items to generate knowledge. With the Semantic Web, businesses also expect automated agents to identify the relevant Web sources for searching and mining their contents.The ability to access, combine and fuse the hidden data from diverse Deep Web sources is required, not only in businesses intelligence and decision making but also for scientific discovery and even for personal decision making. For instance, scientists need to be able to access large, diverse data sets that are locked into separate Deep Web sources in order to discover bigger patterns and interesting trends. Thus we need a search engine that can access Deep Web data. With the Semantic Web vision, automated software agents will perform reasoning and inference using knowledge base-like ontologies to find appropriate data services and these services will then be used for creating dynamic composite business processes. The current Web services that serve data from the Deep Web -- we call them "Deep Web Services" -- can be described in standard WSDL (Web Service Description Language) and published and discovered using UDDI or local directory services, which basically describe the input and output interface of a Web service. A generic search engine should be able to search and locate the best fitting Web services in order to deliver the appropriate Deep Web data. This requires a rich annotation of Deep Web services that illustrates a Deep Web site's data contents, not just the service's input and output protocol, and additional information on the pragmatics -- that is, the intended usage of the Deep Web Service. Q:What research approaches, techniques and methodologies do you believe are needed to model, query, extract and annotate Deep Web resources to provide a semantic layer on top of the Deep Web?
A:In our approach we start by identifying a topic area of interest. Right now we are looking specifically at "famous people," such as singers, athletes and, dare we say it, computer science researchers. The first step consists of identifying Web pages that contain backend databases with information about such people, i.e., in a structured format. Then a robot program needs to be written that queries this Web site and "scrapes the screen" to extract and store the results. Some Web sites are cooperative by offering near misses. Others only return perfect matches or error messages. Yet others block you, as mentioned before. The results are then included into an ontology which was partly built by a knowledge engineer. Thus the knowledge engineer determines that she is interested in cities, singers, etc., but the robot program retrieves the cities and singers in numbers that would be difficult for her to access by hand.And, as mentioned [in the example of flight research above], we have built a research prototype dealing with flights, airports and cities. (The work on the airline Semantic Deep Web system was performed by Dr. Yoo Jung An as part of her PhD dissertation at NJIT. Dr. An is now a visiting professor at Fairleigh Dickinson University.) The "famous person" system was started last September. We have not deployed any Semantic Deep Web applications. As mentioned, this is work-intensive research. But we are quite sure there will be major progress in this area, at least on the Deep Web side. Google has intensified its efforts to search and index the Deep Web.
Q:How far off are we from being able to exploit semantic web technologies to uncovering deep web assets on a large scale?
A:The answer to this really depends on two issues. One issue is how many resources can be invested into such a task. As an academic research organization we are limited in this respect. If a company with the deep pockets of Google gets into the Deep Web, we will see results soon. On the other hand, ontologies are still primarily viewed as research vehicles, rightfully or not. Predictions in this general area have often turned out to be wrong, so I am reluctant to make any.
Q:Are there any concerns/challenges to this idea-for example, issues around content ownership? That is, some of these deep web assets might be purposefully locked away from general view-so, do the keepers of that content have to agree to buy into this vision and actively support it, or are there legitimate ways around that? And why should content owners agree in the first place--what is the benefit for them?
A:This is a complex issue. It is quite obvious that a company would not want a competitor to mine a list of its suppliers from their database. Nor would a bank want their competitors to get a list of their customers. On the other hand, every company wants to be listed by Google, and as high up as possible. No company I know of has ever refused to get their pages searched by Yahoo, Google, etc. So we need to educate the market place to extend this attitude to some of their backend data. Just as there is a big Open Source movement, we need an Open Data movement. Companies will need to segregate their backend data into private and public components, and then they should "invite us" to search and index their public Deep Web components.There is a need for a certain critical mass; after that the bandwagon effect will take care of the rest. Email This Post |
The Voice of Semantic Web Business
|
|||||||