The Pedantic Web Group to the RDF Rescue!Jennifer Zaino How’s your RDF? If it could be in better shape, some folks may be able to help: The Pedantic Web Group was recently formed by researchers at Digital Enterprise Research Institute (DERI) and Institute AIFB at the Universitaet Karlsruhe (see previous article).
SemanticWeb.com: Have you and your colleagues observed a growing trend of RDF data being published? What might that lead you to conclude about the growing maturity of the semantic web? Hogan: There is certainly an encouraging trend of growth in RDF data -- both in terms of quality, heterogeneity and quantity -- being published to the Web. Back in 2005 when I started working in the area of Semantic Web research, RDF Web data consisted of a number of interlinked FOAF profiles and some data published under the auspices of various research projects or geek curiosity. The quality of the data was, as I remember, quite poor. Publishers were reluctant to use URIs to name their resources, vocabularies were replete with errors, interlinking between datasets was either poor or nonexistent. My own FOAF file, at that time, was no different; they were certainly more innocent times. Jump to late 2009 and we've come a long way. More specifically, under the pragmatic guidance of the Linked Data movement, RDF data published on the Web has come a long way. The Linked Data movement has been integral to the maturation of RDF Web publishing, not merely by promoting a set of pragmatic best practices, but also by refocusing efforts on producing data: before, data was often published in RDF as an afterthought, or for the purposes of a specific application. Now, Linked Data advocates publishing RDF data on the Web as a worthwhile endeavor in itself. As such, we are now on our way to solving the chicken-and-egg problem with the Semantic Web with respect to which comes first: the data or the applications. Now is an exciting time for R&D into applications which can exploit the fruits of Linked Data. As compared to four or five years ago, data quality has improved as, for example, publishers understand the importance of using URIs to name things, and that those URIs should be dereferencable. Quantity and heterogeneity has also increased, as the March '09 LOD cloud [refer ttp://linkeddata.org] can attest to; data is being published by governmental and commercial entities and is becoming more 'general-interest'. And, the trend is continuing; for example, there were two exciting announcements at ISWC last week: that Drupal 7 core will support SIOC/RDFa exports by default and that the New York Times are planning to produce Linked Data exports. SemanticWeb.com: How did the idea for Pedantic-web.org come about -- what had you and your colleagues who formed the group been observing around published RDF that led you to think there are problems to be addressed if the vision of the web of data is to be fully realized? Hogan: There are a number of researchers in DERI Galway, like myself, working on various applications and research avenues to locate, crawl, clean, reason, index, and provide search and browsing over RDF data published on the Web; most of this work has been grounded by the requirements of end-user applications like Sindice and SWSE and more recent systems like Sig.ma and VisiNav; working for the purposes of such end-user applications keeps a researcher like myself honest. Unsurprisingly, Web data is quite noisy (if not, sometimes deafening): throughout various incarnations of various systems worked on by various researchers, this undeniable fact has reared its ugly head to me, and numerous other colleagues, in many unusual error logs, non-terminating tasks, hours spent debugging data, etc. We initially thought about setting up a support group, but thought better and set up the Pedantic Web Group instead. There are a number of ways to tackle such noise: identify and ignore the data that causes such problems; build multitudinous workarounds to make the system Web-tolerant; drop features of the system that cause such problems; and request the original publisher to fix the data. We've mainly relied on the former three approaches over the years, but in various discussions with colleagues, we've come to see the bigger picture -- outside of our respective projects --and realize the importance of engaging with publishers, educating them with respect to commonly observed errors, contacting them to improve the quality of their data, and coordinating efforts to make related datasets more inter-operable with each other. SemanticWeb.com: What do you think leads to “broken data”? Hogan: Honestly, the main factor that leads to broken data is the current lack of prominent applications which use that data. When people were creating HTML documents, they could check their handiwork in the browser-of-the-day and if something didn't look okay, they would fix it. Publishing RDF data has previously been a secondary activity for many people, be it for academic purposes or simple curiosity. Publishers of RDF have not been subject to the tangible results of their errors: even if a publisher mistakes Another major factor is the lack of purposeful education, tools and support for inexperienced publishers. RDF, as a data model centering around triples, is inherently fairly straightforward. Businesses and organizations should not be afraid to get their hands dirty in this respect: publishing RDF is only complex if you let it be. In fact, publishers do certainly tend to get it almost right; in our experiences, most errors are easily pinned down and easily fixed -- assuming the co-operation of the publisher/maintainer. Our purpose, as pedants, is to worry about the small things and to try and educate and support publishers and to show them that publishing RDF can be surprisingly straightforward. SemanticWeb.com: What are the “risks” of bad RDF data, so to speak? Does it potentially compromise the semantic web? Hogan: From a personal perspective, the risks of bad RDF data are excessive consumption of coffee and cigarettes, restless nights, irritability, and hours spent debugging and hacking at the expense of enjoying what's left of my youth. With respect to the bigger picture, noisy data can slow down development and adoption of applications which rely on RDF Web data, significantly impact the usability of such systems, and reduce the precision of their results. Applications may have to disable certain features, which although theoretically useful over the data, are not practically tolerant to noise; such applications must be developed in a robust manner, and with this comes certain compromise. However, for example, Google has shown that it is possible to be tolerant to Web data -- and even deliberate spamming -- with little or no compromise. In (more than) a couple of words, noisy Web data contributes to the barriers-to-entry associated with Semantic Web adoption -- and may frighten potential adopters away -- but should be less of a problem for mature applications. SemanticWeb.com: As a more specific example of the above, perhaps you can cite how poor quality data affects some of you and your colleagues’ own projects that depend on the data? Hogan: Hmm… how long do you have? Well, to start, if you do a Google search for the string "08445a31a78661b5c746feff39a9db6e4e2cc5cf", Google will kindly return you nearly a million results (at the moment). Almost all of the results are RDF/FOAF descriptions of people. So what's the string? It's the anti-spam encoded SHA1 value [Ed note: SHA-1 being the best established of the existing SHA cryptographic hash functions] of the " Had I read the above paragraph in late 2006, I could have saved myself a lot of stress. At the time, we were working on a means of performing "object consolidation" for the SWSE system: a method for using uniquely-identifying properties like As another example, the DBPedia exporter for Wikipedia data uses the property More recently, the New York Times export of Linked Data could not be processed by Sig.ma due to a problem with content negotiation on the NYT servers -- similar content negotiation problems are quite common on the Web. The Sig.ma team takes pride in offering exploration of up-to-date RDF data as soon as it is released on the Web: when interesting new data -- such as exported by NYT -- doesn't show up, users will inevitably be disappointed. SemanticWeb.com: In what ways will your group be able to help solve these problems? Hogan: Our primary objective is to increase the quality of RDF data published on the Web: over the years, we've collected a "laundry list" of problems we would like sorted, and our first tentative steps have been to contact publishers informing them of errors that exist in their data, how the errors might have arisen, why they are problematic, and some suggested fixes. We see this as the main focus of the group: to contact publishers of data with known problems and request fixes. Oftentimes, simply showing the publisher that someone cares about the quality of their data can provide them enough impetus to co-operate with us and fix issues. Of course, we understand that there are many other people in industry and academia lamenting into coffee mugs late at night for similar reasons to us; we encourage anyone who finds an error or interoperability issue with a dataset to not only contact the publisher, but to CC the pedantic-web mailing list and coordinate with us. In this respect, we see ourselves as a community-driven effort, and as the go-to point for anybody out there who encounters problems with RDF publishing in their applications. Equally important with contacting individual publishers about errors is, of course, raising awareness with respect to publishing high-quality data, and avoiding common errors. A large part of this takes place in discussions like this, communication through the mailing list, presence at conferences, etc. Hand-in-hand with awareness comes education. Once we make people more conscious of the importance of error-free publishing, we need pragmatic educational material on how to achieve that goal: for example, discussion on frequently observed errors, lists of validation tools, common misconceptions in RDF, etc. We still have a lot to do in this regard. SemanticWeb.com: Is this happening on a voluntary basis? Hogan: This is on a voluntary - albeit, self-interested --; basis: The better the data will be, the more convincing the tools we are researching and developing will be to eagle-eyed audiences. We have no plans for creating a "service" at the moment; we would like to see publishers creating better quality RDF data more autonomously, using various tools and material we currently and eventually plan to provide. Improving data quality will probably happen naturally, but we hope to accelerate the process a bit. At this stage of Semantic Web adoption, community-driven efforts along our lines are much more beneficial than commercially driven efforts: I don't personally believe that there is currently much of a market for the type of service we provide. We are trying to leverage people's passion for RDF publishing, as opposed to their wallets (not that the latter is, by any means, a bad thing -- it's just a little early). Semanticweb.com: Have you had any interest yet and if so how many projects are you and your colleagues already tackling? And do they mostly come from business, academic, individual user domains? Hogan: We now have 62 pedants subscribed to our mailing list as of Nov. 5, '09. Most of these memberships have come from word-of-mouth, twitter, social networks, and presence at ISWC. We've contacted just over a dozen publishers from various backgrounds and with various issues, ranging from one individual who copied and pasted a member's FOAF file and forgot to change some important values; to businesses like the New York Times; to academics and vocabulary maintainers for issues in FOAF, vCard, iCal, OpenCyc, QDOS; to two exporters of LastFM data with interoperability issues; etc. As a start, we're generally tackling more prominent publishers with more prominent errors. To our own immeasurable relief, publishers largely respond, whether to report a quick-fix, to seek further clarification, to discuss possible fixes or difficulties, or to assure us that they are aware of the issue and that it will be fixed in an impending update. It is important to note that some of these exporters publish millions of documents about millions of entities; some of the vocabularies are used in tens of millions of documents on the Web. Simple fixes to such exporters and vocabularies -- which can sometimes only take a few minutes -- can have immediate and positive effect on the quality of millions upon millions of documents describing millions upon millions of resources, and can indirectly improve the quality of my sleep. SemanticWeb.com: Do you see any efforts underway in the industry that might help users get RDF data right the first time so that at some point in the future maybe you and your colleagues won’t have to do as much cleanup? Hogan: Awareness of the issues is important, and will spread. Also, tools will naturally mature, and publishing will become more automatic and stable. For example, take Drupal exporting SIOC (along with existing exporters for WordPress and phpBB); such systems offer RDF exports with little or no intervention on the part of the content provider. One important effort in helping current publishers to help themselves is offering a set of comprehensive validators. As an analogy, the RDF/XML validation service provided by the W3C is well-known and well-used: In our experience, RDF/XML syntax errors are relatively rare, and since RDF/XML is not exactly an intuitive syntax (certainly for me: only after spending two weeks writing an RDF/XML parser could I understand all of it), we would reasonably assume that the presence of a well-known validator is responsible for the large percentage of valid RDF/XML documents on the Web. SemanticWeb.com: What best practices might you recommend users follow? Hogan: Besides well-known -- and sometimes difficult-to-comprehend -- W3C primers and best-practice documents, we have provided a document outlining frequently observed problems in RDF publishing, a list of validation tools, and intend to provide a document discussing common misconceptions relating to RDF. Our material is more aimed at people who are already familiar with, or have already tried RDF publishing to some extent. Email This Post |
The Voice of Semantic Web Business
|
|||||||