Nathan's Blog: Hypertext, Hypermedia and the Semantic Web: What is all this Semantics stuff anyway?

What is the ‘Semantic Web’?

‘The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. The first steps in weaving the Semantic Web into the structure of the existing Web are already under way. In the near future, these developments will usher in significant new functionality as machines become much better able to process and “understand” the data that they merely display at present’, (Berners-Lee et al 2001).

What are the design principles of the Semantic Web?

‘The essential property of the World Wide Web is its universality. The power of a hypertext link is that “anything can link to anything”... Like the Internet, the Semantic Web will be as decentralized as possible... Decentralization requires compromises: the Web had to throw away the ideal of total consistency of all its interconnections, ushering in the infamous message “Error 404: Not Found” but allowing unchecked exponential growth’, (Berners-Lee et al 2001).

We can trace this property back to the original Web’s design principles, particularly the Web’s ability to record random associations between objects.

‘The Web was designed to be a universal space of information, so when you make a bookmark or a hypertext link, you should be able to make that link to absolutely any piece of information that can be accessed using networks. The universality is essential to the Web: it loses its power if there are certain types of things to which you can’t link’, (Berners-Lee 1998).

‘The second part of the dream was... The computer re-enters the scene visibly as a software agent, doing anything it can to help us deal with the bulk of data, to take over the tedium of anything that can be reduced to a rational process, and to manage the scale of our human systems’, (Berners-Lee 1998).

How will the Semantic Web work?

‘For the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning’, (Berners-Lee et al 2001).

‘Traditional knowledge-representation systems typically have been centralized, requiring everyone to share exactly the same definition of common concepts such as “parent” or “vehicle”. But central control is stifling, and increasing the size and scope of such a system rapidly becomes unmanageable’, (Berners-Lee et al 2001).

‘For example, a genealogy system, acting on a database of family trees, might include the rule “a wife of an uncle is an aunt”. Even if the data could be transferred from one system to another, the rules, existing in a completely different form, usually could not”, (Berners-Lee et al 2001).

The point that Sir Tim Berners-Lee is making here, is that data is usually stored on a range of different systems and the ‘semantic’ rules that define objects are found in a variety of different formats, dependent on the information storage system used to store the data. It is then impossible to associate semantic rule sets when moving data from one information system to another.

‘Moreover, these systems usually carefully limit the questions that can be asked so that the computer can answer reliably or answer at all. The problem is reminiscent of [Kurt] Godel’s [incompleteness] theorem from mathematics: any system that is complex enough to be useful also encompasses unanswerable questions... Semantic Web researchers, in contrast, accept that paradoxes and unanswerable questions are a price that must be paid to achieve versatility. We make the language for the rules as expressive as needed to allow the Web to reason as widely as desired’, (Berners-Lee et al 2001).

‘Early in the Web’s development, detractors pointed out that it could never be a well-organized library; without a central database and tree structure, one would never be sure of finding everything. They were right’, (Berners-Lee et al 2001).

‘The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web’, (Berners-Lee et al 2001).

The earlier example about a genealogy system is particularly close to my heart, GEDCOM is certainly a ‘traditional knowledge representation system’. New genealogy formats are being created, with their own ways of defining data. We must therefore find a way of associating data...

eXtensible Markup Language (XML)

XML is a Markup Language, meaning that tags are inserted into the document to signify that the data between them is related to the tag. It was developed by the World Wide Web Consortium’s (W3C) XML Working Group in 1996, (W3C 2006).

‘XML’s power comes from the fact that it can be used regardless of the platform, language, or data store of the system using it to expose datasets’, (Evjen et al 2007).

‘XML is considered ideal for data representation purposes because it enables developers to structure XML documents as they see fit. For this reason, it is also a bit chaotic. Sending self-structured XML documents between dissimilar systems doesn’t make a lot of sense – it requires custom building of both the exposure and consumption models for each communication pair’, (Evejen et al 2007).

So really, everyone can create their own definition of how to represent data using XML. Again, the genealogy developers are doing this, so how will we associate?

A Resource Definition Framework (RDF)

In 1997, the W3C defined the first Resource Description Framework specification. It became a W3C recommendation in 1999. But what exactly does it do?

‘Meaning is expressed by RDF, which encodes it in sets of triples, each triple being rather like the subject, verb and object of an elementary sentence. These triples can be written using XML tags. In RDF, a document makes assertions that particular things (people, Web pages or whatever) have properties (such as “is a sister of”, “is the author of”) with certain values (another person, another Web page). This structure turns out to be a natural way to describe the vast majority of the data processed by machines’, (Berners-Lee et al 2001).

‘Subject and object are each identified by a Universal Resource Indicator (URI), just as used in a link on a Web page. (URLs, Uniform Resource Locators, are the most common type of URI). The verbs are also identified by URIs, which enables anyone to define a new concept, a new verb, just be defining a URI for it somewhere on the Web’, (Berners-Lee et al 2001).

So, as a genealogist, one URI, (web page), can represent one person; another URI represents another person and I can link them together using a RDF file at an intermediate location, (URI), which defines their relationship.

‘Two databases may use different identifiers for what is in fact the same concept... A program that wants to compare or combine information across the two databases has to know that these two terms are being used to mean the same thing. Ideally, the program must have a way to discover such common meanings for whatever databases it encounters’, (Berners-Lee et al 2001).

Yes, as I said earlier, I may want to link with another genealogy held in a different system elsewhere, which uses its own XML and RDF structures.

‘A solution to this problem is provided by the third basic component of the Semantic Web, collections of information called ontologies... [In terms of the Semantic Web] an ontology is a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules’, (Berners-Lee et al 2001).

‘The taxonomy defines classes of objects and relations among them... Classes, subclasses and relations among entities are a very powerful tool for Web use. We can express a large number of relations among entities by assigning properties to classes and allowing subclasses to inherit such properties’, (Berners-Lee 2001).

‘Inference rules in ontologies supply further power... A program could then readily deduce, for instance, that a Cornell University address, being in Ithaca, must be in New York State, which is in the U.S., and therefore should be formatted to U.S. standards. The computer doesn’t truly “understand” any of this information, but it can now manipulate the terms much more effectively in ways that are useful and meaningful to the human user’, (Berners-Lee et al 2001).

‘Ontologies can enhance the functioning of the Web in many ways. They can be used in a simple fashion to improve the accuracy of Web searches and the search program can look for only those pages that refer to a precise concept instead of all the ones using ambiguous keywords. More advanced applications will use ontologies to relate the information on a page to the associated knowledge structures and inference rules’, (Berners-Lee et al 2001).

Ontologies can be defined using the Web Ontology Language (OWL). Isn’t that neat?

‘Another vital feature will be digital signatures, which are encrypted blocks of data that computers and agents can use to verify that the attached information has been provided by a specific trusted source’, (Berners-Lee et al 2001).

Digital signatures can be used to sign the objects (XML files) or links (RDF) to ensure their validity.

Digital signatures only enable us to make an assertion that an object is linked with a person, or that they are who they say they are, but matched with checking modules in applications the following situations could be possible:

‘Proxy caches... will be able to check that they are really acting in accordance with the publisher’s wishes when it comes to re-distributing material [e.g. distribution controls selected dependent on the publisher’s certificate]. A browser will be able to get an assurance, before imparting personal information in a Web form, on how that information will be used [a digitally signed Web service]. People will be able to endorse Web pages that they perceive to be of value [a digitally signed hyperlink]. Search engines will be able to take such endorsements into account and give results that are perceived to be of much higher quality’, (Berners-Lee 1998).

‘When we have this, we will be able to ask the computer not just for information, but why we should believe it. Imagine an ‘Oh, yeah?’ button on your browser’, (Berners-Lee 1998).

Berners-Lee, T., 1998. Realising the Full Potential of the Web. Available at: http://www.w3.org/1998/02/Potential.html [Accessed February 25, 2009].
Berners-Lee, T., Hendler, J. & Lassila, O., 2001. The Semantic Web: Scientific American. Scientific American Magazine. Available from: http://www.sciam.com/article.cfm?id=the-semantic-web [Accessed February 27, 2009].
Evjen, B., Sharkey, K., Thangarathinam, T., Kay, M., Vernet, A. & Ferguson, S., 2007. Professional XML. John Wiley & Sons.
W3C, 2006. Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation 16 August 2006. World Wide Web Consortium. Available from: http://www.w3.org/TR/2006/REC-xml-20060816/ [Accessed 09 January 2009].

Nathan's Blog

Google Analytics

Thursday, 19 March 2009

Hypertext, Hypermedia and the Semantic Web: What is all this Semantics stuff anyway?

No comments:

Post a Comment