Google Analytics

Wednesday, 25 March 2009

Web Dev Research Presentation

Presentation about Web services developed for my degree's web programming module.
Includes TCP/IP, FTP and HTTP due to assignment requirements:

Monday, 23 March 2009

Collaboration: A Vision for Content Managment

Reading interview with Ross Mayfield, CEO and founder of Socialtext, found in ‘Wikinomics’ by Tapscott and Williams...
‘One of a growing number of start-ups that have emerged to supply social computing technologies (especially wikis) to enterprises.’
‘”For a long time,” says Mayfield, “personal productivity tools and applications – the kind that Microsoft makes – have been centred on a single user who generates documents. You also have highly structured enterprise systems designed and implemented from the top down – in many ways as an instrument of control – with rigid work flow, business rules, and ontologies that users must fit themselves into. The problem is that users don’t like using those kinds of tools, and what they end up doing is trying to circumvent them. That’s why ninety percent of collaboration exists in emails.”’
‘Mayfield argues that traditional organizations have reached a point where e-mail itself is breaking. “You could argue that ten or twenty percent of e-mail is productive”.’
‘Mayfield thinks the solution is collaboration tools that adapt to the habits of workplace teams and social networks rather than the other way around.’
So, in similar way to envisaged genealogy software, each employee is represented by an XML or Web page object. FOAF or similar networks can be created and used as required to provide information about how the employees relate to each other, without prescribing a hierarchy of any kind.

A modern Content Management System (CMS) should provide the facility for each employee to create blogs, wiki articles etc. In the same way that they expect be able to use Social Networking software when at home. The difference being that these activities are only posted to an internal intranet, (unless required by the employee to link to some public facility like Wikipedia).

Concerning Mayfield’s comment about spam... I think, it may be better to employ a twitter style stream – admittedly 80% may be irrelevant, but it would enable workers to know what each is doing. I would also build in an ‘ignore hashtag’ function though, as I don’t necessarily want to know about every conversation – or even ‘block hashtag’. (I need to find this function within twitter!)

And, oh yes, of course I think that we can apply the same system to genealogy.
  1. Tapscott, D. & Williams, A., 2008. Wikinomics, Atlantic Books.

Thursday, 19 March 2009

The Semantic Web relating to Genealogy: Thoughts Re-jiggled

‘Associating a URI with a resource means that anyone can link to it, refer to it, or retrieve a representation of it’, (Shadbolt et al 2006).
‘Much of the motivation for the Semantic Web comes from the value locked in relational databases. To release this value, database objects must be exported to the Web as first-class objects and therefore must be mapped into a system of URIs’, (Shadbolt et al 2006).
So, I envisage, any genealogical object, (such as a person, family, source, repository, place, note or media), must exist as an individual XML file on the Web that can then be linked to, as desired or required.

As far as possible, these files should be normalised. In the same way as the database rule that a table should relate to ‘the key, the whole key and nothing but the key’, objects should only include tags that relate to them.

Folksonomies
‘[Folksonomies] represent a structure that emerges organically when individuals manage their own information requirements. Folksonomies arise when a large number of people are interested in particular information and are encourage to describe it – or tag it’, (Shadbolt et al 2006).
‘Rather than a centralized form of classification, users can assign keywords to documents or other information sources’, (Shadbolt et al 2006).
This links with my vision of future online genealogy, objects could be linked by tagging and specifying a description of that relationship.

If we have two XML files, perhaps both representing people in our family tree, a user could tag one object from another as being related in some way, perhaps a cousin for example. Behind the scenes the tagging application creates a RDF file which describes that XML file A, representing person A, is the cousin of XML file B, representing person B.

Alternatively, pictures of a person could be tagged, (in a similar way to Facebook, Flickr etc). Again, a RDF file created describing the tagged picture as a picture of my ancestor, who is represented by an XML file elsewhere.

Consider the situation where a paragraph in an online text mentions a relative of some kind, again a link could be made between the two in the same way as described above.
‘But folksonomies serve very different purposes from ontologies. Ontologies are attempts to more carefully define parts of the data world and to allow mappings and interactions between data held in different formats. Ontologies refer by virtue of URIs; tags use words’, (Shadbolt et al 2006).
I don’t see this as a ‘one or other’ situation, but think we need both. An ontology needs to be defined to define a standard for basic types of genealogical link, (e.g. parent, spouse, sibling), and to ensure system compatibility. A folksonomy system has a particular advantage in that it can cover inadequacies of the ontology.

In any application users should be given the choice of creating a standard type of link, as defined in the ontology, or if the ontology is missing a link type they could define it themselves, (folksonomy).

Ontologies become particularly important when we think on a global scale, in that ontologies can be language independent, whereas a folksonomy is very hard to translate.

The Friend Of A Friend (FOAF) project exists as a ontology that could, and should, be used in an online genealogy system.

Whether using a prescribed ontology, or a folksonomy, creating tags must be simple in order to encourage participation.

The Principle of Least Power - Keep It Simple Stupid
‘When Berners-Lee developed the Web, he took the salient ideas of hypertext and SGML syntax and removed complexities such as backward hyperlinks. At the time, many criticized their absence from HTML because, without them, pages can simply vanish and links can break. But the need to control both the linking and linked pages is a burden to authoring, sharing, and copying’, (McCool 2006).
In my research, (Thomas 2009), the idea of splitting one GEDCOM file into multiple XML files, one XML file for each object, has raised similar concerns. Although, in my opinion, the benefits outlined in this blog out-weigh this problem.
‘Early forms of HTML paid no regard to SGML document-type definitions (DTDs). Berners-Lee simply ignored these difficult to create and understand declarations of how markup tags are used’, (McCool 2006).
In a similar way, it does not really matter how people define their data in XML, as long as there are ontologies in order that we can associate XML tags.
‘[Folksonomies have] ‘no notion of synonyms or disambiguation... For a Web community with simple, easy-to-use authoring tools that support synonyms, disambiguation, and categories, we can look to Wikipedia... Wikipedia calls synonyms redirect pages, and disambiguation is explicitly handled via special pages’, (McCool 2006).
Wikipedia is different from my vision in that there is not any XML based data behind the presentation layer. Where not implied by the ontology, presentation layer pages (HTML) could be used in the same way as Wikipedia to support the issues of described in the above paragraph.

Conclusion

It is clear that the GEDCOM format would never be able to facilitate the Semantic Web.
I have created a project, called GenPACK, which breaks GEDCOM files down into XML files, retaining the same tags as GEDCOM. (Some call this an intermediate format).

Although GenPACK is currently being written to provide linking by importing those XML files into a database, from this research it is clear that the next step is to use RDF files to imply relations between these files. In this process I will need to analyse FOAF to see how it can be applied to this situation.

The GenPACK project can be found at: http://sourceforge.net/projects/genpack/

  1. McCool, R., 2006. Rethinking the Semantic Web, Part 2. IEEE Internet Computing, 10(1), 96-95.
  2. Shadbolt, N., Hall, W. & Berners-Lee, T., 2006. The Semantic Web Revisited. Intelligent Systems, IEEE, 21(3), 96-101.
  3. Thomas, N., 2009. It’s All Relative: But is GEDCOM still a member of the family? Pre-Print Conference Paper. Available from: http://genpack.wiki.sourceforge.net/space/showimage/ItsAllRelativeD01InternetVersion.doc/ [Accessed 19 March 2009].

Hypertext, Hypermedia and the Semantic Web: What is all this Semantics stuff anyway?

What is the ‘Semantic Web’?
‘The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. The first steps in weaving the Semantic Web into the structure of the existing Web are already under way. In the near future, these developments will usher in significant new functionality as machines become much better able to process and “understand” the data that they merely display at present’, (Berners-Lee et al 2001).
What are the design principles of the Semantic Web?
‘The essential property of the World Wide Web is its universality. The power of a hypertext link is that “anything can link to anything”... Like the Internet, the Semantic Web will be as decentralized as possible... Decentralization requires compromises: the Web had to throw away the ideal of total consistency of all its interconnections, ushering in the infamous message “Error 404: Not Found” but allowing unchecked exponential growth’, (Berners-Lee et al 2001).
We can trace this property back to the original Web’s design principles, particularly the Web’s ability to record random associations between objects.
‘The Web was designed to be a universal space of information, so when you make a bookmark or a hypertext link, you should be able to make that link to absolutely any piece of information that can be accessed using networks. The universality is essential to the Web: it loses its power if there are certain types of things to which you can’t link’, (Berners-Lee 1998).
‘The second part of the dream was... The computer re-enters the scene visibly as a software agent, doing anything it can to help us deal with the bulk of data, to take over the tedium of anything that can be reduced to a rational process, and to manage the scale of our human systems’, (Berners-Lee 1998).
How will the Semantic Web work?
‘For the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning’, (Berners-Lee et al 2001).
‘Traditional knowledge-representation systems typically have been centralized, requiring everyone to share exactly the same definition of common concepts such as “parent” or “vehicle”. But central control is stifling, and increasing the size and scope of such a system rapidly becomes unmanageable’, (Berners-Lee et al 2001).
‘For example, a genealogy system, acting on a database of family trees, might include the rule “a wife of an uncle is an aunt”. Even if the data could be transferred from one system to another, the rules, existing in a completely different form, usually could not”, (Berners-Lee et al 2001).
The point that Sir Tim Berners-Lee is making here, is that data is usually stored on a range of different systems and the ‘semantic’ rules that define objects are found in a variety of different formats, dependent on the information storage system used to store the data. It is then impossible to associate semantic rule sets when moving data from one information system to another.
‘Moreover, these systems usually carefully limit the questions that can be asked so that the computer can answer reliably or answer at all. The problem is reminiscent of [Kurt] Godel’s [incompleteness] theorem from mathematics: any system that is complex enough to be useful also encompasses unanswerable questions... Semantic Web researchers, in contrast, accept that paradoxes and unanswerable questions are a price that must be paid to achieve versatility. We make the language for the rules as expressive as needed to allow the Web to reason as widely as desired’, (Berners-Lee et al 2001).
‘Early in the Web’s development, detractors pointed out that it could never be a well-organized library; without a central database and tree structure, one would never be sure of finding everything. They were right’, (Berners-Lee et al 2001).
‘The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web’, (Berners-Lee et al 2001).
The earlier example about a genealogy system is particularly close to my heart, GEDCOM is certainly a ‘traditional knowledge representation system’. New genealogy formats are being created, with their own ways of defining data. We must therefore find a way of associating data...

eXtensible Markup Language (XML)

XML is a Markup Language, meaning that tags are inserted into the document to signify that the data between them is related to the tag. It was developed by the World Wide Web Consortium’s (W3C) XML Working Group in 1996, (W3C 2006).
‘XML’s power comes from the fact that it can be used regardless of the platform, language, or data store of the system using it to expose datasets’, (Evjen et al 2007).
‘XML is considered ideal for data representation purposes because it enables developers to structure XML documents as they see fit. For this reason, it is also a bit chaotic. Sending self-structured XML documents between dissimilar systems doesn’t make a lot of sense – it requires custom building of both the exposure and consumption models for each communication pair’, (Evejen et al 2007).
So really, everyone can create their own definition of how to represent data using XML. Again, the genealogy developers are doing this, so how will we associate?

A Resource Definition Framework (RDF)


In 1997, the W3C defined the first Resource Description Framework specification. It became a W3C recommendation in 1999. But what exactly does it do?
‘Meaning is expressed by RDF, which encodes it in sets of triples, each triple being rather like the subject, verb and object of an elementary sentence. These triples can be written using XML tags. In RDF, a document makes assertions that particular things (people, Web pages or whatever) have properties (such as “is a sister of”, “is the author of”) with certain values (another person, another Web page). This structure turns out to be a natural way to describe the vast majority of the data processed by machines’, (Berners-Lee et al 2001).
‘Subject and object are each identified by a Universal Resource Indicator (URI), just as used in a link on a Web page. (URLs, Uniform Resource Locators, are the most common type of URI). The verbs are also identified by URIs, which enables anyone to define a new concept, a new verb, just be defining a URI for it somewhere on the Web’, (Berners-Lee et al 2001).
So, as a genealogist, one URI, (web page), can represent one person; another URI represents another person and I can link them together using a RDF file at an intermediate location, (URI), which defines their relationship.
‘Two databases may use different identifiers for what is in fact the same concept... A program that wants to compare or combine information across the two databases has to know that these two terms are being used to mean the same thing. Ideally, the program must have a way to discover such common meanings for whatever databases it encounters’, (Berners-Lee et al 2001).
Yes, as I said earlier, I may want to link with another genealogy held in a different system elsewhere, which uses its own XML and RDF structures.
‘A solution to this problem is provided by the third basic component of the Semantic Web, collections of information called ontologies... [In terms of the Semantic Web] an ontology is a document or file that formally defines the relations among terms. The most typical kind of ontology for the Web has a taxonomy and a set of inference rules’, (Berners-Lee et al 2001).
‘The taxonomy defines classes of objects and relations among them... Classes, subclasses and relations among entities are a very powerful tool for Web use. We can express a large number of relations among entities by assigning properties to classes and allowing subclasses to inherit such properties’, (Berners-Lee 2001).
‘Inference rules in ontologies supply further power... A program could then readily deduce, for instance, that a Cornell University address, being in Ithaca, must be in New York State, which is in the U.S., and therefore should be formatted to U.S. standards. The computer doesn’t truly “understand” any of this information, but it can now manipulate the terms much more effectively in ways that are useful and meaningful to the human user’, (Berners-Lee et al 2001).
‘Ontologies can enhance the functioning of the Web in many ways. They can be used in a simple fashion to improve the accuracy of Web searches and the search program can look for only those pages that refer to a precise concept instead of all the ones using ambiguous keywords. More advanced applications will use ontologies to relate the information on a page to the associated knowledge structures and inference rules’, (Berners-Lee et al 2001).
Ontologies can be defined using the Web Ontology Language (OWL). Isn’t that neat?
‘Another vital feature will be digital signatures, which are encrypted blocks of data that computers and agents can use to verify that the attached information has been provided by a specific trusted source’, (Berners-Lee et al 2001).
Digital signatures can be used to sign the objects (XML files) or links (RDF) to ensure their validity.

Digital signatures only enable us to make an assertion that an object is linked with a person, or that they are who they say they are, but matched with checking modules in applications the following situations could be possible:
‘Proxy caches... will be able to check that they are really acting in accordance with the publisher’s wishes when it comes to re-distributing material [e.g. distribution controls selected dependent on the publisher’s certificate]. A browser will be able to get an assurance, before imparting personal information in a Web form, on how that information will be used [a digitally signed Web service]. People will be able to endorse Web pages that they perceive to be of value [a digitally signed hyperlink]. Search engines will be able to take such endorsements into account and give results that are perceived to be of much higher quality’, (Berners-Lee 1998).
‘When we have this, we will be able to ask the computer not just for information, but why we should believe it. Imagine an ‘Oh, yeah?’ button on your browser’, (Berners-Lee 1998).
  1. Berners-Lee, T., 1998. Realising the Full Potential of the Web. Available at: http://www.w3.org/1998/02/Potential.html [Accessed February 25, 2009].
  2. Berners-Lee, T., Hendler, J. & Lassila, O., 2001. The Semantic Web: Scientific American. Scientific American Magazine. Available from: http://www.sciam.com/article.cfm?id=the-semantic-web [Accessed February 27, 2009].
  3. Evjen, B., Sharkey, K., Thangarathinam, T., Kay, M., Vernet, A. & Ferguson, S., 2007. Professional XML. John Wiley & Sons.
  4. W3C, 2006. Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation 16 August 2006. World Wide Web Consortium. Available from: http://www.w3.org/TR/2006/REC-xml-20060816/ [Accessed 09 January 2009].

Hypertext, Hypermedia and the Semantic Web: The Web Itself

Sir Tim Berners-Lee described computing in 1980 as a world of ‘incompatible networks, disk formats, data formats and character encoding schemes’, this was particularly frustrating ‘given that... to a greater extent computers were being used directly for most information handling, and so almost anything one might want to know was almost certainly recorded magnetically somewhere’, (Berners-Lee 1996).

The ‘Design Criteria’ of the World Wide Web, described in Sir Tim Berners-Lee’s 1996 paper make very interesting reading:
  1. ‘An information system must be able to record random associations between any arbitrary objects, unlike most database systems’.
  2. ‘To make a link from one system to another should be an incremental effort, not requiring un-scalable operations such as the merging of databases’.
  3. ‘Any attempt to constrain users as a whole to the use of particular languages or operating systems was always doomed to failure’.
  4. ‘Information must be available on all platforms, including future ones’.
  5. ‘Any attempt to constrain the mental model users have of data into a given pattern was always doomed to failure’.
  6. ‘Entering or correcting [information] must be trivial for the person directly knowledgeable’.
The Web is formed around three common standards: the Address Space, Hyper-Text Transfer Protocol (HTTP) and Hyper-Text Mark-up Language (HTML), all originally designed by Sir Tim Berners-Lee.

The Web was designed around a principle of minimal constraint, in order that it could be incrementally improved by future developers. Additionally, the Web’s standards needed to be modular and support information-hiding. So that anybody designing anything on top of those standards did not have to know how the standards actually worked, (Berners-Lee 1996).
‘A test of this ability was to replace them with older specifications, and demonstrate the ability to intermix those with the new. Thus, the old FTP protocol could be intermixed with the new HTTP protocol in the address space, and conventional text documents could be intermixed with the new hypertext documents’, (Berners-Lee 1996).
Also, as a further example, we can look at HTTP’s ability to carry images (JPG, PNG, VRML) or even Java code.
‘Typically, hypertext systems were built around a database of links. This did not scale... However, it did guarantee that links would be consistent and links to documents would be removed when documents were removed. The removal of this feature was the principle compromise made in the [World Wide Web] architecture... allowing references to be made without consultation with the destination, allowed the scalability which the later growth of the web exploited’, (Berners-Lee 1996).
File Transfer Protocol (FTP) existed when the web was first developed, but was ‘not optimal for the web, in that it was too slow and not sufficiently rich in features’, (Berners-Lee 1996). So the Hyper-Text Transfer Protocol (HTTP) was created.

Universal Resource Identifiers (URIs) are the primary element of Web architecture. ‘Any new space of any kind which has some kind of identifying, naming or addressing syntax can be mapped into a printable syntax and given a prefix’, (Berners-Lee 1996). ‘URIs are generally treated as opaque strings: client software is not allowed to look inside them and to draw conclusions about the object referenced’, (Berners-Lee 1996). ‘HTTP URIs are resolved... by splitting them into two halves. The first half is applied to the Domain Name Service to discover a suitable server, and the second half is an opaque string which is handed to that server’, (Berners-Lee 1996).

Hyper-Text Markup Language (HTML) was defined as the data format to be transmitted over HTTP. HTML was based around SGML in order to encourage its adoption by those already using SGML.

The initial prototype browser was written in NeXTStep in late 1990. It allowed HTML to be edited as well as browsed. The limited use of NeXT limited its visability, so in 1991 a read-only ‘line mode’ browser was written. This enabled the early web to be viewed on a range of systems. As more people became involved, full browsers were written.

In 1993, rumours threatend that the Web’s competition ‘Gopher’ was to become a licenced product. As a result, a mass of people and organisations transferred their hypermedia systems to be WWW systems instead.

The World Wide Web Consortium (W3C) was formed in 1994. The rest is history...

Wednesday, 18 March 2009

Chromon


Repeat the musical pattern