We have written a lot here about the the vision of building a structured layer on top of the current web. Annotating billions of HTML documents in a bottom-up way or building top-down tools that can automagically interpret the existing information are the two approaches that we discussed. Together these approaches would result in a global database which will make the web even more connected. The ability to correlate content and concepts accross web sites would reduce the time necessary for searching and would enable the discovery of related information.
In previous posts we discussed the difficulties with the bottom-up approach to the Semantic Web – a sophisticated form of annotating information using tools like RDF and OWL. Among the factors that impair the web wide adoption of these tools is complexity and the lack of clear end user benefits.On the other hand, the top-down approach that we discussed does not place any burden on content owners and delivers instant benefits to end users. Yet, the top-down tools run into a difficulty – interpreting raw information is not that simple. Typical solutions focus on a vertical, but still suffer from imperfections.What if there was some minimal annotation in the content to help top-down tools interpret it? In this post we look at how content owners can implement simple annotation strategies which can help the top-down tools and search engines to make the web more structured.
Annotation Basics – Headers
It is striking how many sites today do not use meta tags in the head of the document to provide the bare minimum information about a page’s content. Forget building a smarter web, this is just plain bad SEO practice. The work that is being put into generating great content can be offset by lack of a succinct, meaningful description of that content. Every page on the web should have the following information filled in:
- title – a sentence briefly describing the site/page
- description – a paragraph about the site/page
- keywords – a list of keywords that describe the site/page
Note that it makes sense to provide different information for the root page and subsequent pages. For example, for a newspaper or a blog, the root page should provide information about the site at large, while individual article and post pages should contain information about that specific page, not the overall site.
The New York Times’ web site provides a good example of how to properly use meta tags. For example, this article on Slowdown in US Growth includes the following meta data:
- title – U.S. Growth Slowed Drastically in 4th Quarter
- description – The economy expanded by a weak 0.6 percent in the latest indication of a substantial slowdown and perhaps a recession.
- keywords – United States Economy,Gross Domestic Product
The New York Times is actually a great example of taking the basics of annotation and building on top of them. Each page includes an extended set of rich meta data including, the author of the article, the date it was published, thumbnail image URL, creator, category and even ticker symbols for public companies that are mentioned in the article. Certainly, the New York Times provides a really great set of information, perhaps even wider than needed for most content, but lets focus on the ones that should be used on a wider scale.
author: Web content is produced by people and for people. With the rise of social culture we are increasingly interested in finding bits of everyone’s identity around the web. If something piqued your interest enough for you to blog or to write an article, at least you can put your name on it. Having people attached to content would allow seamless navigation from one to another. There is already a standard meta tag for this, with a suggestive name: author.
thumbnail: We love pictures. Since the launch of Flickr we can’t live without them. Facebook’s success owes a lot to photo sharing. With bandwidth becoming cheap, we are increasingly become more visual. We do not want text we want pictures, so if a news article or blog post contains an image, it is simple to do what the Times did – generate a meta tag for it. There is no standard meta as far as I know, but any of these would do: thumb, image, picture, thumbnail, etc.
date: As we are becoming a real-time culture the freshness of content becomes paramount. Tagging the page with date is important way of helping classify the page in time. Most blog posts and articles contain dates anyways, and having a standard date header would make it simple and obvious.
location: Location is becoming increasingly more important as well. With GPS and widely available Internet access we are able to easily let people know where we are and are able to take advantage of local services. If the article or a post is related to a specific location there is a conventional way of annotating it. The technical term for annotating content with location information is Geotagging. It generally means placing a pair of latitude and longtitude coordinates. A more relaxed form would be specifying country/region/city and is described in detail by the Geo microformat specification. While specifying exact position coordinates may be difficult, even something as simple as the geo header New York, NY would be very helpful.