People, places, events; our browsers (and other machines) should be able to recognize these things as easily as you and I can. That’s the promise of the semantic web. But users are not the ones who will make the semantic web work. Microformats extend existing XHTML tags to put human-readable information into machine-parsable form (<span class="location">Argent Hotel, San Francisco, CA</span>). But as simple of a solution as that is, it will never have mainstream appeal.

Unlike text-formatting XHTML tags & classes like <strong> or <blockquote>, data-formatting XHTML tags have no immediate visible effect on their contents. Can you tell the difference between:
? Because of this, users have no reason to ever remember to format their information with machine-readable tags. Even a rich-text-editing-like interface isn’t that helpful, because users lack a reason to want to make their data machine-readable. Only techies and smart business people care about that; for everyone else, human-readable is good enough. People write for people, not machines. So services will have to write for services.
We already have our information put into semantic form on a regular basis. Every time we type our name into a textfield created for that purpose, the machines of at least one service (Facebook, Yahoo! Mail, Wachovia) store that not just as a bit of text, but text identified specifically as our name. Not just our names, but often our addresses, our interests, our friends, our job titles, and many other kinds of information. Data silos have existed for ages, of course; but giving the semantic web at large access to data silos is a key step forward in making semantic tools useful — just as key, if not more so, than collecting data from the distributed web.

Not only is that data all in a central location for easy access, it’s often the kind of data we don’t explicitly state in our natural usage of the rest of the web. How often do you write on your blog or in an IM, “I live in Boston, MA. My interests are coffee, technology, startups, and communications. My relationship status is single, my height is 5’8”, and my eye color varies between blue and green”? (Which reminds me, I really need to increase the female demographic of my readership.)
Finally, access to data silos is as important as access to distribute data because data silos are where the majority of people are (and will be for the foreseeable future). There are only 165,700,000 sites in existence, whereas Facebook alone has over 70 million active users (and Facebook gets 1/4th the traffic Yahoo! does). The majority of people’s online presences will be through a centralized service, rather than their own site.
But, that doesn’t mean that there isn’t any value in all of the data available on the distributed web. After all, even just a few sites can put out a lot of content. But for the reasons outlined above, most of these sites won’t be putting their own content into a semantic format. That’s where semantic creators come in.
Data scraping is as simple as understanding the common format a specific type of data usually is put in (myname@mydomain.com or 555-555-5555), and having a web-crawler find that information, extract it, and put it into the desired markup or database entry fields. As natural language processing advances, we’ll start seeing more services that recognize the kind of information I expressed explicitly above (interests, relationship status, etc.), even when it’s only expressed implicitly.
Those kind of services will operate either constantly, or on-demand. They will be the middlemen between semantic tools and the rest of the web, collecting all of the information we put out, and putting it into a machine-readable format services can use (and changing it from the format one service uses to the format another service prefers, until semantic markup becomes more standardized).
So we have all that information in machine-readable format; great! Now what can we do with it? We can:
The possibilities for users to enhance their web experience with semantic data are endless. But it’s up to developers to create the services that give them that chance.
Commenting is closed for this article.
If browsers built in microformat awareness (or everyone installed the operator add-on for firefox) hCard and other microformats would get much more attention.
I can tell the difference between the two entries at the start of this blog post because one shows up in Operator and the other doesn’t.
But I share your belief that the content creation and consumption tools and services have to get aware before individual authors are likely to engage with microformats.
Once decent mashups/aggregations start to gain speed people will take the time to markup their content, hopefully in a systematic/programmatic way.
— John Eckman · 05/29/2008 11:08 AM · #
Absolutely, browser-plugins could do the job; but not everyone’s going to install them. When Firefox 3 was first being talked about last year, I was really excited about the possibility of microformat support! But look what’s happened…
If Firefox doesn’t support Microformat recognition natively, who’s going to? Maybe Opera, but they’re hardly mainstream.
Even as mashup and aggregation tools become popular, I still don’t think the majority of content creators will make the conscious connection between their everyday content creation and wanting that content to be included in mashup/aggregation services.
The best hope is that the services/tools they use for content creation will save them the trouble of having to make that conscious connection, and automatically markup relevant data for them, the same way this content box automatically puts each paragraph into <p> tags for me, or e-mail programs automatically add append your signature block to e-mails based on your one-time selection of that preference, and input of data.
Thanks for commenting, John!
— Jay Neely · 05/29/2008 11:31 AM · #