Information Architecture for the World Wide Web, Second Edition, by Louis Rosenfeld and Peter Morville. O'Reilly, 2002, 486pp. US$39.95, CDN$61.95, UK£28.50
Since the dawn of the World Wide Web, more than a billion web pages have been written. The corporate web site has grown from perhaps a dozen pages to, in some cases, several hundred. As the web has grown more and more complex, it is no longer possible for one person to wear all the hats necessary in designing comprehensive, information-rich sites. We have web programmers, web designers, content managers, usability testers, and project managers. And information architects. Though this was an area where the one-man web shop never consciously paid much attention, the organization and classification of the site's content is of utmost importance when more and more web surfers are using sites for specific tasks, whether that be searching for information, carrying out a financial transaction, or conducting online research.
The field of information architecture is still young, and constantly evolving. Rosenfeld and Morville, pioneers in the field, wrote the first edition of this book in 1998. Four years have seen the rise and fall of the "Internet economy." Many lessons have been learned, chiefly in economics. But perhaps now more than ever, the prudent use of resources is essential. With their battle scars still fresh, the authors have poured their experience and their experiences into an almost completely rewritten book.
As you would expect, a book written by information architects is exquisitely organized. They divide their book into six major sections:
As an information architecture newbie, I'm particularly glad for the first two sections. This is also important because, although IA has evolved into a discrete discipline all its own, current economic conditions still dictate that, for many companies, someone will have to wear the IA hat in addition to their other responsibilities. And so a good primer is in order, and at almost half the book's length, these two sections provide a thorough introduction to the field for the non-specialist.
To be completely honest, trying to give you a taste of the content of this book is going to be a little bit like trying to take a drink from a fire hose, but I'm going to focus on an area where I can present some useful information without reproducing the entire book in this column. The book contains a surprising amount of information on search engines. Usually a humble element on any site's main page, an impressive amount of work behind the scenes is hidden in the (usually) simple text box and submit button. Let's see if we can get a better understanding.
Chapter 8 is devoted to Search Systems, and comprises about 10% of the book. In an earlier chapter, the authors describe several models of user behavior when "searching" for information on the web. Some users are looking for a specific piece of information, such as the population of a town, or a ZIP code. They call this type of searching known-item searching (or, more colorfully, "The Perfect Catch"), and it's probably the first thing that comes to mind when thinking of search engines. But it's much more complex than that. Sometimes users are looking for more than just one answer. For instance, if I want to search for some reviews of the film "The Royal Tenenbaums," I'm not looking for one "right" answer, but for several different viewpoints. This type of user behavior is described as "Lobster Trapping". Finally, Rosenfeld and Morville describe a type of exhaustive search that might be done by a graduate student, or even someone "Googling" their blind date. They call this type of searching "Indiscriminate Driftnetting," where the searcher wants every piece of information available. Though users employ a variety of behaviors to find information, including browsing and using portals and directories, the most common way to find information is to use a search engine. It's not hard, then, to see that designing a search engine is more difficult than it might first appear.
It helps to break down searching into its component parts:
There are many decisions to be made for each of the components. Though the focus of the chapter is on designing site-specific search engines, many lessons can be learned from the ongoing refinement of the web-wide search engines.
The first question the information architect needs to ask when considering a search engine is whether one is actually needed. If users can find information easily just by browsing, a search engine might not be useful. But if the site has a lot of content, especially text-rich content such as white papers, press releases, and the like, then a search engine is essential. A related question is whether a search engine is just a band-aid solution to a poorly designed navigation system. This and other architectural weaknesses need to be addressed first, and quite possibly may eliminate the need for a search engine at all.
When it's decided that a search engine is required, the authors warn the information architect not to leave too much up to the IT (information technology) department. Though the programming will be done by the programmers, the information architect represents the user, and has to advocate for the people who will be using the technology. The IT people can get the search engine to "work," but only the IA people can tell if the working search engine is any use or not.
Once a search engine is chosen, or developed internally, the information architect must decide which content on the web site should be indexed for searching. Search engines can be comprehensive, and that might be a reason to index the content of every single page on your site, but it might make more sense to break the site into logical "search zones" in order to facilitate deeper, more relevant searches. Criteria for search zones can be anything, but examples might be subject/topic, geography, chronology, author, or department/business unit.
Retrieval algorithms employ a system called pattern matching, where they will look for the query terms either in the exact sequence they appear in the query or in close proximity in all the documents on the web site. There are two poles on the "settings" axis for returning results. High "recall" means that the search engine will return the largest number of results, with varying degrees of relevance. High "precision" will return fewer results, but of closer relevance to the original query. Both approaches have their benefits. For those who are "driftnetting," the high recall approach will be best. For those who are "lobster trapping" or looking for the "perfect catch," high precision is a better setting. The ways in which these settings are controlled include stemming, which takes a query term, breaks it down into its root, and returns results for the original term and several variants. Some search algorithms have higher levels of stemming, and some don't use it at all. Obviously, higher levels of stemming return more results, and since they won't all contain the original query word, this would describe a high recall approach. Low or no stemming returns more precise, but fewer, results (a high precision approach). Another way to affect the precision/recall ratio is using "good" results to feed back into the algorithm, returning "similar" pages (a high recall approach).
Presenting results also requires some reflection by the information architect. Now that the search engine has assembled some good matches, there are two questions to answer: what exactly should be displayed, and how should those results be listed?
The simple guideline for the first question is to display less information to those who know what they are looking for, and more information to those who are not quite sure. By offering more description of the document, the results page offers driftnetters or lobster trappers more information to determine whether the particular result is a "good" document. A good counterbalance to this advice is the observation that most users hesitate to click through to the second or subsequent page of results. Other good advice includes letting the searcher know how many total results have been returned, and offering her a chance to revise or narrow her search from the results pages.
When it comes to listing results, there are two ways to organize information, sorting and ranking. Sorting can be carried out in any number of ways, alphabetically and chronologically being the two most common. Ranking involves using data from the search algorithm to determine relevancy or popularity. Sorting is useful for those users who need to make a choice or decision. Giving these users the ability to sort according to their own needs is very important. On the other hand, ranking results by relevance is usually more useful to those who are seeking knowledge and don't necessarily need a large number of results (note the "recall" vs. "precision" dichotomy again). The main lesson here is that it's very important to know what sort of users are searching on your site.
Accommodating the various types of user will be the determining factor behind the design of the search interface. Giving the user the option of using natural language vs. Boolean operators, or letting them decide how many results per page will be returned, or even if stemming will be applied, will give them a feeling of control and make their searches more fruitful. However, having the simple "search box" will let them search the entire site quickly without having to make a lot of decisions. This is why many sites have the simple search box on the front page, with a link to an "advanced search" page for those users who are willing to refine their search criteria. The authors also recommend educating "simple searchers" by giving them the opportunity to refine their search from the results pages.
The authors admit that the search chapter is the longest chapter in their book, but since so much of information architecture is ensuring that information can be found, this should not be surprising. The chapter contains much more information than I've covered here, in addition to many examples. And this chapter is just a taste of what Morville and Rosenfeld have to teach us.
Information Architecture for the World Wide Web is an introductory course in a discipline of which we are all slowly becoming practitioners. That it is such an enjoyable course is due entirely to the knowledge and experience of the authors. Their humility, evident in their willingness to point the reader to other sources of information, is also refreshing. The mixture of theoretical and practical material is particularly useful, especially the chapter on "selling" the need for information architecture in a skeptical, post-"Bubble" economy. I'm confident that this book can teach almost anyone the beginnings of what they need to know about how to define a web site's structure to facilitate information retrieval. Or, to cut the jargon, to make a web site work.