Sailing Page Technical Report |
I choose free libraries as the best agencies for improving
the masses of the people, because they give nothing for nothing.
They only help those who help themselves. They never pauperize.
They reach the aspiring and open to these chief treasures of the
world
-- those stored up in books. A taste for reading drives
out lower tastes.
- Andrew Carnegie
In the pre-computer days of my youth, when you went to the local public library with a question - the librarian didn't reach for the book of "all knowledge," but instead guided you to books with more prosaic titles such as "Science Citation" or "The Readers' Guide to Periodic Literature." It was in these more specialized indicies that I spent a good bit of my high school youth, learning to write papers, and just maybe learning a useful fact or two.
Now adays on the Internet, it seems, we use search engines that claim to index all knowledge, or at least that tiny smudge of knowledge that has made its way to the internet. Search engines incorporate web walkers that visit all the hyperlinks on the pages they know about, index those links and move on to the set of hyperlinks present on the new pages. Folks can also register their pages with search engines, which is another way the web of links is increased.
Even in early 1995, a year after I started my sailing page, it became clear to me that general search engines weren't producing all that useful information on sailing queries. I then happened upon the set of ideas and technologies that were dubbed by my former colleague George Furnas as "Restricted Domain Search."
Another colleague, Larry Stead had just written a quite flexible web walker in the programming language Lisp. I had started running this web walker against my sailing pages to check for moved or dead links. The walker would start at my sailing page and walk one level off my site and check that all the links were still good. Given the rate at which things were changing on the net this was a useful application of the walker. It then occurred to me, that many of the sites I pointed to, also pointed to sailing material, i.e. they described a set of sites more dense in sailing information than say a randomly chosen set of sites. It was clear that if I could provide a search facility over this larger set, it would be a valuable service, since it would provide a larger base set of sites than I could by hand, and also had the potential of providing better hits than a more general search engine since the information was more dense, so there would be fewer false hits.
A group of researchers at Bellcore, which included George Furnas and another colleague at the time, Susan Dumais developed and patented a search technology called Latent Semantic Indexing (LSI), which was available to me, and has some unique properties in capturing synonyms and disambiguating queries. A large number of papers have been written on LSI, so I refer you to them for the real scoop. My simplified view is that LSI takes a bunch of documents and places them in a very high (1000s) dimensional space. In an interesting way, it projects those documents onto a lower (100s) dimensional space such that relevant documents are close to each other. A query (which is just a document) is placed in this new space, and documents close to the query are retrieved. Another way to think about the process is that LSI clustors similar documents together.
In early April 1995, I used the web walker to walk out level 3 away from my sailing page, and with a good deal of help from Susan Dumais, I indexed the material that was returned. On April 4, 1995 I finished the cgi-bin programs that glued LSI to my sailing page, and made available my experimental sailing search to the world. It has been quite successful, and is always in the top 10 pages requested from my site. Todd Letsche, an intern greatly reworked the LSI backend in 1996, and this is the code I am currently using.
While this body of indexed material has been very useful, from the start it was clear that this wasn't working exactly as envisioned. For instance, the Cal Tech sailing club (one the early clubs with material online) has a link to the main Cal Tech home page, which then has links all over Cal Tech, none of which is useful to a sailing search. Excluding the Cal Tech home page was easy, since the web walker provided a very general regular expression language for including or excluding pages. The difficulty was going through after a walk and finding irrelevant pages.
I built a fairly simple visualization and exclusion list writing tool using the dired (directory editor) mode of GNU Emacs. I would point to a file, dired would direct a Netscape browser to display the page, and I would hit a key, if the page should be excluded. These excluded pages were collected and not used in the indexing, and excluded from the next walk. Even for moderately sized walks this tool was useable if not exceedingly tedious.
As always, I had an idea. What if, I could use LSI as a simple classifier of web pages, not just as the search engine. First, I would build an LSI space of known sailing and nonsailing materials. Then I'd start the web walker out. When a new page was retrieved, I would see whether it was close to other sailing material, and if so index it and note its links to walk, otherwise I'd toss it. The risk of such a system is that the classifier would accidentally toss useful material or include irrelevant material, so the initial world has to be constructed carefully.
When I first contemplated this new technology, I though of augmenting the Lisp walker, but it had really run out of steam. For another project I got approval to write web walker in Java (my second walker - my first was in Gnu Emacs elisp). With the newly designed walker it was simple to architect in a place to have an LSI interface object control which pages were walked. I have done some small promising experiments, but was unable to obtain further funding to continue this work (though the new walker is in use in a number of projects).
I have some proposals in to continue this work, and hope to actually collect some data on these ideas are effective or not. In the mean time, I have scaled back to level 2 the distance the walker walks from my page which cuts down the irrelevant paths and continue to add to the exclude list as time allows.
Egan, D., Lesk, M.E., Ketchum, D., Lochbaum, C.C., Remde, J.R., and Landauer, T.K., Hypertext for the Electronic Library? CORE Sample Results. Hypertext '89, (Pittsburgh, Nov.) ACM, New York, 1989, 299-312.
Allen, R.B., Electronic Proceedings for IWANNT'93. Short Papers, ACM SIGCHI '94, Boston.
Note in the database world, this type of work is referred to as "focused crawling." See for instance Focused Crawling Using Context Graphs by M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. Gori.