|
|
 |
|
| |
HTML Unleashed PRE: Strategies for Indexing and Search Engines |
|
|
| |
The search interface is the visible part of a search engine's iceberg. Every
day millions of people enter myriads of keywords into search forms
and get innumerable URLs in response. This is already one of
the biggest and most intensively used information resources on Earth.
I'm not going to teach you how to use search engines, as
that's beyond the scope of this book. However, to create
search-friendly HTML documents, you must be aware of the range of
features offered to the users of modern search engines.
| |
| |
All major search engines have, besides the simplest form of query
with one or several keywords, some additional search options. However,
the scope of these features varies significantly, and no standard
syntax for invoking them is yet established. Among the most common
search options are:
- Boolean operators: AND (find all),
OR (find any), AND NOT (exclude) to combine
keywords in queries;
- phrase search: looking for the keywords only if
they're positioned in the document next to each other, in this
particular order;
- proximity: looking for the keywords only if they're
close enough to each other (the notion of "close enough" ranges from
2 in-between words for WebCrawler to 25 words for Lycos);
- media search: looking for pages containing Java
applets, Shockwave objects, and so on;
- special searches: looking for keywords or URLs within
links, image names, document titles;
- various search constraints: limiting the search
to a time span of document creation, specifying a document language
(Alta Vista), and so on.
You should be aware that even with a full inventory of these bells
and whistles, you cannot expect from a search engine capabilities
that are comparable to, say, the Search dialog in Microsoft Word.
For example, Alta Vista suggests using its database as a spelling
dictionary: search for CDROM and CD-ROM and see
which will "win" by yielding more results. A bright idea, but you
can't resolve in a similar fashion the controversy of World Wide
Web vs. World-Wide Web simply because the system treats
both hyphens and spaces as "separators" and cannot differentiate
between them. Those accustomed to regular expressions such as
those used in Perl or awk can't even dream of using something
similar with search engines.
In the future, search engines may offer more sophisticated
options, although for now, their search interfaces seem
to be developing in another direction, described in the
following subsection.
| |
| |
Recently, several search engines developed schemes to categorize results
of a search by combining them into groups with similar "keywords
spectrum." By selecting the Refine button in Alta Vista, you get a
list of several categories that your results fall into, allowing you to
specify including or excluding of any category for the next search
iteration.
Similarly, Excite invites you to
"Select words to add to your search," with these additional keywords
extracted from the results just obtained. This selection
allows you to narrow the search in a much more efficient fashion
than you could do by blindly trying different keywords.
Northern Light Search also
sorts its search results into "folders" based on their content and
the domain URL. All these features make really powerful
searching possible by interactively detecting trends in the data.
| |
| |
All search engines rank their results so that more relevant
documents are at the top of the list. This sorting is based
on, first, the frequency of keywords within a document, and second,
the distance of keyword occurrences from the beginning of the
document.
In other words, if one document contains two matches for a keyword
and another is identical but contains only one, the first document
will be closer to the top of list. If two documents are identical
except that one has a keyword positioned closer to the top
(especially, in the document title), it will come first.
In addition to these principles, some search engines use extra factors to
determine the ranking order, called relevancy boosters. For
instance, HotBot and Infoseek
favor those documents that make use of META tags over their
METAless peers.
WebCrawler relies on link
popularity: if a page is linked frequently from other pages and
sites, it is considered "more authoritative" and gets some priority
on the list of results. Excite, being a combination of a search engine
and a directory, quite naturally gives preference to those pages that
are reviewed in its directory.
Finally, all search engines try to fight unfair practices of some
webmasters who attempt to fool the ranking algorithm by repeating
keywords to improve their effective frequency in the documents. You
might have noticed pages with a tail of hundreds of repeated keywords
(usually made invisible in browsers by changing font color, but still
visible to search engines) or pages with multiple TITLE elements
(again, only the first one is visible in browsers, but all are indexed
by a spider). Now, not only do such "keyword spammers" not receive
high rankings, but many search engines also automatically exclude them from
the database. (For more on spamming, see "The Meta Controversy" later
in this chapter.)
| |
| |
Usually, lists of search results contain document titles, URLs,
summaries, sometimes dates of the document creation (with other search
engines, dates of their inclusion in the database), and document sizes.
For compiling document summaries, several approaches have been
developed.
Many search engines use META descriptions provided by
page authors, but when META data is unavailable, they
usually take the first 100 or 200 characters of page text. Excite
stands apart by ignoring META tags altogether and employing
a sophisticated---but not particularly well-performing---algorithm
that extracts sentences appearing to be the "theme" of the page and
presents them as the page's summary.
However, the solution that seems optimal to me is that used by
Aport, a Russian search
engine. Instead of generating summaries, Aport just lists,
for each document found, the sentences from the document that
matched the query. Indeed, in order to decide if a document is
worth browsing, we're often more interested to see what is the
context of the keyword match, not what sort of a document is this.
Aport has a number of other features unique among search engines.
For example, it allows you to retrieve a text-only reconstruction of the
document directly from the search engine's database, in case the
original document (or the server it's stored on) is inaccessible.
| |
      
 |
|