Put The Squeeze On Web Searches
unixlarge: Put The Squeeze On Web Searches
Sometimes it seems like we spend half our lives searching. Be it for money, the carkeys, or the meaning of life, there is always gratification in finding what you arelooking for. Think about that some more, and you realize that it is not always good tofind more than you are seeking. Finding keys to a car you sold long ago is not veryhelpful and neither is searching the Web for the meaning of life and finding more than 10million pages, all with different answers.
It would be wonderful if this month's column found the meaning of life, but it doesn't.Instead, I discuss one of my pet peeves -- watching people search the Web ineffectively.So often in classes I notice people doing a search and becoming frustrated by the sheervolume of responses. The problem is not the amount of data available (more is better), butin the search methods used.
What really amazes me is that these people can spend hours fixing a program,researching an engineering problem, or solving a computer problem. I guess that performingan accurate Web search is not that important, but it sure can save time if you get a shortlist of relevant items compared to a huge list that can knot your stomach.
TAPE @ ALTAVISTA
Let's start with an example of an ineffective search. Suppose we're looking for tapecartridges for our backup system that uses the TR4 format. If we entered tr4 tapecartridges into a search engine like www.altavista.digital.com we get over 240,000matches. Because the search engines put the "best" matches first, chances arethat what we were looking for is near the top, but still...
If we have a little bit of Web savvy, we might realize that we did not want tape orcartridges, but instead wanted tape cartridges as a phrase. To be more specific, tell thesearch engine to treat the two words as one by making a phrase out of it. This is donewith quotes. Lets try a search for tr4 "tape cartridges."
HIT ME WITH YOUR BEST
This returns over 11,300 "hits" in the database. Still daunting, and stilllots of things we did not want in there. If you spend a little time in the help pages onthe search Web sites, you'll find some meta-characters that most search engines accept.One of the most commonly used is +. This means to require the following word or phrase.Knowing that, we can trim the search result down to a smaller and more relevant 4,100 orso with the pattern: tr4 +"tape cartridges."
This seems pretty good, but, it actually did not do what we wanted. To prove it, lookat what these two searches returned. The phrase "tape cartridges" brings backover 4,100 while tr4 retrieves almost 7,300.
What the original phrase really meant was to return pages that required "tapecartridges" and I don't care about "tr4" showing up. What I really wantedwas pages that had info about tape cartridges of tr4 type. Thus, I really meant to requireboth.
BACKING UP YOUR SPORTS CAR
The pattern +tr4 +"tape cartridges" returned less than 50 hits and all ofthem contained both phrases. I can be pretty sure that all pages written by Triumph TR4automobile buffs were removed. Keep in mind that I also might have lost pages thatdescribed tr4 tapes, but didn't use the phrase "tape cartridges." Sometimes youdo not want to get overly specific.
The two more commonly used Web search engines (altavista.digital.com and yahoo.com)provide similar search engine directives. Details are available from links on the searchpages, but here is my take.
The first thing to realize is that there is a different syntax required from the"main" search page URLs than from the "advanced" or "searchoptions" pages that are links available near the search input boxes. This is ratherannoying, but then, so is the whole current Web atmosphere. The list below describes onlythe syntax used in the main search pages.
- The primary characters recognized are:
- " used to combine words into phrases,
- + require the following phrase in any match,
- - omit any page that contains the following phrase,
- * match partial words, section: limits matches to sections of an HTML document.
Because we have already described the first two in the list, we'll start with the -symbol. It means that any page containing this word should not be included. If we continuewith the example started above, we could search for tr4 -triumph and get roughly 6,430hits, compared to the almost 7,300 for tr4 alone. I guess there aren't that many Britishsports car fans.
The * character is used much as it is in matching filenames from a UNIX shell. It meansunlimited text can follow. My first attempt to use this found that it doesn't match withless than three characters, which prevents too many matches. Here was my attempt: tr*. Thegoal was to find pages about the Triumph TR3 and TR4 cars. It returned zero patches.
I tried tr3 and that returned 11 thousand or so. My next attempt was to be morespecific, but I ended up getting lucky. When I tried +tr* +triumph I got the followingerror:
Sorry, wildcard (*) must be at least three characters from start of word: tr*
Interesting that there was no error from the original tr* search, but heck, that typeof thing is what keeps a computer educator in business. A better example of using the *character might be if I were a fly fisherman and wanted pages about how to tie a fly thatmatches a particularly flavorful insect. The search pattern baetis returned almost 900matches, but baetis* returned over 2,260, since there are several suffixes to the namebaetis. Interestingly, you cannot use the * to match the beginning of a word, only theend. The pattern *fred returned nothing, while fred* returns over five million.
The final characters mentioned above vary between search engines, but serve the samepurpose. You can look for a match in specified sections of an HTML document. This meansthat you have some familiarity with HTML for many of them, but others just plain makesense.
There are many sections you can search in. For example, the following all refer tosections in the altavista site: anchor: link: text: title: url: and there are severalmore. Yahoo's engine lets you abbreviate sections to just one character, for example titleto just t:. Actually, Yahoo only allows two sections, URL (u:) and title.
If we were sick of work and wanted to go flyfishing in the Bahamas, we might searchaltavista with +flyfishing +bahamas and get almost 950 hits. By searching for+title:Flyfishing +title:Bahamas there are only three, all definitely with informationthat I need, and, right now, that's exactly what I want.
--Fred was last seen searching for title: +the +meaning +of +life in the +Bahamas.