A Look at Search Engines
icekin — Wed, 2006/03/08 - 06:20

This paper attempts to explain the basic idea behind search engines, their pros and cons as well as take a look at new trends in searching. Any feedback appreciated.
Note : While most of the theory presented here is accurate, some of it is based on logical deduction. For example, visitors have told me that result sorting on Google is purely pagerank based. Logically, we can guess that Google and other search engines also track outgoing links to determine popularity, but exactly how and in what proportions each of these factors determines the final result is not clearly known.
Contents
1. Parts of a Search Engine
a) Spider
b) Index
c) Algorithm
d) Server
e) Webpage
2. How It Works
3. Limitations and Weakness
b) No access to restricted or private sites
c) Search Engine Spam and Bombing
d) Problem with word count relevancy method
Appendix: A look at E-Advertising
4. Advantages of Search Engines
5. Trouble Shooting your Search
6. List of Search Engines
a) General
b) Multi & Meta
c) Clustering
e) Specialized
f) Experimental
7. Bibliography
One of the main instruments of Internet Research is the Search Engine. Before we look at how to use these effectively, we will examine through the course of this section, how a search engine works and its strengths and weaknesses.
1) Parts of a Search Engine
Search engines generally comprise of three main components:
a) Spider (also called ant or crawler)
The role of a spider is to start at one page, then visit every link on that page and then visit every link on each of those pages and so on in a recursive fashion. Assuming that the starting page as a good number of links to begin with, the spider will in time construct a huge database of web pages. Each page that the spider visits is copied onto the hardisk of the search engine company. Now, it is possible that many pages on the internet may link to the same webpage. Thus, each time the spider visits the page; it will check it against its index to see if that page is already there. If so, it will also check to see if the current version of the page it is viewing is newer than the one it has in its database by checking against the date, size and content of the page. If the current page is newer, it will update the search engine index with the new copy of the page. Most times the copy of the old page is removed from the search engine's records. But, if the new page is a blank or error page, then the spider will retain the old copy of the page.
b) Index
The database of web pages on the search engine's hard disk is called the index. Different search companies may address the index using different names. Google, for instance, calls theirs a cache. One must remember that at any one time the company will have employed several spiders across the internet. Thus the rate at which the index builds up with content is high. But then again, there are an enormous number of web pages on the internet and the rate at which new pages, blogs and sites are appearing is ever increasing.
A search engine's spiders are constantly scouring the internet, day and night. But due to the huge volume of web pages combined with the rate of increase of new pages and updates to current pages, each search engine's spider will not be able to keep a constant eye on every page on the net. Thus, there is a time span between a spider's visit to a page till the time it comes to visit the page again. This time span will vary from on search engine to another.
In Google or Yahoo, you can check the time of the spider's last visit to that page by clicking in cached copy next to the search result. This will take you Google's index copy of that page. The date and time that page was copied to the index will also be given. This time span between the spider's visits may vary; sometimes it's over a week.
You may realize that if the there are copies of so many pages in the index; it must naturally be quite large. Its size would also be in the order of several hundred tetra bytes. However, the index does not reside in a single hard disk which would be impossible since there is no hard disk with that much capacity. The index of most search companies is stored across the world as explained in the Server section later.
c) Algorithm
In a simple definition, the algorithm is basically the idea behind how something works. For search engines, the algorithm is implemented in the form of a computer program that resides on the search engine's servers. There are several algorithms and several resulting programs, each one responsible for some task. For example, the decision taken by the spider whether or not to refresh its index copy of a certain webpage is governed by an algorithm.
Search Companies invest a huge amount of Research effort on making better algorithms and more efficient ways of indexing.
Speed Factors
Even with an efficient algorithm, well organized index and high end hardware, it may still takes a long amount of time (in computing speak) for data to travel over a network between the search engine and your computer. This may be due to low bandwidth and connection speed on the user's end, congested networks and large latency in general. Latency is the minimum amount of time taken for a single data packet to travel between 2 locations on a network. Latency generally increases as the geographical distance between the two locations increases. This is natural since more distance means the data packet has to travel trough more wire and make stop at more routers along the way. Routers are, simply put, devices that move data between large networks.
One way to overcome the problem of high latency is to set up search engine servers all over the world so that each one can cater to the users in its region. This keeps the search engine site fast loading and quick in response. Google and Yahoo for instance have servers in almost every country and sometimes several servers in different locations of the same country.
d) Server
The servers are like mini search engines of their own. They each have the same type of algorithms and spiders installed in them to do indexing of the internet. But, they only carry part of the search engine index on their own local hard disks. These are the parts that that server's spiders gathered when scouring the internet. This means that the search engine's index is stored all over the world in bits and pieces across the company's various servers. But, when a person queries the search engine, he queries the entire index. The different servers are connected through private high speed data lines with large bandwidth. All the company's servers are allowed to access the data on any other company server by through these data lines.
When a person queries a search engine, the request goes to the nearest server, which then passes the request onto all the other servers. In addition, it also checks its own index for that keyword that was searched. It then collects and compiles all the results together and displays them to the user. This concept is called distributed computing and file system. The actual implementation of this idea may vary a little between companies. For example, Google calls theirs the GFS (Google File System).
e) Search Engine's Webpage
Some search companies have also taken the approach of keeping the main or first page of the search engine small in size so that it loads quickly. This is done by making the page more text based and with lesser pictures. An example of this approach can be seen in Google, All the Web and Exalead. Yahoo and MSN still have very large sized opening pages, but that is due to the fact that they need to offer more than just search, they are online portals. They do however have a small sized search page for those who wish to just conduct search. This can be found at Yahoo Search and MSN Search.
Lastly, the same search engine page has copies located on all the servers. When a user types the URL of the webpage, the user's IP address is determined to check his geographical location. Even the web page that loads in the browser is the copy from that nearby server. E.g. See IP to Country to map an IP address to a geographical location.
2) How it Works
a) Querying process
The keyword is basically the key terms whatever is typed into the search engine. Most search engines have an algorithm that 'refines' the entire expression that is typed in to just extract the keywords. The refining process usually removes all the joining words like and, or, to, the, is etc. In Google, this is stated at the top of the page if refining was done. In addition, the refining process also corrects common spelling mistakes.
The keywords are then checked against all the pages in the index. All the pages that have that word appearing in them are then returned as the results. In most cases, nouns and plural forms are taken into account by the search engine as well. This means that if you searched for the word 'shoe' but there is a page with only the word 'shoes' in its content, even that page would be returned with the results.
b) Results Display
The returned results are then displayed, often in order of relevance. Relevance is however, a concept best understood by humans. Many words have multiple meanings when used in different contexts. For example, water tanks, fuel tanks and military tanks are commonly called tanks. The second problem in relevance is deciding in which order or rank to display the pages. In the past, search engines used to list pages that had the keywords appearing most number of times at the top. The natural idea was that if a certain keyword like shoe appeared many times on a page, then that page must contain a lot of data on the subject and hence is most relevant.
This method of ranking also placed the perfect matches at the top and the near matches later. For instance, in the search for 'shoes', the pages with shoes would all be ranked before those that have only shoe in their content.
These ideas were used in all the early search engines. But later Google came up with a different concept for relevance. The patented Page rank algorithm used by Google sorts results out in order of popularity.
There are 3 factors for determining a page's position in the results through page rank :
I) Pages with lots of other pages linking to them are given a higher rank.
II) The rank of the pages that are linking to a page is also taken into account. For example, if you have another page with a high rank linking to another page, then that page is "vouching for this one". This means that this page will also be placed higher up the ranks and so on.
III) A counter is present for every page in Google's index that increments every time that link is clicked. This means that the page that most people went to from Google after searching for a certain keyword is possibly given higher ranking because it is more popular.
Page rank worked well because used human activity to measure relevance. Most people or other sites won't link to a site unless that site is good and has useful content. Page rank assumes that most humans think somewhat alike and are probably looking for the same type of information when they type a certain keyword. The page rank algorithm gets better as more people use the search engine because the popularity of each webpage is determined from a larger sample of users leading to more accuracy. As time grows, the index also becomes bigger as more pages are indexed and possibly more sites linking to a certain site can be found.
3) Limitations and Weakness of Search Engines
a) No Coverage of the Deep Web
Since the spiders visit every webpage through the pages that link to it, what about the pages that no other page links to? In reality there are thousands of such pages on the internet and they all contain information that no search engine can gather. Of course, as a webpage becomes popular, more pages will obviously link to it, but till then there is no way to locate it. There is an option to get a site indexed by submitting its URL manually to the search engine. However, many site owners don't bother to do this because they think their site will be found by the spider eventually.
There are a whole lot of pages that are dynamically generated in real time only when they are requested. For example, consider a NIH database on diseases. The information on the diseases is stored as plain text in a database. The template for the webpage and the related images are also stored separately. When you use the NIH search page, you query their database. The related images, text and other data on the subject is then found and inserted into the template to create the webpage for displaying. Hence there is no direct link from anywhere else to this newly created page. It cannot be indexed. Such pages are called invisible pages and along with the un-indexable pages they form the 'invisible' internet.
Dynamic pages are created purely for viewing and will be destroyed when the user leaves the site. The only exception is caching, in which the most popular pages are pre-fetched and stored to avoid too many similar queries to the database. Even so, the cache is not indexable by spiders. The advantage of dynamic pages is that they use lesser hard disk space and allow greater control over the appearance of the site. A person could change the appearance of all the pages on the website by simply changing the template.
As sites on the internet grow in size, they start to switch to dynamic technology to make further development of the site easier. At the same time, the switch also makes the invisible internet grow. Since the invisible internet cannot be indexed, it cannot be found by search engines. It is estimated that what search engines currently index is only 16% of the internet (study by Lawrence and Giles, 2000). Soon, even that figure might decrease. Some argue that the corresponding growth in the number of normal indexable web pages everyday will be able to account for the loss of search results through the invisible internet. But the growth in normal static web pages comes from blogs and personal websites. This can hardly be considered an equivalent replacement of data from organizations like NIH or PubMed.
b) Inability to Access Private or Restricted sites
There are several sites on the internet that have restricted access. An example is Experts-Exchange which requires you to buy an annual subscription to access their database. This can also be described as being part of the invisible internet. However, there are some special search engines that can let you access even this content. An example is the Dow Jones owned Factiva database.
Then there are some sites that simply cannot be accessed. These are more often company intranets but they can also be accessed through the internet by employees of that company. The only way to access this information is to be an employee of the company. Since such information is often company specific, it does not matter to the typical internet researcher. Another example of this are specific sections of defense related sites, mostly ending with the suffix .mil. The existences of these are often less known due to their high security, but it is possible that any other user might accidentally stumble upon them. He will however not be allowed to proceed further without proper authentication.
c) Problem with Pagerank - Bombing
The problem with the page rank idea is that the most popular meaning of the word many not be what a user is looking for all the time. For example, during a well publicized war like the Gulf War, most users may have gone to the pages on military tanks hence causing a hard time for a person trying to find a decent site that sells water tanks.
The page rank idea can also be easily abused. People could encourage others to place links to a certain other page to push up its popularity. A group of users could constantly keep searching for a keyword and then clicking a certain page on the results, just to push up its popularity. While it seems unlikely that people would waste time just to misuse the page rank idea and make it lead to an irrelevant site, it has happened quite often in recent history.
The term 'Google Bombing' was used to describe the action of purposely trying to raise the popularity of a site to make it rank higher on Google. Many people, especially those with commercial sites have engaged in Google Bombing to make their site rank higher.
This becomes an even bigger problem if the Search engine becomes popular because then it will cause more people to click on the sites at the top of the results, which are bombed. Each click raises the popularity of the site, leading to further unintended bombing. 75% of the users to nearly any web site find it though a search engine, mostly Google. For commercial sites that sell products, this means that Google could give them more sales if their result appeared at the top. As a result, many commercial sites initially engaged in Google Bombing.
Pro Googlers initially took the stand that this wouldn't distort their results much because the size of the entire sample of users was so large that a few individuals would find it exceedingly difficult to distort the results on their own. Later on, a cookie was implemented to determine whether a user searching for a certain keyword had searched for the same anytime earlier. If he had, then any number of clicks on a certain link would not alter its clicks counter. The cookie was used to determine whether the query was coming from the same computer. This measure somewhat curbed Google bombing, but it was still possible if a person decided to start a large movement and starts encouraging everyone to click a certain site. This too happened in reality with the keywords 'WMD' which lead to a page of jokes about US Invasion of Iraq instead of information on the Weapons of Mass Destruction. Other hacked keywords included 'unelectable' and 'failure', both of which lead to the Whitehouse page about George Bush.
Some Googlers stated that these two cases were only because those sites were actually popular and people did actually read the information on them rather than just click to improve page ranking by linking to them. However, this only serves to show that popularity is not the absolute measure of relevance. Despite its loopholes, the idea of determining relevancy by popularity has gained widespread usage in industry. Soon after, most search engines, including MSN and Yahoo used some kind of system similar to Page rank to determine popularity.
Note: Google has recently implemented another system called TrustRank; where a small seed of sites are verified by a person and used as a starting point for indexing further portions of the web. Sites directly connected from the starting site are placed higher up on results. Google says that this will help reduce search engine spam. Even with a small initial seed of only around 200 sites, result sorting is greatly improved.
d) Problem with Relevancy by Word Count
This is the alternative idea for ranking pages based on how many times that word appeared in the page. Some small site level search engines used this method. However just like Google Bombers, there are a lot of jokers on the internet who will create a webpage displaying the same word on it a million times. Such pages have no real information but will rank highly in this method of ranking.
This system can in fact be more easily hacked than page rank because it takes just 'Copy and paste' to type the same word again. Such dummy pages could be created in a few minutes and displayed since the cost of publishing on the internet is free in the current era of Blogging. As a result, page rank is probably a lot better. Relevancy through word count is not used anymore.
e) Sponsored Links
By now, it is clear that running a search engine company is no cheap task. There are high fixed operating costs like electricity for servers and machines and paying the staff. The variable costs include leasing high speed lines and paying for added bandwidth. Search engine companies realized quickly that users rely on them to find information and even shop. A recent study showed that more than 70% of medium to high income workers prefer to shop online. Shopping online saves time and money since a person would have to pay for fuel or transportation to get to the shop. In the case of standard products like stationary or basic supplies, most shoppers don't care to examine the product in detail before buying since they usually work alright. Most e-shops like Amazon give quality assurance that the product will be delivered undamaged or the money will be refunded. In many special offers, even the shipping costs are free beyond a certain amount of purchase. Online shops are generally cheaper to operate since there is no need to pay staff or salesmen, operate a store, keep a check out counter etc. This large saving is passed onto consumers in the form of lower costs.
i) E-Advertising
These e-shops are found, like any other site, through search engines. Here, the search companies decided to take a share of the profit. They started leasing keywords to the e-Shops on an annual basis. For example, if Nike wanted their site to be ranked first whenever a person searched for shoes, they would simply rent the keyword 'shoe' by paying Google the annual lease fee. For popular keywords, many bidders would be competing to get the keyword. The keyword is generally leased to the highest bidder. As one may guess, very popular keywords may fetch high amounts for the search company. When a site pays to get its ranking higher up, it is called a sponsored link. Also see A Look at E-Advertising.
By pushing up a site to the top of the rankings, the idea of relevancy becomes distorted once again. As of 2003, a mandatory law passed by the US Government required all search companies to clearly distinguish the sponsored results from the real results. To comply, most search companies simply put a small icon next to the sponsored results to distinguish them from the real results. Most users however don't notice these icons or small print. Google and Exalead are two major search engines that display the search results in two columns with the real results on one side and the sponsored links on another, making the difference clear.
Note: You can see a real time graph of bids placed on various keywords by advertisers here at Yahoo Research.
In view of online shopping, many search companies have commented that the sponsored links are actually useful since many users are actually looking for e-shops through the search engines. However, this fails to address the problem of those who are looking purely for research data and facts. There is also the problem of determining to what extent a site is informative or commercial. Many informative sites also sell products or display write reviews for specific brands and manufacturers. One example is How Stuff Works. Some sites could be accused of promoting a product, even if the site in reality provides data about troubleshooting that product. Examples are the many sites that provide information about making cell phone modifications of particular brands.
Note: Also see Yahoo Mindset under experimental search engines at the end of this document.
To date, there is no major unbiased, sponsor-less, free to use search engine that scours the web. Every well known search engine company is a commercial venture. See the Non Commercial search engines and Notable exceptions below for some clearly unbiased search engines.
ii) Exceptions in Commercial Search
The below engines are commercial ventures, but not backed by a large corporation. This alone does not make them better, but at least the results won't favor any one company in particular unlike MSN search which is known to give better placement for Microsoft products and services. Even Yahoo always shows result links to Yahoo services the top.
This can deliver results with accuracy, nearly as relevant as Yahoo or Google. Gigablast itself is more of a one man operation at the moment. The Gigablast search software itself requires much less hardware to run effectively and can be purchased for use on corporate servers.
Mozdex uses the open source Nutch search software. Htdig is another example of a well known search program. Such search programs are often used to power a custom search engine for small sites. Mozdex is an attempt to extend this idea to make a World Wide Web search. Since the underlying technology remains open, people know that it will always be a bias free search engine.
iii) Non-Commercial Engines
This is based on the idea of using a distributed search engine running off the processing power and bandwidth. This is how the search features on the decentralized file sharing programs (e.g. Gnutella) work. Grub was an unsuccessful attempt by Looksmart to start a distributed search engine that would run off the power of normal PCs. The Grub project has gone offline as of late. Fortunately, others have not given up. Yacy is an open Source p2p based web search engine. Since it is owned by no one, it is free from advertisements or biasness in result ranking. The reliability is also higher as there are literally thousands of servers (every computer running the software) rather than a select few like a commercial search engine. Majestic-12 is another example of a distributed search engine. You can try the Majestic-12 search here.
f) Old, Out of Date Index
Since spiders visit a site only once in a while, if any changes in content are made in between, that change will not be immediately reflected on the Search Engine's Index. Thus, when searching through a search engine, you are not searching the latest 'copy' of the internet. This is more of a reality that should just be accepted, rather than a problem. Pages across the web are constantly being updated with fresh content. It would be impossible to ever conduct a search on the latest 'copy' of the internet. In fact, by the time you finish reading this line, about a few thousand few pages would have probably published new content. Thus, when a person uses a search engine, he is conducing a retrospective search; meaning he is searching the past.
When this site was published, it contaned about fifteen articles. MSN indexed the site within a five days of my submission, Yahoo did it within ten days while Google took over two weeks.
Note: There is one site that appears to offer a solution to this. PubSub offers a prospective search. Instead of an index of web pages, PubSub keeps an index of search terms used by other surfers on their engine. Thus, it can return better sorted results on hot news items. PubSub thus tracks new news as it appears on the web. MSN Search also offers an option to sort results based on more recently updated sites being ranked higher.
g) Non-Uniform Spider Distribution and Coverage
Another issue arises because search companies are biased in their distribution of spiders across the web. They might send more spiders to refresh the index of sponsored pages and popular sites more often and cause the spiders to visit the other pages less frequently. Thus, some portions of the index will be distinctly newer than all the rest. Search companies state that they do this because some sites are updated more frequently as they are news or service related unlike personal homepages which are updated once in a while. While this may be generally true, there are some who update homepages regularly. In addition human behavior is unpredictable, and during certain massive events, people may suddenly start to update their sites. In particular, the few days preceding the September 11th attacks on the Twin towers saw a sharp increase in the amount of blogging as many individuals starting expressing their own views openly on the matter. Those new blogs took a while before they became listed on search engines.
This is also to be accepted. Any company will try to maximize efficiency by allocating the limited resources, in this case, spiders, in such a way as to cover most of the web in real time.
h) Limited Search through Index
Despite amassing a huge index, the algorithm never queries the entire database to generate the results. To save time, the querying is simply done for a few milliseconds and if a sufficient number of results are gathered, these are then returned. If you notice, some search engines state clearly that they obtained "...about n results", where n is some number. For an example, see the top right side column of Google's returned results on any subject.
As a result, the obtained results are never a comprehensive collection of all the pages containing that information. However, the algorithm first ranks the pages in the database descending order of popularity. The query proceeds in the same way, top down. Thus, the most popular pages with that keyword will be most certainly be returned. This method works fine because in general most people don't go beyond the first few pages to find the data that they are looking for.
i) Restricted Display of Results
Some of the results returned by the search engine will be omitted from being displayed because of legal action. The best example is in the File Sharing market. When Kazaa started inserting spy ware into their applications, some users decided to hack the application and remove the spy ware. They distributed their 'clean' versions mainly at http://www.kazaalite.tk . Ridiculous as it may seem, Kazaa managed to get a court order to make all major search engines to remove that site and others hosting the 'clean' version from the results. See this notice for more details. Also see Digital Millennium Copyrights Act. In Google, you can find a small footer message that explains this if any of the results have been prevented from being displayed.
The problem is because the website of an application, even if hacked, is a source of information. A researcher trying to gather data on the subject of file sharing will have a hard time locating these 'banned' sites that actually contain more information and less advertising.
Many search engines have implemented a program to clean the results before they are displayed. The cleaning program removes sites with sexually explicit content or racist propaganda. While most users may not be seeking to find such a site, consider the case of the Abu Ghraib Camp in Iraq where prisoners were tortured. A person seeking information to write an article about this would never find some of the sites due to the results cleaning. A search for Abu Ghraib on Google Images omits all the sites with explicit pictures. The cleaning program has its own proprietary names. Google calls theirs Safe Search.
The idea of cleaning the results was brought about to protect children from the content on the internet. It is possible in most search engines to disable this option under the site's settings, but some users do not know of this. Some search engines like Google display footer at the bottom of the page that states whether Safe Search was used in the results.
4) Advantages of Search Engines
The strengths are obvious. They are the best way to generate some content on a subject, especially when all the researcher is armed with is a keyword and no list of any specialized sites on the subject. Search engines work best when what you are seeking is popular information like the reviews of the latest novels or movies. The developement in search engines has reduced the quantity of bookmarks in the average web surfer's list since its now possible to retrieve common information on demand rather than bookmark it. In short, they are a definite instrument every online researcher's arsenal and are here to stay.
5) Trouble Shooting your Search
When we seek to conduct an internet research, we must first start by narrowing down our area of search. The best way is to write on a piece of paper, the exact question as it may appear on our mind. The question may be vague but it should contain all the information about what we seek.
a) Designing a query string
E.g. - A person seeks to find information about a green fruit that is about the size of a fist. Then, he formulates the following idea:
"How can I find information about a Green fist sized fruit?"
The next step is to isolate the keywords in the sentence: "Green, Fruit, Fist-Sized"
These words are then entered into the search engine. Since not all the Key words may be that commonly found on web pages, it may be necessary to replace some of the words with their alternate synonyms. Fist-Sized, for instance, could be replaced by the Terms "Small or Palm-Sized" instead.
Note: Ask Jeeves is a search Engine that allows the user to enter the entire question as it is. The search engine does the task of filtering the keywords. Most search engines like Yahoo and Google also allow this now. There are also a separate set of engines called answer and natural language engines that attempt to directly answer the question asked, if possible. Their algorithms are quite different from a normal search engine. See section 6)d) later in this document.
The search engine then returns all the pages with the words Green Fruit and Fist-Sized in it. However, the problem still lies that not all this information may be found on the same line. Consider the following :
"Heat until the soup becomes green. Then take a fist-sized measure of salt and add it to the cooking pot. The finished dish is best served with fruit and salad."
This excerpt contains all the keywords that we entered, but certainly does not return the information that we were looking for. This problem is commonly encountered. The problem here lies in the fact that when we type the expression into the search engine as
Green Fist-Sized Fruit
We are actually entering the expression "Green" AND "Fist-Sized" AND "Fruit"
The search engine starts to search for the individual terms rather than the whole expression "Green Fist-Sized Fruit"
To get around this, we need to actually write the expression as "Green Fist-Sized Fruit" with the inverted commas. Then only pages with that entire matching expression together will be returned. Also see Google Cheatsheet and Google Operators for some information on Boolean operators in search.
b) Acronyms and Abbreviations
Many people looking for information on UFOs may land up at the site on the popular band, UFO. The problem is that several terms have alternate meanings. A good way to get around this problem is to use the exact full term rather than the acronym. For example, search for the expression "Unidentified Flying Object" rather than simply UFO, if you are not getting the search results that you want.
6) List of Search Engines
a) General Search Engines
Different Search Engines have different indexes. See Yahoo versus Google results for comparisons on a single keyword. A study by Dogpile shows below 3% overlap between results of Google, Yahoo, and Ask Jeeves. 85% of the results were unique to one of the three engines. So, use a variety of search engines. Meta Search comes is handy here; it can run a query across multiple engines at once and concatenate all the results into a single page for display. See section on Meta Search below. Noodletools has a quick reference to choosing the best search engine for your needs.
This has a large index and a good ranking algorithm that can find most of the information. It also supports Boolean searching under advanced search options and allows for searching within results etc. It also tries to cover the invisible internet as far as possible by simultaneously searching the same keyword through both USENET (via Google Groups) and the Open Directory Project. In fact, the Google directory is identical to the Open Directory Project, but with the results ranked in order of pagerank. Google Guide has a good list of tips on using Google.
At one time, this was powered by Google, but now it has its own index after acquiring several other smaller search companies like Inktomi and All The Web. The top results between Yahoo and Google are not always the same, despite popular belief. Running a search by both the engines always helps. Yahoo also owns Altavista and Lycos.
Gigablast is relatively new. Gigablast guesses related keywords from a search query. This technology is called Giga Bits . Using this, one can enter a query like 'Who is the president of the USA?' and the Giga Bit with the highest percentage - standing for the highest relevance - is President Bush. This technology works well in many cases.
Owned by Ask Jeeves. Teoma uses a link popularity algorithm. Unlike Google's PageRank, Teoma's technology (Subject-Specific Popularity) analyzes links in context to rank a web page's importance within its specific subject. For instance, a web page about 'baseball' will rank higher if other web pages about 'baseball' link to it. It uses subject-specific link popularity to compute "authoritativeness" of a search result. The Teoma technology also incorporates patented click popularity techniques, originally from DirectHit search engine, which Ask Jeeves acquired in 2000.
Not quite a search engine, but handy when trying to find information on recent topics.
Quote: "Traditional search stores data and then allows you to find documents within that store of data. PubSub operates by first storing your subscription query, and then watching for new information that matches it. Your query will be checked against every piece of new information passing through our Matching Engine."
b) Multi and Meta Engines
Dogpile is a metasearch engine that "fetches" results from About, Ask Jeeves, FindWhat, Google, LookSmart, MSN, Teoma, Yahoo! and several other popular search engines. Dogpile is owned by Infospace, which also owns Metacrawler, Webcrawler and Excite.
The popular AskJeeves and Search are also meta search engines. Several more meta search engines can be found under the Clustering Search Engines section below.
Muti Search engines combine results from other search engines (similar to meta search), databases like PubMed and Nationmaster, public directories and various other sources. Multi search engines can be thought of as the ultimate WWW level search since it extends its query to far more places than most normal or meta search.
IXQuick searches using five major engines. Its most attractive feature is the sorting and navigation of results. Every result is given a rating of 1 to 5 stars based on how many search engines placed it within their top 10 results. If all 5 engines gave a certain result a top 10 place, then that site gets 5 stars. It is also possible to further filter the results by clicking the tick or cross next to each result. The tick will refresh the results to only include other sites of the same topic or cluster group. A cross will refresh results by removing topics like itself. This is also provides an excellent way for IXQuick to get statistics on which results customers best associate with certain keywords. IXQuick also has an international phone directory and shopping price search (similar to Froogle).
A meta search and clustering engine. Allows results to be sorted in clusters, separately based on different search engines and blended. When blended, all the results are displayed together on a single page, but again ranked by popularity based on QkSearch algorithm. QkSearch has the best search interface of all, even better than Google or Yahoo. All the search options are easily available on a single screen. Also allows searching by geographical location, file type and sources.
A easy to use, complete multi search tool that can narrow searching to specific categories. Multiz, and Muse Seek are some other multi search tools available.
All multi search engines have the neat feature of being able to look through specific topics or file types like audio, video etc. Multi search is useful and will probably be implemented in many other major search engines at some time, if not done already.
c) Clustering Engines
Clustering allows results to be sorted in groups according to categories and sub categories. Most clustering search engines also do meta search. While the taxonomy still leaves much to be desired (cross categorization is still unrefined), clustering presents the results in a much better than standard search results. Since the clustering is done by computers, there is often too much categorization. While looking through the results, one might discover related categories which they may also be interested in. Clustering is currently a buzzword in search and clearly the next big investment by search companies.
Owned by Vivisimo. Under advanced options, you can set the search engines to include. Probably the best basic clustering search engine for the present, based on the relevancy of the returned results.
The site offers a graphic display of the taxonomy using flash. There is also a normal text based cluster display, but the graphic display is where the strength of Kartoo lies. The graph is generated in real time and displays links between sites as well. Thus one can not only see results by categories but know how the sites connect between each other. How this information will benefit the normal end users is not clear, but those who study data mining and methods of data retrieval and display will find this interesting.
Another clustering engine with a visual display of results as a graph of interconnected nodes rather than just by category. This one does not need flash to view though. Between this and Kartoo, Kartoo has better appearance since flash can produce nice animation.
Exalead , a company based in France . Most notable feature is the proximity search which allows results to be sorted in order of distance between words. Their engine also has a phonetics search, localized search, and clustering.
d) Natural Language & Answer Based Engines
These are search engines that attempt to provide direct answers and straightforward information to the question in addition to pointing to a list of links as search results. While Ask Jeeves also appears to do this, it is really just a meta search engine.
One of the best answer engines around. Provides answers along with the source web page.
Newer than brainboost, but promising. Also clusters results.
Other sites include Answers Corporation which gives direct definitions to the answer in addition to links to other sites which have relevant information, Wikipedia almost always being among the results.
e) Specialized Search Engines
These are search engines that are specialized at searching a particular type of document or site (e.g. news sites, blogs or FTP sites etc.)
This is a standard site to search for definitions in several dictionaries. It comes with a useful Reverse Search tool to allow you to find a word after supplying a description of it in a sentence or phrase. Likewise, Symbols is a search engine for all kinds of symbols like highway, street signs and more.
This is a search engine for source code. Probably useful for teachers who need to check whether their students cheated in an assignment by copying code off some existing open source project.
As the name states, this site is aimed at finding articles, academic or otherwise from a variety of sources.
Search engine that scours all well known wikis for information.
Many more specialized search engines can be found here.
f) Experimental Search Engines
The below search engines are still being tested, but are usable. They represent new methods of retrieving and viewing information.
Yahoo has introduced Mindset, a new way of sorting the results of the search in order more informative or more commercial. The order of the results changes in real-time as the bar is moved to vary between informative and commercial.
Yahoo has taken a new approach to contextual searching. They have taken the idea of typing a question like Ask Jeeves, but improved upon the idea further. Yahoo Q is a new search engine that allows a person to conduct a search even if he cannot clearly specify the question. The Y Q search box allows any text, ranging from a few words to an entire page to be entered into it. Y Q then determines the appropriate context from the given text and keywords. This search engine is still under testing and development, but has already been released for the public to try out.
Yahoo's answer to making access to paid content easier. Allows searching through major paid databases like Lexis Nexis and Factiva for free. However, one needs to have a subscription with that site in order to view the full article. Google Print does a similar thing, allowing searching of printed (including copyrighted books) for data, but only the first page of the copyrighted book will be displayed with a link to purchasing the book.
More Yahoo inventions and experiments found at: Yahoo Next.
A search engine that can suggest various combinations of queries as you type inside the box. Don't really know the purpose, since most people know what they want to search for. I suppose it can help if you only know part of a sentence or word and need to search for that content.
Allows for searching of printed mail order catalogs. Even in the age of digital advertising, mail order catalogs are still used widely to promote products.
More Google inventions and experimets can be found at Google Labs.
7) Bibliography
For a more detailed list of search engines, see http://en.wikipedia.org/wiki/List_of_search_engines
'Indexed doesn't mean you'll find it' http://www.workingfaster.com/sitelines/archives/2004_05.html#000205
Search Engine Watch keeps an eye on new search technologies on the internet.


