A Look at Searching
icekin — Tue, 2006/03/07 - 12:20
Contents
1. Introduction
a) The internet
b) The problem
3. Finding and retrieving data
- Appendix: Study of Search Engines
i) Directory Links
i) Creating Starter Pages
ii) Starter Pages Links
d) Site Maps
e) Databases
i) Database Links
- Appendix: A Look into Online Information Databases
- Appendix: Guide to Using Forums
h) Newsletters
i) Newsletters & Spam
i) IRCs & MUDs (including MOOs)
j) Gopher
k) FTP
m) Email
i) Wikis
ii) Blogs
iii) Internet Archives
i) Social Bookmarking
ii) Public Conversations
iii) Webrings
iv) Groups and Communities
v) Social Networks
5. Evaluating a Source of Information
a) Age
b) Level of Acceptance
c) Referrals
d) Quality
e) Credentials
6. Summary
7. Bibliography
1) Introduction
a) The Internet
The internet, especially since its widespread implementation as the World Wide Web (WWW) in the 90s, has served two major roles - as a means of communication and as a source of information. With the advent of IRC, email and messengers, the internet's use as a communication medium is well known and documented. The widespread growth of the internet over the last decade has been primarily due to its open nature. This means that anyone can add anything they want to the internet. Even if web hosting companies disapprove of the content, a person can always start his own web server for hosting the content with any basic computer and an internet connection.
Due to this, the amount of information on the internet has grown rapidly. This makes the internet a powerful source of information since it currently contains more information than all the libraries of the world put together. It may also be possible to find some data only on the internet since not everyone may have the money to get their ideas and writings published on paper.
b) The Problem
The associated problem with this explosive growth is a rapid decline in quality of content. A good number of websites on the internet currently consist of blogs, which are ready to publish journals in which anyone can publish their ideas. While some of the information in a blog or website may be true, there is no hard and fast way to verify it. I have written the below information to educate people with the skills needed to sort through the vast quantity of information on the internet and find what exactly what they are seeking. I have also described how to evaluate a source of information to determine its reliability and how to quote the internet references.
2) Information - a broad idea
The word information in the context of computing refers to data. Data can come in several forms ranging from pure text (e.g. a .txt file), a text and picture document (e.g. a Word Processor Document or webpage), text and animated picture (e.g. video files, animation extensions like .swf), music (e.g. mp3, ogg, wma, wav etc.), archived files (.zip, .7z, etc) and so on. We will now examine various ways in which information can be retrieved from the internet.
3) Ways to Find and Retrieve Information
a) Search Engines
These present a quick way to sort through a large mass of information. In recent times, their popularity has soared, but they are still way short of perfection and accuracy, especially in specialized topics. The one obvious weakness of search engines is that they cannot access the invisible internet effectively. This refers to pages on the internet that are non-indexable because no other pages are linking to them. Pages that are present in a database are only pulled out through search queries are typical examples.
See A Look at Search Engines for a more detailed explanation about the invisible internet, how search engines work, their strengths, weaknesses and trends.
b) Web Directories - Products of Human effort
Unlike the spider which simply makes an index of every site that it comes across on the internet, the Web directories are constructed by humans. This means that every site is first visited, checked carefully for approvable content and then added to the internet. The people involved in building these directories are usually volunteers who do so purely due to interest and because they have some useful links to contribute. An example is the Open Directory Project. Some directories hire only professionals to contribute the effort of building up the directory. An example is the Librarian's Index to the Internet, which is a project that is run by the state of California . The government asks the librarians who have knowledge about certain subjects to gather useful sites on that data and insert them into the index. In a directory, every thing is displayed in a structured format. The links are organized by subject and sub categories. Since every link that is added undergoes extensive checks from other members and higher level editors, the final product only comprises of quality data. Directories try to keep commercial sites like those that appear on 'sponsored links' out, unless those sites also happen to contain information.
As one might guess, the directory has far fewer links to a subject as compared to the results generated by a search engine. The Open Directory for instance has about a few hundred thousand links. A search engine would probably gather that many results on just one keyword alone. The advantage in a directory is that the few links that are present on a subject will usually contain all the information you will need for your research since they are quality sites.
A directory is the place to be when you have a specific but well understood subject related keyword like Aeronautics or Protocols. Both words are specific to a certain degree, but are related to specific subjects. If the keyword is too specific like 'Gram Staining Procedures' or 'Thermal Coefficient of Volumetric Expansion' then a search engine is better. But, a directory would be able to yield links to the general subjects, in this case Cell Biology and Thermal Physics. The user could then search those sites for Gram Staining or Volumetric Expansion and probably get good results.
If the subject is too general, a Web Directory would also probably be useless. For example, a search for 'Influence of 15th to 18th century music on today's artists' would yield almost nothing out of a web directory.
Web directories grow at a snail's pace compared to Search engines since every page must be reviewed. Thus, the oldest Web Directories are often the best. Web Directories have the ability to cover most of the Invisible Internet and have a better way of determining relevancy.
i) Directory Links
Galaxy | Invisible Web | RDN | Virtual Reference Library | Virtual Library | Gimpsy | BUBL
c) Starter Pages
These are similar to web directories, but generally all the links are on one page. These may even be pages that were created and owned by a single person, like the links page on a personal site. A starter page is meant to provide a list of about 100 or so high quality links that cover a wide range of topics. These pages are meant to act as a quick start to finding data. An example is the online Yellow Pages.
i) Creating your own Starter pages
As a person browses the internet, he will, over time, come across several high quality sites. Most surfers often use the Favorites and Bookmarks to remember these sites so that they can visit them later. Some people have even organized their bookmarks into folders by category to make its usage easier.
It is possible to share these bookmarks with others by simply printing out the bookmarks as an .html or .txt file. Yahoo has a service called bookmarks, which lets you import your bookmarks online and then browse them from Yahoo. It also makes it easy to publish and share your bookmarks online with others.
An extended form of Yahoo! Bookmarks lies in Furl. Furl is a free service by Looksmart that adds a toolbar to any browser to allow any webpage to be bookmarked easily. However, unlike a traditional bookmark, it actually indexes a copy of that page under your account username. This is a personalized index which is regularly refreshed by Furl's spiders. Unlike traditional bookmarks, you can actually search through the content of the bookmarked pages through Furl. Furl has gained popularity quickly and it is possible to view several other users' starter pages for information. Other services offering a personalized search engine like Furl include Spurl and Rollyo. Users can also recommend new sites to one another or copy another user's list of sites to their own search profile. Google also offers personalized search, but no associated social bookmarking service yet. You can however view detailed statistics about your search activity to understand your own browsing habits online. Starter pages are sorted by specific category and subjects. This leads us to another related concept: Social bookmarking, found later in this document.
Search engines, web directories and starter pages never provide any information directly. They merely provide links to the information. A person may have to explore through several such links to gather a large amount of information. Several sites will usually repeat similar information. This can prove useful as a cross check on accuracy.
ii) Links to Starter Pages
Reference Desk is directory that also has a useful starter page here.
d) Site Maps & Links
These used to be a standard feature on several corporate websites. The site map shows a graph or tree of all main sections of a site. If you cannot find something on a site through the site's search, a site map is a good alternate method. Not all sites have site maps since many sites, like blogs and news sites like slashdot simply publish content by single hierarchy categories without sub classification. Secondly, there are a lot of orphan pages on a site. These are pages that don't belong under any of the site's categories and hence cannot be classified under a site map. Site maps work somewhat closely with search engines. Sites with site maps make it easier for spiders to index them hierarchically and hence may get a higher rank on search results. Potential webmasters are often strongly encouraged to include a sitemap for easy navigation for visitors.
The links page is generally found on several sites. It usually gives links to other relevant information found elsewhere on the internet. Some sites have a jumpstart page. This is a combination of a starter page, site map and links. Blogs usually include a list of links to other blogs of similar nature or to the blogger's friends. See Blogrolling for an example.
e) Databases
This is a significant portion of the invisible internet. A lot of data is stored in databases and these data have no direct links pointing to them. Such data can only be retrieved by typing a query into the site's search engine and then clicking on the results obtained. Databases are often run by specialty sites, which focus on one particular area. For example, PubMed, a health related site, keeps a detailed database of all known physical symptoms and conditions. Though databases are a rich source of information, locating one that is specialized to a person's area of search is tedious. Good databases are generally found through links from starter pages, recommendations from friends and Wiki articles like those on Wikipedia. CompletePlanet is a site that maintains a directory of directories and databases on the internet according to subject and category. CompletePlanet's list of databases should be the first place to head to after a search engine. Even search companies have started to realize the importance of databases. Since search engines cannot directly index a database, they have devised a method of keeping a list of popular keywords and associated some high quality databases on that subject with those terms. Thus, when a person searches, the search engine will also type that same query into an associated database and retrieve those results for display with the overall search results. One may find links to databases among the top of most search results. Yahoo, Google and some other companies currently use this method.
Also see A Look into Online Information Databases.
f) Message Boards
These are also called Forums or Bulletin Board System (BBS). Technically, a BBS refers to another type of implementation that was popular in the 1980s (e.g.Fidotel BBS). These are numerous in number and are devoted to a variety of Topics. Name a topic, and you could probably find a forum about it. Forums are only a good source of information as long as they have active users who are willing to help. The oldest form of message board lies on USENET and it is still the largest. Some forums require the user to make an account, usually free of charge, before he can log into the forum to read messages or post. As a result, the forum contains invisible content and cannot be indexed by search engines. An exception to this is USENET, in which all the posts are actually displayed in the form of a webpage for anyone to view.
How does one find a forum of a specific topic? The best way is to first find a site on that specific subject and then look to see if that site has an associated forum along with it. Most popular sites have associated forums on the same subject. There are also some forum directories that list forums on the internet by category. e.g.TILE
Most forums have an inbuilt search capability that can return all the posts that contain the same keywords. Having located a forum with a large user base, one can proceed to search it for data. This way, the invisible net can be accessed. It costs nothing to open a message board online these days. Some forums require membership (usually free) before being able to search it. Since there may be thousands of forums on the same subject, it is not practical to join each one just to conduct a search. The link below gives a guide on choosing, joining and posting on forums.
Also see A Guide to Message Boards.
g) Mailing Lists
A mailing list refers to a list of emails belonging to several users, who are often part of some group. Emails sent to the mailing list's email address are received by everyone on the list. This can be used to ask the entire list of people a question at large or make a quick announcement. Users who read it can then reply to the specific person (like private message on forums) or reply to the list at large so that everyone can read and benefit from the information (like a reply to a forum post). Newsletters (see below) also use a mailing list to send all subscribed users their regular content in the inbox. Many communities use mailing lists to communicate and discuss ideas. Some social network services like Yahoo Groups also have a mailing list in addition to a regular forum. The main problem with mailing lists is that they can accumulate quickly in the inbox and it becomes difficult to follow the discussion because it is not presented in a threaded manner like forum posts. With increased email storage capacity like Gmail, the first is not really a problem. Certain email clients offer a threaded appearance for mailing lists so that it becomes easier to follow. Gmail automatically stacks mailing list messages in a threaded order based on the subject line not changing. It is possible to post to Usenet through a mailing list address. There are also sites that maintain mailing list archives of popular mailing list topics so that the information in them can be searched at a future date. Gmane is one such example. Forums, message boards and mailing lists work much along the same lines and are even interconnected in their application and usage.
h) Newsletters
These are closely associated with any site that closely publishes news regularly. A newsletter is a copy of that news report that is delivered to your email inbox for reading. Many sites keep the newsletter content minimal, providing only excerpts of the main article and providing a link to the rest of the article. The site also usually contains archives of all the past newsletters and publications. Newsletters are a good way to keep an eye on several activities, but one still needs to read the entire article for the full picture. There are some newsletters that have no sites associated with them, the content in these often missed by search engines. To locate these, use a newsletter directory like TILE.
i) Newsletters and Spam
Newsletters can accumulate very fast in your email inbox. Several newsletters have become associated with spam, meaning more of brand advertising than actual content and information. It has become difficult to distinguish between the two. Thus, the best way to find high quality newsletters is to go through referrals given by users on forums and through links from websites. Some forums have an option to concatenate all the recent active discussions into the form of a newsletter for email delivery. This is a good way to keep an eye on recent forum discussions, especially if you are a regular poster there.
i) IRC, MUDs and MOOs
IRC stands for Internet Relay Chat. It is actually more than just a place for social gathering; it can also be used for serious discussions. Several open source projects like Ubuntu use IRC similar to how companies use boardrooms. There are channels dedicated to specific topics where a person can obtain instant information. Think of this to be a forum in which people can discuss in real time. Naturally, there are pranksters and spammers on IRC channels, but they usually get booted out fast. Information on IRC should be treated similar to a forum post; always cross check it for accuracy. The same rules of courtesy used in forums also apply on IRC. Every channel has its own rules. Take time to read these. If someone invites you for a war of words, never engage. If you find an IRC channel unhelpful, simply go find another one. Popular topics often have several alternate channels for discussion. Try Search IRC to find new IRC channels or browse the list by topic. IRC Junkie also maintains an archive of all the messages being exchanged on several public IRC channels. While its not as organized as a forum discussion, its a good source of information at times and is currently the only way to search past IRC discussions.
MUD stands for Multi User Dungeon. While widely used for playing games, MUDs can also be used for education and discussion. While not as well known as IRC, MUDs have been around longer. MOOs (MUD Object Oriented) are MUD based virtual reality systems that are commonly used in long distance education (e.g. Lingua MOO). A more detailed list of educational MOOs can be found here.
j) Gopher
The predecessor to the World Wide Web, gopher was a popular tool for seeking information in the pre 1990s before http took off. Gopher sites begin with gopher:// and they can be accessed with any standard web browser. Most of the information on Gopher is slowly being transferred to the WWW by the site owners. Jughead and Veronica are two engines used for searching Gopher sites. Gopher sites today are all but extinct and most of the information they had is on the WWW.
k) FTP
File Transfer Protocol or FTP is an old protocol that was used for transferring files over networks. It has little security, excepting SFTP (i.e. Secure FTP) and offers fast transfer speeds with resume functionality, if supported by server and client. FTP servers can be located using a search engine. FTP is used mainly for distributing Open Source Software and Content like the BSD, Linux distributions and Public Domain books (see Gutenberg project). Archie was the first FTP search engine. Today, it has been phased out and most of the content on FTP servers is usually listed on web sites with links to the server's address. FTPSearchEngines, FileSearching and Midida are sites that offer FTP search functionality. More information on FTP can be found at FTP Planet.
l) File Sharing Networks
I was initially apprehensive about suggesting this, lest it be taken as a sign of promoting piracy. But file sharing networks are fast becoming an excellent source of information, especially for that which is out of print or off the web. Consider a book which is banned by the government for its political views. Someone on a file sharing network may have an e-copy or the book's scanned pages. The network itself is not illegal, nor is it illegal to share files on it. However, users can engage in illegal activity by sharing copyrighted files and other information on the network. Unfortunately, the entire network comes under attack from RIAA, MPAA and the FBI. Always protect your privacy on these networks by using an anonymity program like EFF Tor or JAP so that no one can track you down.
Some file sharing networks also offer other features like chat rooms (e.g. WinMX, Direct Connect etc.). Many of these rooms have an environment similar to IRC channels. Hence, a person can discuss on specific topics and get information. It is also possible to use Usenet and IRC for file sharing though this requires a little more effort and is not as reliable in file transfer mechanisms and speeds.
m) Email
It is not so commonly known, but there are ways to get specific information through email. I am not just referring to the newsletters that get delivered to the inbox. There are certain sites which specialize in providing user support through email. While many of such sites usually charge for money, some are free. Protonic and 5starsupport are two examples. Google Answers is another example of support by email, although this costs a little (US $ 2.50 is the minimum). One advantage of Google Answers is that you can read the questions and answers posted by others. Thus, Google Answers can also be considered a public conversation. Forums or IRC lets you generate a large number of answers to the question quickly. Email support means that the replies will usually come from a small team of experts. The advantage is that the replies are sensible and will usually point to a possible solution. You will not get emails like 'Go Google it up' or 'RTFM' which can happen on forums.
n) Other Sources
These are new age sources of information that are a mix of one or more of the features from the above types.
i) Wikis
Strictly speaking, they are just web sites. However, Wikis represent a collaborative effort towards building information. Since anyone can edit them freely, wikis can amass content quickly. The content could be of low quality initially, but could improve over time. Wikipedia and Infoanarchy are two examples of wikis that started out plainly and now serve as excellent sources of information. Any search engine will usually list several links to wikis among its results. If you fail to find some information at a wiki, especially on a current topic, consider revisiting the site within a few days again. Someone might have added data by then. Wikia, formerly known as Wikicities, is a free Wiki hosting service and hosts a large number of Wikis by topic.
ii) Blogs
After the founding of Blogger and Livejournal, these have shot up in number on the internet. In a search for any keyword, one can often find a number of blog posts in the results. Blogs, while interesting to read, may not always be accurate. Data from a blog is unsuitable for quoting as a reference, even for daily homework. Blogs are best source of information when looking for data from a customer's or user's perspective. For example, consider that you have been ripped off recently by being sold some inferior quality product. Looking for people with a similar experience and find out what they did to get their cash back? Blogs, along with forums like Annoyances could be the best source here. Blogs have an ability to cover small niche areas that large data sites like Wikipedia and Answers miss out on. There are some search engines that only search blogs: NewsIsFree, Technorati, Feedster, Google blog search, IceRocket and Bloglines. Blogs are closely related to social networks and can be used a medium of discussion, much like forums.
iii) Internet Archives
Remember how Deja (now Google owned) archived all Usenet messages for the public to search and use? The Wayback machine is a project that similarly archives entire WWW sites and pages for everyone to search and use. Not an easy task, and certainly needs a lot of hard disk space and bandwidth. If you are looking for a website that has been taken off the internet, head over to the Wayback machine and you might find it. I managed to obtain useful information from the archives of my older websites that had been removed by the web host. A mirror of the Wayback machine is located at Biblotheca Alexandrina.
o) Non-Conventional Networks
These networks represent a new trend. They offer one or two unique features, along with a mix of several features found on the above networks.
i) Social Bookmarking
Social Bookmarking is an attempt to sort links through a community effort. This is analogous to the way Wikipedia uses the community to create and sort information. In social bookmarking, there are usually no sub hierarchies for classification. The social bookmarks are a decent source of starter pages. Del.icio.us is probably the best known social bookmarking service. Recently, there have been some open source implementations of social bookmarking like del.lirio.us and Scuttle. Also see my write up on the social bookmarking.
ii) Public Conversations
This allows a person to start a private or public conversation about any topic that other people can come and post on. Conversate is one example. It's very much along the lines of posting comments under a blog post or social tagging of bookmarks on del.icio.us. It was designed as an alternative to having an email based discussion. Certain topics of conversation attract lots of comments and provide a source of information for the rest of us. Google Answers, Yahoo Answers, JustCurio.Us and Askeet are examples are other services offering this. Like blogs and popular sites, these services offer RSS feeds.
iii) Webrings
A webring is a set of websites that belong to the same category, sharing similar content. These are similar to directories where websites are also classified by category. However, unlike directories, webrings are not always moderated before a website is added to the ring. All the sites that belong to a certain webring usually display a notice somewhere on the site stating so. By clicking on the webring, you can easily locate a large number of sites of the same genre. Sites like Webrings keep a directory of all known webrings. One can also search this database for locating webrings of a particular category of interest.
iv) Groups and Communities
These are like forums, but one where images, videos and other types of files can also be shared (e.g. Yahoo groups). They also offer a mailing list feature, and all mailing list posts are archived along with the message board posts. One needs to join the group to be able to post messages, but many groups allow a person to view all the messages freely. Since they are owned by Yahoo, bandwidth and space for hosting the files is limited. Many software authors use Yahoo groups to provide support to the community of users. Various interest groups can also be found here. Other similar services include MSN Communities.
v) Social Networks
Services like Friendster, Hi5, Yelp and Orkut come under this. Hi5 also allows formation of groups, with message boards within the community focused towards specific interests. Yelp was designed to share recommendations and reviews among friends, but also offers private messaging, message boards and other community related features. While the main purpose of these networks is social interaction, information can occasionally be found within the comments posted by various users. It is however hard to sort through this information since it can't be searched easily. One can approach these social network groups and post information, similar to a forum. The community on these networks, while vibrant, was initially more geared towards socializing rather than helping and advising like the normal technical forums. However, with their growing popularity, social networks act as an important medium of communication and could serve as information sources in the future. See this list of social networks.
4) Future Trends
Recently, many social networks have begun to offer features that allow publishing of data like user blogs. Some blog service providers have also developed into social networks, like Livejournal and Xanga, which allow a feature similar to blogrolling so that one can build networks within blog users. Some sites like deviantART, an online community of artists, have also started offering blogs. Frankly, the lines between social networking, blogging, online albums (e.g. Flickr) etc. have started to become quite blurred. I expect features like social bookmarking, photo & video blogging, podcasting and other services to join the mix soon (e.g. see Yahoo 360). It is quite a sight for the observer on the outside. Since the best way to keep users coming back is to offer more features, every community based and oriented service is jumping on the bandwagon regardless of their original goals for starting the service.
5) Evaluating a source of information
We will now look at how a piece of information can be checked for degree of reliability. This is especially important if you are looking to use the information on a research paper or school project where data accuracy is vital.
Factors to consider :
a) Age of site
An older site that has been on the net longer would have been read on and evaluated by a lot of other people. Also look at the site's traffic. High traffic means the site is consulted by many others.
b) General level of acceptance of site
Ask yourself how the site is rated amongst the community. Does it attract a lot of criticism? If so, is the criticism valid and justified? Has the site author made any attempt to counter his critics or explain his work in more detail? Many sites that publish articles have a comments feature, allowing people to post comments below the article for feedback. Sometimes this can lead to a lengthy discussion. Comment features are commonly found on blogs below each post. Wikis offer the option of a separate discussion page for every article. If there are comments there, you can read them to know what everyone else thinks. Some non-wiki or non-blog sites also offer a space for discussion on the forums for each article. A link will usually be placed from each article to the relevant thread.
c) Referring sites
If a lot of other sites keep quoting articles from a particular source, then that source should be worth looking at. Search engines use this logic to rank web sites. Some folks have stated that the sites that a site refers you to are an indication of its quality. This is somewhat illogical since everyone will try to link to good sites. But that doesn't mean all those referring sites are also good.
d) Quality of work
Does the author attempt to justify all the information with logical arguments? This is particularly important on scientific discussions. If the author uses historical references, check the source quoted for accuracy. Dates, names of people and places etc. should never be totally erroneous, even if the author is an amateur. Are there any inconsistencies or self contradictions in the work? If so, are they explained, maybe later on in the document? Always look for cited works. If the article is a research paper, authors will usually include a bibliography.
e) Author Credentials
Some questions to ask: Is the author an authority on the subject? If so, what are his qualifications and achievements? Has the author published a lot of related material and research on the subject before? Note that there are many good authors without top notch credentials. This doesn't mean their work is inferior. Technical documentation for several open source projects are written by normal folks, many times not even a computer scientist or engineer. If the author is a professional, it adds to the credibility of the data to some extent.
As a last step, check some books, paper or E-Books, for material on the same subject and cross refer for more accuracy.
6) Summary
Having seen various sources of information on the internet and methods to retrieve them, we now come to a conclusion. Now some types of data are more prevalent on some networks. If you are looking for banned files and out of copyright old movies, a file sharing network may be a better place to start. Spyware free version of popular software (e.g. Kazaa Lite etc.) are also found on file sharing networks. Search engines are the best place to go when you need to locate specific content, like the website of a brand or product. Search engines are probably a good place to start when you have no clue where else to go. Most people run always run by a search engine in addition to trying other methods of data retrieval. Websites of major libraries are a good place to go when looking for academic information and historical records etc. Wikis and databases are useful for looking for extended information about a topic. Blogs are used to find user comments and opinions. Forums, IRC, Newsletters, Email support, social networks etc. are useful when all these methods fail to yield any information and you need a new person to answer your queries. Internet archives are useful when you need to search the web across a larger timeline, especially if many sites containing that data have been banned or lifted off the internet.
Regarding information, it is unnecessary to do a complete evaluation unless you feel doubtful about the information provided. Most people go by a gut feeling to determine whether the information sounds right. The best way is to develop logical thinking that facilitates in selecting information by asking questions. If you feel comfortable, then the data is probably okay. Most people take data from a number of sites anyway, so facts are usually cross checked.
7) Bibliography
Most of the content was written off the top off my head. Most used resource in writing this was Wikipedia.
Research Buzz has some useful information on internet research.
Neal M. Holtz has a detailed list of links to various electronic journals online.
Diana Hacker has a Research and Documentation site that has information on finding new sources and citing proper references.
First Monday has published guidelines on how to write articles for its authors. Also useful to follow for anyone looking to write research reports.



