A new study by scientists from IBM Research, Compaq Corporate Research Laboratories and AltaVista presents a comprehensive “map” of the World Wide Web, and uncovers inadequate linking between regions of the Internet that can make navigation difficult — or in some cases, impossible.
Based on an analysis of more than 600 million Web pages, the findings contradict earlier, small-sample studies that suggested a high degree of connectivity on the Web. According to the study, the Web is divided into four large regions, each containing approximately the same number of pages. Massive constellations of Web sites in the outer regions are inaccessible by links, the most common route of travel between sites for Web surfers.
Giant Bow Tie
The image of the Web that emerged through the research is that of a bow tie. The knot of the bow tie is the Web’s “strongly-connected core” and contains almost one-third of all Web sites. In the core region, Web surfers can use links to travel between the sites with relative ease.
One bow of the tie contains “origination” pages, and the other contains “termination” pages. Origination pages, constituting almost one-quarter of the Web, allow users to reach the connected core eventually. However, those pages cannot be reached from the core.
Termination pages, which also constitute approximately one-quarter of the Web, can be reached from connected core, but do not link back to it.
The fourth and final region, constituting approximately 22 percent of the Web, contains “disconnected” pages. Disconnected pages are sometimes connected to origination and/or termination pages, but Web surfers in the connected core cannot access those pages.
What Does It Mean?
The researchers believe that scientific and business communities will be able to improve connectivity on the Web if they understand the bow tie theory and identify which category their Web pages fall into.
For example, understanding connectivity problems can help such search engine developers as AltaVista design more effective Web “crawling” strategies. Crawling the Web, then indexing the pages found, is one of the fundamental methods used by search engines to organize the Internet. Currently, it is estimated that the most popular search engines index only 16 to 25 percent of the Web’s URLs.
With the new data in hand, search engines will be able to improve methods for ranking Web sites. Currently, ranking in search engine results relies heavily on the number of links to and from a site and promotes “link spamming” to artificially enhance a site’s ranking.
Despite the increased awareness and attempts at increasing connectivity, as more pages become part of the connected core, new pages will continue to be created in all three outer regions of the Web, the report predicts.