Debate Over the Size of the Web

Language Selection

Submitted by srlinuxx on Monday 15th of August 2005 08:52:23 AM Filed under

How big is the World Wide Web? Many Internet engineers consider that query one of those imponderable philosophical questions, like how many angels can dance on the head of a pin.

But the question about the size of the Web came under intense debate last week after Yahoo announced at an Internet search engine conference in Santa Clara, Calif., that its search engine index - an accounting of the number of documents that can be located from its databases - had reached 19.2 billion.

Because the number was more than twice as large as the number of documents (8.1 billion) currently reported by Google, Yahoo's fierce competitor and Silicon Valley neighbor, the announcement - actually a brief mention in a Yahoo company Web log - set off a spat. Google questioned the way its rival was counting its numbers.

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."

The scope of Internet search engines, and thus indirectly the size of the Internet, has long been a lively area of computer science research and debate.

Moreover, all camps in the discussion are quick to note that index size is only loosely - and possibly even somewhat inversely - related to the quality of results returned.

The major commercial search engines use software programs known as Web crawlers to scour the Internet systematically for documents and index them.

The indexes themselves are maintained as arcane structures of computer data that permit the search engines to return lists of hundreds of answers in fractions of a second when Web users enter terms like "Britney Spears" or "Iraq and weapons of mass destruction."

On Sunday, researchers at the National Center for Supercomputer Applications attempted to shed light on the debate by performing a large number of random searches on both indices. They ran a random sample of 10,012 queries and concluded that Google, on average, returned 166.9 percent more results than Yahoo. In only three percent of the cases did the Yahoo searches return more queries than Google. The group said the Yahoo index claim was suspicious.

Neither Yahoo nor Google makes public the software algorithms that underlie their collection methods. In fact, those details are closely guarded secrets, which lie near the heart of heated competition now going on between Google, Yahoo and Microsoft over who can provide the most relevant answers to a user's query.

"It's a little bit silly," said Christopher Manning, a Stanford University professor who teaches a course on information retrieval. "It's difficult, and the whole question of how big indexes are has clearly become extremely political and commercial."

Even if the methodology is unclear, there is no shortage of outside speculation about what the different numbers mean, if anything.

Jean Veronis, a linguist in France and director of the Centre Informatique pour les Lettres et Sciences Humaines, posted a discussion on a blog noting that the increase in Yahoo references in French appeared consistent with the larger overall number that Yahoo was now reporting.

He added a caveat, however. "All of this should of course be taken with a large pinch of salt," he wrote. "So far, I haven't quite caught Yahoo red-handed when it comes to fiddling the books, but this could simply be because they are smarter with their figures than their competitors ;-)"

In contrast, a fellow blogger, Akash Jain, did his own random query test and wrote that it appeared that Google's index remained about 50 percent larger.

Other search engine specialists remained skeptical about the ability to estimate Web or index size as long as the search engines were being secretive about their methods. "I don't have any good way of checking," said Raul Valdes-Perez, a computer scientist who is chief executive of Vivisimo, which operates the Clusty search engine. "It feels a little like Harvard and Yale decided to argue over who has the most books in their respective libraries."

By JOHN MARKOFF
The New York Times

Login or register to post comments
Printer-friendly version
2264 reads
PDF version

More in Tux Machines

digiKam 7.7.0 is released

After three months of active maintenance and another bug triage, the digiKam team is proud to present version 7.7.0 of its open source digital photo manager. See below the list of most important features coming with this release.

Dilution and Misuse of the "Linux" Brand

Linux Foundation Rewards StepSecurity’s Impact on CI/CD Pipeline Security Fixes for Critical Open Source Projects [Ed: Having just participated in a FUD attack together with a Microsoft proxy, not to mention issued a report with it]
Cardano Roundup: Lace Wallet Announcement, Hoskinson Proposes Self-Regulation, and Linux Foundation Membership [Ed: The "Linux" Foundation misuses or sells the Linux brand, diluting the name and the project's identity]
Can SONiC be the Linux of Networking? [Ed: The Register now abuses the Linux brand to describe something of Microsoft, which is attacking Linux]
Kuro: An Unofficial Microsoft To-Do Desktop Client

Microsoft says that they love Linux and open-source, but we still do not have native support for a lot of its products on Linux.

Samsung, Red Hat to Work on Linux Drivers for Future Tech

The metaverse is expected to uproot system design as we know it, and Samsung is one of many hardware vendors re-imagining data center infrastructure in preparation for a parallel 3D world. Samsung is working on new memory technologies that provide faster bandwidth inside hardware for data to travel between CPUs, storage and other computing resources. The company also announced it was partnering with Red Hat to ensure these technologies have Linux compatibility.

today's howtos

How to install go1.19beta on Ubuntu 22.04 – NextGenTips

In this tutorial, we are going to explore how to install go on Ubuntu 22.04 Golang is an open-source programming language that is easy to learn and use. It is built-in concurrency and has a robust standard library. It is reliable, builds fast, and efficient software that scales fast. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel-type systems enable flexible and modular program constructions. Go compiles quickly to machine code and has the convenience of garbage collection and the power of run-time reflection. In this guide, we are going to learn how to install golang 1.19beta on Ubuntu 22.04. Go 1.19beta1 is not yet released. There is so much work in progress with all the documentation.
molecule test: failed to connect to bus in systemd container - openQA bites

Ansible Molecule is a project to help you test your ansible roles. I’m using molecule for automatically testing the ansible roles of geekoops.
How To Install MongoDB on AlmaLinux 9 - idroot

In this tutorial, we will show you how to install MongoDB on AlmaLinux 9. For those of you who didn’t know, MongoDB is a high-performance, highly scalable document-oriented NoSQL database. Unlike in SQL databases where data is stored in rows and columns inside tables, in MongoDB, data is structured in JSON-like format inside records which are referred to as documents. The open-source attribute of MongoDB as a database software makes it an ideal candidate for almost any database-related project. This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo‘ to the commands to get root privileges. I will show you the step-by-step installation of the MongoDB NoSQL database on AlmaLinux 9. You can follow the same instructions for CentOS and Rocky Linux.
An introduction (and how-to) to Plugin Loader for the Steam Deck. - Invidious
Self-host a Ghost Blog With Traefik

Ghost is a very popular open-source content management system. Started as an alternative to WordPress and it went on to become an alternative to Substack by focusing on membership and newsletter. The creators of Ghost offer managed Pro hosting but it may not fit everyone's budget. Alternatively, you can self-host it on your own cloud servers. On Linux handbook, we already have a guide on deploying Ghost with Docker in a reverse proxy setup. Instead of Ngnix reverse proxy, you can also use another software called Traefik with Docker. It is a popular open-source cloud-native application proxy, API Gateway, Edge-router, and more. I use Traefik to secure my websites using an SSL certificate obtained from Let's Encrypt. Once deployed, Traefik can automatically manage your certificates and their renewals. In this tutorial, I'll share the necessary steps for deploying a Ghost blog with Docker and Traefik.

Best Features in Ubuntu 23.04 “Lunar Lobster”

Ubuntu 23.04, codenamed Lunar Lobster, is the latest version of the popular open-source operating system to be released on 20th April. It comes with a host of new features and upgrades designed to enhance the user experience and improve the system’s overall performance.

Top 7 Linux Window Managers

Linux is known for its open-source nature and flexibility. One of the essential components of a Linux system is the window manager. A window manager is responsible for the appearance and management of windows on the screen.

Navigation

Language Selection

Search

Tux Machines Section/s

Authors Information

LWN

Active forum topics

Collabora

9to5Linux

Gnu Planet

FOSS Force

FOSSLinux

Phoronix

OpenSource.com

Kde Planet

Fedora Magazine

Mozilla

Debate Over the Size of the Web

More in Tux Machines

digiKam 7.7.0 is released

Dilution and Misuse of the "Linux" Brand

Linux Foundation Rewards StepSecurity’s Impact on CI/CD Pipeline Security Fixes for Critical Open Source Projects [Ed: Having just participated in a FUD attack together with a Microsoft proxy, not to mention issued a report with it]

Cardano Roundup: Lace Wallet Announcement, Hoskinson Proposes Self-Regulation, and Linux Foundation Membership [Ed: The "Linux" Foundation misuses or sells the Linux brand, diluting the name and the project's identity]

Can SONiC be the Linux of Networking? [Ed: The Register now abuses the Linux brand to describe something of Microsoft, which is attacking Linux]

Kuro: An Unofficial Microsoft To-Do Desktop Client

Samsung, Red Hat to Work on Linux Drivers for Future Tech

today's howtos

How to install go1.19beta on Ubuntu 22.04 – NextGenTips

molecule test: failed to connect to bus in systemd container - openQA bites

How To Install MongoDB on AlmaLinux 9 - idroot

An introduction (and how-to) to Plugin Loader for the Steam Deck. - Invidious

Self-host a Ghost Blog With Traefik

Latest News

Best Features in Ubuntu 23.04 “Lunar Lobster”

Top 7 Linux Window Managers

Tux Machines IRC Logs 2021 Archive

2021

Syndication & Social Media

Support Tux Machines

Recent comments

Who's online

Gizmoz

FOSSMint

Debug Point

LXer

UbuntuBuzz

SparkyLinux

Linux Journal

Linux Today

System 76

Kernel Planet

Purism

LinuxSecurity.com Advisories

Debian

It's FOSS

OMG! Ubuntu!

GamingOnLinux

Slashdot

DW Latest Releases

DistroWatch

FSF

Ubuntu Buzz

LinuxLinks