Language Selection

English French German Italian Portuguese Spanish

Debate Over the Size of the Web

Filed under
Web

How big is the World Wide Web? Many Internet engineers consider that query one of those imponderable philosophical questions, like how many angels can dance on the head of a pin.

But the question about the size of the Web came under intense debate last week after Yahoo announced at an Internet search engine conference in Santa Clara, Calif., that its search engine index - an accounting of the number of documents that can be located from its databases - had reached 19.2 billion.

Because the number was more than twice as large as the number of documents (8.1 billion) currently reported by Google, Yahoo's fierce competitor and Silicon Valley neighbor, the announcement - actually a brief mention in a Yahoo company Web log - set off a spat. Google questioned the way its rival was counting its numbers.

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."

The scope of Internet search engines, and thus indirectly the size of the Internet, has long been a lively area of computer science research and debate.

Moreover, all camps in the discussion are quick to note that index size is only loosely - and possibly even somewhat inversely - related to the quality of results returned.

The major commercial search engines use software programs known as Web crawlers to scour the Internet systematically for documents and index them.

The indexes themselves are maintained as arcane structures of computer data that permit the search engines to return lists of hundreds of answers in fractions of a second when Web users enter terms like "Britney Spears" or "Iraq and weapons of mass destruction."

On Sunday, researchers at the National Center for Supercomputer Applications attempted to shed light on the debate by performing a large number of random searches on both indices. They ran a random sample of 10,012 queries and concluded that Google, on average, returned 166.9 percent more results than Yahoo. In only three percent of the cases did the Yahoo searches return more queries than Google. The group said the Yahoo index claim was suspicious.

Neither Yahoo nor Google makes public the software algorithms that underlie their collection methods. In fact, those details are closely guarded secrets, which lie near the heart of heated competition now going on between Google, Yahoo and Microsoft over who can provide the most relevant answers to a user's query.

"It's a little bit silly," said Christopher Manning, a Stanford University professor who teaches a course on information retrieval. "It's difficult, and the whole question of how big indexes are has clearly become extremely political and commercial."

Even if the methodology is unclear, there is no shortage of outside speculation about what the different numbers mean, if anything.

Jean Veronis, a linguist in France and director of the Centre Informatique pour les Lettres et Sciences Humaines, posted a discussion on a blog noting that the increase in Yahoo references in French appeared consistent with the larger overall number that Yahoo was now reporting.

He added a caveat, however. "All of this should of course be taken with a large pinch of salt," he wrote. "So far, I haven't quite caught Yahoo red-handed when it comes to fiddling the books, but this could simply be because they are smarter with their figures than their competitors ;-)"

In contrast, a fellow blogger, Akash Jain, did his own random query test and wrote that it appeared that Google's index remained about 50 percent larger.

Other search engine specialists remained skeptical about the ability to estimate Web or index size as long as the search engines were being secretive about their methods. "I don't have any good way of checking," said Raul Valdes-Perez, a computer scientist who is chief executive of Vivisimo, which operates the Clusty search engine. "It feels a little like Harvard and Yale decided to argue over who has the most books in their respective libraries."

By JOHN MARKOFF
The New York Times

More in Tux Machines

Red Hat News

  • 'Indian IT decision makers adopting open source to go digital'
  • 86% CIOs in India Bank on Open Source for Digital Innovation
    Red Hat, Inc. has announced the results of a commissioned study by Forrester Consulting, on behalf of Red Hat, about the use of open source in digital innovation initiatives in the Asia Pacific region. The results, highlighted in the study Open Source Drives Digital Innovation, revealed that IT decision makers in India are turning to open source to drive digital innovation to support business with new capabilities. The research surveyed 455 CIOs and senior IT decision makers from nine countries in Asia Pacific. The insights gathered reflect that 73 percent of respondents from India regard open source as a cost-saving initiative. The survey respondents from India believe that technology innovation is either “very important” or “critical” to their organization’s success.
  • Indian IT decision makers turn to open source: Study
    86 percent of survey respondents highlighted reducing cost as one of their key business initiatives within the next 12 months. Red Hat, a provider of open source solutions, has announced the results of a commissioned study by Forrester Consulting, on behalf of Red Hat, about the use of open source in digital innovation initiatives in the Asia Pacific region. The results, highlighted in the study Open Source Drives Digital Innovation, revealed that IT decision makers in India are turning to open source to drive digital innovation to support business with new capabilities.
  • Red Hat's OpenShift Gives Storage Flexibility to Developers
    Software-defined networking (SDN) has emerged as a mighty tech trend, and many of the leadin players on the open source scene are waking up to it. While it is known widely for its enterprise Linux efforts, Red Hat has updated its OpenShift Container Platform to provide support for dynamic storage provisioning in local and remote applications. It's all done with the cloud in mind. "Red Hat OpenShift Container Platform 3.4 provides a platform for innovation while retaining a focus on existing mission-critical workloads, offering dynamic storage provisioning for both traditional and cloud-native applications and multi-tenant capabilities that can support multiple applications, teams and deployment processes in a hybrid cloud environment," the company notes.
  • Earnings in Full Force, Analysts Take Aim at Red Hat, Inc. (NYSE:RHT)
  • The Jacobs Levy Equity Management Inc. Boosts Stake in Red Hat Inc. (RHT)
  • Earnings Outlook Revision Scan: Red Hat, Inc. (NYSE:RHT)
  • Capital Investment Counsel Inc Has $1,706,000 Stake in Red Hat Inc. (RHT)

Code for Pakistan and Linux Foundation Event

  • Code for Pakistan to host Open Source Day for Women
    Open source refers to software with its source code publicly available for people to modify and share. However, it does not simply mean to write a source code and make it publicly available, it is also about collaborative participation, transparency, rapid growth and community-oriented development. The Open Source Day is an opportunity for women with a background in Computer Science to get started on Open Source Projects and network with mentors in the tech industry. It provides them an opportunity to come together and hone their tech skills.
  • Open Source Software Strategies for Enterprise IT
    Enterprises using open source code in infrastructure must understand both the risks and benefits of community-developed software. Professional open source management is a discipline that focuses on minimizing risk and delivering the benefits of open source software as efficiently as possible. For successful open source management, enterprises must adopt clear strategies, well-defined policies, and efficient processes. Nobody gets all this right the first time, so it’s also important to review and audit your policies for continuous improvement. Additionally, successful open source initiatives for enterprise IT must provide real ROI in acquisition, integration, and management.

Security Leftovers

Leftovers: BSD

  • BSD Mag: Understanding Unikernels by Russell Pavlicek
    The number of tasks which lend themselves to being unikernels is larger than you might think. In 2015, Martin Lucina announced the successful creation of a “RAMP” stack. A variant of the common “LAMP” stack (Linux. Apache, MySQL, PHP/Python), the “RAMP” stack employs NGINX, MySQL, and PHP each built on Rumprun. Rumprun is an instance of a Rump kernel, which is a unikernel system based on the modular operating system functions found in the NetBSD project. So even this very common solution stack can be successfully converted into unikernels.
  • Summary of the preliminary LLDB support project
    Operating systems can be called monitors as they handle system calls from userland processes. A similar task is performed by debuggers as they implement monitors for traced applications and interpret various events that occurred in tracees and are messaged usually with signals to their tracers. During this month I have started a new Process Plugin within LLDB to incept NativeProcessNetBSD - copied from NativeProcessLinux - implementing basic functionality and handling all the needed events in the MonitorCallback() function. To achieve these tasks, I had to add a bunch of new ptrace(2) interfaces in the kernel to cover all that is required by LLDB monitors. The current Process Plugin for NetBSD is capable to start a process, catch all the needed events correctly and if applicable resume or step the process.
  • NetBSD Making Progress On LLDB Debugger Support
    NetBSD developers have been implementing the relevant interfaces needed for the LLVM debugger to effectively monitor and work on the operating system. As part of that they have also improved some of their own documentation, provided new ptrace interfaces, and more. Those interested in LLDB and/or NetBSD can learn more about this debugging work via this NetBSD.org blog post.