Language Selection

English French German Italian Portuguese Spanish

Debate Over the Size of the Web

Filed under
Web

How big is the World Wide Web? Many Internet engineers consider that query one of those imponderable philosophical questions, like how many angels can dance on the head of a pin.

But the question about the size of the Web came under intense debate last week after Yahoo announced at an Internet search engine conference in Santa Clara, Calif., that its search engine index - an accounting of the number of documents that can be located from its databases - had reached 19.2 billion.

Because the number was more than twice as large as the number of documents (8.1 billion) currently reported by Google, Yahoo's fierce competitor and Silicon Valley neighbor, the announcement - actually a brief mention in a Yahoo company Web log - set off a spat. Google questioned the way its rival was counting its numbers.

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."

The scope of Internet search engines, and thus indirectly the size of the Internet, has long been a lively area of computer science research and debate.

Moreover, all camps in the discussion are quick to note that index size is only loosely - and possibly even somewhat inversely - related to the quality of results returned.

The major commercial search engines use software programs known as Web crawlers to scour the Internet systematically for documents and index them.

The indexes themselves are maintained as arcane structures of computer data that permit the search engines to return lists of hundreds of answers in fractions of a second when Web users enter terms like "Britney Spears" or "Iraq and weapons of mass destruction."

On Sunday, researchers at the National Center for Supercomputer Applications attempted to shed light on the debate by performing a large number of random searches on both indices. They ran a random sample of 10,012 queries and concluded that Google, on average, returned 166.9 percent more results than Yahoo. In only three percent of the cases did the Yahoo searches return more queries than Google. The group said the Yahoo index claim was suspicious.

Neither Yahoo nor Google makes public the software algorithms that underlie their collection methods. In fact, those details are closely guarded secrets, which lie near the heart of heated competition now going on between Google, Yahoo and Microsoft over who can provide the most relevant answers to a user's query.

"It's a little bit silly," said Christopher Manning, a Stanford University professor who teaches a course on information retrieval. "It's difficult, and the whole question of how big indexes are has clearly become extremely political and commercial."

Even if the methodology is unclear, there is no shortage of outside speculation about what the different numbers mean, if anything.

Jean Veronis, a linguist in France and director of the Centre Informatique pour les Lettres et Sciences Humaines, posted a discussion on a blog noting that the increase in Yahoo references in French appeared consistent with the larger overall number that Yahoo was now reporting.

He added a caveat, however. "All of this should of course be taken with a large pinch of salt," he wrote. "So far, I haven't quite caught Yahoo red-handed when it comes to fiddling the books, but this could simply be because they are smarter with their figures than their competitors ;-)"

In contrast, a fellow blogger, Akash Jain, did his own random query test and wrote that it appeared that Google's index remained about 50 percent larger.

Other search engine specialists remained skeptical about the ability to estimate Web or index size as long as the search engines were being secretive about their methods. "I don't have any good way of checking," said Raul Valdes-Perez, a computer scientist who is chief executive of Vivisimo, which operates the Clusty search engine. "It feels a little like Harvard and Yale decided to argue over who has the most books in their respective libraries."

By JOHN MARKOFF
The New York Times

More in Tux Machines

OSS Leftovers

  • Open source movement to disrupt NFV and SDN marketplace
    According to Technology Business Research’s 1Q18 NFV/SDN Telecom Market Landscape report, open-source groups will spur NFV and SDN adoption by establishing industry standards that foster interoperability among a broader range of solution providers.
  • First look at Google Chrome's UI design refresh
    Users of Google Chrome Canary, the cutting edge version of Google's web browser, have a chance to get a sneak peek of a user interface design refresh that Google may plan to launch in all versions of Chrome eventually. The feature is hidden behind a flag currently but that is a common practice by Google; the company uses flags to hide future features from the general population. While there is no guarantee that features will land in Chrome one day, it is often the case that Google uses experimental flags to prepare the wider release.
  • Mozilla Thunderbird: Thunderbird April News Update: GSoC, 60 Beta 4, New Thunderbird Council
    Due to lots of news coming out of the Thunderbird project, I’ve decided to combine three different blog posts I was working on into one news update that gives people an idea of what has been happening in the Thunderbird community this month.
  • New Mozilla Poll: Support for Net Neutrality Grows, Trust in ISPs Dips
    “Today marks the ostensible effective date for the FCC’s net neutrality repeal order, but it does not mark the end of net neutrality,” says Denelle Dixon, Mozilla COO. “And not just because some procedural steps remain before the official overturning of the rules — but because Mozilla and other supporters of net neutrality are fighting to protect it in the courts and in Congress.” Also today: Mozilla is publishing results from a nationwide poll that reveals where Americans stand on the issue. Our survey reinforces what grassroots action has already demonstrated: The repeal contradicts most Americans’ wishes. The nation wants strong net neutrality rules.
  • Another Summer of Code with Smack
    I’m very happy to announce that once again I will participate in the Summer of Code. Last year I worked on OMEMO encrypted Jingle Filetransfer for the XMPP client library Smack. This year, I will once again contribute to the Smack project. A big thanks goes out to Daniel Gultsch and Conversations.im, who act as an umbrella organization.
  • NOAA’s Mission Toward Open Data Sharing
    The goal of the National Oceanic and Atmospheric Administration (NOAA) is to put all of its data — data about weather, climate, ocean coasts, fisheries, and ecosystems – into the hands of the people who need it most. The trick is translating the hard data and making it useful to people who aren’t necessarily subject matter experts, said Edward Kearns, the NOAA’s first ever data officer, speaking at the recent Open Source Leadership Summit (OSLS).   NOAA’s mission is similar to NASA’s in that it is science based, but “our mission is operations; to get the quality information to the American people that they need to run their businesses, to protect their lives and property, to manage their water resources, to manage their ocean resources,” said Kearns, during his talk titled “Realizing the Full Potential of NOAA’s Open Data.” He said that NOAA was doing Big Data long before the term was coined and that the agency has way too much of it – to the tune of 30 petabytes in its archives with another 200 petabytes of data in a working data store. Not surprisingly, NOAA officials have a hard time moving it around and managing it, Kearns said.
  • Document Freedom Day Singapore 2018
    On the 28 March 2018, Fedora Ambassadors organized Document Freedom Day in Singapore. Document Freedom Day is a day which like-minded folks who care about libre document formats gather to discuss and raise awareness of libre document formats. Libre document formats help reduce restrictions and vendor lock-ins. They are also an important tool that enables our right to read freely.

How to Run Android Apps and Games on Linux

Want to run Android apps on Linux? How about play Android games? Several options are available, but the one that works the best is Anbox, a useful tool that runs your favorite Android apps on Linux without emulation. Here’s how to get it up and running on your Linux PC today. Read more Also: 8 Best Android Apps For Kids To Help Children Learn With Fun | 2018 Edition

SUSE: openSUSE Tumbleweed and SUSE in HPC

  • Krita, Linux Kernel, KDEConnect Get Updated in Tumbleweed
    There have been a few openSUSE Tumbleweed snapshots released in the past two weeks that brought some new features and fixes to users. This blog will go over the past two snapshots. The last snapshot, 20180416, had several packages updated. The adobe-sourceserifpro-fonts package updated to version 2.000; with the change, the fonts were refined to make the Semibold and Bold heavier. Both dbus-1 and dbus-1-x11 were updated to 1.12.6, which fixed some regreations introduced in version 1.10.18 and 1.11.0. The gtk-vnc 0.7.2 package deprecated the manual python2 binding, which will be deleted in the next release, in favor of GObject introspection. Notifications that caused a crash were fixed in kdeconnect-kde 1.3.0. The 4.16.2 Linux Kernel made ip_tunnel, ipv6, ip6_gre, ip6_tunnel and vti6 better to validate user provided tunnel names. Due to a build system failure, not all 4.16.2 binaries were built correctly; this will be resolved in the 20180417 snapshot, which will be released shortly. Krita 4.0.1 had multiple fixes from its major version upgrade. The visual diff and merge tool meld 3.19.0 added new features like a new per-pane status bar with selectors for syntax highlighting and text encoding. Python Imaging Library python-Pillow 5.1.0 removed the freetype-2.9.patch and YaST had several packages with a version bump.
  • SUSE Linux Enterprise High Performance Computing in the SLE 15 Beta Program!
  • SUSE Linux Enterprise 15 Prepares HPC Module
    The upcoming release of SUSE Linux Enterprise 15 is offering an HPC (High Performance Computing) module for development, control, and compute nodes. Today that SLE15-HPC module is now available in beta.

OPNsense 18.1.6

For more than 3 years now, OPNsense is driving innovation through modularising and hardening the code base, quick and reliable firmware upgrades, multi-language support, fast adoption of upstream software updates as well as clear and stable 2-Clause BSD licensing. Read more