Language Selection

English French German Italian Portuguese Spanish

Debate Over the Size of the Web

Filed under
Web

How big is the World Wide Web? Many Internet engineers consider that query one of those imponderable philosophical questions, like how many angels can dance on the head of a pin.

But the question about the size of the Web came under intense debate last week after Yahoo announced at an Internet search engine conference in Santa Clara, Calif., that its search engine index - an accounting of the number of documents that can be located from its databases - had reached 19.2 billion.

Because the number was more than twice as large as the number of documents (8.1 billion) currently reported by Google, Yahoo's fierce competitor and Silicon Valley neighbor, the announcement - actually a brief mention in a Yahoo company Web log - set off a spat. Google questioned the way its rival was counting its numbers.

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."

The scope of Internet search engines, and thus indirectly the size of the Internet, has long been a lively area of computer science research and debate.

Moreover, all camps in the discussion are quick to note that index size is only loosely - and possibly even somewhat inversely - related to the quality of results returned.

The major commercial search engines use software programs known as Web crawlers to scour the Internet systematically for documents and index them.

The indexes themselves are maintained as arcane structures of computer data that permit the search engines to return lists of hundreds of answers in fractions of a second when Web users enter terms like "Britney Spears" or "Iraq and weapons of mass destruction."

On Sunday, researchers at the National Center for Supercomputer Applications attempted to shed light on the debate by performing a large number of random searches on both indices. They ran a random sample of 10,012 queries and concluded that Google, on average, returned 166.9 percent more results than Yahoo. In only three percent of the cases did the Yahoo searches return more queries than Google. The group said the Yahoo index claim was suspicious.

Neither Yahoo nor Google makes public the software algorithms that underlie their collection methods. In fact, those details are closely guarded secrets, which lie near the heart of heated competition now going on between Google, Yahoo and Microsoft over who can provide the most relevant answers to a user's query.

"It's a little bit silly," said Christopher Manning, a Stanford University professor who teaches a course on information retrieval. "It's difficult, and the whole question of how big indexes are has clearly become extremely political and commercial."

Even if the methodology is unclear, there is no shortage of outside speculation about what the different numbers mean, if anything.

Jean Veronis, a linguist in France and director of the Centre Informatique pour les Lettres et Sciences Humaines, posted a discussion on a blog noting that the increase in Yahoo references in French appeared consistent with the larger overall number that Yahoo was now reporting.

He added a caveat, however. "All of this should of course be taken with a large pinch of salt," he wrote. "So far, I haven't quite caught Yahoo red-handed when it comes to fiddling the books, but this could simply be because they are smarter with their figures than their competitors ;-)"

In contrast, a fellow blogger, Akash Jain, did his own random query test and wrote that it appeared that Google's index remained about 50 percent larger.

Other search engine specialists remained skeptical about the ability to estimate Web or index size as long as the search engines were being secretive about their methods. "I don't have any good way of checking," said Raul Valdes-Perez, a computer scientist who is chief executive of Vivisimo, which operates the Clusty search engine. "It feels a little like Harvard and Yale decided to argue over who has the most books in their respective libraries."

By JOHN MARKOFF
The New York Times

More in Tux Machines

Security: Uber Sued, Intel ‘Damage Control’, ZDNet FUD, and XFRM Privilege Escalation

  • Uber hit with 2 lawsuits over gigantic 2016 data breach
    In the 48 hours since the explosive revelations that Uber sustained a massive data breach in 2016, two separate proposed class-action lawsuits have been filed in different federal courts across California. The cases allege substantial negligence on Uber’s part: plaintiffs say the company failed to keep safe the data of the affected 50 million customers and 7 million drivers. Uber reportedly paid $100,000 to delete the stolen data and keep news of the breach quiet. On Tuesday, CEO Dara Khosrowshahi wrote: “None of this should have happened, and I will not make excuses for it.”
  • Intel Releases Linux-Compatible Tool For Confirming ME Vulnerabilities [Ed: ‘Damage control’ strategy is to make it look like just a bug.]
    While Intel ME security issues have been talked about for months, confirming fears that have been present about it for years, this week Intel published the SA-00086 security advisory following their own internal review of ME/TXE/SPS components. The impact is someone could crash or cause instability issues, load and execute arbitrary code outside the visibility of the user and operating system, and other possible issues.
  • Open source's big weak spot? Flawed libraries lurking in key apps [Ed: Linux basher Liam Tung entertains FUD firm Snyk and Microsoft because it suits the employer's agenda]
  • SSD Advisory – Linux Kernel XFRM Privilege Escalation

gThumb 3.6 GNOME Image Viewer Released with Better Wayland and HiDPI Support

gThumb, the open-source image viewer for the GNOME desktop environment, has been updated this week to version 3.6, a new stable branch that introduces numerous new features and improvements. gThumb 3.6 comes with better support for the next-generation Wayland display server as the built-in video player, color profiles, and application icon received Wayland support. The video player component received a "Loop" button to allow you to loop videos, and there's now support for HiDPI displays. The app also ships with a color picker, a new option to open files in full-screen, a zoom popover that offers different zoom commands and a zoom slider, support for double-click activation, faster image loading, aspect ratio filtering, and the ability to display the description of the color profile in the property view. Read more Also: Many Broadway HTML5 Backend Improvements Land In GTK4

ExTiX 18.0, 64bit, with Deepin Desktop 15.5 (made in China!) and Refracta Tools – Create your own ExTiX/Ubuntu/Deepin system in minutes!

I’ve made a new extra version of ExTiX with Deepin 15.5 Desktop (made in China!). Deepin is devoted to providing a beautiful, easy to use, safe and reliable system for global users. Only a minimum of packages are installed in ExTiX Deepin. You can of course install all packages you want. Even while running ExTiX Deepin live. I.e. from a DVD or USB stick. Study all installed packages in ExTiX Deepin. Read more Also: ExTiX, the Ultimate Linux System, Now Has a Deepin Edition Based on Ubuntu 17.10 Kali Linux 2017.3 Brings New Hacking Tools — Download ISO And Torrent Files Here

Graphics: Greenfield, Polaris, Ryzen

  • Greenfield: An In-Browser HTML5 Wayland Compositor
    Earlier this year we covered the Westfield project as Wayland for HTML5/JavaScript by providing a Wayland protocol parser and generator for JavaScript. Now that code has morphed into Greenfield to provide a working, in-browser HTML5 Wayland compositor.
  • New Polaris Firmware Blobs Hit Linux-Firmware.Git
    Updated firmware files for the command processor (CP) on AMD Polaris graphics cards have landed in linux-firmware.git. These updated firmware files for Polaris GPUs are light on details besides being for the CP and from their internal 577de7b1 Git state.
  • Report: Ryzen "Raven Ridge" APU Not Using HBM2 Memory
    Instead of the Vega graphics on Raven Ridge using HBM2 memory, it appears at least for some models they are just using onboard DDR4 memory. FUDZilla is reporting today that there is just 256MB of onboard DDR4 memory being used by the new APU, at least for the Ryzen 5 APU found on the HP Envy x360 that was the first Raven APU system to market.