Language Selection

English French German Italian Portuguese Spanish

Debate Over the Size of the Web

Filed under
Web

How big is the World Wide Web? Many Internet engineers consider that query one of those imponderable philosophical questions, like how many angels can dance on the head of a pin.

But the question about the size of the Web came under intense debate last week after Yahoo announced at an Internet search engine conference in Santa Clara, Calif., that its search engine index - an accounting of the number of documents that can be located from its databases - had reached 19.2 billion.

Because the number was more than twice as large as the number of documents (8.1 billion) currently reported by Google, Yahoo's fierce competitor and Silicon Valley neighbor, the announcement - actually a brief mention in a Yahoo company Web log - set off a spat. Google questioned the way its rival was counting its numbers.

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."

The scope of Internet search engines, and thus indirectly the size of the Internet, has long been a lively area of computer science research and debate.

Moreover, all camps in the discussion are quick to note that index size is only loosely - and possibly even somewhat inversely - related to the quality of results returned.

The major commercial search engines use software programs known as Web crawlers to scour the Internet systematically for documents and index them.

The indexes themselves are maintained as arcane structures of computer data that permit the search engines to return lists of hundreds of answers in fractions of a second when Web users enter terms like "Britney Spears" or "Iraq and weapons of mass destruction."

On Sunday, researchers at the National Center for Supercomputer Applications attempted to shed light on the debate by performing a large number of random searches on both indices. They ran a random sample of 10,012 queries and concluded that Google, on average, returned 166.9 percent more results than Yahoo. In only three percent of the cases did the Yahoo searches return more queries than Google. The group said the Yahoo index claim was suspicious.

Neither Yahoo nor Google makes public the software algorithms that underlie their collection methods. In fact, those details are closely guarded secrets, which lie near the heart of heated competition now going on between Google, Yahoo and Microsoft over who can provide the most relevant answers to a user's query.

"It's a little bit silly," said Christopher Manning, a Stanford University professor who teaches a course on information retrieval. "It's difficult, and the whole question of how big indexes are has clearly become extremely political and commercial."

Even if the methodology is unclear, there is no shortage of outside speculation about what the different numbers mean, if anything.

Jean Veronis, a linguist in France and director of the Centre Informatique pour les Lettres et Sciences Humaines, posted a discussion on a blog noting that the increase in Yahoo references in French appeared consistent with the larger overall number that Yahoo was now reporting.

He added a caveat, however. "All of this should of course be taken with a large pinch of salt," he wrote. "So far, I haven't quite caught Yahoo red-handed when it comes to fiddling the books, but this could simply be because they are smarter with their figures than their competitors ;-)"

In contrast, a fellow blogger, Akash Jain, did his own random query test and wrote that it appeared that Google's index remained about 50 percent larger.

Other search engine specialists remained skeptical about the ability to estimate Web or index size as long as the search engines were being secretive about their methods. "I don't have any good way of checking," said Raul Valdes-Perez, a computer scientist who is chief executive of Vivisimo, which operates the Clusty search engine. "It feels a little like Harvard and Yale decided to argue over who has the most books in their respective libraries."

By JOHN MARKOFF
The New York Times

More in Tux Machines

Security News

  • Windows 10 least secure of Windows versions: study
    Windows 10 was the least secure of of current Windows versions in 2016, with 46% more vulnerabilities than either Windows 8 or 8.1, according to an analysis of Microsoft's own security bulletins in 2016. Security firm Avecto said its research, titled "2016 Microsoft Vulnerabilities Study: Mitigating risk by removing user privileges", had also found that a vast majority of vulnerabilities found in Microsoft products could be mitigated by removing admin rights. The research found that, despite its claims to being the "most secure" of Microsoft's operating systems, Windows 10 had 395 vulnerabilities in 2016, while Windows 8 and 8.1 each had 265. The research also found that while 530 Microsoft vulnerabilities were reported — marginally up from the 524 reported in 2015 — and 189 given a critical rating, 94% could be mitigated by removing admin rights. This was up from 85% in 2015.
  • Windows 10 Creators Update can block Win32 apps if they’re not from the Store [Ed: By Microsoft Peter. People who put Vista 10 on a PC totally lose control of that PC; remember, the OS itself is malware, as per textbook definitions. With DRM and other antifeatures expect copyright enforcement on the desktop soon.]
    The latest Windows 10 Insider Preview build doesn't add much in the way of features—it's mostly just bug fixes—but one small new feature has been spotted, and it could be contentious. Vitor Mikaelson noticed that the latest build lets you restrict the installation of applications built using the Win32 API.
  • Router assimilated into the Borg, sends 3TB in 24 hours
    "Well, f**k." Harsh language was appropriate under the circumstances. My router had just been hacked. Setting up a reliable home network has always been a challenge for me. I live in a cramped three-story house, and I don't like running cables. So my router's position is determined by the fiber modem in a corner on the bottom floor. Not long after we moved in, I realized that our old Airport Extreme was not delivering much signal to the attic, where two game-obsessed occupants fought for bandwidth. I tried all sorts of things. I extended the network. I used Ethernet-over-powerline connectors to deliver network access. I made a mystic circle and danced naked under the full moon. We lost neighbors, but we didn't gain a signal.
  • Purism's Librem 13 Coreboot Port Now "100%" Complete
    According to Purism's Youness Alaoui, their Coreboot port to the Librem 13 v1 laptop is now considered complete. The Librem 13 was long talked about having Coreboot over a proprietary BIOS while the initial models still had shipped with the conventional BIOS. Finally in 2017, they have now Coreboot at what they consider to be 100% complete for this Linux-friendly laptop.
  • The Librem 13 v1 coreboot port is now complete
    Here are the news you’ve been waiting for: the coreboot port for the Librem 13 v1 is 100% done! I fixed all of the remaining issues, it is now fully working and is stable, ready for others to enjoy. I fixed the instability problem with the M.2 SATA port, finished running all the tests to ensure coreboot is working correctly, fixed the headphone jack that was not working, made the boot prettier, and started investigating the Intel Management Engine issue.
  • Linux Update Fixes 11-Year-Old Flaw
    Andrey Konovalov, a security researcher at Google, found a use-after-free hole within Linux, CSO Online reported. This particular flaw is of interest because it appears to be situational. It only showed up in kernels built with a certain configuration option — CONFIG_IP_DCCP — enabled.

Kerala saves Rs 300 cr as schools switch to open software

The Kerala government has made a saving of Rs 300 crore through introduction and adoption of Free & Open Source Software (FOSS) in the school education sector, said a state government official on Sunday. IT became a compulsory subject in Kerala schools from 2003, but it was in 2005 only that FOSS was introduced in a phased manner and started to replace proprietary software. The decision made by the curriculum committee to implement it in the higher secondary sector has also been completed now. Read more

Tired of Windows and MAC computer systems? Linux may now be ready for prime time

Are you a bit tired of the same old options of salt and pepper, meaning having to choose only between the venerable Windows and MAC computer operating systems? Looking to branch out a bit, maybe take a walk on the wild side, learn some new things and save money? If so, the Linux operating system, which has been around for a long time and is used and loved by many hard-core techies and developers, may now be ready for prime time with the masses. Read more

Braswell based Pico-ITX SBC offers multiple expansion options

Axiomtek’s PICO300 is a Pico-ITX SBC with Intel Braswell, SATA-600, extended temperature support, and both a mini-PCIe and homegrown expansion connector. Axiomtek has launched a variation on its recently announced Intel Apollo Lake based PICO312 SBC that switches to the older Intel Braswell generation and offers a slightly reduced feature set. The board layout has also changed somewhat, with LVDS, SATA, and USB ports all changing location. Read more