This article is interesting not so much because it discusses search engines, but because the writer brings in textual and linguistic theory, as well as hinting at some implications for evolutionary theory.
Web search is a multipart process. First, Google crawls the Web to collect the contents of every accessible site. This data is broken down into an index (organized by word, just like the index of a textbook), a way of finding any page based on its content. Every time a user types a query, the index is combed for relevant pages, returning a list that commonly numbers in the hundreds of thousands, or millions. The trickiest part, though, is the ranking process — determining which of those pages belong at the top of the list.
That’s where the contextual signals come in. All search engines incorporate them, but none has added as many or made use of them as skillfully as Google has. PageRank itself is a signal, an attribute of a Web page (in this case, its importance relative to the rest of the Web) that can be used to help determine relevance. Some of the signals now seem obvious. Early on, Google’s algorithm gave special consideration to the title on a Web page — clearly an important signal for determining relevance. Another key technique exploited anchor text, the words that make up the actual hyperlink connecting one page to another . . . Later signals included attributes like freshness (for certain queries, pages created more recently may be more valuable than older ones) and location (Google knows the rough geographic coordinates of searchers and favors local results). The search engine currently uses more than 200 signals to help rank its results.
A search engine needs to determine the best (most relevant) page to return in response to a query. It has to “know” all the possible elements, then it has to interpret contextual clues. The search engine uses some information about the behavior of various entities:
The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. . . . Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.
The algorithms have to incorporate a lot of general and specific rules about culture and language:
Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches.
The algorithms, and the search engine itself, must undergo constant testing and improvement:
Google employs hundreds of people around the world to sit at their home computer and judge results for various queries, marking whether the tweaks return better or worse results than before. But Google also has a larger army of testers — its billions of users, virtually all of whom are unwittingly participating in its constant quality experiments. Every time engineers want to test a tweak, they run the new algorithm on a tiny percentage of random users, letting the rest of the site’s searchers serve as a massive control group. There are so many changes to measure that Google has discarded the traditional scientific nostrum that only one experiment should be conducted at a time. “On most Google queries, you’re actually in multiple control or experimental groups simultaneously,” says search quality engineer Patrick Riley. Then he corrects himself. “Essentially,” he says, “all the queries are involved in some test.”
Success, however, is not measured by objective criteria pertaining to the content of the resulting pages. It is measured in terms of the behavior of the searcher, whether they are “satisfied.”
From an evolutionary perspective, the “search engine” is whatever causes organisms to evolve and text strings are genetic codes. Applying this to evolution requires multiple methods for organisms to search for and find relevant genetic codes. It also requires all the genetic codes to be “indexed,” and it requires some way for the evolutionary mechanism itself to adapt. The ideas are intriguing, but they are obviously impossible unless you propose some type of universal interconnectedness and guiding principles.
Some type of intelligent design or creationist theory might suffice for this. If you can’t accept the idea of a transcendental creator, then you have to propose a pantheistic or panentheistic system. This is where the naturalistic, atheistic evolutionary hypothesis breaks down. It simply cannot handle the extreme complexity that is observed in the real world, especially if it is limited to “random mutation” and “natural selection” as the only mechanisms. It’s a really myopic, dogmatic, and atomistic nineteenth-century paradigm that has very little explanatory power or usefulness.
My simplistic analogy is not meant to “disprove” evolutionary theory; it is simply meant to illustrate how inadequate and ignorant such a theory appears from a modern information-centered viewpoint. Of course, nineteenth-century creationism is equally inadequate. That’s just one reason why I think the contemporary “Darwinism vs. Creationism” debate is pointless.