Paul Ryan - GoTo

Paul Ryan was a Unix kernel hacker before he joined GoTo in April of 1999.

Agenda of the Presentation

  1. Search engines
    1. Origins of search engines
    2. How search engines work
    3. Who is the biggest
    4. Why GoTo is the “coolest”
  2. What is needed to have the second largest search engine
    1. Architecture, Infrastructure
    2. Performance
    3. Operations
  3. What kind of, and how many people are needed for a successful business?
  4. Where is the ‘net going? What will happen to search engines in the future?
  1. The Origins of Search Engines
    • ’90 Archie - ftp (File Transfer Protocol) indexing and retrieval
    • ’92 Gopher - document network (non-ftp) there is a movement to bring this back
    • ’92-’93 - Bots start to surface
      • WWW Wanderer (wandex) - 1st servers, then URLs
      • Aliweb - indexes web like Archie
    • ‘93+ Spiders
      • WWW Worm
      • (Excite) Architext from Stanford
  2. How search engines work

Problems with spiders

Generate a lot of data, without intelligence to map pages to space
     eg. a page with “China”, is the person searching for a country? Plates? A person?

Problems with this today include the spamming of engines.

One solution to this is a searchable directory with human crafted hierarchies,

Examples include:

  • Tradewave Galaxy 1/94
  • Yahoo! 4/94 Created by Filo and Yang of Stanford

Metasearchers

These spread out the searches to several engines and collate the responses into one result page that is hopefully of better quality than the individual result pages.

Example:

  • Metacrawler
  • Search.com

The Crawler-based Search Engines

Examples:

  • Lycos (7/94) - the wolf spider
  • Infoseek (4/94)
  • Altavista (12/95)
  • Inktomi (Slurp) - HotBot (5/96) - (name based on the plains Indians spider myth) The Directory/Editorial based Search Engines
  • Yahoo! (4/94)
  • LookSmart (5/95)
  • Snap.com
  • ODP (NewHoo) -- dmoz (1/98)
  • Ask Jeeves (4/97)
  • GoTo (6/98)

Crawlers start with a list of URLs, and then use the following algorithm with each URL

  1. Get the HTML page
  2. Follow the links on the page

They often check two additional sources of information on the sites as well.

  • META tags - commands embedded in HTML page which can give commands to robots

  • Robots.txt files - this is a file in the base directory of an HTML directory hierarchy that controls where crawlers can crawl

    Example:

         # /robots.txt file for http://goto.com/
         # disallow all robots from crawling GoTo
         User-agent: *
         Disallow: /

Inktomi is an architecture only company, other companies pay for feeds of results. Some companies that take advantage of their services are Yahoo! (now owned by Google), Hotbot, and many others. They are mostly used as fall through placement, the search engines use their own results first (usually bidded or paid inclusion) and the Inktomi results after those.

Google was started for fun by Sergey and Larry and are currently searching for a viable revenue model. It is used to power Yahoo!, Virgin.net and others.

Inktomi “Slurp” crawler follows the following algorithm:

  • Starts with the submitted URLs
  • Searches for various things within those pages, such as: Page title, Description META tag, Keyword META tag, and text blurbs in the document (not images though)
    Caveats
    • It ignores frames
    • Will look for spamming and will drop the page if spamming exists
    • Uses a 4 week incremental cycle
    • Many indices are created for different customers, depending on the customers specifications.

Google uses the Backrub/Googlebot crawler.

This crawler uses an interesting algorithm to determine page rank.

  1. It will rank pages higher when they are linked to by the most other pages, the higher the ranking of the pages that link to it, the higher the ranking of the page.
  2. Also, the more links on a page, the lower ranking boost of all of those links. This is different than most other crawlers which use word count to rank pages.
  3. Caches all pages into a huge database
    (some major issues with this include: copyright infringement of the pages, and the changing content on the pages, etc…)
  4. It also uses certain tweaks for ranking of pages
    • Domain tweaks (.edu,.org,.gov)
    • Bias against large pages
    • Bias against dynamic pages
    www.searchengineworld.com/google
    Original design

    Who is the biggest

    • Yahoo! 100 million
    • Alta Vista 50 million
    • Google 50 million
    • Inktomi 40 million
    • Everyone else 10 million or less

    This is the “Bowtie Plan” of the web. The central section represents 120 million visitors, the sides are around 100 million visitors each, and there are tendrils that lead off into the Internet that have lesser amounts of visitors.

    • CPM (cost per thousand) models
      • Banner ads
      • Hybrid text ads
      • Add words (Google)
    • Pay to get into directory (Pay more, get better ranking)
      • Look Smart
      • Yahoo!
    • CPC (Cost per Click)
      • Goto
      • Inktomi

    How Does GoTo Work?.

    The basic business model of GoTo is to provide the medium for textual advertisements. Advertisers provide GoTo with information for listings. GoTo charges the advertisers for each click in the search listings.

    Basically, GoTo is a giant ad auction. The top placement in a search goes to the highest bidder, and that can change in real time (i.e. if an advertiser wants to bid more, he will see an immediate rise in his ranking). The average bid is $.17 per link.

    Each bidder site goes to a group of editors who decide if the link is relevant (working back buttons, pop-ups, etc…)

    GoTo serves some searches from its own site (5%), but the majority of searches come through GoTo’s partner sites (Microsoft, MetaCrawer, Alta Vista, AOL, Netscape, Cnet, etc…)

    Since GoTo makes money when users search (and click), GoTo pays for sites to include GoTo listings

    Scale of operations:

    • Search Volume - 70mm+/day, capacity for 210mm/day
    • 300mm impressions/day
    • 10mm clicks/day - Med/Large Phone company
    • 6mm+ search listings
    • 40,000+ advertisers

    3/10 (30%) of top searches are for other search engines, Yahoo! is among the top 6 searchers of all search engines.

    As was stated before 5% of GoTo’s hits are on their own site, 95% of the hits are in the form of an XML feed to the other sites

    Yahoo! Has 18,000 advertiser whereas GoTo has 1.2 billion because GoTo serves the small guys who want very specific hits.

    Yahoo! Has 100 editors building its directory while GoTo has 100 editors checking over the advertisers.

    The GoTo site basically breaks into 3 parts:

    • Search serving systems
    • Advertiser Management Systems
    • Event Tracking, Fraud Detecting, Data Reporting

    GoTo’s systems seem deceptively simple. Advertisers provide the content in the form of search listings, the content is ordered by bid price, and advertisers are charged for resulting clicks. The complexity of these systems is based on the scale of the problem (number of advertisers, search listings, searches per day, etc.). In addition to some non-apparent complications (e.g. fraud detection).

    This system is deceptively simple, there are several challenges:

    1. High Availability - Noah’s Ark Approach (everything goes in 2s) - no single point of failure.
      • Load Balancers
      • State Migration
    2. Scalability - No architectural changes should be needed to add or subtract serving capability.
    3. Extensibility - can add search features incrementally.
    4. Distributed content - Multiple sites currently serving all partners.

    Advertiser Management Systems

    DirecTraffic Center

    Manage balance, report activity, real-time bid changes, add/modify/delete search listings

    Account Monitoring

    The real ‘special sauce’, handles the business transactions for the system, by processing inputs to control accounts. Includes functionality such as automatic credit card billing for additional impressions, cutting of hits in the search engine if the account is out of money.

    Editorial Processing

    GoTo is in fact a publishing business.

    GoTo employs 100 editors with a workflow of 50,000-100,000 work orders per month. These editors have to review all listings (with some help) using a EJB/Desktop App (written in Swing).

    Fraud Detection and Reporting

    This system decides what should be considered clicks, this is done through LWES (Light Weight Event Systems) which is a front end system that throw UDP-multicast based events. These events include searches, clicks (redirects) and navigation. These events are composed of Key/Value pairs, and are captured by separate journaling systems (in pairs).

    Why is UDP Used?

    According to Mr. Ryan, UDP is much faster and uses fewer resources than an equivalent TCP connection, and a stateful connection is not needed when using switches. Plus, it makes adding additional front-end systems extremely simple: all you need to do is plug it in and it starts sending out packets and they start getting picked up.

    Fraud Detection

    Result clicks=clicks for which advertisers get charged

    Clicks go to the fraud detection system. The fraud detection system is a patent pending system that monitors our web site behavior to detect potentially fraudulent activity. These systems analyze all of the clicks on the site (millions of transactions daily) and decide whether it is malicious or benign. It perform sophisticated rule-based and statistically-derived event filtering.

    GoTo employs an 8 person Fraud Squad consisting of developers and analysts who constantly monitor and improve the fraud detection techniques and tools, and manage the issue treatment and resolution processes.

    Errors

    There are several techniques that are used for to determine fraud. But before we discuss the types of fraud and error in the data we should first decide what type of errors can be present in the system.

    Some errors are inadvertent accidents, such as a person clicking on the wrong link and immediately correcting him/herself. E.g. Mac users double clicking the link (which will tally 2 hits instead of the one that it should), Or, Spiders crawling through a large amount of links. Or advertisers checking their own listings.

    Malicious errors

    E.g. Stockholders clicking through links trying to make revenue for the company (which is counter-productive because it drives down the bid price of the links). Or, advertisers clicking through the links of their competitors. Or bored crackers playing with the system.

    Fraud Detection

    Fraud is detected using constantly evolving sophisticated filters that go through the events generated within the system and process it looking for suspicious activity.

    There are two different types of filters used. Deterministic, which are rule based that cover user sessions, IP addresses and search terms. The deterministic filters catch all the blatant abuses (repetitive clicking, repetitive searching, “speed” clicking).

    The second type of filters are probabilistic--behavior pattern based. They discard clicks that seem to be grouped together in a non-typical way. These filters are used to find patterns in the click throughs in the system that are tied to patterns.

    This feat is accomplished in real time because of a backend network (about 30) simple computers (most of which just cheap x86 Linux boxes), These computers do simple calculations and aggregations while all working together. To do this, a control and processing language is used to describe the calculations and process the data on the component machines.

    Data is scored by click-events that place each of the clicks into various “buckets” of validity created using historical patterns of behavior on the site (based on both valid and fraudulent inputs). Most companies do simple work here, but GoTo does complex pattern-matching. This results in a loss of income, but clients are happier and will pay more because they know that they are being charged for are real hits and less bogus (fraudulent & accidental) hits.

    GoTo Servers

    A combination of Netscape Enterprise Server and Apache/mod_perl. For the most high volume sites, Apache/mod_perl is used because it is much quicker. For the most content rich sites however, jsp’s are used. These are much slower though, because the garbage collection associated with them takes up 6 seconds out of every minute, or a total of 1/10 of the serving time is spent doing garbage collection.

    • Search Serving Platforms
      • 100+ Sun e420R, 450mhz (4), 4GB
      • ATG/Dynamo/Java, and Apache/mod_perl
      • Gigabit site backbone
      • InterNAP
      • Multiple (3) co-location facilities
      • Search serving feeds include HTML and XML all through HTTP (1.0 or 1.1)
      • Global Load Balancing (Arrowpoint)
      • Distributed content caching (Akamai)
    • Backend Platforms
      • Data repository (16TB) for search and click events - several (4) e4500 Sun/Oracle 8i machines connected to a MTI SAN
      • Fraud Detection through an array (3) or Intel/Linux machines, utilizing custom detection systems.
      • CRM via Silknet (NT/2000)
      • N-tier application backbone via EJB (Weblogic) servers - application integration all through XML
      • Complete DR site for fast recovery

    GoTo has six separate facilities: Three search-serving sites in Sunnyvale CA, Reston VA, and Dublin, Ireland, and four offices, Pasadena, San Mateo and Raleigh-Durham and London. The Development and Test site is located in Burbank, and their new Backend Processing site is located in Las Vegas.

    Conclusion

    • The battle for fastest service has played out well for GoTo so far, though there is still room for improvement. In tests of speed of the respective search engines, Yahoo is by far the fastest, followed by AltaVista.
    • According to Mr. Ryan, the future is not pretty for the stickiness model. Companies will need to offer enough features that people visit the site for ads. Also, he sees many waves of consolidations in the near future. Lastly, he predicts that search engines are dead. The general concept is void of a revenue model, which is needed for any successful company.
    • Search à Portal à ?
    • What are the ranking methods of the future?
    • IBM Clever may be a viable competitor, as may Google’s method.

    15% of GoTo’s advertisers are for adult content. Of the 100 editors employed by GoTo only 20 can check the adult sites for truthful advertising and such. The smartest advertisers change their bids within the day, making sure their name comes up on top when they most want it to be.

    Useful Links