Search Log Analysis

Learning Notes of CMU Course 11-642 Search Engine

Posted by haohanz May 02, 2019 · Stay Hungry · Stay Foolish

Table of Contents


Why Search Log Analysis

  • Query Frequency, a power law (same shape as Zipf’s law) \begin{equation} Frequency(q) = K \times Rank(q)^{−\alpha} \end{equation}
    • Where K = constant, positive
    • Alpha about 2.4 for Excite query log
    • Most unique queries occur very rarely, a small percentage of the unique queries are very common
    • The overlap of frequent queries goes down as time growths
    • 50% of the unique queries, 20% of the total queries are never seen before
  • Web search behavior -> 3 dimensions
    • What
    • Who
    • How - session characteristics
      • Features:
        • The variation of the docs you click on
        • % of queries with low/high click entropy
          • The session length, queries/session (entropy can be evaluated by the variation of the documents)
  • “Who”
    • Representing users:
      • For users, cluster them based on queries
      • Firstly, build a pseudo document for each user, the docs contains the categories of the top 10 docs for all the queries in their search log
      • Use Jenson-Shannon Divergence or cosine correlation to find the similarity users.
      • Based on Yahoo Directory categories
    • Informational Users (how)
      • Less single-click sessions
      • More Non-navigational queries
      • Query suggestions are used more
      • Well-educated
      • Above-average income
    • Navigational Users (how)
      • Facebook, YouTube, mostly representative of the entire population
      • More single-click
      • More navigational
      • Less query suggestions
    • Transactional Users (how)
      • Low income and education
      • Diverse clicks
      • Less interaction
      • Has overlap with navigational users
      • Search for shopping, adult content, gaming
    • High level educated users
      • Use query suggestions more but click on them less
      • Tend to have shorter sessions
      • More: submit tail queries (more precise)
    • => Query log analysis is useful for personalization strategies designs

How to Segment Search Logs into Sessions

  • Methods
    • Using time: the gap between two queries are less than delta, then they are in the same session.
    • Using the overlap terms (CT): iff there’s overlap between two quries, they are in the same session
    • Using the rewrite classes (RT): the common reformulation patterns(add, delete, replace…)
    • Both b and c would cause high precision and low recall
    • Other features to segment a log
      • The JSD of top 10 results of q1 and q2
      • The Yahoo page category overlap of top10 results of q1 and q2
      • The co-occurrence of queries in a query log
      • Etc.
  • Differentiate the goal and mission
    • Information need == goal == sub tasks
    • Mission == a set of related information needs, which is higher-level or an extended information need (e.g. how to go to PIT)
    • Boundary for mission and boundary for goal: the queries must be sequential!
    • Same task: they might not be sequential, just two (a pair of) separate queries are from the same mission or goal
    • The trained time for mission is longer than that of goals
    • Method of detecting “same task” for pairs of related queries - classification based features
      • Temporal
        • Delta time, are sequential or not
      • Edit distance (the distance between two strings)
        • Character and token-based metrics (e.g. CT, RT)
      • The occurrence of <q1, q2> pair in a larger query log
      • The Web search result: the cosine distance (JSD) of top 50 search results fore each query
      • To decide the if <q1, q2> are related queries (same for mission/goal) Use two methods
        • Heuristic (only time delta, CT, RT)
        • A classification-based approach - use the features mentioned above to do classification.
        • The classification is somewhat a little better. The heuristic is not bad!
      • The “same task” task has higher accuracy than adjacent query task “boundary task” because the “same task” has more negative samples.
      • The “Edit Distance” is very important, cosine distance is ok, time along is primitive, but with combination, it works.

Query Suggestions

  • Source
    • Pseudo documents
      • For many queries - almost any query can use this method
    • Co-occurrence statistics
      • For reasonably frequent queries - can solve the mid-frequent and the common queries.
  • Pseudo documents
    • Build the pseudo document, success is the last query in a session
    • Use the query to search for the corpus of pseudo documents
  • Co-occurrence Statistics
    • Get Set1 and Set2
      • Set1: the most frequent queries with q as a substring
      • Set2: the most frequent queries that follows q
    • Get Score of the union of Set1 and set2 queries \begin{equation} Score(q_s) = \frac {Count(q_s)+ \lambda_1 }{N_1+\lambda_1 } \times \frac {count_follows (q, q_s )+λ_2 } {N_2+λ_2 } \end{equation}

Query Intents

  • Find the most popular intent of a query
    • Steps
      • Expand
      • Filter to improve precision
      • Cluster - maybe textbased distance/similarities
      • Random walk to estimate popularity
      • Name → to the query with highest weight from random walk
    • Reformulation patterns and the click information can be combined to identify the common intents for an ambiguouos query.

Click Models

  • Models
    • DCTR
    • PBM
    • RCM
    • RCTR
    • CM
  • Use of click models
    • Generate artificial clicks
    • Generate artificial relevance r(DCTR)

Within-Session Learning

  • Key ideas
  • Procedural vs declarative
  • Features for user expertise and behavior
    • The display time (short)
    • The query complexity (high)
    • The entropy (few topic, less entropy)
    • The Focus (category)
    • The domain count (number of unique domains in the SERPs)
  • The result of sessions:
    • domain count would increase, the declarative > procedural domain count
    • Domain count and query complexity: declarative > procedural
    • Features to identify the page is a declarative page
      • The text len, sentence len, term len
      • Cover of query terms in title, distribution of query terms
      • Par of speech distribution (POS)
      • The page complexity, query complexity, reading difficulty metrics