Neural Ranking: DSMM, DRMM, K-NRM, Conv-KNRM

Why Apply Deep Learning in Search Engine?

More features, more complex combination of weight and features, less hand-crafted features and functions.

DSMM
DRMM
K-NRM
Conv-KNRM
Summary

DSMM

Motivation

Representation: solve query vocabulary mismatch, represent text in concept based space
Loss: \begin{equation} L(*) = −\log\prod{\Pr(D^+|Q)} \end{equation} Where \begin{equation} \Pr(D│Q) = \frac{exp⁡(\lambda R(Q, D))} { \sum_{D^′ \in D} exp⁡(\lambda R(Q,D^′) } \end{equation}
Use 500k term -> 30k trigram hashing vector
- Robust to the out of vocabulary problem
- Low collision rate: 22 out of 500k terms
The hashed representation captures the orthographic (continuous term representation) similarity
- Spelling
- Hyphenation
- Misspell
- Case
- But not conceptual similarity
After projection, use cosine similarity between the 128 len query qtf representation and document tf representation.
How training data are selected:
- Use query log, The relevant are clicked, non-relevant are sampled from top N
- Use pairwise training
  - Can’t use pointwise/listwise due to accuracy
- Minimize lose function: only consider the positive samples
- Different from our target:
  - If there are many pairs, it tend to give higher score to clicked documents.
Drawbacks:
- Better than BM25, this is not surprising, supervised > unsupervised
- Just re-rank top 15 documents, not robust
- No idf or document length considered, only tf considered. Perhaps not needed to re-rank the top 15 from a stronger ranker, perhaps not needed to re-rank titles.

DRMM

Continuous Representation - Both DSMM, DRMM (word2vec), Why DRMM

Previous continuous representation models ignore the “exact match” strong signal.

Key idea of DRMM & Steps

Firstly use Word2Vec to present different source of terms: query terms separated with document terms.
Structure:
- -> Embedding layer
- -> Pyramid pooling (Binning)
- -> Feed forward matching to learn the interaction between $(q_i, d_i)$
- -> term gating using idf
- -> sum scores of all the terms (linear combination) \begin{equation} g_i =\frac{ exp⁡(w \cdot idf(q_i)) } { \sum_{j=1}^n exp⁡(w \cdot idf(q_j)) }
  \end{equation} where i is the index for $q_i$ in a query q.
Training:
- Pairwise training with hinge loss, minimize it \begin{equation} L(q, d^+, d^−)= \max⁡(0, 1−s(q, d^+ )+s(q, d^−)) \end{equation}
Evaluation:
- Better than rankSVM
- Should be compared to query expansion? Because used word2vec which input a lot of texts

What’s the similarity and difference for DRMM and old retrieval models?

Same:
- Use tf, idf, log(tf)
- Can tell (represent) exact match of query term and document term
- Use bag-of-words model
- Summation of scores for each query term
Diff:
- Use “bin” value for matches of different quality
- Exact and soft match of query terms and document terms are all represented by continuous representations (word2vec)
- It learns how to combine evidences (non-linear) during the feed forward matching phase. Non-linear combination of match values of different quality (through feed forward matching after pyramid pooling)
- Get interaction patterns for matching using e.g. CNN, hierarchical patterns.

K-NRM

Motivation

More convenient for DRMM - use end-to-end training to train all parts of the model - use a layer to replace the pyramid pooling layer, which is not differentiable.
Use search log data - more than TREC and more noises.
Use as much part as the DRMM

What’s same with DRMM?

Text representation of query and docs

What’s different with DRMM?

Softly count soft-match term frequencies (kernel pooling instead of pyramid pooling)
Embeddings tuned for ad-hoc search - need more training data.
Consistent training (end-to-end)
No idf consider like DRMM (no need for the only top 15 docs from a strong engine result)
Stronger soft-match features (more noise tolerable)
A simpler model with exact-match features - one exact match score for a (q, d) pair.

Architecture of K-NRM:

-> Embedding layer
-> translation layer $(M_{n \times m})$ matrix
-> kernel pooling
- each bin has a kernel that center at the bin’s center, 11 bins->11 kernels
- Each $q_i$ term has a soft-tf score per bin, which is the sum of the kernel on all thet docs that are in this range.
- Thus each $q_i$ has 11 scores after the kernel pooling. Just the same as DRMM’s pyramid pooling output
- Log sum each 11 length vector for all the query terms, finally get 1×11 output. (k=11), there’s only one feature for each kernel
  - Why log sum? Panelize the query terms with few matches.
-> Feed Forward Matching (1 layer NN) (pairwise training)
-> Final score

Training of K-NRM

Exactly the same with DRMM

Performance of K-NRM:

Conv-KNRM > K-NRM > DRMM > RankSVM > DSMM

Why K-NRM is better? The source of effectiveness?

Embeddings trained end-to-end for ad-hoc search
- The tuning of word embeddings makes a lot of pairs changed (58%)
- Word2Vec make it has conceptual similarities for pairs
- The Words are de-couples e.g. cats vs dogs
- The rare words are punished by the log summation
- The soft match patterns are discovered: (pdf, reader)
- The matching strength are changed.
Kernel pooling is more effective than other pooling: combine matches in a score range.

Conv-KNRM

Motivation:

To support n-gram matching between different lengths.

The source of effectiveness for Conv-KNRM:

(N-grams)
- A CNN to form n-grams
- Fewer parameters to train than discrete n-grams
(Cross-matching n-grams of different lengths)
- Another use of soft matching

Representation Based vs. Interactive Based Model

Similarity
- Identify local matches between two pieces of text - cosine similarity of term vectors
- Use continuous representation of text (embedding)
Difference
- Interactive: Learn interaction patterns for matching
  - Often hierarchical patterns
  - E.g. CNN
- Interactive: Time complexity is high, each term has to compare with each document term, usually used for re-ranking
- Interactive: consider both positive and negative examples using hinge lost of pairwise training data, but representative (DSMM) only use the positive ones, the goal of which is different from our target.
- Interactive: Integration of exact-match and soft-match signals,
  - Older systems only have discrete (IR) or continuous(DSMM), not both.
  - The combination is effective and robust
- The trained data is required more in interactive based models, the static embeddings need little data.

Next Post → ← Previous Post

Neural Ranking: DSMM, DRMM, K-NRM, Conv-KNRM

Learning Notes of CMU Course 11-642 Search Engine

Why Apply Deep Learning in Search Engine?

Table of Contents

DSMM

Motivation

DRMM

Continuous Representation - Both DSMM, DRMM (word2vec), Why DRMM

Key idea of DRMM & Steps

What’s the similarity and difference for DRMM and old retrieval models?

K-NRM

Motivation

What’s same with DRMM?

What’s different with DRMM?

Architecture of K-NRM:

Training of K-NRM

Performance of K-NRM:

Why K-NRM is better? The source of effectiveness?

Conv-KNRM

Motivation:

The source of effectiveness for Conv-KNRM:

Representation Based vs. Interactive Based Model