Let’s go through the Explain API with a set of example documents. In my case I am going to use a small list of movie quotes.
POST /_bulk
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "The Incredibles", "quote": "Never look back, darling. It distracts from the now" }
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "The Lion King", "quote": "Oh yes, the past can hurt. But, you can either run from it or learn from it" }
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "Toy Story", "quote": "To infinity and beyond" }
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "Ratatouille", "quote": "You must not let anyone define your limits because of where you come from" }
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "Lilo and Stitch", "quote": "Ohana means family, family means nobody gets left behind. Or forgotten" }
The BM25 algorithm, the default algorithm for scoring in elasticsearch.
boost
- constant 2.2 = (k1 + 1)
, ignore, not
relevant for ordering
freq
- the number of times this term appears in the field
N
- the total number of documents in the shardn
- the number of documents that contain this termk1
- constant 1.2, term saturation parameter, can be
changed
b
- constant 0.75, length normalization parameter, can be
changed
dl
- length of the field, specifically the number of
terms in the field
avgdl
- the average length of this field for every
document in the cluster
GET /movie_quotes/_search
{
"explain": true,
"query": {
"match": {
"quote": "the"
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.94581884,
"hits": [
{
"_shard": "[movie_quotes][0]",
"_node": "M9Dx5c1BTk6ehVVSCiJAvQ",
"_index": "movie_quotes",
"_id": "LMpi64YBn8MlrX4RQHf9",
"_score": 0.94581884,
"_source": {
"title": "The Incredibles",
"quote": "Never look back, darling. It distracts from the now"
},
"_explanation": {
"value": 0.94581884,
"description": "weight(quote:the in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.94581884,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.87546873,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 5,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.4910714,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 9,
"description": "dl, length of field",
"details": []
},
{
"value": 11,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[movie_quotes][0]",
"_node": "M9Dx5c1BTk6ehVVSCiJAvQ",
"_index": "movie_quotes",
"_id": "Lcpi64YBn8MlrX4RQHf9",
"_score": 0.71575475,
"_source": {
"title": "The Lion King",
"quote": "Oh yes, the past can hurt. But, you can either run from it or learn from it"
},
"_explanation": {
"value": 0.71575475,
"description": "weight(quote:the in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.71575475,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.87546873,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 5,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.3716216,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 17,
"description": "dl, length of field",
"details": []
},
{
"value": 11,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}
]
}
}
For the document that came first, The Incredibles, we get the score
0.94581884
, calculated as follows:
For the second document, The Lion King, we get 0.71575475
,
calculated as follows
For this example the equation is almost exactly the same, the only difference is in the second document, the field is longer by 2 terms, the algorithm has been designed in a way that means this document is less important because it has more terms in total. The reasoning behind this; the shorter the field that contains the term, the more real estate has been used for the term so it must be more valuable.
GET /movie_quotes/_search
{
"explain": true,
"query": {
"match": {
"quote": "you"
}
}
}
Skipping the full output this time, for the document that came first,
Ratatouille, we get the score 1.1180129
, calculated as
follows:
For the second document, The Lion King, we get 0.71575475
,
calculated as follows
For this example the equation is very similar again, but the documents have very different scores, more so than the shorter field example. The differences now are in frequency and field length, the field length in the first document is shorter, and the term frequency is higher, the field length is important to a point, but the frequency is more important up to a point, see next example. We also see that the second document has the same score as the first document in the previous example, this is because the circumstances are the same, the same frequency and the same field length, in fact it is the same document.
If the term frequency is so important, can I just make sure my document always comes to the top by repeating the same term over and over?
POST /_bulk
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "Movie 1", "quote": "Movie movie movie movie." }
{ "index" : { "_index" : "movie_quotes" } }
{ "title" : "Movie 2", "quote": "Movie movie movie movie movie movie movie movie." }
GET /movie_quotes/_search
{
"explain": true,
"query": {
"match": {
"quote": "movie"
}
}
}
For Movie 2, we get 2.2614799
and for Movie 1 we get
2.1889362
. These 2 scores are very similar now, the reason
is that the freq
is in the numerator and the denominator at
first, as the frequency of the term increases the score boosts fast, but
when the frequency gets to high, it becomes less and less relevant, even
though the Movie 2 document has double the frequency of terms.
These are short examples with a simple match query and we have not seen the interplay of every possible condition yet, but even so this is a good starting point to really get to grips with the scores that our documents get and to understand how to start tuning our documents to take advantage of this algorithm. It is necessary to mention and understand at this point that the exact value of the score is irrelevant, the relative score is the only thing useful for ordering. Some further examples to try out on your own;
the
, an
or a
, also called
stop words. This sample set is too small to really see how stop words
can really affect the number of documents returned.
film
,
movie
, flick
, actor
,
camera
and so on. With a larger movie database you will
find that searching for some of these terms will bring back results
with the wrong relevance.