Jak MongoDB radzi sobie z długością dokumentu w indeksie tekstu i wyniku tekstu?

Punktacja opiera się na liczbie dopasowań tematycznych, ale istnieje również wbudowany współczynnik, który dostosowuje wynik dla dopasowań w stosunku do całkowitej długości pola (z usuniętymi słowami stop). Jeśli Twój dłuższy tekst zawiera bardziej trafne słowa do zapytania, doda to do wyniku. Dłuższy tekst, który nie pasuje do zapytania, obniży wynik.

Fragment kodu źródłowego MongoDB 3.2 w serwisie GitHub (źródło/mongo/db/fts/fts_spec.cpp ):

   for (ScoreHelperMap::const_iterator i = terms.begin(); i != terms.end(); ++i) {
        const string& term = i->first;
        const ScoreHelperStruct& data = i->second;

        // in order to adjust weights as a function of term count as it
        // relates to total field length. ie. is this the only word or
        // a frequently occuring term? or does it only show up once in
        // a long block of text?

        double coeff = (0.5 * data.count / numTokens) + 0.5;

        // if term is identical to the raw form of the
        // field (untokenized) give it a small boost.
        double adjustment = 1;
        if (raw.size() == term.length() && raw.equalCaseInsensitive(term))
            adjustment += 0.1;

        double& score = (*docScores)[term];
        score += (weight * data.freq * coeff * adjustment);
        verify(score <= MAX_WEIGHT);
    }
}

Konfigurowanie danych testowych, aby zobaczyć wpływ współczynnika długości na bardzo prostym przykładzie:

db.articles.insert([
    { headline: "Rock" },
    { headline: "Rocks" },
    { headline: "Rock paper" },
    { headline: "Rock paper scissors" },
])

db.articles.createIndex({ "headline": "text"})

db.articles.find(
    { $text: { $search: "rock" }},
    { _id:0, headline:1, score: { $meta: "textScore" }}
).sort({ score: { $meta: "textScore" }})

Wyniki z adnotacjami:

// Exact match of raw term to indexed field
// Coefficent is 1, plus 0.1 bonus for identical match of raw term
{
  "headline": "Rock",
  "score": 1.1
}

// Match of stemmed term to indexed field ("rocks" stems to "rock")
// Coefficent is 1
{
  "headline": "Rocks",
  "score": 1
}

// Two terms, one matching
// Coefficient is 0.75: (0.5 * 1 match / 2 terms) + 0.5
{
  "headline": "Rock paper",
  "score": 0.75
}

// Three terms, one matching
// Coefficient is 0.66: (0.5 * 1 match / 3 terms) + 0.5
{
  "headline": "Rock paper scissors",
  "score": 0.6666666666666666
}