TF-IDF | Notion

`TF-IDF(x) = TF(x)*IDF(x)`

TF(Term Frequency) 词频

tf(w,d) = count(w,d) / size(d)

count(w,d): 表示词w在文档d中出现的次数
size(d): 文档d中总的词数

// 上面也可以简化成count(w.d)，假设每个文档一样长的情况。

/// For each word in the document, how often does it occur in the document
fn term_frequency(document: &str) -> HashMap<&str, usize> {
  document
    // good enough definition of "word" for this exercise
    .split_whitespace()
    // using fold to get some extra practice with monoids. Using a for loop is also totally fine.
    .fold(HashMap::default(), |mut hash_map, word| {
        hash_map.entry(word).and_modify(|c| *c += 1).or_insert(1);
        hash_map
    })
}

IDF(Inverse Document Frequency**) 逆文档频率**

idf(w) = log(n / docs(w, D) + 1)

docw(w, D): 单词w在所有文档中出现的次数，有可能一个单词一次也没出现过，所以统一加1，避免0 
n: 文档总数