TF-IDF(x) = TF(x)*IDF(x)
tf(w,d) = count(w,d) / size(d)
count(w,d): 表示词w在文档d中出现的次数
size(d): 文档d中总的词数
// 上面也可以简化成count(w.d),假设每个文档一样长的情况。
/// For each word in the document, how often does it occur in the document
fn term_frequency(document: &str) -> HashMap<&str, usize> {
document
// good enough definition of "word" for this exercise
.split_whitespace()
// using fold to get some extra practice with monoids. Using a for loop is also totally fine.
.fold(HashMap::default(), |mut hash_map, word| {
hash_map.entry(word).and_modify(|c| *c += 1).or_insert(1);
hash_map
})
}
- IDF(Inverse Document Frequency**) 逆文档频率**
idf(w) = log(n / docs(w, D) + 1)
docw(w, D): 单词w在所有文档中出现的次数,有可能一个单词一次也没出现过,所以统一加1,避免0
n: 文档总数