admin管理员组文章数量:1278985
I'm trying to understand the difference between entropy and cross-entropy, as I often hear about the entropy of a language and the cross-entropy of a language model, and I want to understand the link between the two.
To simplify things, let's consider a language (with a vocabulary) and a language model trained on that language.
We'll work at the character level (which gives us 26 characters), and a limited number of words (let's take the 20 names below).
prenoms = [
"Alice", "Alfred", "Alina", "Aline", "Alexandre",
"Alicia", "Alison", "Alma", "Alva", "Elise",
"Elisa", "Eliane", "Alain", "Amélie", "Arline",
"Olivier", "Oline", "Alva", "Eliott", "Julien"
]
How do we calculate the entropy over these 20 names (i.e., the entropy of our language) and the cross-entropy for our language model (let's take a unigram model or any language model you prefer to help me to understand)?
If you have a more relevant example, I’m open to it.
PS: My confusion comes from the fact that, in general definitions, we talk about a (language) distribution P when calculating entropy (without quite knowing how to calculate it), and about two distributions P and Q when calculating cross-entropy (where P is a one-hot encoding vector in this case, when calculating cross-entropy-loss).
PS2: A python code could help me to well understand, here is my understanding. I based it on y understanding of Jurasky Book (and )
def distribution_ngrams(text, n=4):
"""
"""
import math
from collections import Counter
ngrams = [text[i:i+n] for i in range(len(text)-n+1)]
counts = Counter(ngrams)
total = len(ngrams)
# Calculate the distribution of n-grams
distribution = {ngram: count/total for ngram, count in counts.items()}
return distribution
def language_entropy_ngrams(text, n_approx=4):
"""
Calculate an estimate of the entropy of a text using n-grams (normally, we take a very large n and consider an infinite sequence L)
"""
import math
distribution = distribution_ngrams(text, n_approx)
# Calculate entropy
entropy = -sum((p * math.log2(p)) for ngram,p in distribution.items())
entropy_rate = entropy / n_approx # normalize by the size of the n-gram
return entropy_rate
def model_cross_entropy(text,n_approx=4):
"""
Calculate the cross-entropy between the true text and the model's predictions
"""
import math
unigram_model_distribution = distribution_ngrams(text, 1)
language_model_distribution_approximation = distribution_ngrams(text, n_approx)
q = {}
cross_entropy = 0
for ngram,p in language_model_distribution_approximation.items():
q[ngram] = 1
for c in ngram:
q[ngram] = q[ngram]*unigram_model_distribution[c]
cross_entropy -= p*math.log2(q[ngram])
return cross_entropy/n_approx
if __name__ == "__main__":
prenoms = ["Alice", "Alfred", "Alina", "Aline", "Alexandre", "Alicia",
"Alison", "Alma", "Alva", "Elise", "Elisa", "Eliane", "Alain",
"Amélie", "Arline", "Olivier", "Oline", "Alva", "Eliott", "Julien"] #each prenonm can be seen as a sequence of characters
L = ''.join(prenoms).lower() #the corpus/language L can be seen as the concatenation of the sequences
print(language_entropy_ngrams(L))
print(model_cross_entropy(L))
本文标签:
版权声明:本文标题:nlp - Understanding the Difference Between Entropy and Cross-Entropy in Language Models: Practical Example with Character-Level 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741253149a2366167.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论