admin管理员组

文章数量:1201048

Python's Gensim package offers a dynamic topic model called LdaSeqModel(). I have run into the same problem as in this issue from the Gensim mailing list (which has not been solved). The problem is that the model infers a topic that is logically impossible in the sense that it assigns a non-zero probability to a word in a time slice where the word was not used. This is a reproduction of the problem:

from gensim.corpora import Dictionary
from gensim.models import LdaSeqModel

common_texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

model = LdaSeqModel(corpus=common_corpus, id2word=common_dictionary, num_topics=1, time_slice=[5, 4])

model.print_topic_times(topic=0)

time_slice=[5, 4] means that the first time slice contains the documents in the first 5 items of the common_texts list. The term graph is not in the first time slice, but print_topic_times() says it is. The output is:

[[('system', 0.13896054593348167),
  ('user', 0.10696589214152682),
  ('trees', 0.10664464447111177),
  ('graph', 0.10643809153102356),
  ('computer', 0.07494460648968987),
  ('human', 0.07494460648968987),
  ('interface', 0.07494460648968987),
  ('response', 0.07494460648968987),
  ('time', 0.07494460648968987),
  ('eps', 0.07494460648968987),
  ('minors', 0.07474199433434457),
  ('survey', 0.01658119265037265)],
 [('system', 0.13882862152464212),
  ('graph', 0.10742799576320598),
  ('trees', 0.10713473662111127),
  ('user', 0.1064043188010877),
  ('minors', 0.07517325760789559),
  ('computer', 0.07474729274679391),
  ('human', 0.07474729274679391),
  ('interface', 0.07474729274679391),
  ('response', 0.07474729274679391),
  ('time', 0.07474729274679391),
  ('eps', 0.07474729274679391),
  ('survey', 0.01654731320129382)]]

Do I have to set additional parameters to obtain correct results?

I have run this with Python 3.10.12 and Gensim 4.3.3.

Update January 23, 2025

I've experimented with the alphas, passes, and em_min_iter parameters, none of which have an effect on the problem.

本文标签: pythonCorrect topics from LDA Sequence Model in GensimStack Overflow