python - word reduction of a list of words - Stack Overflow-软件玩家

admin管理员组
文章数量:1123712

Besides using a chatbot like o1 mini, is there more local way to reduce a list of words from 10 words down to 3 words? I feel that due to context, word association, semantic meanings of the words, a there has to be some form of extrapolation to reduce the list but so far my attempts have not been nearly as good.

The ultimate goal is to reduce a list to something that can be searchable.

example using chatgpt 1o mini

reduce the list of words to 3 terms able to used in search. Combine words if they make sense in context to all other words in the list. Each term should combine words meaningfully, and no word should be repeated across the terms. return in list form

["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]

result: Dungeons Dragons, Role Playing, Monster Class

here are the attempts I tried using python

attempt 1: UMAP, Cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

def concatenate_overlap(list_of_list):
    pooled = [set(subList) for subList in list_of_list]
    merging = True
    while merging:
        merging = False
        for i, group in enumerate(pooled):
            merged = next((g for g in pooled[i + 1:] if g.intersection(group)), None)
            if not merged: 
                continue
            group.update(merged)
            pooled.remove(merged)
            merging = True
    return [list(x) for x in pooled]


text_list = ["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]

MODEL_NAME = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
batch_dict = tokenizer(text_list, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0][:768]
similarity_matrix = cosine_similarity(embeddings.detach())
np.fill_diagonal(similarity_matrix, 0)
to_merge_list: List[List[int]] = []
for idx, topic_similarity_scores in enumerate(similarity_matrix):
    similar_words = list(np.where(0.8> min_similarity)[0])
    similar_words.append(idx)
    to_merge_list.append(similar_words)

to_concat = concatenate_overlap(to_merge_list)
words = [text_list[a[0]] for a in to_concat]

result: ['rpg', 'monster', 'Dungeons', 'dragons', 'master', 'monster', 'job', 'class']

attempt 2: summurization model

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

word_list = ["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]
text = ", ".join(word_list)
summarizer(text, max_length=50, do_sample=False)

result: rpgs, role playing, fantasy, monster, Dungeons, dragons, master, monster. job, class. rpg, roleplaying, fantasy,. monster, dungeons, dragons,. master, monsters, master. job,. class

edit: the criteria is simple, shrink the word list to a 3 or 4 word phrase that can be searchable. In the case of chatgpt 01, it gave me 3 multi word expressions for searching. For the two examples. Ideally it would give me something also as small.

I plan to take word lists and boil them down to searchable prompts

edit 2: to provide more context to what the expected outcome should be here is a unit test and explanation.

input: ["bad", "horrible", "unfun", "waste", "big", "mutha", "truckers", "racing", "incomplete", "big mutha"]

output: ["big mutha truckers", "unfun", "incomplete"]

contextually, the list of 10 words is focusing on a game called "big mutha truckers" and as such this would get pulled from the word list. The second would be all the synonyms which refer to the game being not good, so "unfun" or "bad" or even "horrible racing". Lastly "incomplete" would get pulled as it isn't a close synonym of bad and is is descriptive

with these 3 words, I can throw it into google and ideally search results of negative reviews for big mutha truckers should appear.

本文标签： pythonword reduction of a list of wordsStack Overflow

版权声明：本文标题：python - word reduction of a list of words - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736592300a1945095.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - word reduction of a list of words - Stack Overflow

更多相关文章