admin管理员组文章数量:1123712
Besides using a chatbot like o1 mini, is there more local way to reduce a list of words from 10 words down to 3 words? I feel that due to context, word association, semantic meanings of the words, a there has to be some form of extrapolation to reduce the list but so far my attempts have not been nearly as good.
The ultimate goal is to reduce a list to something that can be searchable.
example using chatgpt 1o mini
reduce the list of words to 3 terms able to used in search. Combine words if they make sense in context to all other words in the list. Each term should combine words meaningfully, and no word should be repeated across the terms. return in list form
["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]
result: Dungeons Dragons, Role Playing, Monster Class
here are the attempts I tried using python
attempt 1: UMAP, Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
def concatenate_overlap(list_of_list):
pooled = [set(subList) for subList in list_of_list]
merging = True
while merging:
merging = False
for i, group in enumerate(pooled):
merged = next((g for g in pooled[i + 1:] if g.intersection(group)), None)
if not merged:
continue
group.update(merged)
pooled.remove(merged)
merging = True
return [list(x) for x in pooled]
text_list = ["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]
MODEL_NAME = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
batch_dict = tokenizer(text_list, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0][:768]
similarity_matrix = cosine_similarity(embeddings.detach())
np.fill_diagonal(similarity_matrix, 0)
to_merge_list: List[List[int]] = []
for idx, topic_similarity_scores in enumerate(similarity_matrix):
similar_words = list(np.where(0.8> min_similarity)[0])
similar_words.append(idx)
to_merge_list.append(similar_words)
to_concat = concatenate_overlap(to_merge_list)
words = [text_list[a[0]] for a in to_concat]
result: ['rpg', 'monster', 'Dungeons', 'dragons', 'master', 'monster', 'job', 'class']
attempt 2: summurization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
word_list = ["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]
text = ", ".join(word_list)
summarizer(text, max_length=50, do_sample=False)
result: rpgs, role playing, fantasy, monster, Dungeons, dragons, master, monster. job, class. rpg, roleplaying, fantasy,. monster, dungeons, dragons,. master, monsters, master. job,. class
edit: the criteria is simple, shrink the word list to a 3 or 4 word phrase that can be searchable. In the case of chatgpt 01, it gave me 3 multi word expressions for searching. For the two examples. Ideally it would give me something also as small.
I plan to take word lists and boil them down to searchable prompts
edit 2: to provide more context to what the expected outcome should be here is a unit test and explanation.
input: ["bad", "horrible", "unfun", "waste", "big", "mutha", "truckers", "racing", "incomplete", "big mutha"]
output: ["big mutha truckers", "unfun", "incomplete"]
contextually, the list of 10 words is focusing on a game called "big mutha truckers" and as such this would get pulled from the word list. The second would be all the synonyms which refer to the game being not good, so "unfun" or "bad" or even "horrible racing". Lastly "incomplete" would get pulled as it isn't a close synonym of bad and is is descriptive
with these 3 words, I can throw it into google and ideally search results of negative reviews for big mutha truckers should appear.
本文标签: pythonword reduction of a list of wordsStack Overflow
版权声明:本文标题:python - word reduction of a list of words - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736592300a1945095.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论