admin管理员组

文章数量:1123712

Besides using a chatbot like o1 mini, is there more local way to reduce a list of words from 10 words down to 3 words? I feel that due to context, word association, semantic meanings of the words, a there has to be some form of extrapolation to reduce the list but so far my attempts have not been nearly as good.

The ultimate goal is to reduce a list to something that can be searchable.

example using chatgpt 1o mini

reduce the list of words to 3 terms able to used in search. Combine words if they make sense in context to all other words in the list. Each term should combine words meaningfully, and no word should be repeated across the terms. return in list form

["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]

result: Dungeons Dragons, Role Playing, Monster Class

here are the attempts I tried using python

attempt 1: UMAP, Cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

def concatenate_overlap(list_of_list):
    pooled = [set(subList) for subList in list_of_list]
    merging = True
    while merging:
        merging = False
        for i, group in enumerate(pooled):
            merged = next((g for g in pooled[i + 1:] if g.intersection(group)), None)
            if not merged: 
                continue
            group.update(merged)
            pooled.remove(merged)
            merging = True
    return [list(x) for x in pooled]


text_list = ["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]

MODEL_NAME = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
batch_dict = tokenizer(text_list, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0][:768]
similarity_matrix = cosine_similarity(embeddings.detach())
np.fill_diagonal(similarity_matrix, 0)
to_merge_list: List[List[int]] = []
for idx, topic_similarity_scores in enumerate(similarity_matrix):
    similar_words = list(np.where(0.8> min_similarity)[0])
    similar_words.append(idx)
    to_merge_list.append(similar_words)

to_concat = concatenate_overlap(to_merge_list)
words = [text_list[a[0]] for a in to_concat]

result: ['rpg', 'monster', 'Dungeons', 'dragons', 'master', 'monster', 'job', 'class']

attempt 2: summurization model

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

word_list = ["rpg", "role playing", "fantasy", "monster", "Dungeons", "dragons", "master", "monster", "job", "class"]
text = ", ".join(word_list)
summarizer(text, max_length=50, do_sample=False)

result: rpgs, role playing, fantasy, monster, Dungeons, dragons, master, monster. job, class. rpg, roleplaying, fantasy,. monster, dungeons, dragons,. master, monsters, master. job,. class

edit: the criteria is simple, shrink the word list to a 3 or 4 word phrase that can be searchable. In the case of chatgpt 01, it gave me 3 multi word expressions for searching. For the two examples. Ideally it would give me something also as small.

I plan to take word lists and boil them down to searchable prompts

edit 2: to provide more context to what the expected outcome should be here is a unit test and explanation.

input: ["bad", "horrible", "unfun", "waste", "big", "mutha", "truckers", "racing", "incomplete", "big mutha"]

output: ["big mutha truckers", "unfun", "incomplete"]

contextually, the list of 10 words is focusing on a game called "big mutha truckers" and as such this would get pulled from the word list. The second would be all the synonyms which refer to the game being not good, so "unfun" or "bad" or even "horrible racing". Lastly "incomplete" would get pulled as it isn't a close synonym of bad and is is descriptive

with these 3 words, I can throw it into google and ideally search results of negative reviews for big mutha truckers should appear.

本文标签: pythonword reduction of a list of wordsStack Overflow