admin管理员组文章数量:1122846
I have a model from hugging face and would like to use it for performing word comparisons. At first I thought of performing a series of similarity calculations across words of interest but quickly I found that this problem would exponentially grow as the number of words expanded as well.
A solution I thought about is plotting a skip gram where all words result on a 2 dimensional plane and then can simply perform clustering on the coordinates to find similar words. The problem here is that this requires a bert model and a low embedding layer that can be mapped.
As I have a pretrained model, I don't know if I can create a skip gram with from it. I was hoping to calculate the embedding and through the use of a transformation, convert the embedding into coordinates that I can plot myself. I though do not know if this is possible or reasonable
I tried to do it though with the code below
from sklearn.manifold import TSNE
from transformers import AutoModel, AutoTokenizer
# target word
word = ["Slartibartfast"]
# model setup
model = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model)
auto_model = AutoModel.from_pretrained(model, trust_remote_code=True)
# embbed and calculate
batch_dict = self.tokenizer(text_list, max_length=8192, padding=True, truncation=True, return_tensors='pt')
result = auto_model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0][:768]
# transform to coordinates
clayer = TSNE(n_components=3, learning_rate='auto', init='random', perplexity=50)
embedding_numpy = embeddings.detach().numpy()
clayer.fit_transform(embedding_numpy) # crashes here saying perplexity must be less than n_samples
I have a model from hugging face and would like to use it for performing word comparisons. At first I thought of performing a series of similarity calculations across words of interest but quickly I found that this problem would exponentially grow as the number of words expanded as well.
A solution I thought about is plotting a skip gram where all words result on a 2 dimensional plane and then can simply perform clustering on the coordinates to find similar words. The problem here is that this requires a bert model and a low embedding layer that can be mapped.
As I have a pretrained model, I don't know if I can create a skip gram with from it. I was hoping to calculate the embedding and through the use of a transformation, convert the embedding into coordinates that I can plot myself. I though do not know if this is possible or reasonable
I tried to do it though with the code below
from sklearn.manifold import TSNE
from transformers import AutoModel, AutoTokenizer
# target word
word = ["Slartibartfast"]
# model setup
model = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model)
auto_model = AutoModel.from_pretrained(model, trust_remote_code=True)
# embbed and calculate
batch_dict = self.tokenizer(text_list, max_length=8192, padding=True, truncation=True, return_tensors='pt')
result = auto_model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0][:768]
# transform to coordinates
clayer = TSNE(n_components=3, learning_rate='auto', init='random', perplexity=50)
embedding_numpy = embeddings.detach().numpy()
clayer.fit_transform(embedding_numpy) # crashes here saying perplexity must be less than n_samples
Share
Improve this question
asked Nov 21, 2024 at 11:19
linkey apiacesslinkey apiacess
1271 silver badge8 bronze badges
1 Answer
Reset to default 0After more through reading, it was brough to my attention that it would be impossible to use TSNE in the manner which I was hoping as the dimensions generated by TSNE is only representative of the training data. Further fitting with new data or transformation of data not within the training set would result in outputs that are not on a similar range and thus noncomparable.
I found a replacement to TSNE which is called umap. umap is also for dimension reduction but it can be fitted multiple times and data can be transformed along the same range.
I will explore umap and see if it will work for what I need.
本文标签: pythonMethods to reduce a Tensor embedding to xYz coordinatesStack Overflow
版权声明:本文标题:python - Methods to reduce a Tensor embedding to x,y,z coordinates - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736311465a1934746.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论