admin管理员组文章数量:1290950
I'm pretty new with Python and text analysis in general...working on a project for a class. I'm reading in a bunch of free text from .csv files that came from excel. There are over 200,000 rows.
I read them in with just pd.read_csv()
and then
df['Text'].fillna('').apply(str)
df['Text'].str.replace(r"[^a-zA-Z]", " ", regex=True)
df.dropna()
Then I've defined
def preprocess_text(text):
text = re.sub(r'\d+', '', text) # Remove numbers
text = text.lower() # Convert to lowercase
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
words = word_tokenize(text) # Tokenize text
words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
return words # Return list of words
But when I call that on my dataframe I get
df['cleaned_text'] = df['Text'].apply(preprocess_text)
AttributeError: 'float' object has no attribute 'lower'
I went back and modified the function
def preprocess_text(text):
try:
text = re.sub(r'\d+', '', text) # Remove numbers
except TypeError:
print(text)
except AttributeError:
print(text)
text = text.lower() # Convert to lowercase
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
words = word_tokenize(text) # Tokenize text
words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
return words # Return list of words
And the text that I get when the error occurs is just nan
Any pointers on how to isolate where in this mass of text the error is occurring? Or better yet, a pre-processing step that I can eliminate this with?
本文标签: pandasHow to isolate problematic text in a large csv file with PythonStack Overflow
版权声明:本文标题:pandas - How to isolate problematic text in a large csv file with Python - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741518726a2383051.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论