pandas - How to isolate problematic text in a large csv file with Python - Stack Overflow

IT技术

更新时间：2025-03-090

admin管理员组
文章数量:1290950

I'm pretty new with Python and text analysis in general...working on a project for a class. I'm reading in a bunch of free text from .csv files that came from excel. There are over 200,000 rows.

I read them in with just pd.read_csv() and then

df['Text'].fillna('').apply(str)
df['Text'].str.replace(r"[^a-zA-Z]", " ", regex=True)
df.dropna()

Then I've defined

def preprocess_text(text):
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return words  # Return list of words

But when I call that on my dataframe I get

df['cleaned_text'] = df['Text'].apply(preprocess_text)

AttributeError: 'float' object has no attribute 'lower'

I went back and modified the function

def preprocess_text(text):
    try:
        text = re.sub(r'\d+', '', text)  # Remove numbers
    except TypeError:
        print(text)
    except AttributeError:
        print(text)
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return words  # Return list of words

And the text that I get when the error occurs is just nan

Any pointers on how to isolate where in this mass of text the error is occurring? Or better yet, a pre-processing step that I can eliminate this with?

本文标签： pandasHow to isolate problematic text in a large csv file with PythonStack Overflow

版权声明：本文标题：pandas - How to isolate problematic text in a large csv file with Python - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741518726a2383051.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

pandas - How to isolate problematic text in a large csv file with Python - Stack Overflow

更多相关文章