admin管理员组

文章数量:1290950

I'm pretty new with Python and text analysis in general...working on a project for a class. I'm reading in a bunch of free text from .csv files that came from excel. There are over 200,000 rows.

I read them in with just pd.read_csv() and then

df['Text'].fillna('').apply(str)
df['Text'].str.replace(r"[^a-zA-Z]", " ", regex=True)
df.dropna()

Then I've defined

def preprocess_text(text):
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return words  # Return list of words

But when I call that on my dataframe I get

df['cleaned_text'] = df['Text'].apply(preprocess_text)

AttributeError: 'float' object has no attribute 'lower'

I went back and modified the function

def preprocess_text(text):
    try:
        text = re.sub(r'\d+', '', text)  # Remove numbers
    except TypeError:
        print(text)
    except AttributeError:
        print(text)
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return words  # Return list of words

And the text that I get when the error occurs is just nan

Any pointers on how to isolate where in this mass of text the error is occurring? Or better yet, a pre-processing step that I can eliminate this with?

本文标签: pandasHow to isolate problematic text in a large csv file with PythonStack Overflow