python - How to fix strings with typos in pandas dataframe? - Stack Overflow

IT技术

更新时间：2025-04-071

admin管理员组
文章数量:1356294

I have a dataset with over 60k rows of data of an imaginary list of purchases. I need to filter out the names used in certain columns such as Name, City, Store, etc.

A lot of the names mentioned in the dataset have "typos" which consist of 3 random letters added to the end of the string. An example would be that the name "New York" is repeated 70 times but 10 out of those instances are similar to "New Yorkxyz" with "xyz" being random letters; Or "Chicago" that is repeated twice and one of those instances is "Chicagoijk". What I've tried so far with my unfamiliarity with python, is:

First check if a string is repeated multiple times, and replace all the other instances that start with the letters of the said string, with it.
If no other instances were found, check again with the same string but without the last 3 letters.

I'm not sure if the algorithm I came up with is correct because so far It hasn't worked, but it might be how I'm unfamiliar with python.

The libraries that we were told we can use are pandas, matplotlib, seaborn, nltk, sklearn, numpy, imblearn, and scipy.

import pandas as pd  
import re
from fuzzywuzzy import process


df = pd.read_csv("purchases.csv")  
 

columns_to_clean = ["City", "Name", "Store"]

def fix_typos_smart(df, columns):
    for col in columns:
        if col in df.columns:
            df[col] = df[col].astype(str)  #ensure all values are strings
            
            #count occurrences of full words
            full_word_counts = df[col].value_counts()

            #create a mapping of words to replace
            correction_map = {}

            for word, count in full_word_counts.items():
                if count > 1:  #if the full word appears more than once
                    for typo in df[col].unique():
                        if typo.startswith(word):  #find all variants that start with the full word
                            correction_map[typo] = word
            
            #if no full-word matches, check for shortened words
            if not correction_map:
                short_word_counts = df[col].apply(lambda x: x[:-3] if len(x) > 3 else x).value_counts()
                valid_shortened_words = set(short_word_counts[short_word_counts > 1].index)

                for word in df[col].unique():
                    if len(word) > 3:
                        shortened = word[:-3]
                        if shortened in valid_shortened_words:
                            correction_map[word] = shortened
            
            
            df[col] = df[col].replace(correction_map)

    return df

#apply the typo correction function
df = fix_typos_smart(df, columns_to_clean)

df.to_csv("purchasesfixed.csv", index=False) #save

I have a dataset with over 60k rows of data of an imaginary list of purchases. I need to filter out the names used in certain columns such as Name, City, Store, etc.

A lot of the names mentioned in the dataset have "typos" which consist of 3 random letters added to the end of the string. An example would be that the name "New York" is repeated 70 times but 10 out of those instances are similar to "New Yorkxyz" with "xyz" being random letters; Or "Chicago" that is repeated twice and one of those instances is "Chicagoijk". What I've tried so far with my unfamiliarity with python, is:

First check if a string is repeated multiple times, and replace all the other instances that start with the letters of the said string, with it.
If no other instances were found, check again with the same string but without the last 3 letters.

I'm not sure if the algorithm I came up with is correct because so far It hasn't worked, but it might be how I'm unfamiliar with python.

The libraries that we were told we can use are pandas, matplotlib, seaborn, nltk, sklearn, numpy, imblearn, and scipy.

import pandas as pd  
import re
from fuzzywuzzy import process


df = pd.read_csv("purchases.csv")  
 

columns_to_clean = ["City", "Name", "Store"]

def fix_typos_smart(df, columns):
    for col in columns:
        if col in df.columns:
            df[col] = df[col].astype(str)  #ensure all values are strings
            
            #count occurrences of full words
            full_word_counts = df[col].value_counts()

            #create a mapping of words to replace
            correction_map = {}

            for word, count in full_word_counts.items():
                if count > 1:  #if the full word appears more than once
                    for typo in df[col].unique():
                        if typo.startswith(word):  #find all variants that start with the full word
                            correction_map[typo] = word
            
            #if no full-word matches, check for shortened words
            if not correction_map:
                short_word_counts = df[col].apply(lambda x: x[:-3] if len(x) > 3 else x).value_counts()
                valid_shortened_words = set(short_word_counts[short_word_counts > 1].index)

                for word in df[col].unique():
                    if len(word) > 3:
                        shortened = word[:-3]
                        if shortened in valid_shortened_words:
                            correction_map[word] = shortened
            
            
            df[col] = df[col].replace(correction_map)

    return df

#apply the typo correction function
df = fix_typos_smart(df, columns_to_clean)

df.to_csv("purchasesfixed.csv", index=False) #save

Share Improve this question edited Apr 1 at 21:57 desertnaut 60.5k32 gold badges155 silver badges181 bronze badges asked Mar 31 at 0:12 Jared Cage 11 bronze badge New contributor Jared Cage is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

Please provide a representative small sample of input data and expected vs. actual output. – Mark Tolonen Commented Mar 31 at 0:34
What do you mean exactly by it has not worked ? You have an error message ? – rehaqds Commented Mar 31 at 6:32
If there was just one occurrence of "New Yorkxyz" in your dataset, how would you know that it was typo? – Adon Bilivit Commented Mar 31 at 7:47

Add a comment |

1 Answer 1

Sorted by: Reset to default -1

import pandas as pd
import re
from collections import defaultdict
def fix_typos(df, columns_to_clean):
    df_cleaned = df.copy()
    for col in columns_to_clean:
        if col not in df.columns:
            continue
        df_cleaned[col] = df_cleaned[col].astype(str)
        string_groups = defaultdict(list)
        for value in df_cleaned[col].unique():
            base = value[:-3] if len(value) > 3 else value
            string_groups[base].append(value)
        corrections = {}
        for base, variants in string_groups.items():
            if len(variants) > 1:
                variant_counts = {variant: df_cleaned[col].eq(variant).sum() for variant in variants}
                most_common = max(variant_counts.items(), key=lambda x: x[1])[0]
                has_suffix = False
                for variant in variants:
                    if variant != most_common and len(variant) == len(most_common) + 3:
                        has_suffix = True
                if has_suffix or len(variants) > 2:
                    for variant in variants:
                        other_variants = [v for v in variants if v != variant]
                        is_clean = all(
                            (len(v) == len(variant) + 3 and v.startswith(variant)) 
                            for v in other_variants
                        )
                        
                        if is_clean and variant_counts[variant] > 1:
                            for v in other_variants:
                                corrections[v] = variant
                        elif any(variant.startswith(v) and len(variant) == len(v) + 3 for v in variants):
                            for v in variants:
                                if variant.startswith(v) and len(variant) == len(v) + 3:
                                    corrections[variant] = v
                                    break
        df_cleaned[col] = df_cleaned[col].replace(corrections)
    return df_cleaned
df = pd.read_csv("purchases.csv")  
columns_to_clean = ["City", "Name", "Store"]
cleaned_df = fix_typos(df, columns_to_clean)
print(cleaned_df)

I didn't have your data and checked against random ones to verify code completion on your assignment

I hope my corrected code works for you.

本文标签： pythonHow to fix strings with typos in pandas dataframeStack Overflow

版权声明：本文标题：python - How to fix strings with typos in pandas dataframe? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1743971713a2570715.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - How to fix strings with typos in pandas dataframe? - Stack Overflow

1 Answer 1

更多相关文章