admin管理员组

文章数量:1356294

I have a dataset with over 60k rows of data of an imaginary list of purchases. I need to filter out the names used in certain columns such as Name, City, Store, etc.

A lot of the names mentioned in the dataset have "typos" which consist of 3 random letters added to the end of the string. An example would be that the name "New York" is repeated 70 times but 10 out of those instances are similar to "New Yorkxyz" with "xyz" being random letters; Or "Chicago" that is repeated twice and one of those instances is "Chicagoijk". What I've tried so far with my unfamiliarity with python, is:

  • First check if a string is repeated multiple times, and replace all the other instances that start with the letters of the said string, with it.

  • If no other instances were found, check again with the same string but without the last 3 letters.

I'm not sure if the algorithm I came up with is correct because so far It hasn't worked, but it might be how I'm unfamiliar with python.

The libraries that we were told we can use are pandas, matplotlib, seaborn, nltk, sklearn, numpy, imblearn, and scipy.

import pandas as pd  
import re
from fuzzywuzzy import process


df = pd.read_csv("purchases.csv")  
 

columns_to_clean = ["City", "Name", "Store"]

def fix_typos_smart(df, columns):
    for col in columns:
        if col in df.columns:
            df[col] = df[col].astype(str)  #ensure all values are strings
            
            #count occurrences of full words
            full_word_counts = df[col].value_counts()

            #create a mapping of words to replace
            correction_map = {}

            for word, count in full_word_counts.items():
                if count > 1:  #if the full word appears more than once
                    for typo in df[col].unique():
                        if typo.startswith(word):  #find all variants that start with the full word
                            correction_map[typo] = word
            
            #if no full-word matches, check for shortened words
            if not correction_map:
                short_word_counts = df[col].apply(lambda x: x[:-3] if len(x) > 3 else x).value_counts()
                valid_shortened_words = set(short_word_counts[short_word_counts > 1].index)

                for word in df[col].unique():
                    if len(word) > 3:
                        shortened = word[:-3]
                        if shortened in valid_shortened_words:
                            correction_map[word] = shortened
            
            
            df[col] = df[col].replace(correction_map)

    return df

#apply the typo correction function
df = fix_typos_smart(df, columns_to_clean)

df.to_csv("purchasesfixed.csv", index=False) #save

I have a dataset with over 60k rows of data of an imaginary list of purchases. I need to filter out the names used in certain columns such as Name, City, Store, etc.

A lot of the names mentioned in the dataset have "typos" which consist of 3 random letters added to the end of the string. An example would be that the name "New York" is repeated 70 times but 10 out of those instances are similar to "New Yorkxyz" with "xyz" being random letters; Or "Chicago" that is repeated twice and one of those instances is "Chicagoijk". What I've tried so far with my unfamiliarity with python, is:

  • First check if a string is repeated multiple times, and replace all the other instances that start with the letters of the said string, with it.

  • If no other instances were found, check again with the same string but without the last 3 letters.

I'm not sure if the algorithm I came up with is correct because so far It hasn't worked, but it might be how I'm unfamiliar with python.

The libraries that we were told we can use are pandas, matplotlib, seaborn, nltk, sklearn, numpy, imblearn, and scipy.

import pandas as pd  
import re
from fuzzywuzzy import process


df = pd.read_csv("purchases.csv")  
 

columns_to_clean = ["City", "Name", "Store"]

def fix_typos_smart(df, columns):
    for col in columns:
        if col in df.columns:
            df[col] = df[col].astype(str)  #ensure all values are strings
            
            #count occurrences of full words
            full_word_counts = df[col].value_counts()

            #create a mapping of words to replace
            correction_map = {}

            for word, count in full_word_counts.items():
                if count > 1:  #if the full word appears more than once
                    for typo in df[col].unique():
                        if typo.startswith(word):  #find all variants that start with the full word
                            correction_map[typo] = word
            
            #if no full-word matches, check for shortened words
            if not correction_map:
                short_word_counts = df[col].apply(lambda x: x[:-3] if len(x) > 3 else x).value_counts()
                valid_shortened_words = set(short_word_counts[short_word_counts > 1].index)

                for word in df[col].unique():
                    if len(word) > 3:
                        shortened = word[:-3]
                        if shortened in valid_shortened_words:
                            correction_map[word] = shortened
            
            
            df[col] = df[col].replace(correction_map)

    return df

#apply the typo correction function
df = fix_typos_smart(df, columns_to_clean)

df.to_csv("purchasesfixed.csv", index=False) #save
Share Improve this question edited Apr 1 at 21:57 desertnaut 60.5k32 gold badges155 silver badges181 bronze badges asked Mar 31 at 0:12 Jared CageJared Cage 11 bronze badge New contributor Jared Cage is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 3
  • Please provide a representative small sample of input data and expected vs. actual output. – Mark Tolonen Commented Mar 31 at 0:34
  • What do you mean exactly by it has not worked ? You have an error message ? – rehaqds Commented Mar 31 at 6:32
  • If there was just one occurrence of "New Yorkxyz" in your dataset, how would you know that it was typo? – Adon Bilivit Commented Mar 31 at 7:47
Add a comment  | 

1 Answer 1

Reset to default -1
import pandas as pd
import re
from collections import defaultdict
def fix_typos(df, columns_to_clean):
    df_cleaned = df.copy()
    for col in columns_to_clean:
        if col not in df.columns:
            continue
        df_cleaned[col] = df_cleaned[col].astype(str)
        string_groups = defaultdict(list)
        for value in df_cleaned[col].unique():
            base = value[:-3] if len(value) > 3 else value
            string_groups[base].append(value)
        corrections = {}
        for base, variants in string_groups.items():
            if len(variants) > 1:
                variant_counts = {variant: df_cleaned[col].eq(variant).sum() for variant in variants}
                most_common = max(variant_counts.items(), key=lambda x: x[1])[0]
                has_suffix = False
                for variant in variants:
                    if variant != most_common and len(variant) == len(most_common) + 3:
                        has_suffix = True
                if has_suffix or len(variants) > 2:
                    for variant in variants:
                        other_variants = [v for v in variants if v != variant]
                        is_clean = all(
                            (len(v) == len(variant) + 3 and v.startswith(variant)) 
                            for v in other_variants
                        )
                        
                        if is_clean and variant_counts[variant] > 1:
                            for v in other_variants:
                                corrections[v] = variant
                        elif any(variant.startswith(v) and len(variant) == len(v) + 3 for v in variants):
                            for v in variants:
                                if variant.startswith(v) and len(variant) == len(v) + 3:
                                    corrections[variant] = v
                                    break
        df_cleaned[col] = df_cleaned[col].replace(corrections)
    return df_cleaned
df = pd.read_csv("purchases.csv")  
columns_to_clean = ["City", "Name", "Store"]
cleaned_df = fix_typos(df, columns_to_clean)
print(cleaned_df)

I didn't have your data and checked against random ones to verify code completion on your assignment

I hope my corrected code works for you.

本文标签: pythonHow to fix strings with typos in pandas dataframeStack Overflow