admin管理员组文章数量:1356294
I have a dataset with over 60k rows of data of an imaginary list of purchases. I need to filter out the names used in certain columns such as Name, City, Store, etc.
A lot of the names mentioned in the dataset have "typos" which consist of 3 random letters added to the end of the string. An example would be that the name "New York" is repeated 70 times but 10 out of those instances are similar to "New Yorkxyz" with "xyz" being random letters; Or "Chicago" that is repeated twice and one of those instances is "Chicagoijk". What I've tried so far with my unfamiliarity with python, is:
First check if a string is repeated multiple times, and replace all the other instances that start with the letters of the said string, with it.
If no other instances were found, check again with the same string but without the last 3 letters.
I'm not sure if the algorithm I came up with is correct because so far It hasn't worked, but it might be how I'm unfamiliar with python.
The libraries that we were told we can use are pandas, matplotlib, seaborn, nltk, sklearn, numpy, imblearn, and scipy.
import pandas as pd
import re
from fuzzywuzzy import process
df = pd.read_csv("purchases.csv")
columns_to_clean = ["City", "Name", "Store"]
def fix_typos_smart(df, columns):
for col in columns:
if col in df.columns:
df[col] = df[col].astype(str) #ensure all values are strings
#count occurrences of full words
full_word_counts = df[col].value_counts()
#create a mapping of words to replace
correction_map = {}
for word, count in full_word_counts.items():
if count > 1: #if the full word appears more than once
for typo in df[col].unique():
if typo.startswith(word): #find all variants that start with the full word
correction_map[typo] = word
#if no full-word matches, check for shortened words
if not correction_map:
short_word_counts = df[col].apply(lambda x: x[:-3] if len(x) > 3 else x).value_counts()
valid_shortened_words = set(short_word_counts[short_word_counts > 1].index)
for word in df[col].unique():
if len(word) > 3:
shortened = word[:-3]
if shortened in valid_shortened_words:
correction_map[word] = shortened
df[col] = df[col].replace(correction_map)
return df
#apply the typo correction function
df = fix_typos_smart(df, columns_to_clean)
df.to_csv("purchasesfixed.csv", index=False) #save
I have a dataset with over 60k rows of data of an imaginary list of purchases. I need to filter out the names used in certain columns such as Name, City, Store, etc.
A lot of the names mentioned in the dataset have "typos" which consist of 3 random letters added to the end of the string. An example would be that the name "New York" is repeated 70 times but 10 out of those instances are similar to "New Yorkxyz" with "xyz" being random letters; Or "Chicago" that is repeated twice and one of those instances is "Chicagoijk". What I've tried so far with my unfamiliarity with python, is:
First check if a string is repeated multiple times, and replace all the other instances that start with the letters of the said string, with it.
If no other instances were found, check again with the same string but without the last 3 letters.
I'm not sure if the algorithm I came up with is correct because so far It hasn't worked, but it might be how I'm unfamiliar with python.
The libraries that we were told we can use are pandas, matplotlib, seaborn, nltk, sklearn, numpy, imblearn, and scipy.
import pandas as pd
import re
from fuzzywuzzy import process
df = pd.read_csv("purchases.csv")
columns_to_clean = ["City", "Name", "Store"]
def fix_typos_smart(df, columns):
for col in columns:
if col in df.columns:
df[col] = df[col].astype(str) #ensure all values are strings
#count occurrences of full words
full_word_counts = df[col].value_counts()
#create a mapping of words to replace
correction_map = {}
for word, count in full_word_counts.items():
if count > 1: #if the full word appears more than once
for typo in df[col].unique():
if typo.startswith(word): #find all variants that start with the full word
correction_map[typo] = word
#if no full-word matches, check for shortened words
if not correction_map:
short_word_counts = df[col].apply(lambda x: x[:-3] if len(x) > 3 else x).value_counts()
valid_shortened_words = set(short_word_counts[short_word_counts > 1].index)
for word in df[col].unique():
if len(word) > 3:
shortened = word[:-3]
if shortened in valid_shortened_words:
correction_map[word] = shortened
df[col] = df[col].replace(correction_map)
return df
#apply the typo correction function
df = fix_typos_smart(df, columns_to_clean)
df.to_csv("purchasesfixed.csv", index=False) #save
Share
Improve this question
edited Apr 1 at 21:57
desertnaut
60.5k32 gold badges155 silver badges181 bronze badges
asked Mar 31 at 0:12
Jared CageJared Cage
11 bronze badge
New contributor
Jared Cage is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
3
- Please provide a representative small sample of input data and expected vs. actual output. – Mark Tolonen Commented Mar 31 at 0:34
- What do you mean exactly by it has not worked ? You have an error message ? – rehaqds Commented Mar 31 at 6:32
- If there was just one occurrence of "New Yorkxyz" in your dataset, how would you know that it was typo? – Adon Bilivit Commented Mar 31 at 7:47
1 Answer
Reset to default -1import pandas as pd
import re
from collections import defaultdict
def fix_typos(df, columns_to_clean):
df_cleaned = df.copy()
for col in columns_to_clean:
if col not in df.columns:
continue
df_cleaned[col] = df_cleaned[col].astype(str)
string_groups = defaultdict(list)
for value in df_cleaned[col].unique():
base = value[:-3] if len(value) > 3 else value
string_groups[base].append(value)
corrections = {}
for base, variants in string_groups.items():
if len(variants) > 1:
variant_counts = {variant: df_cleaned[col].eq(variant).sum() for variant in variants}
most_common = max(variant_counts.items(), key=lambda x: x[1])[0]
has_suffix = False
for variant in variants:
if variant != most_common and len(variant) == len(most_common) + 3:
has_suffix = True
if has_suffix or len(variants) > 2:
for variant in variants:
other_variants = [v for v in variants if v != variant]
is_clean = all(
(len(v) == len(variant) + 3 and v.startswith(variant))
for v in other_variants
)
if is_clean and variant_counts[variant] > 1:
for v in other_variants:
corrections[v] = variant
elif any(variant.startswith(v) and len(variant) == len(v) + 3 for v in variants):
for v in variants:
if variant.startswith(v) and len(variant) == len(v) + 3:
corrections[variant] = v
break
df_cleaned[col] = df_cleaned[col].replace(corrections)
return df_cleaned
df = pd.read_csv("purchases.csv")
columns_to_clean = ["City", "Name", "Store"]
cleaned_df = fix_typos(df, columns_to_clean)
print(cleaned_df)
I didn't have your data and checked against random ones to verify code completion on your assignment
I hope my corrected code works for you.
本文标签: pythonHow to fix strings with typos in pandas dataframeStack Overflow
版权声明:本文标题:python - How to fix strings with typos in pandas dataframe? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743971713a2570715.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论