admin管理员组文章数量:1355679
For example:
S1: "some filename contains few words.txt"
S2: "some filename contains few words - draft.txt"
S3: "some filename contains few words - another draft.txt"
S4: "some filename not contains few words.txt"
Important to note that I can get S2 or S3 for the 1st string and the others to match.
EDITED: I have the "master" string, and I need to find matches.
Lets say that in the first round, I found the typos.
Now I have to match only whole words.
I want to be able to decide that 5 out of 7 words are match, or 7 out of 10. The exact number of "X out of Y" is less important.
The important thing is how to find that the difference is X words, no metter where they are in the sentence.
Thanks
For example:
S1: "some filename contains few words.txt"
S2: "some filename contains few words - draft.txt"
S3: "some filename contains few words - another draft.txt"
S4: "some filename not contains few words.txt"
Important to note that I can get S2 or S3 for the 1st string and the others to match.
EDITED: I have the "master" string, and I need to find matches.
Lets say that in the first round, I found the typos.
Now I have to match only whole words.
I want to be able to decide that 5 out of 7 words are match, or 7 out of 10. The exact number of "X out of Y" is less important.
The important thing is how to find that the difference is X words, no metter where they are in the sentence.
Thanks
Share Improve this question edited Jun 20, 2012 at 8:50 ps ps asked Jun 20, 2012 at 7:19 ps psps ps 651 silver badge6 bronze badges 8- Add appropriate language tag. – dirkgently Commented Jun 20, 2012 at 7:27
- not sure if this can be acplished with RegEx, but take a look at the Levenshtein distance algorithm: dotnetperls./levenshtein – Oliver Kötter Commented Jun 20, 2012 at 7:29
- This isn't really appropriate for regex; more like finding mon substrings. – Joey Commented Jun 20, 2012 at 7:41
- @dirkgentle The logic is of importance to me. The rest is technicality.. I'd rather c#, javascript or java. – ps ps Commented Jun 20, 2012 at 7:44
- There are multiple similarity measures you could use for this. Levinshtein distance is certainly a popular candidate, but it might not correspond exactly to your intuition of a similarity percentage. You could also look at the n-gram algorithms used e.g. for language identification; Google for TextCat. – tripleee Commented Jun 20, 2012 at 7:45
2 Answers
Reset to default 7This isn't a regex problem.
You don't specify a language, but if you're using java, there's the getLevenshteinDistance
method of StringUtils. From the javadocs:
Find the Levenshtein distance between two Strings.
This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).
Usage:
int distance = StringUtils.getLevenshteinDistance(
"some filename contains few words.txt",
"some filename not contains few words.txt"
);
To match by some percentage, you have to decide which string is the "master" since the input strings can have different lengths: that the distance
might be all deletions, so "cat"
and "cataract"
have a distance of 5
. Defining what a "90% match" should be is also a bit difficult. Look at our cat
example; 100% of the string "cat" is found in "cataract", but they're not exactly the same string. You'll have to decide these rules depending on your use-case.
update
If your "difference" should be word-based, it'd be relatively easy to split the string on word boundaries and construct a Map
from the resultant word to the count for each word. Comparing the generated maps for each string should then give you a rough "similarity" measurement. For example:
public HashMap<String, Integer> countWords(String str) {
HashMap<String, Integer> counts = new HashMap<String, Integer>();
for(String s : str.split("\\s+")) {
if(!s.isEmpty()) {
if(counts.containsKey(s)) {
counts.put(s, counts.get(s) + 1);
} else {
counts.put(s, 1);
}
}
}
return counts;
}
// ...
String s1 = "some filename contains few words.txt";
String s2 = "some filename not contains few words.txt";
HashMap<String, Integer> s1Counts = countWords(s1);
HashMap<String, Integer> s2Counts = countWords(s2);
// assume s1 is "master" string, count the total number of words
int s1Total = 0, s2Total = 0;
for(Integer i : s1Counts.values()) {
s1Total += i;
}
// iterate over words in s1, find the number of matching words in s2
for(Map.Entry<String, Integer> entry : s1Counts.entrySet()) {
if(s2Counts.containsKey(entry.getKey())) {
if(s2Counts.get(entry.getKey()) >= entry.getValue()) {
s2Total += entry.getValue();
} else {
s2Total += s2Counts.get(entry.getKey());
}
}
}
// result
System.out.println(s2Total + " out of " + s1Total + " words match.");
I think is worth mentioning to take a look at the Apache mons-text class JaroWinklerDistance
Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.
distance.apply(null, null) = IllegalArgumentException
distance.apply("","") = 0.0
distance.apply("","a") = 0.0
distance.apply("aaapppp", "") = 0.0
distance.apply("frog", "fog") = 0.93
distance.apply("fly", "ant") = 0.0
distance.apply("elephant", "hippo") = 0.44
distance.apply("hippo", "elephant") = 0.44
distance.apply("hippo", "zzzzzzzz") = 0.0
distance.apply("hello", "hallo") = 0.88
distance.apply("ABC Corporation", "ABC Corp") = 0.93
distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95
distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92
distance.apply("PENNSYLVANIA", "PENNCISYLVNIA") = 0.88
本文标签: cHow to match 2 strings by X (ie gt90 matching)Stack Overflow
版权声明:本文标题:c# - How to match 2 strings by X% (i.e. >90% matching) - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744051526a2582512.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论