c# - How to match 2 strings by X% (i.e. >90% matching) - Stack Overflow

IT技术

更新时间：2025-04-080

admin管理员组
文章数量:1355679

For example:

S1: "some filename contains few words.txt"
S2: "some filename contains few words - draft.txt"
S3: "some filename contains few words - another draft.txt"
S4: "some filename not contains few words.txt"

Important to note that I can get S2 or S3 for the 1st string and the others to match.

EDITED: I have the "master" string, and I need to find matches.

Lets say that in the first round, I found the typos.

Now I have to match only whole words.

I want to be able to decide that 5 out of 7 words are match, or 7 out of 10. The exact number of "X out of Y" is less important.

The important thing is how to find that the difference is X words, no metter where they are in the sentence.

Thanks

For example:

S1: "some filename contains few words.txt"
S2: "some filename contains few words - draft.txt"
S3: "some filename contains few words - another draft.txt"
S4: "some filename not contains few words.txt"

Important to note that I can get S2 or S3 for the 1st string and the others to match.

EDITED: I have the "master" string, and I need to find matches.

Lets say that in the first round, I found the typos.

Now I have to match only whole words.

I want to be able to decide that 5 out of 7 words are match, or 7 out of 10. The exact number of "X out of Y" is less important.

The important thing is how to find that the difference is X words, no metter where they are in the sentence.

Thanks

Share Improve this question edited Jun 20, 2012 at 8:50 asked Jun 20, 2012 at 7:19 ps ps 651 silver badge6 bronze badges

Add appropriate language tag. – dirkgently Commented Jun 20, 2012 at 7:27
not sure if this can be acplished with RegEx, but take a look at the Levenshtein distance algorithm: dotnetperls./levenshtein – Oliver Kötter Commented Jun 20, 2012 at 7:29
This isn't really appropriate for regex; more like finding mon substrings. – Joey Commented Jun 20, 2012 at 7:41
@dirkgentle The logic is of importance to me. The rest is technicality.. I'd rather c#, javascript or java. – ps ps Commented Jun 20, 2012 at 7:44
There are multiple similarity measures you could use for this. Levinshtein distance is certainly a popular candidate, but it might not correspond exactly to your intuition of a similarity percentage. You could also look at the n-gram algorithms used e.g. for language identification; Google for TextCat. – tripleee Commented Jun 20, 2012 at 7:45

| Show 3 more ments

2 Answers 2

Sorted by: Reset to default 7

This isn't a regex problem.

You don't specify a language, but if you're using java, there's the getLevenshteinDistance method of StringUtils. From the javadocs:

Find the Levenshtein distance between two Strings.

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).

Usage:

int distance = StringUtils.getLevenshteinDistance(
    "some filename contains few words.txt",
    "some filename not contains few words.txt"
);

To match by some percentage, you have to decide which string is the "master" since the input strings can have different lengths: that the distance might be all deletions, so "cat" and "cataract" have a distance of 5. Defining what a "90% match" should be is also a bit difficult. Look at our cat example; 100% of the string "cat" is found in "cataract", but they're not exactly the same string. You'll have to decide these rules depending on your use-case.

update

If your "difference" should be word-based, it'd be relatively easy to split the string on word boundaries and construct a Map from the resultant word to the count for each word. Comparing the generated maps for each string should then give you a rough "similarity" measurement. For example:

public HashMap<String, Integer> countWords(String str) {
    HashMap<String, Integer> counts = new HashMap<String, Integer>();
    for(String s : str.split("\\s+")) {
        if(!s.isEmpty()) {
            if(counts.containsKey(s)) {
                counts.put(s, counts.get(s) + 1);
            } else {
                counts.put(s, 1);
            }
        }
    }
    return counts;
}

// ...

String s1 = "some filename contains few words.txt";
String s2 = "some filename not contains few words.txt";
HashMap<String, Integer> s1Counts = countWords(s1);
HashMap<String, Integer> s2Counts = countWords(s2);
// assume s1 is "master" string, count the total number of words
int s1Total = 0, s2Total = 0;
for(Integer i : s1Counts.values()) {
    s1Total += i;
}
// iterate over words in s1, find the number of matching words in s2
for(Map.Entry<String, Integer> entry : s1Counts.entrySet()) {
    if(s2Counts.containsKey(entry.getKey())) {
        if(s2Counts.get(entry.getKey()) >= entry.getValue()) {
            s2Total += entry.getValue();
        } else {
            s2Total += s2Counts.get(entry.getKey());
        }
    }
}
// result
System.out.println(s2Total + " out of " + s1Total + " words match.");

I think is worth mentioning to take a look at the Apache mons-text class JaroWinklerDistance

Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.
 distance.apply(null, null)          = IllegalArgumentException
 distance.apply("","")               = 0.0
 distance.apply("","a")              = 0.0
 distance.apply("aaapppp", "")       = 0.0
 distance.apply("frog", "fog")       = 0.93
 distance.apply("fly", "ant")        = 0.0
 distance.apply("elephant", "hippo") = 0.44
 distance.apply("hippo", "elephant") = 0.44
 distance.apply("hippo", "zzzzzzzz") = 0.0
 distance.apply("hello", "hallo")    = 0.88
 distance.apply("ABC Corporation", "ABC Corp") = 0.93
 distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95
 distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92
 distance.apply("PENNSYLVANIA", "PENNCISYLVNIA")    = 0.88

本文标签： cHow to match 2 strings by X (ie gt90 matching)Stack Overflow

版权声明：本文标题：c# - How to match 2 strings by X% (i.e. >90% matching) - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744051526a2582512.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

c# - How to match 2 strings by X% (i.e. >90% matching) - Stack Overflow

2 Answers 2

更多相关文章