admin管理员组文章数量:1400681
Given a string, I want to find the longest substring such that the count of each character does not exceed the number of unique characters in that substring.
For example:
Input: 'aaabb'
Output: 'aabb'
I understand how to do this brute-force: Generate all substrings while tracking the count of each character and the number of distinct characters in each substring. If there's a substring where the condition is violated, we move the start of the window forward. Else, we move the end of the window forward.
The brute-force approach is in O(n^2)
. Is there a more efficient way to do this?
Given a string, I want to find the longest substring such that the count of each character does not exceed the number of unique characters in that substring.
For example:
Input: 'aaabb'
Output: 'aabb'
I understand how to do this brute-force: Generate all substrings while tracking the count of each character and the number of distinct characters in each substring. If there's a substring where the condition is violated, we move the start of the window forward. Else, we move the end of the window forward.
The brute-force approach is in O(n^2)
. Is there a more efficient way to do this?
2 Answers
Reset to default 4Here's an O(n^2) algorithm, I think. (note that I don't think the O(n^2) solution given in the question works)
O(n^2) in the n >>> # of unique characters
case:
Let S be the input string.
- Scan
S
and produceC
the set of distinct characters inS
. Let|C|
denote the length ofC
. - For each character
c
in C, produce a cumulative-occurrences arrayO_c
, where eachO_c[i]
is the number ofc
in the substringS[0:i]
. This is O(n * |C|). - Given a start and endpoint, we can get the number of characters
c
in the substring viaO_c[end] - O_c[start]
. Obtain the maximum occurrences and the number of non-zero occurrences and use these to check validity. - Using 3, check each start and endpoint in O(n^2 * |C|). If we can assume that the number of characters is small compared to
n
, that's O(n^2).
It seems like if there is an improvement over O(n^2), it'd be some way to avoid checking every start and endpoint in 4.
O(n) in the n >>> (# of unique characters)^2
case:
- If for some substring
S[i:j]
there are more occurrences of some characterc
than|C|
, then anyS[i:k]
withk > j
is an invalid substring (because all the unique characters in the string cannot make up for the current occurrences ofc
). Thus, we can immediately incrementi
once we find this condition. - For substring of length
l
, at least one character will have at leastceil(l / |C|)
occurrences (pigeonhole argument). - So the maximal length of a valid substring is
|C|^2
.
If we have few distinct characters m (say, if string consists of small letters only) we can check substrings with exactly m distinct characters each of them has frequency m or less, if there are no such substrings we can check for m - 1 distinct characters etc.
In the worst case we have O(n * m) time
C# Code:
public static string LongestSubstring(string value) {
if (string.IsNullOrEmpty(value))
return "";
string result = value.Substring(0, 1);
int distinct = value.Distinct().Count();
for (int count = distinct; count > 1; --count) {
var freqs = new Dictionary<char, int>();
var found = false;
for (int left = 0, right = 0; right < value.Length; ++right) {
var letter = value[right];
if (freqs.TryGetValue(letter, out var freq))
freqs[letter] += 1;
else
freqs.Add(letter, 1);
while (freqs.Count > count || freqs[letter] > count) {
var leftLetter = value[left++];
if (freqs[leftLetter] == 1)
freqs.Remove(leftLetter);
else
freqs[leftLetter] -= 1;
}
if (freqs.Count == count && result.Length < right - left + 1) {
found = true;
result = value.Substring(left, right - left + 1);
}
}
if (found)
return result;
}
return result;
}
Demo:
var tests = new string[] {
"aaaa",
"aaabb",
"aaabbc",
"aaaabbaaaaaaccaaaaa",
"aaaabbaaadaaaccaaaaa",
};
var report = string.Join(Environment.NewLine, tests
.Select(test => $"{test,30} -> {LongestSubstring(test)}"));
Console.WriteLine(report);
Output:
aaaa -> a
aaabb -> aabb
aaabbc -> aaabbc
aaaabbaaaaaaccaaaaa -> aabb
aaaabbaaadaaaccaaaaa -> bbaaad
Fiddle
本文标签:
版权声明:本文标题:algorithm - Find longest substring s.t the count of each character does not exceed the number of unique characters - Stack Overf 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744224782a2596036.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
AAABCDAAAAAEFGHAAA
the substringBCDAAAAA
is invalid but becomes valid again once you reachBCDAAAAAEFGH
– Kaia Commented Mar 25 at 0:03