admin管理员组

文章数量:1400681

Given a string, I want to find the longest substring such that the count of each character does not exceed the number of unique characters in that substring.

For example:

Input: 'aaabb'
Output: 'aabb'

I understand how to do this brute-force: Generate all substrings while tracking the count of each character and the number of distinct characters in each substring. If there's a substring where the condition is violated, we move the start of the window forward. Else, we move the end of the window forward.

The brute-force approach is in O(n^2). Is there a more efficient way to do this?

Given a string, I want to find the longest substring such that the count of each character does not exceed the number of unique characters in that substring.

For example:

Input: 'aaabb'
Output: 'aabb'

I understand how to do this brute-force: Generate all substrings while tracking the count of each character and the number of distinct characters in each substring. If there's a substring where the condition is violated, we move the start of the window forward. Else, we move the end of the window forward.

The brute-force approach is in O(n^2). Is there a more efficient way to do this?

Share Improve this question asked Mar 24 at 22:59 Mark CarothersMark Carothers 311 bronze badge 3
  • I don't think the brute force algorithm described works. e.g. "aaaabcdef" we would check "aa" and decide it violates the condition and move the start of window forward, even though if we continue moving the tail forward we'll eventually reach a substring that meets the condition, right? – Kaia Commented Mar 24 at 23:38
  • The most brute-force possible is "choose a start and end. then iterate through that substring in O(n) time and see if it works" for a O(n^3) performance, right? – Kaia Commented Mar 24 at 23:40
  • And in particular, it's not just a matter of choosing the right starting point (as it might seem from aaaabcdef, where if we start from the right side with a greedy approach we're golden). I don't think you can pick a starting pivot and then add items to the left or right until you have a maximal thing, since e.g. AAABCDAAAAAEFGHAAA the substring BCDAAAAA is invalid but becomes valid again once you reach BCDAAAAAEFGH – Kaia Commented Mar 25 at 0:03
Add a comment  | 

2 Answers 2

Reset to default 4

Here's an O(n^2) algorithm, I think. (note that I don't think the O(n^2) solution given in the question works)

O(n^2) in the n >>> # of unique characters case:

Let S be the input string.

  1. Scan S and produce C the set of distinct characters in S. Let |C| denote the length of C.
  2. For each character c in C, produce a cumulative-occurrences array O_c, where each O_c[i] is the number of c in the substring S[0:i]. This is O(n * |C|).
  3. Given a start and endpoint, we can get the number of characters c in the substring via O_c[end] - O_c[start]. Obtain the maximum occurrences and the number of non-zero occurrences and use these to check validity.
  4. Using 3, check each start and endpoint in O(n^2 * |C|). If we can assume that the number of characters is small compared to n, that's O(n^2).

It seems like if there is an improvement over O(n^2), it'd be some way to avoid checking every start and endpoint in 4.

O(n) in the n >>> (# of unique characters)^2 case:

  1. If for some substring S[i:j] there are more occurrences of some character c than |C|, then any S[i:k] with k > j is an invalid substring (because all the unique characters in the string cannot make up for the current occurrences of c). Thus, we can immediately increment i once we find this condition.
  2. For substring of length l, at least one character will have at least ceil(l / |C|) occurrences (pigeonhole argument).
  3. So the maximal length of a valid substring is |C|^2.

If we have few distinct characters m (say, if string consists of small letters only) we can check substrings with exactly m distinct characters each of them has frequency m or less, if there are no such substrings we can check for m - 1 distinct characters etc.

In the worst case we have O(n * m) time

C# Code:

public static string LongestSubstring(string value) {
  if (string.IsNullOrEmpty(value))
    return "";

  string result = value.Substring(0, 1);

  int distinct = value.Distinct().Count();

  for (int count = distinct; count > 1; --count) {
    var freqs = new Dictionary<char, int>();
    var found = false;

    for (int left = 0, right = 0; right < value.Length; ++right) {
      var letter = value[right];

      if (freqs.TryGetValue(letter, out var freq))
        freqs[letter] += 1;
      else
        freqs.Add(letter, 1);

      while (freqs.Count > count || freqs[letter] > count) {
        var leftLetter = value[left++];

        if (freqs[leftLetter] == 1)
          freqs.Remove(leftLetter);
        else
          freqs[leftLetter] -= 1;
      }

      if (freqs.Count == count && result.Length < right - left + 1) {
        found = true;
        result = value.Substring(left, right - left + 1);
      }
    }

    if (found)
      return result;
  }

  return result;
}

Demo:

var tests = new string[] {
  "aaaa",
  "aaabb",
  "aaabbc",
  "aaaabbaaaaaaccaaaaa",
  "aaaabbaaadaaaccaaaaa",
};
      
var report = string.Join(Environment.NewLine, tests
  .Select(test => $"{test,30} -> {LongestSubstring(test)}"));
      
Console.WriteLine(report);  

Output:

                          aaaa -> a
                         aaabb -> aabb
                        aaabbc -> aaabbc
           aaaabbaaaaaaccaaaaa -> aabb
          aaaabbaaadaaaccaaaaa -> bbaaad

Fiddle

本文标签: