删除字符串中的重复字符

This is from cracking the Coding Interview Book.

这是来自破解编码面试手册。

Design an algorithm and write code to remove the duplicate characters in a string without using any additional buffer. NOTE: One or two additional variables are fine. An extra copy of the array is not.

设计一种算法，并编写代码，在不使用任何额外缓冲区的情况下删除字符串中的重复字符。注意:一个或两个额外的变量是可以的。数组的额外副本不是。

In the book it says time complexity is $O(N^2)$. How can we tell that the time complexity is $O(N^2)$ from the solution? I have questions as to how the solution is removing duplicate characters. I have included them in the inline comments below.

在书中时间复杂度是O(N ^ 2)美元。我们如何告诉时间复杂度是O(N ^ 2)美元的解决方案吗?我对如何删除重复字符有疑问。我已经将它们包含在下面的内联注释中。

   public static void removeDuplicates(char[] str) {
      if (str == null) return; // if the array is empty return  nothing?
      int len = str.length; // get the length of the array 
      if (len < 2) return; // if the array length is only 1 return nothing?
      int tail = 1; // initialise tail variable as 1 ! 
      // Why are we starting at second character here (at i = 1),why not start at i = 0 ? 
      for (int i = 1; i < len; ++i) { 
        int j;

        for (j = 0; j < tail; ++j) { // so are we checking if j is less then 1 as tail has been initialized to 1 
          if (str[i] == str[j]) break; // stop, if we find duplicate.
        }

        if (j == tail) { why are we comparing j to tail(1) here ?
          str[tail] = str[i]; // assigning the value 
          ++tail; // incrementing tail
        }
      }
      str[tail] = 0; //setting last element as 0 
    }


 -

4 个解决方案

#1

I completly rely on @pbabcdefp comment as I'm to lazy to test it, but it seems like your algorithm does not work.

我完全依赖@pbabcdefp注释，因为我懒得对它进行测试，但您的算法似乎不能工作。

I did not like it anyway, here how I would do it with explanation in comments :

无论如何，我不喜欢它，在这里我要用注释来解释:

public static void main(String[] args) {
    removeDuplicates(new char[]{'a','a','b','b','c','d','c'});
}

public static final void removeDuplicates(char[] str)
{
    /*
     * If the str is not instantiated, or there is maximum 1 char there is no need to remove duplicates as 
     * it is just impossible to have duplicate with 1 char.
    */
    if (str == null || str.length < 2)
        return;

    //loop over each char
    for(int i = 0; i < str.length; i++)
    {
        //no need to check with earlier character as they were already checked, so we start at i + 1
        for(int j = i + 1; j < str.length; j++)
        {
            //if they match we clear
             if(str[j] == str[i])
                str[j] = ' ';
        }
    }

    System.out.println(str);
}

That would output a b cd.

这会输出一个b cd。

#2

First of all, this is a great book, I wish to recommend to everyone!

首先，这是一本很棒的书，我想向大家推荐!

Generally if you are allowed to use a lots of memory, you can save time, if you are allowed to use a few variables, then you still can solve this problem by a slower algorithm. And there is the complete brute-force algorithm, when you check every possible solution.

一般来说，如果允许你使用大量的内存，你可以节省时间，如果允许你使用一些变量，那么你仍然可以用一个较慢的算法来解决这个问题。有一个完整的蛮力算法，当你检查每一个可能的解。

public static void removeDuplicates(char[] str) {
  if (str == null) return; // if the array is empty return  nothing?

The input is a string pointer, so the string exists somewhere in the memory, the code will may modify it, but it stays at the same place. That's why the return type of the function is void, so it doesn't return anything. When it returns, the string at its original place is without duplication.

输入是一个字符串指针，所以字符串存在于内存中的某个地方，代码可能会修改它，但它会保持在相同的位置。这就是为什么函数的返回类型为void，所以它不返回任何东西。当它返回时，在其原始位置的字符串没有重复。

  int len = str.length; // get the length of the array 
  if (len < 2) return; // if the array length is only 1 return nothing?

Same as above, no return value. If the string is less then 2 character, then it cannot contain a duplication.

与上面一样，没有返回值。如果字符串小于2个字符，则不能包含复制。

From here the logic is the following: Take the i-th character. Check if it was existing before this place in the string. If it exists, then the algorithm deletes the i-th character. If it doesn't exists then it stays in the string.

从这里，逻辑如下:取第i个字符。检查它是否存在于字符串中的这个位置之前。如果存在，算法将删除第i个字符。如果它不存在，那么它就留在字符串中。

The proof that it's the right algorithm: None of the characters will stay which existed earlier in the string. If a character would exists later in the string, it would be deleted because of the previous rule.

证明它是正确的算法:字符串中早期存在的所有字符都不会保留。如果字符串后面存在一个字符，由于前面的规则，它将被删除。

If this would be the algorithm, it would work fine, but there would be "Empty" characters in the string. The string wouldn't be smaller, even tough it should contain less characters.

如果这是算法，它会工作得很好，但是字符串中会有“空”字符。字符串不会更小，即使很硬，也应该包含更少的字符。

That's why the algorithm keeps track on the "tail of the output string". That's why the tail is equals 1 at the beginning, since the 1st character will be definitely part of the result string. When the current character should be deleted, the output string's tail wont move, no new character added to the result. When the current character should be part of the result, then it gets copied to the tail of the result string.

这就是为什么算法会跟踪“输出字符串的尾部”。这就是为什么在开始时尾巴等于1，因为第一个字符肯定是结果字符串的一部分。当应该删除当前字符时，输出字符串的尾部不会移动，不会向结果添加新字符。当当前字符应该是结果的一部分时，它将被复制到结果字符串的尾部。

When the algorithm reaches the end of the input string, it closes the result string.

当算法到达输入字符串的末尾时，它关闭结果字符串。

Complexity: It means, relative to the size of the input, which is called 'n' how many steps the algorithm has to take. Typically cycles and recursions counts only. This code has 2 for loop embedded into each other. The external goes from 1 to n every time. The internal one goes from 0 to tail where tail goes from 1 to n. So the worst case scenario, the internal one goes by average from 1 to n/2. This means your complexity is n*(n/2). Since 2 is a constant, your complexity is n*n.

复杂度:指的是相对于输入的大小，也就是所谓的“n”，即算法需要走多少步。典型的循环和递归只计算。这段代码内嵌了两个for循环。外部元素每次从1到n。内部的从0到尾部尾巴从1到n，最坏的情况，内部的从1到n/2。这意味着您的复杂性是n*(n/2)。因为2是常数，所以复杂度是n*n。

#3

The O time complexity is about worst-case. Ignoring the array you get and the actions you do on it, when you have 2 nested for loops bounded by the length of the string, your complexity couldn't be higher than n^2, and thus it is O(n^2) (it is only an upper bound, if you'd like to show that it's also a lower bound more work should be done).

O时间复杂度是最坏情况。忽略了数组和行动你做,当你有两个嵌套的循环以该字符串的长度为界,你的复杂性不能高于n ^ 2,因此它是O(n ^ 2)(只有一个上限,如果你想显示它也是一个下界应该做更多的工作)。

#4

O(N^2) basically means that as the number of inputs increases, N being the number of inputs, the complexity (number of operations performed) will scale porportional to N^2 + some constant value.

O(N ^ 2)基本上意味着随着输入的数量增加,N被输入的数量,规模复杂性(操作)将porportional N ^ 2 +一个常数的值。

So looking at the code, str.length is N. For each element, you compare it to each other element, N compared N times = N^2.

所以看代码,str.length N为每个元素,你比较每个其他元素,N N次相比= N ^ 2。

Now O(N^2) is not meant to be exact. By definition it is concerned with only the non-constant factors that contribute to complexity growth. It will never tell you how quickly a particular algorithm will run, it purely tells you how the time it takes to run will scale with fluctuations in the number of elements being operated on.

现在O(N ^ 2)并不意味着是恰当的。根据定义，它只关心导致复杂性增长的非常数因素。它永远不会告诉你一个特定的算法会运行多快，它只会告诉你它运行的时间会随着运行的元素数量的波动而变化。

#1