Repeated DNA Sequences 解答

Question

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].

Solution -- Bit Manipulation

Original idea is to use a set to store each substring. Time complexity is O(n) and space cost is O(n). But for details of space cost, a char is 2 bytes, so we need 20 bytes to store a substring and therefore (20n) space.

If we represent DNA substring by integer, the space is cut down to (4n).

 public List<String> findRepeatedDnaSequences(String s) {

     List<String> result = new ArrayList<String>();

     int len = s.length();

     if (len < 10) {

         return result;

     }

     Map<Character, Integer> map = new HashMap<Character, Integer>();

     map.put('A', 0);

     map.put('C', 1);

     map.put('G', 2);

     map.put('T', 3);

     Set<Integer> temp = new HashSet<Integer>();

     Set<Integer> added = new HashSet<Integer>();

     int hash = 0;

     for (int i = 0; i < len; i++) {

         if (i < 9) {

             //each ACGT fit 2 bits, so left shift 2

             hash = (hash << 2) + map.get(s.charAt(i));

         } else {

             hash = (hash << 2) + map.get(s.charAt(i));

             //make length of hash to be 20

             hash = hash &  (1 << 20) - 1; 

             if (temp.contains(hash) && !added.contains(hash)) {

                 result.add(s.substring(i - 9, i + 1));

                 added.add(hash); //track added

             } else {

                 temp.add(hash);

             }

         }

     }

     return result;

 }

秒客网

Repeated DNA Sequences 解答

Question

Solution -- Bit Manipulation

相关文章