如何在f#中找到字符串中的子字符串?

时间:2022-10-14 20:58:07

I found a "fun" project online for f# and the idea behind it is to find the number of substrings within a given string.

我在网上找到了一个f#的“有趣”项目,其背后的想法是在给定的字符串中找到子字符串的数目。

Here's the prompt:

提示:

Description:
You are given a DNA sequence:
a string that contains only characters 'A', 'C', 'G', and 'T'.
Your task is to calculate the number of substrings of sequence,
in which each of the symbols appears the same number of times.

Example 1:
For sequence = "ACGTACGT", the output should be 6
All substrings of length 4 contain each symbol exactly once (+5),
and the whole sequence contains each symbol twice (+1).

Example 2:
For sequence = "AAACCGGTTT", the output should be 1
Only substring "AACCGGTT" satisfies the criterion above: it contains each symbol twice.


Input: String, a sequence that consists only of symbols 'A', 'C', 'G', and 'T'.
Length constraint: 0 < sequence.length < 100000.

Output: Integer, the number of substrings where each symbol appears equally many times.

I'm not exactly sure where to go with this, or more specifically what to do. I've looked around on the internet to try and find what I'm supposed to do and I've only found the following code (I added the input variable, var variable, and changed the show "things" to input then the substring to search for (i hope that makes sense)):

我不确定该怎么做,或者更具体地说该怎么做。我在网上查了一下,想看看我应该做什么,我只找到了下面的代码(我添加了输入变量var变量,并将show“things”改为输入,然后是要搜索的子字符串(我希望这是有意义的):

open System

let countSubstring (where :string) (what : string) =
match what with
| "" -> 0
| _ -> (where.Length - where.Replace(what, @"").Length) / what.Length


[<EntryPoint>]
let main argv =

let input = System.Console.ReadLine();
let var = input.Length;
Console.WriteLine(var);
let show where what =
    printfn @"countSubstring(""%s"", ""%s"") = %d" where what (countSubstring where what)
show input "ACGT"
show input "CGTA"
show input "GTAC"
show input "TACG"
0

Anyways, if anyone can help me with this, it would be greatly appreciated.

无论如何,如果有人能帮助我,我将不胜感激。

Thanks in advance

谢谢提前

2 个解决方案

#1


2  

First declare a function numberACGT that from a string returns 1 if the number of characters A is the same as C, G and T and 0 otherwise. For this, declare an array N of 4 integers initialized to 0 and run throw the string, incrementing the corresponding counter. In late compare array elements between them.

首先声明一个函数numberACGT,如果字符串a的字符数与C、G、T和0相同,则返回1。为此,声明初始化为0的4个整数的数组N,并运行抛出字符串,递增相应的计数器。最后比较它们之间的数组元素。

Then for each sub-string (fixed length multiple of 4) call numberACGT and add to counter count (initialized to 0 at the beginning)

然后对每个子字符串(固定长度4的倍数)调用numberACGT并添加计数器计数(初始化为0)

let numberACGT (aString:string) =
    let N = Array.create 4 (0:int)
    let last = aString.Length - 1 
    for i = 0 to last do
        match aString.[i] with
        | 'A' -> N.[0] <- N.[0] + 1
        | 'C' -> N.[1] <- N.[1] + 1
        | 'G' -> N.[2] <- N.[2] + 1
        | _ -> N.[3] <- N.[3] + 1
    if (N.[0] = N.[1]) && (N.[1] = N.[2]) && (N.[2] = N.[3]) then 1 else 0 

let numberSubStrings (aString:string) =
    let mutable count = 0
    let len = aString.Length 
    for k = 1 to len / 4 do //only multiple of 4
        for pos = 0 to len - 4*k do
            count <- count + numberACGT (aString.[pos..pos+4*k-1])
    count

I hope that it is fast enough.

我希望它足够快。

[<EntryPoint>]
let main argv = 
  let stopWatch = System.Diagnostics.Stopwatch.StartNew()
  let input =  Console.ReadLine() in
    printf "%i  " (numberSubStrings input)
  stopWatch.Stop()
  let g =  Console.ReadLine()
  0

Result:

结果:

62    4.542700

An new version in O(n²):

新版本的O(n²):

let numberSubStringsBis (aString:string) =
    let mutable count = 0 
    let len = aString.Length 
    for pos = 0 to len - 1 do
        let mutable a = 0 
        let mutable  c = 0 
        let mutable g = 0 
        let mutable t = 0 
        let mutable k = pos 
        while k + 3 <= len - 1 do
            for i in [k..k+3] do
                match aString.[i] with
                | 'A' -> a <- a + 1
                | 'C' -> c <- c + 1
                | 'G' -> g <- g + 1
                | _ -> t <- t + 1
            k <- k + 4 
            if a=c && c=g && g=t then count <- count + 1               
    count

#2


2  

Here is a solution that generates all substrings that have length divisible by four and then counts how many of those have equal amount of symbols. Note that if the length of a substring is not divisible by four it cannot have equal amount of four different symbols.

这里有一个解决方案,生成所有长度可被4整除的子字符串,然后计算有多少子字符串具有相同数量的符号。注意,如果子字符串的长度不能被4整除,那么它不可能有相同数量的4个不同的符号。

let hasEqualAmountOfSymbols (substring : string) =
    let symbolAppearances =
        ['A'; 'C'; 'G'; 'T']
        |> List.map (fun symbol ->
            substring
            |> Seq.filter ((=) symbol)
            |> Seq.length)
    symbolAppearances
    |> List.pairwise
    |> List.forall (fun (x, y) -> x = y)


let countSubstrings input =
    let potentialSubstrings =
        let lastIndex = String.length input - 1
        [ for i in 0 .. lastIndex do
            for j in i + 3 .. 4 .. lastIndex do
                yield input.Substring(i, j - i + 1) ]
    potentialSubstrings
    |> List.filter hasEqualAmountOfSymbols
    |> List.length


countSubstrings "ACGTACGT" // -> 6
countSubstrings "AAACCGGTTT" // -> 1

#1


2  

First declare a function numberACGT that from a string returns 1 if the number of characters A is the same as C, G and T and 0 otherwise. For this, declare an array N of 4 integers initialized to 0 and run throw the string, incrementing the corresponding counter. In late compare array elements between them.

首先声明一个函数numberACGT,如果字符串a的字符数与C、G、T和0相同,则返回1。为此,声明初始化为0的4个整数的数组N,并运行抛出字符串,递增相应的计数器。最后比较它们之间的数组元素。

Then for each sub-string (fixed length multiple of 4) call numberACGT and add to counter count (initialized to 0 at the beginning)

然后对每个子字符串(固定长度4的倍数)调用numberACGT并添加计数器计数(初始化为0)

let numberACGT (aString:string) =
    let N = Array.create 4 (0:int)
    let last = aString.Length - 1 
    for i = 0 to last do
        match aString.[i] with
        | 'A' -> N.[0] <- N.[0] + 1
        | 'C' -> N.[1] <- N.[1] + 1
        | 'G' -> N.[2] <- N.[2] + 1
        | _ -> N.[3] <- N.[3] + 1
    if (N.[0] = N.[1]) && (N.[1] = N.[2]) && (N.[2] = N.[3]) then 1 else 0 

let numberSubStrings (aString:string) =
    let mutable count = 0
    let len = aString.Length 
    for k = 1 to len / 4 do //only multiple of 4
        for pos = 0 to len - 4*k do
            count <- count + numberACGT (aString.[pos..pos+4*k-1])
    count

I hope that it is fast enough.

我希望它足够快。

[<EntryPoint>]
let main argv = 
  let stopWatch = System.Diagnostics.Stopwatch.StartNew()
  let input =  Console.ReadLine() in
    printf "%i  " (numberSubStrings input)
  stopWatch.Stop()
  let g =  Console.ReadLine()
  0

Result:

结果:

62    4.542700

An new version in O(n²):

新版本的O(n²):

let numberSubStringsBis (aString:string) =
    let mutable count = 0 
    let len = aString.Length 
    for pos = 0 to len - 1 do
        let mutable a = 0 
        let mutable  c = 0 
        let mutable g = 0 
        let mutable t = 0 
        let mutable k = pos 
        while k + 3 <= len - 1 do
            for i in [k..k+3] do
                match aString.[i] with
                | 'A' -> a <- a + 1
                | 'C' -> c <- c + 1
                | 'G' -> g <- g + 1
                | _ -> t <- t + 1
            k <- k + 4 
            if a=c && c=g && g=t then count <- count + 1               
    count

#2


2  

Here is a solution that generates all substrings that have length divisible by four and then counts how many of those have equal amount of symbols. Note that if the length of a substring is not divisible by four it cannot have equal amount of four different symbols.

这里有一个解决方案,生成所有长度可被4整除的子字符串,然后计算有多少子字符串具有相同数量的符号。注意,如果子字符串的长度不能被4整除,那么它不可能有相同数量的4个不同的符号。

let hasEqualAmountOfSymbols (substring : string) =
    let symbolAppearances =
        ['A'; 'C'; 'G'; 'T']
        |> List.map (fun symbol ->
            substring
            |> Seq.filter ((=) symbol)
            |> Seq.length)
    symbolAppearances
    |> List.pairwise
    |> List.forall (fun (x, y) -> x = y)


let countSubstrings input =
    let potentialSubstrings =
        let lastIndex = String.length input - 1
        [ for i in 0 .. lastIndex do
            for j in i + 3 .. 4 .. lastIndex do
                yield input.Substring(i, j - i + 1) ]
    potentialSubstrings
    |> List.filter hasEqualAmountOfSymbols
    |> List.length


countSubstrings "ACGTACGT" // -> 6
countSubstrings "AAACCGGTTT" // -> 1