子字符串查找

时间:2021-11-09 18:48:46

一、KMP算法

1、KMP算法实现(p500):

public class KMP 
{
	private String pat;
	private int[][] dfa;//确定有限状态自动机
	public void dfa(String pat)//构造DFA
	{
		this.pat = pat;
		int M = pat.length();
		int R = 256;
		dfa = new int[R][M];
		dfa[pat.charAt(0)][0] = 1;
		for (int X = 0, j = 1; j < M; j++)
		{
			for (int c = 0; c < R; c++)
				dfa[c][j] = dfa[c][X];
			dfa[pat.charAt(j)][j] = j + 1;
			X = dfa[pat.charAt(j)][X];
		}
	}
	public int search(String txt) 
	{
		int i,j,M = pat.length(),N = txt.length();
		for (i = 0, j = 0; i < N && j < M; i ++)
			j = dfa[txt.charAt(i)][j];
		if (j == M) return i - M;
		else return N;
	}
	
	public static void main(String[] args) 
	{
		String pat = "ABABAC";
		String txt = "ABABCABABAC";
		KMP kmp = new KMP();
		kmp.dfa(pat);
		System.out.println(kmp.search(txt));
		// TODO Auto-generated method stub

	}

}

2、对于长度为M的模式字符串和长度为N的文本,KMP字符串查找算法访问的字符不会超过M+N个。

3、KMP算法由于不回退,适用于长度不确定的输入流中进行查找,以及在重复性很高的文本中查找重复性很高的模式。

二、Boyer-Moore字符串查找算法(启发式的处理不匹配字符)

1、实现(p504)

public class BoyerMoore//启发式处理不匹配字符 
{
	private String pat;
	private int[] right;//记录每个字符在模式中出现的最右位置
	BoyerMoore(String pat)
	{
		this.pat = pat;
		int R = 256;
		int M = pat.length();
		right = new int[R];
		for (int c = 0; c < R; c++)
			right[c] = -1;
		for (int j = 0; j < M; j++)
			right[pat.charAt(j)] = j;
		
	}
	public int search(String txt)
	{
		int N = txt.length();
		int M = pat.length();
		int skip;
		for (int i = 0; i <= N-M; i += skip)
		{
			skip = 0;
			for (int j = M - 1; j >= 0; j--)
				if (txt.charAt(i + j) != pat.charAt(j))
				{
					skip = j - right[txt.charAt(i+j)];
					if (skip < 0) 
						skip = 1;
					break;
				}
			if (skip == 0) return i;
		}
		return N;
	}
	public static void main(String[] args) 
	{
		// TODO Auto-generated method stub
		String pat = "ABABAC";
		String txt = "ABABCABABAC";
		BoyerMoore bm = new BoyerMoore(pat);
		System.out.println(bm.search(txt));
	}

}

2、在一般情况下,对于长度为N的文本和长度为M的模式字符串,使用了Boyer-Moore的子字符串查找算法通过启发式处理不匹配的字符需要~N/M次字符串比较。

三、Rabin-Krap指纹字符串查找算法(基于散列的字符串查找算法)

1、实现(p508):

public class RabinKarp
{

	private String pat;
	private int R = 256;
	private long Q;
	private long RM;
	private long patHash;
	private int M;
	public RabinKarp(String pat)
	{
		this.pat = pat;
		this.M = pat.length();
		Q = 97;
		RM = 1;
		for (int i = 1; i <= M - 1; i++)
			RM = (RM * R) % Q;
		patHash = hash(pat, M);
	}
	private long hash(String pat, int M)
	{
		long h = 0;
		for (int i = 0; i < M; i++)
		{
			h = (R * h + pat.charAt(i)) % Q;
		}
		return h;
	}
	public boolean check(int i)
	{
		return true;
	}
	public int search(String txt)
	{
		int N = txt.length();
		long txtHash = hash(txt, M);
		if (patHash == txtHash&& check(0)) return 0;
		for (int i = M; i < N; i++)
		{
			txtHash = (txtHash + Q - RM * txt.charAt(i-M) % Q) % Q;
			txtHash = (txtHash * R + txt.charAt(i)) % Q;
			if (patHash == txtHash && check(i - M + 1)) return i - M + 1;
			
		}
		return N;
	}
	public static void main(String[] args) 
	{
		// TODO Auto-generated method stub
		String pat = "ABABAC";
		String txt = "ABABCABABAC";
		RabinKarp rk = new RabinKarp(pat);
		System.out.println(rk.search(txt));
	}

}
2、使用蒙特卡洛算法的Rabin-Karp子字符串查找算法的运行时间是线性级别且出错概率极小;使用拉斯维加斯算法的Rabin-Karp算法能够保证正确性且性能及其接近线性级别。

四、各种字符串查找算法实现成本总结:

算法
版本 最坏情况操作次数 一般情况操作次数 在文中是否回退 正确性 格外空间需要
Knuth-Morris-Pratt 完整的DFA 2N 1.1N MR
Boyer-Moore算法 启发式查找不匹配的字符 MN N/M R
Rabin-Krap算法 蒙特卡洛算法
拉斯维加斯算法
7N
7N*
7N
1
暴力算法    MN 1.1N 1