自己做分词,一般选择继承Tokenizer类。在以前的版本中只需要重写Next()方法就可以了。这个类比较简单,而塔的父类TokenStream则更加简单了,和接口没什么两样:
//
2.1版
public
abstract
class
TokenStream
{
///
<summary>
Returns the next token in the stream, or null at EOS.
</summary>
public
abstract
Token Next();
///
<summary>
Releases resources associated with this stream.
</summary>
public
virtual
void
Close()
{
}
}
到了2.3.1版本中,这个抽象类发生了很多变化.下面是2.3.1版的TokenStream类代码。

Code
1
public abstract class TokenStream
2
{
3
4
/**//// <summary>Returns the next token in the stream, or null at EOS.
5
/// The returned Token is a "full private copy" (not
6
/// re-used across calls to next()) but will be slower
7
/// than calling {@link #Next(Token)} instead..
8
/// </summary>
9
public virtual Token Next()
10
{
11
Token result = Next(new Token());
12
13
if (result != null)
14
{
15
Payload p = result.GetPayload();
16
if (p != null)
17
{
18
result.SetPayload((Payload) p.Clone());
19
}
20
}
21
22
return result;
23
}
24
25
/**//// <summary>Returns the next token in the stream, or null at EOS.
26
/// When possible, the input Token should be used as the
27
/// returned Token (this gives fastest tokenization
28
/// performance), but this is not required and a new Token
29
/// may be returned. Callers may re-use a single Token
30
/// instance for successive calls to this method.
31
/// <p>
32
/// This implicitly defines a "contract" between
33
/// consumers (callers of this method) and
34
/// producers (implementations of this method
35
/// that are the source for tokens):
36
/// <ul>
37
/// <li>A consumer must fully consume the previously
38
/// returned Token before calling this method again.</li>
39
/// <li>A producer must call {@link Token#Clear()}
40
/// before setting the fields in it & returning it</li>
41
/// </ul>
42
/// Note that a {@link TokenFilter} is considered a consumer.
43
/// </summary>
44
/// <param name="result">a Token that may or may not be used to return
45
/// </param>
46
/// <returns> next token in the stream or null if end-of-stream was hit
47
/// </returns>
48
public virtual Token Next(Token result)
49
{
50
return Next();
51
}
52
53
/**//// <summary>Resets this stream to the beginning. This is an
54
/// optional operation, so subclasses may or may not
55
/// implement this method. Reset() is not needed for
56
/// the standard indexing process. However, if the Tokens
57
/// of a TokenStream are intended to be consumed more than
58
/// once, it is necessary to implement reset().
59
/// </summary>
60
public virtual void Reset()
61
{
62
}
63
64
/**//// <summary>Releases resources associated with this stream. </summary>
65
public virtual void Close()
66
{
67
}
68
}
可以看到,2.3.1版本中多了一个Reset方法和一个Next方法的重载。
在它的子类中重写了Reset和Close方法,但是Next方法没有变化。

Code
1
public abstract class Tokenizer : TokenStream
2
{
3
/**//// <summary>The text source for this Tokenizer. </summary>
4
protected internal System.IO.TextReader input;
5
6
/**//// <summary>Construct a tokenizer with null input. </summary>
7
protected internal Tokenizer()
8
{
9
}
10
11
/**//// <summary>Construct a token stream processing the given input. </summary>
12
protected internal Tokenizer(System.IO.TextReader input)
13
{
14
this.input = input;
15
}
16
17
/**//// <summary>By default, closes the input Reader. </summary>
18
public override void Close()
19
{
20
if (input != null)
21
{
22
input.Close();
23
}
24
}
25
26
/**//// <summary>Expert: Reset the tokenizer to a new reader. Typically, an
27
/// analyzer (in its reusableTokenStream method) will use
28
/// this to re-use a previously created tokenizer.
29
/// </summary>
30
public virtual void Reset(System.IO.TextReader input)
31
{
32
this.input = input;
33
}
34
}
如果,你建立一个类,继承自Tokenizer类,然后,这个时候你做其他事情去了,然后等你回来,你写了调用的代码,嘿嘿,问题来了。会报一个堆栈溢出错误。而你根本不知道这个错误是怎么产生的。这个父类写得太不厚道了,虽然Next方法是必然被重写掉的。