自己做分词,一般选择继承Tokenizer类。在以前的版本中只需要重写Next()方法就可以了。这个类比较简单,而塔的父类TokenStream则更加简单了,和接口没什么两样:
//
2.1版
public
abstract
class
TokenStream
{
///
<summary>
Returns the next token in the stream, or null at EOS.
</summary>
public
abstract
Token Next();
///
<summary>
Releases resources associated with this stream.
</summary>
public
virtual
void
Close()
{
}
}
到了2.3.1版本中,这个抽象类发生了很多变化.下面是2.3.1版的TokenStream类代码。
Code
1 public abstract class TokenStream
2 {
3
4 /**//// <summary>Returns the next token in the stream, or null at EOS.
5 /// The returned Token is a "full private copy" (not
6 /// re-used across calls to next()) but will be slower
7 /// than calling {@link #Next(Token)} instead..
8 /// </summary>
9 public virtual Token Next()
10 {
11 Token result = Next(new Token());
12
13 if (result != null)
14 {
15 Payload p = result.GetPayload();
16 if (p != null)
17 {
18 result.SetPayload((Payload) p.Clone());
19 }
20 }
21
22 return result;
23 }
24
25 /**//// <summary>Returns the next token in the stream, or null at EOS.
26 /// When possible, the input Token should be used as the
27 /// returned Token (this gives fastest tokenization
28 /// performance), but this is not required and a new Token
29 /// may be returned. Callers may re-use a single Token
30 /// instance for successive calls to this method.
31 /// <p>
32 /// This implicitly defines a "contract" between
33 /// consumers (callers of this method) and
34 /// producers (implementations of this method
35 /// that are the source for tokens):
36 /// <ul>
37 /// <li>A consumer must fully consume the previously
38 /// returned Token before calling this method again.</li>
39 /// <li>A producer must call {@link Token#Clear()}
40 /// before setting the fields in it & returning it</li>
41 /// </ul>
42 /// Note that a {@link TokenFilter} is considered a consumer.
43 /// </summary>
44 /// <param name="result">a Token that may or may not be used to return
45 /// </param>
46 /// <returns> next token in the stream or null if end-of-stream was hit
47 /// </returns>
48 public virtual Token Next(Token result)
49 {
50 return Next();
51 }
52
53 /**//// <summary>Resets this stream to the beginning. This is an
54 /// optional operation, so subclasses may or may not
55 /// implement this method. Reset() is not needed for
56 /// the standard indexing process. However, if the Tokens
57 /// of a TokenStream are intended to be consumed more than
58 /// once, it is necessary to implement reset().
59 /// </summary>
60 public virtual void Reset()
61 {
62 }
63
64 /**//// <summary>Releases resources associated with this stream. </summary>
65 public virtual void Close()
66 {
67 }
68 }
可以看到,2.3.1版本中多了一个Reset方法和一个Next方法的重载。
在它的子类中重写了Reset和Close方法,但是Next方法没有变化。
Code
1 public abstract class Tokenizer : TokenStream
2 {
3 /**//// <summary>The text source for this Tokenizer. </summary>
4 protected internal System.IO.TextReader input;
5
6 /**//// <summary>Construct a tokenizer with null input. </summary>
7 protected internal Tokenizer()
8 {
9 }
10
11 /**//// <summary>Construct a token stream processing the given input. </summary>
12 protected internal Tokenizer(System.IO.TextReader input)
13 {
14 this.input = input;
15 }
16
17 /**//// <summary>By default, closes the input Reader. </summary>
18 public override void Close()
19 {
20 if (input != null)
21 {
22 input.Close();
23 }
24 }
25
26 /**//// <summary>Expert: Reset the tokenizer to a new reader. Typically, an
27 /// analyzer (in its reusableTokenStream method) will use
28 /// this to re-use a previously created tokenizer.
29 /// </summary>
30 public virtual void Reset(System.IO.TextReader input)
31 {
32 this.input = input;
33 }
34 }
如果,你建立一个类,继承自Tokenizer类,然后,这个时候你做其他事情去了,然后等你回来,你写了调用的代码,嘿嘿,问题来了。会报一个堆栈溢出错误。而你根本不知道这个错误是怎么产生的。这个父类写得太不厚道了,虽然Next方法是必然被重写掉的。