Oracle Text:如何清理用户输入

时间:2022-09-20 11:22:38

If anyone has experience using Oracle text (CTXSYS.CONTEXT), I'm wondering how to handle user input when the user wants to search for names that may contain an apostrophe.

如果有人有使用Oracle文本(CTXSYS.CONTEXT)的经验,我想知道当用户想要搜索可能包含撇号的名称时如何处理用户输入。

Escaping the ' seems to work in some cases, but not for 's at the end of the word - s is in the list of stop words, and so seems to get removed.

逃避'似乎在某些情况下工作,但不是在字的结尾处 - s在停用词列表中,因此似乎被删除。

We currently change simple query text (i.e. anything that's just letters) to %text%, for example:

我们目前将简单查询文本(即只是字母的任何内容)更改为%text%,例如:

contains(field, :text) > 0

A search for O'Neil works, but Joe's doesn't.

寻找奥尼尔的作品,但乔没有。

Has anyone using Oracle Text dealt with this issue?

有没有人使用Oracle Text处理这个问题?

3 个解决方案

#1


2  

Escape all special characters with backslashes. Curly braces won't work with substring searches as they define complete tokens. Eg %{ello}% won't match the token 'Hello'

使用反斜杠转义所有特殊字符。大括号搜索不能用于子字符串搜索,因为它们定义了完整的标记。例如%{ello}%与令牌'Hello'不匹配

Escaped space characters will be included in the search token, so the search string '%stay\ near\ me%' will be treated as a literal string "stay near me" and will not invoke the 'near' operator.

转义空格字符将包含在搜索令牌中,因此搜索字符串'%stay \ near \ me%'将被视为文字字符串“靠近我”并且不会调用'near'运算符。

If you are indexing short strings (like names, etc ) and you want Oracle Text to behave exactly as the like operator, you must write your own lexer that won't create tokens for individual words. (Unfortunately CATSEARCH does not support substring search...)

如果要索引短字符串(如名称等)并且希望Oracle Text的行为与like运算符完全相同,则必须编写自己的词法分析器,这些词法分析器不会为单个单词创建标记。 (遗憾的是CATSEARCH不支持子字符串搜索...)

It is probably a good idea to change the searches to use oracle text's semantics, with token matching, but for some applications, the wildcard expansion of multiple (short) tokens and numeric tokens will create too many hits for search strings that the users reasonably would expect to work.

使用令牌匹配更改搜索以使用oracle文本的语义可能是一个好主意,但对于某些应用程序,多个(短)令牌和数字令牌的通配符扩展将为搜索字符串创建太多命中,用户合理地会期待工作。

Eg, a search for "%I\ AM\ NUMBER\ 9%" will most likely fail if there are a lot of numeric tokens in the indexed data, since all tokens ending with 'I' and starting with '9' must be searched and merged before the result can be returned.

例如,如果索引数据中存在大量数字标记,则搜索“%I \ AM \ NUMBER \ 9%”很可能会失败,因为必须搜索以“I”结尾且以“9”开头的所有标记并在返回结果之前合并。

'I' and 'AM' is probably also in the default stoplist and will be totally ignored, so for this hypothetical application, a null stoplist may be used if these tokens are important.

'I'和'AM'可能也在默认的停止列表中并且将被完全忽略,因此对于这个假设的应用程序,如果这些令牌很重要,则可以使用空停止列表。

#2


0  

Using PARAMETERS('STOPLIST ctxsys.empty_stoplist') when indexing would include all alphabetical tokens in the index. Accented characters are indexed as well. Non-alphabetical characters are generally treated as whitespace by BASIC_LEXER.

索引时使用PARAMETERS('STOPLIST ctxsys.empty_stoplist')将包括索引中的所有字母标记。重音字符也被编入索引。非字母字符通常被BASIC_LEXER视为空格。

Also, CONTEXT grammar uses a lot of operators that include symbols and reserved words such as WITHIN, NEAR, ABOUT. These all have to be escaped somehow in the input. If you need to search for substrings, the correct approach to escaping is to escape all characters with \. This is an answer to a related question here: Oracle text escaping with curly braces and wildcards. If your requirements is to search for whole terms (names, etc.) you can use simpler {input} escaping.

此外,CONTEXT语法使用了许多包含符号和保留字的运算符,如WITHIN,NEAR,ABOUT。这些都必须在输入中以某种方式进行转义。如果需要搜索子字符串,正确的转义方法是使用\来转义所有字符。这是对相关问题的回答:Oracle文本使用花括号和通配符进行转义。如果您的要求是搜索整个术语(名称等),则可以使用更简单的{input}转义。

#3


-2  

Forget about sanitizing. Why? Refer to http://en.wikipedia.org/wiki/SQL_injection .

忘记消毒。为什么?请参阅http://en.wikipedia.org/wiki/SQL_injection。

What kind of database interface API are you using? Perl DBI, ODBC, JDBC support parameterized queries or prepared statements. If you're using a native DBI and it doesn't support it, then God bless you.

你使用什么样的数据库接口API? Perl DBI,ODBC,JDBC支持参数化查询或预准备语句。如果您使用的是原生DBI并且它不支持它,那么上帝保佑您。

#1


2  

Escape all special characters with backslashes. Curly braces won't work with substring searches as they define complete tokens. Eg %{ello}% won't match the token 'Hello'

使用反斜杠转义所有特殊字符。大括号搜索不能用于子字符串搜索,因为它们定义了完整的标记。例如%{ello}%与令牌'Hello'不匹配

Escaped space characters will be included in the search token, so the search string '%stay\ near\ me%' will be treated as a literal string "stay near me" and will not invoke the 'near' operator.

转义空格字符将包含在搜索令牌中,因此搜索字符串'%stay \ near \ me%'将被视为文字字符串“靠近我”并且不会调用'near'运算符。

If you are indexing short strings (like names, etc ) and you want Oracle Text to behave exactly as the like operator, you must write your own lexer that won't create tokens for individual words. (Unfortunately CATSEARCH does not support substring search...)

如果要索引短字符串(如名称等)并且希望Oracle Text的行为与like运算符完全相同,则必须编写自己的词法分析器,这些词法分析器不会为单个单词创建标记。 (遗憾的是CATSEARCH不支持子字符串搜索...)

It is probably a good idea to change the searches to use oracle text's semantics, with token matching, but for some applications, the wildcard expansion of multiple (short) tokens and numeric tokens will create too many hits for search strings that the users reasonably would expect to work.

使用令牌匹配更改搜索以使用oracle文本的语义可能是一个好主意,但对于某些应用程序,多个(短)令牌和数字令牌的通配符扩展将为搜索字符串创建太多命中,用户合理地会期待工作。

Eg, a search for "%I\ AM\ NUMBER\ 9%" will most likely fail if there are a lot of numeric tokens in the indexed data, since all tokens ending with 'I' and starting with '9' must be searched and merged before the result can be returned.

例如,如果索引数据中存在大量数字标记,则搜索“%I \ AM \ NUMBER \ 9%”很可能会失败,因为必须搜索以“I”结尾且以“9”开头的所有标记并在返回结果之前合并。

'I' and 'AM' is probably also in the default stoplist and will be totally ignored, so for this hypothetical application, a null stoplist may be used if these tokens are important.

'I'和'AM'可能也在默认的停止列表中并且将被完全忽略,因此对于这个假设的应用程序,如果这些令牌很重要,则可以使用空停止列表。

#2


0  

Using PARAMETERS('STOPLIST ctxsys.empty_stoplist') when indexing would include all alphabetical tokens in the index. Accented characters are indexed as well. Non-alphabetical characters are generally treated as whitespace by BASIC_LEXER.

索引时使用PARAMETERS('STOPLIST ctxsys.empty_stoplist')将包括索引中的所有字母标记。重音字符也被编入索引。非字母字符通常被BASIC_LEXER视为空格。

Also, CONTEXT grammar uses a lot of operators that include symbols and reserved words such as WITHIN, NEAR, ABOUT. These all have to be escaped somehow in the input. If you need to search for substrings, the correct approach to escaping is to escape all characters with \. This is an answer to a related question here: Oracle text escaping with curly braces and wildcards. If your requirements is to search for whole terms (names, etc.) you can use simpler {input} escaping.

此外,CONTEXT语法使用了许多包含符号和保留字的运算符,如WITHIN,NEAR,ABOUT。这些都必须在输入中以某种方式进行转义。如果需要搜索子字符串,正确的转义方法是使用\来转义所有字符。这是对相关问题的回答:Oracle文本使用花括号和通配符进行转义。如果您的要求是搜索整个术语(名称等),则可以使用更简单的{input}转义。

#3


-2  

Forget about sanitizing. Why? Refer to http://en.wikipedia.org/wiki/SQL_injection .

忘记消毒。为什么?请参阅http://en.wikipedia.org/wiki/SQL_injection。

What kind of database interface API are you using? Perl DBI, ODBC, JDBC support parameterized queries or prepared statements. If you're using a native DBI and it doesn't support it, then God bless you.

你使用什么样的数据库接口API? Perl DBI,ODBC,JDBC支持参数化查询或预准备语句。如果您使用的是原生DBI并且它不支持它,那么上帝保佑您。