如何将字符串分割成令牌?

If I have a string

如果我有一根绳子

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

我如何将它分割成以下的令牌列表?

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

目前我正在使用shlex模块:

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

但这回报:

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

我试着把字母和数字分开。我正在考虑取包含字母和数字的字符串，然后以某种方式将它们分开，但不确定如何这样做或如何将它们与其他字符串一起添加到列表中。令牌保持有序很重要，我不能有嵌套列表。

In an ideal world, e and E would not be recognised as letters in the same way, so

在理想的世界里，e和e不会以同样的方式被识别为字母，所以

'-4e1'

would become

将成为

['-', '4e1']

but

但

'-4x1'

would become

将成为

['-', '4', 'x', '1']

Can anybody help?

有人能帮忙吗?

3 个解决方案

#1

Use the regular expression module's split() function, to split at

使用正则表达式模块的split()函数分割at

'\d+' -- digits (number characters) and
'\d+' -数字(数字字符)和
'\W+' -- non-word characters:
“\ W +”——非单词字符:

CODE:

代码:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

输出:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

如果您不想分离点(作为表达式中的浮点数)，那么您应该使用以下方法:

[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
(\ d。]+ -数字或点字符(虽然这允许您写入:13.5.5

CODE:

代码:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

输出:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']

#2

Another alternative not suggested here, is to using nltk.tokenize module

这里不建议的另一个替代方法是使用nltk。标记模块

#3

Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

问题似乎并不简单。我认为，获得健壮(但不幸的是，不是这么短)解决方案的一个好方法是使用Python Lex-Yacc创建一个完整的标记器。Lex-Yacc是一种常见的实践(不仅仅是Python)，因此可以使用现成的语法创建一个简单的算术标记器(就像这个)，您只需将它们适合您的特定需求。

#1

Use the regular expression module's split() function, to split at

使用正则表达式模块的split()函数分割at

'\d+' -- digits (number characters) and
'\d+' -数字(数字字符)和
'\W+' -- non-word characters:
“\ W +”——非单词字符:

CODE:

代码:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

输出:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

如果您不想分离点(作为表达式中的浮点数)，那么您应该使用以下方法:

[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
(\ d。]+ -数字或点字符(虽然这允许您写入:13.5.5

CODE:

代码:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

输出:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']

#2

Another alternative not suggested here, is to using nltk.tokenize module

这里不建议的另一个替代方法是使用nltk。标记模块

秒客网

如何将字符串分割成令牌?

3 个解决方案

#1

#2

#3

#1

#2

#3

相关文章