Numpy has some very useful string operations, which vectorize the usual Python string operations.
Numpy有一些非常有用的字符串操作,它们可以对通常的Python字符串操作进行矢量化。
Compared to these operation and to pandas.str
, the numpy strings module seems to be missing a very important one: the ability to slice each string in the array. For example,
与这些操作和pandas.str相比,numpy strings模块似乎缺少一个非常重要的模块:能够切割数组中的每个字符串。例如,
a = numpy.array(['hello', 'how', 'are', 'you'])
numpy.char.sliceStr(a, slice(1, 3))
>>> numpy.array(['el', 'ow', 're' 'ou'])
Am I missing some obvious method in the module with this functionality? Otherwise, is there a fast vectorized way to achieve this?
我是否在模块中遗漏了一些具有此功能的明显方法?否则,有一种快速的矢量化方式来实现这一目标吗?
4 个解决方案
#1
11
Here's a vectorized approach -
这是一个矢量化的方法 -
def slicer_vectorized(a,start,end):
b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
return np.fromstring(b.tostring(),dtype=(str,end-start))
Sample run -
样品运行 -
In [68]: a = np.array(['hello', 'how', 'are', 'you'])
In [69]: slicer_vectorized(a,1,3)
Out[69]:
array(['el', 'ow', 're', 'ou'],
dtype='|S2')
In [70]: slicer_vectorized(a,0,3)
Out[70]:
array(['hel', 'how', 'are', 'you'],
dtype='|S3')
Runtime test -
运行时测试 -
Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.
测试其他作者发布的我可以在我的最后运行的所有方法,还包括本文前面的矢量化方法。
Here's the timings -
这是时间 -
In [53]: # Setup input array
...: a = np.array(['hello', 'how', 'are', 'you'])
...: a = np.repeat(a,10000)
...:
# @Alberto Garcia-Raboso's answer
In [54]: %timeit slicer(1, 3)(a)
10 loops, best of 3: 23.5 ms per loop
# @hapaulj's answer
In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
100 loops, best of 3: 11.6 ms per loop
# Using loop-comprehension
In [56]: %timeit np.array([i[1:3] for i in a])
100 loops, best of 3: 12.1 ms per loop
# From this post
In [57]: %timeit slicer_vectorized(a,1,3)
1000 loops, best of 3: 787 µs per loop
#2
4
Most, if not all the functions in np.char
apply existing str
methods to each element of the array. It's a little faster than direct iteration (or vectorize
) but not drastically so.
大多数(如果不是全部)np.char中的所有函数都将现有的str方法应用于数组的每个元素。它比直接迭代(或矢量化)快一点,但不是那么急剧。
There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:
没有字符串切片器;至少不是那种名字。最接近的是使用切片编制索引:
In [274]: 'astring'[1:3]
Out[274]: 'st'
In [275]: 'astring'.__getitem__
Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
In [276]: 'astring'.__getitem__(slice(1,4))
Out[276]: 'str'
An iterative approach can be with frompyfunc
(which is also used by vectorize
):
迭代方法可以使用frompyfunc(vectorize也使用):
In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
Out[279]:
array(['el', 'ow', 're', 'ou'],
dtype='<U2')
I could view it as a single character array, and slice that
我可以将它视为单个字符数组,并将其切片
In [289]: a.view('U1').reshape(4,-1)[:,1:3]
Out[289]:
array([['e', 'l'],
['o', 'w'],
['r', 'e'],
['o', 'u']],
dtype='<U1')
I still need to figure out how to convert it back to 'U2'.
我仍然需要弄清楚如何将其转换回'U2'。
In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
Out[290]:
array([['el'],
['ow'],
['re'],
['ou']],
dtype='<U2')
The initial view step shows the databuffer as Py3 characters (these would be bytes in a S
or Py2 string case):
初始视图步骤将数据缓冲区显示为Py3字符(这些字符是S或Py2字符串中的字节):
In [284]: a.view('U1')
Out[284]:
array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
'', 'y', 'o', 'u', '', ''],
dtype='<U1')
Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]]
and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.
选择1:3列相当于选择a.view('U1')[[1,2,6,7,11,12,16,17]],然后重新整形和查看。没有深入细节,我不会感到惊讶,它需要一份副本。
#3
3
Interesting omission... I guess you can always write your own:
有趣的遗漏...我猜你总是可以写自己的:
import numpy as np
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
a = np.array(['hello', 'how', 'are', 'you'])
print(slicer(1, 3)(a)) # => ['el' 'ow' 're' 'ou']
EDIT: Here are some benchmarks using the text of Ulysses by James Joyce.
It seems the clear winner is @hpaulj's last strategy.
@Divakar gets into the race improving on @hpaulj's last strategy.
编辑:以下是使用詹姆斯乔伊斯的尤利西斯文本的一些基准。似乎明显的赢家是@ hpaulj的最后策略。 @Divakar在@ hpaulj的最后战略中进入了比赛。
import numpy as np
import requests
ulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').text
a = np.array(ulysses.split())
# Ufunc
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
%timeit slicer(1, 3)(a)
# => 1 loop, best of 3: 221 ms per loop
# Non-mutating loop
def loop1(a):
out = np.empty(len(a), dtype=object)
for i, word in enumerate(a):
out[i] = word[1:3]
%timeit loop1(a)
# => 1 loop, best of 3: 262 ms per loop
# Mutating loop
def loop2(a):
for i in range(len(a)):
a[i] = a[i][1:3]
b = a.copy()
%timeit -n 1 -r 1 loop2(b)
# 1 loop, best of 1: 285 ms per loop
# From @hpaulj's answer
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
# => 10 loops, best of 3: 141 ms per loop
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
# => 1 loop, best of 3: 170 ms per loop
%timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)
# => 10 loops, best of 3: 60.7 ms per loop
def slicer_vectorized(a,start,end):
b = a.view('S1').reshape(len(a),-1)[:,start:end]
return np.fromstring(b.tostring(),dtype='S'+str(end-start))
%timeit slicer_vectorized(a,1,3)
# => The slowest run took 5.34 times longer than the fastest.
# This could mean that an intermediate result is being cached.
# 10 loops, best of 3: 16.8 ms per loop
#4
2
To solve this, so far I've been transforming the numpy array
to a pandas Series
and back. It is not a pretty solution, but it works and it works relatively fast.
为了解决这个问题,到目前为止,我一直在将numpy数组转换为pandas系列并返回。它不是一个漂亮的解决方案,但它的工作原理相对较快。
a = numpy.array(['hello', 'how', 'are', 'you'])
pandas.Series(a).str[1:3].values
array(['el', 'ow', 're', 'ou'], dtype=object)
#1
11
Here's a vectorized approach -
这是一个矢量化的方法 -
def slicer_vectorized(a,start,end):
b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
return np.fromstring(b.tostring(),dtype=(str,end-start))
Sample run -
样品运行 -
In [68]: a = np.array(['hello', 'how', 'are', 'you'])
In [69]: slicer_vectorized(a,1,3)
Out[69]:
array(['el', 'ow', 're', 'ou'],
dtype='|S2')
In [70]: slicer_vectorized(a,0,3)
Out[70]:
array(['hel', 'how', 'are', 'you'],
dtype='|S3')
Runtime test -
运行时测试 -
Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.
测试其他作者发布的我可以在我的最后运行的所有方法,还包括本文前面的矢量化方法。
Here's the timings -
这是时间 -
In [53]: # Setup input array
...: a = np.array(['hello', 'how', 'are', 'you'])
...: a = np.repeat(a,10000)
...:
# @Alberto Garcia-Raboso's answer
In [54]: %timeit slicer(1, 3)(a)
10 loops, best of 3: 23.5 ms per loop
# @hapaulj's answer
In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
100 loops, best of 3: 11.6 ms per loop
# Using loop-comprehension
In [56]: %timeit np.array([i[1:3] for i in a])
100 loops, best of 3: 12.1 ms per loop
# From this post
In [57]: %timeit slicer_vectorized(a,1,3)
1000 loops, best of 3: 787 µs per loop
#2
4
Most, if not all the functions in np.char
apply existing str
methods to each element of the array. It's a little faster than direct iteration (or vectorize
) but not drastically so.
大多数(如果不是全部)np.char中的所有函数都将现有的str方法应用于数组的每个元素。它比直接迭代(或矢量化)快一点,但不是那么急剧。
There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:
没有字符串切片器;至少不是那种名字。最接近的是使用切片编制索引:
In [274]: 'astring'[1:3]
Out[274]: 'st'
In [275]: 'astring'.__getitem__
Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
In [276]: 'astring'.__getitem__(slice(1,4))
Out[276]: 'str'
An iterative approach can be with frompyfunc
(which is also used by vectorize
):
迭代方法可以使用frompyfunc(vectorize也使用):
In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
Out[279]:
array(['el', 'ow', 're', 'ou'],
dtype='<U2')
I could view it as a single character array, and slice that
我可以将它视为单个字符数组,并将其切片
In [289]: a.view('U1').reshape(4,-1)[:,1:3]
Out[289]:
array([['e', 'l'],
['o', 'w'],
['r', 'e'],
['o', 'u']],
dtype='<U1')
I still need to figure out how to convert it back to 'U2'.
我仍然需要弄清楚如何将其转换回'U2'。
In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
Out[290]:
array([['el'],
['ow'],
['re'],
['ou']],
dtype='<U2')
The initial view step shows the databuffer as Py3 characters (these would be bytes in a S
or Py2 string case):
初始视图步骤将数据缓冲区显示为Py3字符(这些字符是S或Py2字符串中的字节):
In [284]: a.view('U1')
Out[284]:
array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
'', 'y', 'o', 'u', '', ''],
dtype='<U1')
Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]]
and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.
选择1:3列相当于选择a.view('U1')[[1,2,6,7,11,12,16,17]],然后重新整形和查看。没有深入细节,我不会感到惊讶,它需要一份副本。
#3
3
Interesting omission... I guess you can always write your own:
有趣的遗漏...我猜你总是可以写自己的:
import numpy as np
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
a = np.array(['hello', 'how', 'are', 'you'])
print(slicer(1, 3)(a)) # => ['el' 'ow' 're' 'ou']
EDIT: Here are some benchmarks using the text of Ulysses by James Joyce.
It seems the clear winner is @hpaulj's last strategy.
@Divakar gets into the race improving on @hpaulj's last strategy.
编辑:以下是使用詹姆斯乔伊斯的尤利西斯文本的一些基准。似乎明显的赢家是@ hpaulj的最后策略。 @Divakar在@ hpaulj的最后战略中进入了比赛。
import numpy as np
import requests
ulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').text
a = np.array(ulysses.split())
# Ufunc
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
%timeit slicer(1, 3)(a)
# => 1 loop, best of 3: 221 ms per loop
# Non-mutating loop
def loop1(a):
out = np.empty(len(a), dtype=object)
for i, word in enumerate(a):
out[i] = word[1:3]
%timeit loop1(a)
# => 1 loop, best of 3: 262 ms per loop
# Mutating loop
def loop2(a):
for i in range(len(a)):
a[i] = a[i][1:3]
b = a.copy()
%timeit -n 1 -r 1 loop2(b)
# 1 loop, best of 1: 285 ms per loop
# From @hpaulj's answer
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
# => 10 loops, best of 3: 141 ms per loop
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
# => 1 loop, best of 3: 170 ms per loop
%timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)
# => 10 loops, best of 3: 60.7 ms per loop
def slicer_vectorized(a,start,end):
b = a.view('S1').reshape(len(a),-1)[:,start:end]
return np.fromstring(b.tostring(),dtype='S'+str(end-start))
%timeit slicer_vectorized(a,1,3)
# => The slowest run took 5.34 times longer than the fastest.
# This could mean that an intermediate result is being cached.
# 10 loops, best of 3: 16.8 ms per loop
#4
2
To solve this, so far I've been transforming the numpy array
to a pandas Series
and back. It is not a pretty solution, but it works and it works relatively fast.
为了解决这个问题,到目前为止,我一直在将numpy数组转换为pandas系列并返回。它不是一个漂亮的解决方案,但它的工作原理相对较快。
a = numpy.array(['hello', 'how', 'are', 'you'])
pandas.Series(a).str[1:3].values
array(['el', 'ow', 're', 'ou'], dtype=object)