如何使用cython创建自定义numpy dtype

There are examples for creating custom numpy dtypes using C here:

这里有一些使用C创建自定义numpy dtypes的示例:

Additionally, it seems to be possible to create custom ufuncs in cython:

另外,似乎可以在cython中创建自定义ufunc:

It seems like it should also be possible to create a dtype using cython (and then create custom ufuncs for it). Is it possible? If so, can you post an example?

似乎也应该可以使用cython创建一个dtype(然后为它创建自定义ufunc)。可能吗?如果是这样,你能发一个例子吗?

USE CASE:

I want to do some survival analysis. The basic data elements are survival times (floats) with associated censor values (False if the associated time represents a failure time and True if it instead represents a censoring time (i.e., no failure occurred during the period of observation)).

我想做一些生存分析。基本数据元素是具有相关审查值的生存时间(浮点数)(如果关联时间表示失败时间则为假,如果它代表审查时间则为真(即,在观察期间没有发生故障))。

Obviously I could just use two numpy arrays to store these values: a float array for the times and a bool array for the censor values. However, I want to account for the possibility of an event occurring multiple times (this is a good model for, say, heart attacks - you can have more than one). In this case, I need an array of objects which I call MultiEvents. Each MultiEvent contains a sequence of floats (uncensored failure times) and an observation period (also a float). Note that the number of failures is not the same for all MultiEvents.

显然,我可以使用两个numpy数组来存储这些值:时间的float数组和censor值的bool数组。但是,我想说明事件发生多次的可能性(这是一个很好的模型,比如心脏病发作 - 你可以有不止一个)。在这种情况下,我需要一个对象数组,我称之为MultiEvents。每个MultiEvent包含一系列浮点数(未经审查的失败时间)和一个观察期(也是一个浮点数)。请注意,所有MultiEvent的失败次数都不相同。

I need to be able to perform a few operations on an array of MultiEvents:

我需要能够对MultiEvents数组执行一些操作:

Get the number of failures for each

获取每个故障的数量
Get the censored time (that is the period of observation minus the sum of all failure times)

获得审查时间(即观察时间减去所有失败时间的总和)
Calculate a log likelihood based on additional arrays of parameters (such as an array of hazard values). For example, the log likelihood for a single MultiEvent M and constant hazard value h would be something like:

根据其他参数数组(例如危险值数组)计算对数似然。例如,单个MultiEvent M和常量危险值h的对数似然可能是:

sum(log(h) + h*t for t in M.times) - h*(M.period - sum(M.times))

sum(log(h)+ h * t为t,以M为单位) - h *(M.period - sum(M.times))

where M.times is the list (array, whatever) of failure times and M.period is the total observation period. I want the proper numpy broadcasting rules to apply, so that I can do:

其中M.times是失败时间的列表(数组,无论如何),而M.period是总观察期。我希望适用适当的numpy广播规则,以便我可以这样做:

log_lik = logp(M_vec,h_vec)

and it will work as long as the dimensions of M_vec and h_vec are compatible.

只要M_vec和h_vec的尺寸兼容,它就会起作用。

My current implementation uses numpy.vectorize. That works well enough for 1 and 2, but it is too slow for 3. Note also that I can't do this because the number of failures in my MultiData objects is not known ahead of time.

我当前的实现使用numpy.vectorize。这对于1和2来说效果很好,但它对于3来说太慢了。请注意,我不能这样做,因为我的MultiData对象中的失败次数未提前知道。

2 个解决方案

#1

Numpy arrays are most suitable for data types with fixed size. If the objects in the array are not fixed size (such as your MultiEvent) the operations can become much slower.

Numpy数组最适合具有固定大小的数据类型。如果数组中的对象不是固定大小(例如MultiEvent),则操作会变慢。

I would recommend you to store all of the survival times in a 1d linear record array with 3 fields: event_id, time, period. Each event can appear mutliple times in the array:

我建议你将所有的生存时间存储在一个包含3个字段的1d线性记录数组中:event_id,time,period。每个事件都可以在数组中出现多个时间:

>>> import numpy as np
>>> rawdata = [(1, 0.4, 4), (1, 0.6, 6), (2,2.6, 6)]
>>> npdata = np.rec.fromrecords(rawdata, names='event_id,time,period')
>>> print npdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6) (2, 2.6000000000000001, 6)]

To get data for a specific index you could use fancy indexing:

要获取特定索引的数据,您可以使用花哨的索引:

>>> eventdata = npdata[npdata.event_id==1]
>>> print eventdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6)]

The advantage of this approach is that you can easily intergrate it with your ndarray-based functions. You can also access this arrays from cython as described in the manual:

这种方法的优点是您可以轻松地将其与基于ndarray的函数集成。您也可以按照手册中的说明从cython访问此数组:

cdef packed struct Event:
    np.int32_t event_id
    np.float64_t time
    np.float64_6 period

def f():
    cdef np.ndarray[Event] b = np.zeros(10,
        dtype=np.dtype([('event_id', np.int32),
                        ('time', np.float64),
                        ('period', np.float64)]))
    <...>

#2

I apologise for not answering the question directly, but I've had similar problems before, and if I understand correctly, the real problem you're now having is that you have variable-length data, which is really, really not one of the strengths of numpy, and is the reason you're running into performance issues. Unless you know in advance the maximum number of entries for a multievent, you'll have problems, and even then you'll be wasting loads of memory/disk space filled with zeros for those events that aren't multi events.

我为没有直接回答这个问题而道歉,但我以前遇到过类似的问题,如果我理解正确的话,你现在遇到的真正问题是你有可变长度的数据,这实际上并不属于numpy的优点,也是你遇到性能问题的原因。除非您事先知道多个事件的最大条目数,否则您将遇到问题,即使这样,您也会浪费大量内存/磁盘空间来填充那些非多事件事件的零。

You have data points with more than one field, some of which are related to other fields, and some of which need to be identified in groups. This hints strongly that you should consider a database of some form for storing this information, for performance, memory, space-on-disk and sanity reasons.

您的数据点包含多个字段,其中一些字段与其他字段相关,其中一些字段需要在组中进行标识。这强烈暗示您应该考虑使用某种形式的数据库来存储此信息,包括性能,内存,磁盘空间和理智原因。

It will be much easier for a person new to your code to understand a simple database schema than a complicated, hacked-on-numpy structure that will be frustratingly slow and bloated. SQL queries are quick and easy to write in comparison.

对于一个刚接触你的代码的人来说,理解一个简单的数据库模式要比一个复杂的,黑客攻击的结构更加容易,这种结构会令人沮丧地缓慢而臃肿。相比之下,SQL查询可以快速轻松地编写。

I would suggest based on my understanding of your explanation having Event and MultiEvent tables, where each Event entry has a foreign key into the MultiEvent table where relevant.

我建议根据我对你的解释有Event和MultiEvent表的理解,其中每个Event条目都有一个外键到相关的MultiEvent表。

#1