将文本列和行标题添加到numpy数组

I am creating a 2d summary matrix from a 3d array using the following code:

我使用以下代码从3d数组创建二维摘要矩阵:

numTests=len(TestIDs)
numColumns=11
numRows=6
SummaryMeansArray =  p.array([])
summary3dArray = ma.zeros((numTests,numColumns,numRows))

j=0
for j in range(0,len(TestIDs)):
    print 'j is:  ',j
    TestID=str(TestIDs[j])
    print 'TestID is:  ',TestID
    reader=csv.reader(inputfile)

    m=1
    for row in reader:
        if row[0]!='TestID':
            summary3dArray[j,1,m] =row[2]
            summary3dArray[j,2,m] =row[3]
            summary3dArray[j,3,m] =row[4]
            summary3dArray[j,4,m] =row[5]
            summary3dArray[j,5,m] =row[6]
            summary3dArray[j,6,m] =row[7]
            summary3dArray[j,7,m] =row[8]
            summary3dArray[j,8,m] =row[9]
            summary3dArray[j,9,m] =row[10]
            summary3dArray[j,10,m] =row[11]
            m+=1
    inputfile.close()
outputfile=open(outputFileName, "wb")
writer = csv.writer(outputfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
outputfile.close()

smith='test'

summary3dArray.mask = (summary3dArray.data == 0) # mask all data equal to zero
summaryMeansArray = mean(summary3dArray, axis=0) # the returned shape is (numColumns,numRows)
print 'SummaryMeansArray is:  ',summaryMeansArray

The data returned by printing the 2d matrix is:

打印2d矩阵返回的数据是:

SummaryMeansArray is:   [[-- -- -- -- -- --]  
[-- 0.872486111111 0.665114583333 0.578107142857 0.495854166667 0.531722222222]  
[-- 69.6520408802 91.3136933451 106.82865123 125.834593798 112.847127834]  
[-- 1.26883876577 1.64726525154 1.82965948427 1.93913919335 1.81572414167]  
[-- 0.0707222222222 0.0696458333333 0.0654285714286 0.06196875 0.0669444444444]  
[-- 0.219861111055 0.195958333333 0.179925 0.1641875 0.177]  
[-- 0.290583333278 0.265604166667 0.245353571429 0.22615625 0.243944444444]  
[-- 24.1924238322 23.4668576333 23.2784801383 22.8667912971 21.0416383955]  
[-- 90.7234287345 108.496149905 112.364863351 113.57480005 144.061033524]  
[-- 6.16448575902 9.7494285825 11.6270150699 13.5876342704 16.2569218735]  
[-- 0.052665615304 0.069989497088 0.0783212378582 0.0846757181338 0.0862920065249]]

I have two questions:
1.) I want to add textual row headers and column headers to summaryMeansArray, but I am getting error messages when I try to do this now. What is the proper syntax for adding row headers and column headers in this code?

我有两个问题:1。)我想向summaryMeansArray添加文本行标题和列标题,但是当我现在尝试执行此操作时,我收到错误消息。在此代码中添加行标题和列标题的正确语法是什么?

2.) Is summaryMeansArray set up to have 11 columns and 6 rows? My understanding is that the proper syntax is columns,rows. However, it seems to be printing out 11 rows and 6 columns above. Is this just because python groups each column's data within its own brackets by convention? Or did I mess up the syntax?

2.)summaryMeansArray是否设置为11列和6行?我的理解是正确的语法是列,行。但是,它似乎打印出上面的11行和6列。这只是因为python按照惯例将每个列的数据分组在自己的括号中吗?或者我弄乱了语法?

2 个解决方案

#1

1.) I would recommend storing column and row header information in a separate data structure. Numpy matrices can store mixed data types (in this case strings and floats), I try to avoid it. Mixing data types is messy and seems inefficient to me. If you want to, you can make your own class with your matrix data and header information in it. It seems like a cleaner solution to me.

1.)我建议将列和行标题信息存储在单独的数据结构中。 Numpy矩阵可以存储混合数据类型(在这种情况下是字符串和浮点数),我尽量避免它。混合数据类型很混乱,对我来说似乎效率低下。如果您愿意,可以使用矩阵数据和标题信息创建自己的类。这似乎是一个更清洁的解决方案。

2.) No, summaryMeansArray is set-up to have 11 rows and 6 columns. The first dimension of a matrix is the number of rows. You can get the transpose of summaryMeansArray with summaryMeansArray.T. When you are taking the mean of summary3dArray on the 0th axis, the next axis becomes the rows and the one after that the columns.

2.)否,summaryMeansArray设置为有11行和6列。矩阵的第一个维度是行数。您可以使用summaryMeansArray.T获取summaryMeansArray的转置。当您在第0轴上取sum3dArray的平均值时,下一个轴将成为行,而后一个轴成为列。

Edit: As per request, you can create a python list from a numpy array with the method tolist(). For instance,

编辑:根据请求,您可以使用方法tolist()从numpy数组创建python列表。例如,

newMeansArray = summaryMeansArray.tolist()

Then you can insert the column headers using

然后,您可以使用插入列标题

newMeansArray.insert(0,headers)

Inserting the row headers can be done with:

插入行标题可以通过以下方式完成:

newMeansArray[i].insert(0,rowheader)

for each row i. Of course, if you've already inserted the column headers, then the counting for i starts with 1 rather than 0.

对于每一行我。当然,如果你已经插入了列标题,那么i的计数从1开始而不是0开始。

#2

I agree with Justin Peel's answer, regarding question #1 (row/header labels).

我同意Justin Peel的答案,关于问题#1(行/标题标签)。

I created my own class that allows me to decorate a matrix with extra data necessary to my task at hand (for example: row and column labels, a descriptive text for each row, or numerical properties of a row that are external to or independent of the matrix values).

我创建了自己的类,允许我使用我手头任务所需的额外数据来装饰矩阵(例如:行和列标签,每行的描述性文本,或者行的外部或独立的数字属性)矩阵值)。

My first solution that I used for almost 2 years was to have an object for each matrix row, where I would store each row's matrix values in a dictionary, with the dictionary key (ID) providing the second piece of information for that pair's matrix value. This was quite useful, especially for non-square matrices, and matrix manipulations and output were isolated cleanly.

我使用了近2年的第一个解决方案是为每个矩阵行创建一个对象,我将每行的矩阵值存储在字典中,字典键(ID)为该对的矩阵值提供第二条信息。这非常有用,特别是对于非方形矩阵,矩阵操作和输出被清晰地隔离。

However, I ran into a problem with this design: scalability. When using square, symmetric matrices, I needed 91 MB of memory for a 1000x1000 matrix, 327 MB of memory for a 2000x2000 matrix, and 1900 MB of memory for a 5000x5000 matrix. For my recent project that works on the order of 20000x20000 matrix entries, I will quickly and disastrously use up all of my workstation's 8GB of RAM and more.

但是,我遇到了这个设计的问题:可扩展性。当使用方形对称矩阵时,我需要91 MB内存用于1000x1000矩阵,327 MB内存用于2000x2000矩阵,1900 MB内存用于5000x5000矩阵。对于我最近的工程大约20000x20000矩阵条目的项目,我将快速和灾难性地耗尽我的所有工作站的8GB RAM等等。

My second solution was to have a single dictionary of (ID1,ID2)-->value mappings. Compared to my first solution, a 1000x1000 matrix required only 20 MB of memory. This solution also fails miserably in the scalability department, but in a different way, because the time to create and store C(1000+1,2)=500500 mappings was over 3 minutes, compared to 0.88 seconds when using my first design.

我的第二个解决方案是拥有一个(ID1,ID2) - >值映射的字典。与我的第一个解决方案相比,1000x1000矩阵仅需要20 MB内存。此解决方案在可伸缩性部门中也失败了,但是以不同的方式,因为创建和存储C(1000 + 1,2)= 500500映射的时间超过3分钟,而使用我的第一个设计时为0.88秒。

My third and current solution was to create a mapping between the numpy matrix row/column index and a matrix row/column label. Using numpy directly with a 5000x5000 matrix required 202 MB of memory on my system, a 10000x1000 matrix required 774 MB, and a 20000x2000 matrix required 3000 MB. A mapping of 20000 IDs to row/column indexes required 5 MB of memory on my system, which is negligible compared to the value matrix itself.

我的第三个也是当前的解决方案是在numpy矩阵行/列索引和矩阵行/列标签之间创建映射。使用numpy直接使用5000x5000矩阵需要在我的系统上使用202 MB内存,10000x1000矩阵需要774 MB,20000x2000矩阵需要3000 MB。 20000个ID到行/列索引的映射需要我系统上的5 MB内存,与值矩阵本身相比可以忽略不计。

If one is processing only small matrices less than 100x100 elements, then my first solution will be quick and the implemented data structure will be easy to manipulate and extend. However, if you are thinking of large-scale processing, then I recommend the third solution.

如果只处理小于100x100元素的小矩阵,那么我的第一个解决方案将很快,并且实现的数据结构将易于操作和扩展。但是,如果您正在考虑大规模处理,那么我建议使用第三种解决方案。

#1