I want to insert an array that contains empty values created from Pandas in Python, and these empty values default to np.nan in Pandas dataframe. I don't want them to be 'NaN' in my PostgreSQL database, I want my PostgreSQL arrays to contain empty values like this: '{123,24,,23}'
so they are not counted in my aggregate functions like calculating the mean or standard deviation across indices. I am not sure if it is possible to have sparse arrays in PostgreSQL. There won't be a lot of sparse arrays in my dataset, I am just testing this for edge case purposes.
我想插入一个包含在Python中从Pandas创建的空值的数组,这些空值默认为Pandas数据帧中的np.nan。我不希望它们在我的PostgreSQL数据库中成为'NaN',我希望我的PostgreSQL数组包含这样的空值:'{123,24,,23}'因此它们不计入我的聚合函数中,如计算指数的均值或标准差。我不确定PostgreSQL中是否有可能有稀疏数组。我的数据集中不会有很多稀疏数组,我只是为了边缘情况测试它。
My table schema:
我的表架构:
create_table = '''
CREATE TABLE {t} (
patient_id VARCHAR[20] PRIMARY KEY,
gene_expression double precision []
);
'''
The relevant Python code (I don't know how to write the proper SQL code here). Here I converted the array into a string, because Python arrays cannot be sparse:
相关的Python代码(我不知道如何在这里编写正确的SQL代码)。在这里,我将数组转换为字符串,因为Python数组不能稀疏:
df = df.fillna('')
NCI = [1]
MCI = [2,3]
AD = [4,5]
other = [6]
insert_sql = '''
INSERT INTO {t} (patient_id, gene_expression)
VALUES (%s,%s);
'''
cur = psql_conn.cursor()
for index, row in df.iterrows():
arr = row[2:].tolist()
postgres_arr = ','.join(map(str, arr))
if row['DIAGNOSIS'].isdigit():
if int(row['DIAGNOSIS']) in NCI:
cur.execute(insert_sql.format(t='nci'), (row['PATIENT_ID'], postgres_arr,))
elif int(row['DIAGNOSIS']) in MCI:
cur.execute(insert_sql.format(t='mci'), (row['PATIENT_ID'], postgres_arr,))
elif int(row['DIAGNOSIS']) in AD:
cur.execute(insert_sql.format(t='ad'), (row['PATIENT_ID'], postgres_arr,))
elif int(row['DIAGNOSIS']) in other:
cur.execute(insert_sql.format(t='other'), (row['PATIENT_ID'], postgres_arr,))
elif row['DIAGNOSIS'] == '':
cur.execute(insert_sql.format(t='na'), (row['PATIENT_ID'], postgres_arr,))
else:
print('ERROR: unknown diagnosis {d}.'.format(d=diagnosis))
psql_conn.commit()
cur.close()
My Error:
psycopg2.DataError: malformed array literal: "{2.0,2.4,}"
LINE 3: VALUES ('X100_120417','{2.0,2.4,}');
^
DETAIL: Unexpected "}" character.
2 个解决方案
#1
1
If you want to create a column with max length, use bracket, not square brackets. change VARCHAR[20]
to VARCHAR(20)
in create table statement. Otherwise first %s
is expected to be array and it is varchar. Here is sample - mind that patient_id is created as array, not varchar...
如果要创建具有最大长度的列,请使用括号,而不是方括号。在create table语句中将VARCHAR [20]更改为VARCHAR(20)。否则,首先%s应该是数组,它是varchar。这是示例 - 请注意,patient_id是作为数组创建的,而不是varchar ...
t=# CREATE TABLE so23 (
patient_id VARCHAR[20] PRIMARY KEY,
gene_expression double precision []
);
CREATE TABLE
t=# \d+ so23
Table "public.so23"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------+---------------------+-----------+----------+--------------+-------------
patient_id | character varying[] | not null | extended | |
gene_expression | double precision[] | | extended | |
Indexes:
"so23_pkey" PRIMARY KEY, btree (patient_id)
#2
0
After a few hours of trial and error:
经过几个小时的反复试验:
Load this Pandas dataframe df from some CSV file:
从某些CSV文件加载此Pandas数据帧df:
+----+-------+--------------+
| id | stuff | array |
+----+-------+--------------+
| 0 | a | {1,2,3} |
| 1 | b | {1,np.nan,3} |
| 2 | 45 | {np.nan,4,2} |
+----+-------+--------------+
process in pandas using:
大熊猫进程使用:
df = df.fillna('NULL')
insert_sql = '''
INSERT INTO {t} (patient_id, gene_expression)
VALUES (%s,%s);
'''
for index, row in df.iterrows():
arr = row[2:].tolist()
postgres_arr = '{' + ','.join(map(str,arr)) + '}'
cur.execute(insert_sql.format(t='my_table'), (row['id'], postgres_arr,))
My main issue was recognizing that string literal 'NULL' automatically translate to PostgreSQL NULL keyword, which is ignored in calculations and results of aggregate functions return a value as if the NULL values are not there, versus NaN keyword where every operation with it results in NaN.
我的主要问题是识别字符串文字'NULL'自动转换为PostgreSQL NULL关键字,在计算中忽略它,并且聚合函数的结果返回一个值,好像NULL值不存在,而NaN关键字,其中每个操作都会导致NaN的。
#1
1
If you want to create a column with max length, use bracket, not square brackets. change VARCHAR[20]
to VARCHAR(20)
in create table statement. Otherwise first %s
is expected to be array and it is varchar. Here is sample - mind that patient_id is created as array, not varchar...
如果要创建具有最大长度的列,请使用括号,而不是方括号。在create table语句中将VARCHAR [20]更改为VARCHAR(20)。否则,首先%s应该是数组,它是varchar。这是示例 - 请注意,patient_id是作为数组创建的,而不是varchar ...
t=# CREATE TABLE so23 (
patient_id VARCHAR[20] PRIMARY KEY,
gene_expression double precision []
);
CREATE TABLE
t=# \d+ so23
Table "public.so23"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------+---------------------+-----------+----------+--------------+-------------
patient_id | character varying[] | not null | extended | |
gene_expression | double precision[] | | extended | |
Indexes:
"so23_pkey" PRIMARY KEY, btree (patient_id)
#2
0
After a few hours of trial and error:
经过几个小时的反复试验:
Load this Pandas dataframe df from some CSV file:
从某些CSV文件加载此Pandas数据帧df:
+----+-------+--------------+
| id | stuff | array |
+----+-------+--------------+
| 0 | a | {1,2,3} |
| 1 | b | {1,np.nan,3} |
| 2 | 45 | {np.nan,4,2} |
+----+-------+--------------+
process in pandas using:
大熊猫进程使用:
df = df.fillna('NULL')
insert_sql = '''
INSERT INTO {t} (patient_id, gene_expression)
VALUES (%s,%s);
'''
for index, row in df.iterrows():
arr = row[2:].tolist()
postgres_arr = '{' + ','.join(map(str,arr)) + '}'
cur.execute(insert_sql.format(t='my_table'), (row['id'], postgres_arr,))
My main issue was recognizing that string literal 'NULL' automatically translate to PostgreSQL NULL keyword, which is ignored in calculations and results of aggregate functions return a value as if the NULL values are not there, versus NaN keyword where every operation with it results in NaN.
我的主要问题是识别字符串文字'NULL'自动转换为PostgreSQL NULL关键字,在计算中忽略它,并且聚合函数的结果返回一个值,好像NULL值不存在,而NaN关键字,其中每个操作都会导致NaN的。