从复杂的Csv/DataFrame加载Json,保留MongoDB的dtypes。

时间:2021-12-18 21:31:33

I am trying to build at json tree for a queryable MongoDB from some disparate csv/excel files. The data is often incomplete and linked by a subject id.

我正在尝试从一些不同的csv/excel文件为可查询的MongoDB构建json树。数据通常是不完整的,并通过主题id进行链接。

Example data below:

示例数据如下:

subid,firstvisit,name,contact,dob,gender,visitdate1,age,visitcategory,samplenumber,label_on_sample,completed_by
    1,12/31/11,Bob,,12/31/00,Male,,,,,,
    1,,,,,,12/31/15,17,Baseline Visit,,,
    1,,,,,,12/31/16,18,Follow Up Visit,,,
    1,,,,,,12/31/17,18,Follow Up Visit,,,
    1,,,,12/31/00,Male,,17,,XXX123,1,Sally
    2,1/1/12,,,1/1/01,Female,,,,,,
    2,,,,,,1/1/11,10,Baseline Visit,,,
    2,,,,,,1/1/12,11,Follow Up Visit,,,
    2,,,,,,1/1/13,12,Follow Up Visit,,,
    2,,,,,,1/1/14,13,Follow Up Visit,,,
    2,,,,,,1/1/15,14,Follow Up Visit,,,
    2,,,,1/1/01,Female,,15,,YYY456,2,
    2,,,,1/1/01,Female,,15,,ZZZ789,2,Sally'

I'd like the output to look something like this:

我希望输出看起来像这样:

[
    {
        "subject_id": "1",
        "name": "Bob",
        "dob": "12/31/00",
        "gender": "Male",
        "visits": {
            "12/31/15": {
                "age": "17",
                "visit_category": "Baseline Visit"
            },
            "12/31/16": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            },
            "12/31/17": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "XXX123": {
                "completed_by": "Sally",
                "label_on_sample": "1"
            }
        }
    },
    {
        "subject_id": "2",
        "name": null,
        "dob": "1/1/01",
        "gender": "Female",
        "visits": {
            "1/1/11": {
                "age": "10",
                "visit_category": "Baseline Visit"
            },
            "1/1/12": {
                "age": "11",
                "visit_category": "Follow Up Visit"
            },
            "1/1/13": {
                "age": "12",
                "visit_category": "Follow Up Visit"
            },
            "1/1/14": {
                "age": "13",
                "visit_category": "Follow Up Visit"
            },
            "1/1/15": {
                "age": "14",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "YYY456": {
                "completed_by": null,
                "label_on_sample": "2"
            },
            "ZZZ789": {
                "completed_by": "Sally",
                "label_on_sample": "2"
            }
        }
    }
]

I have a program that puts all of this into the right structure, but unfortunately because it uses csv's DictReader, it seems all variables are entered in as strings, making it difficult to query in meaningful ways. This code is below:

我有一个程序把所有这些都放到正确的结构中,但不幸的是,因为它使用了csv的DictReader,看起来所有的变量都是以字符串的形式输入的,因此很难以有意义的方式进行查询。下面这段代码:

def solution(csv_filename):
    by_subject_id = defaultdict(lambda: {
        'name': None,
        'dob': None,
        'gender': None,
        'visits': {},
        'samples': {}
    })

    with open(csv_filename) as f:
        dict_reader = DictReader(f)
        for row in dict_reader:
            non_empty = {k: v for k, v in row.items() if v}
            subject_id = non_empty['subid']  # must have to group by
            first_visit = non_empty.get('firstvisit')  # optional
            sample = non_empty.get('samplenumber')  # optional
            visit = non_empty.get('visitdate1')  # optional

            if first_visit:
                by_subject_id[subject_id].update({
                    'name': non_empty.get('name'),
                    'dob': non_empty.get('dob'),
                    'gender': non_empty.get('gender')
                })
            elif visit:
                by_subject_id[subject_id]['visits'][visit] = {
                    'age': non_empty.get('age'),
                    'visit_category': non_empty.get('visitcategory')
                }
            elif sample:
                by_subject_id[subject_id]['samples'][sample] = {
                    'completed_by': non_empty.get('completed_by'),
                    'label_on_sample': non_empty.get('label_on_sample')
                }
    return [{'subject_id': k, **v} for k, v in by_subject_id.items()]

What would be the best way to solve this issue? Could I convert this to work for a dataframe and hopefully retain the dtypes?

解决这个问题最好的办法是什么?我可以将它转换为dataframe并希望保留dtype吗?

Thanks so much for any advice. New to Mongo, just trying to get something that works.

非常感谢您的建议。对Mongo来说是全新的,只是想找些有用的东西。

1 个解决方案

#1


1  

here is not the best solution, but using pandas might be helpfull for keeping type of values, I did not look at the efficiency of the code, justthe part on reading the csv file but you can do:

这里不是最好的解决方案,但是使用熊猫可能有助于保持类型的值,我没有考虑代码的效率,只是阅读csv文件的部分,但是您可以:

import pandas as pd
def solution(csv_filename):
    by_subject_id = defaultdict(lambda: {
        .
        .
    })

    df = pd.read_csv(csv_filename).fillna('')
    for row in df .iterrows():
        non_empty = {k: v for k, v in row[1].iteritems() if  v != ''}
        subject_id = non_empty['subid']  # must have to group by
        .
        .
        .

I try to keep the fews lines changeed, everything else is the same. Ultimately, if you can directly have as parameter your cleaned DF instead of reading the csv file it would be better. Otherwise, you can add dtype= in read_csv() such as:

我试着保持不变,其他的都是一样的。最后,如果您可以直接使用您的clean DF作为参数,而不是读取csv文件,那就更好了。否则,您可以在read_csv()中添加dtype=:

df = pd.read_csv(csv_filename,dtype={'subid':int, 'age':int}).fillna('')

add any type you want to have.

添加任何您想要的类型。

I hope it helps you

我希望它能帮助你

#1


1  

here is not the best solution, but using pandas might be helpfull for keeping type of values, I did not look at the efficiency of the code, justthe part on reading the csv file but you can do:

这里不是最好的解决方案,但是使用熊猫可能有助于保持类型的值,我没有考虑代码的效率,只是阅读csv文件的部分,但是您可以:

import pandas as pd
def solution(csv_filename):
    by_subject_id = defaultdict(lambda: {
        .
        .
    })

    df = pd.read_csv(csv_filename).fillna('')
    for row in df .iterrows():
        non_empty = {k: v for k, v in row[1].iteritems() if  v != ''}
        subject_id = non_empty['subid']  # must have to group by
        .
        .
        .

I try to keep the fews lines changeed, everything else is the same. Ultimately, if you can directly have as parameter your cleaned DF instead of reading the csv file it would be better. Otherwise, you can add dtype= in read_csv() such as:

我试着保持不变,其他的都是一样的。最后,如果您可以直接使用您的clean DF作为参数,而不是读取csv文件,那就更好了。否则,您可以在read_csv()中添加dtype=:

df = pd.read_csv(csv_filename,dtype={'subid':int, 'age':int}).fillna('')

add any type you want to have.

添加任何您想要的类型。

I hope it helps you

我希望它能帮助你