使用Regex从CSV中提取数据并将其转换为JSON

时间:2021-12-28 18:24:02

Imagine you have a table in a CSV file with this kind of layout:

假设在CSV文件中有一个表具有这种布局:

name,property1 [unit1],property2 [unit2]
name1,4.5,2.3
name2,3.2,7.4
name3,5.5,6.1

I need to convert each row to this kind of JSON structure (ie, for row 1):

我需要将每一行转换为这种JSON结构(即第一行):

{
    "name1": [
        {
            "properties": [
                {
                    "property_1": "_value_",
                    "unit": "unit1"
                },
                {
                    "property_2": "_value_",
                    "unit": "unit2"
                }
            ]
        }
    ]
}

On top of it all, I have to explain that I am using Qt 4.7 and can't update; also, I can't install Qxt so I'm relying on qt-json for the JSON parsing/encoding. More, the CSV file is not created/maintained by me, so I can't really change it either.

最重要的是,我必须解释我正在使用Qt 4.7,不能更新;另外,我不能安装Qxt,所以我依赖于JSON解析/编码的qt-json。而且,CSV文件不是由我创建/维护的,所以我也不能真正更改它。

So with all of this, I realised I need a few things, so this is a kind of multiple question:

有了这些,我意识到我需要一些东西,所以这是一个多重问题:

  • how should I write the RegEx to read the unit in each column's header? Please note that the unit is enclosed in rect-parenthesis.
  • 如何编写RegEx来读取每个列的标题中的单元?请注意,该单元包含在括号中。
  • imagine I extract both the header row and the other rows into a QList<QString>, separating each column as a string. How can I manage to sync all the bits of data in order to create the JSON structure I need on a QString? (I think I need it in a QString so I can dump each row in a different file, but I'm open to other options as well)
  • 假设我将头行和其他行都提取到QList 中,将每一列分隔为字符串。如何设法同步所有数据位,以便在QString上创建所需的JSON结构?(我想我需要它在QString中,这样我就可以在不同的文件中转储每一行,但是我也可以使用其他选项)

Just one final note - I also need to this to be somewhat scalable. The CSV files on which this will be apllied are very heterogenous in column count: some have 8 columns, others have 20.

最后一点要注意的是,我还需要使其具有一定的可扩展性。CSV文件在列数中是非常异构的:有些有8列,有些有20列。

I know it is not a good practice to post "multiquestions", but the thing is I'm feeling too overwhelmed with all of this, and because I have virtually no experience with Qt, I can't even define a plan to attack this. Hope someone can share some pointers. Thanks!

我知道发布“多问题”不是一个好习惯,但问题是我对所有这些都感到不知所措,而且因为我几乎没有Qt的经验,我甚至不能定义一个计划来解决这个问题。希望有人能分享一些建议。谢谢!

EDIT So, I've been thinking a little more about this and I don't actually know if this is a good idea/feasible but here is what I thought of:

编辑所以,我一直在思考这个问题,我不知道这是不是一个好主意/可行的方法,但我的想法是:

  • when going through the header row, I would check if each column string had a hit for the RegEx. If so, I would store the column index and the unit string in a list;
  • 在遍历标题行时,我将检查每个列字符串是否对RegEx有命中。如果是,我将把列索引和单位字符串存储在一个列表中;
  • then, when going through the other rows, in order to parse them into JSON, I would check in each column if it matched the index in the previous list, and if so, I would then add the unit to the map (as qt-json docs explains)
  • 然后,在遍历其他行时,为了将它们解析为JSON,我将检查每一列是否与前面列表中的索引匹配,如果匹配,然后将该单元添加到映射中(正如qt- docs解释的那样)

Does this make any sense? Can anyone mock up a skeleton I can work on for this?

这有意义吗?有人能模拟出一具骨架吗?

EDIT2

EDIT2

I've managed to get a few things working so far, but still not working as it should. Right now I have managed to read properly from the CSV file, but the output isn't coming out right. Can anyone share some insight?

到目前为止,我已经设法让一些东西发挥作用,但仍然没有发挥应有的作用。现在我已经成功地从CSV文件中读取了数据,但是输出结果并不正确。有人能分享一些见解吗?

NOTE: the processLineFromCSV function returns a QStringList obtained like so: QStringList cells = line.split(separator_char);

注意:processLineFromCSV函数返回如下所示的QStringList: QStringList cells = line.split(separator_char);

NOTE2: the RegEx was obtained from this answer.

注2:RegEx是从这个答案中获得的。

NOTE3: Check below for the type of output I'm getting. Right now I think the problem relates more to the usage of the qt-json lib than actually the rest of the code, but any help is welcome! :)

NOTE3:检查下面的输出类型。现在我认为问题更多地与qt-json库的使用有关,而不是代码的其他部分,但是任何帮助都是受欢迎的!:)

The code so far:

到目前为止,代码:

QFile file(csvfile);

    if (file.open(QIODevice::ReadOnly | QIODevice::Text))
    {
        bool first = true;
        QVariantMap map;
        QVariantMap propertyMap;
        QList<QVariant> generalList, propertiesList;

        while (!file.atEnd())
        {
            QString line = file.readLine();
            if(first == true){
                headerList = processLineFromCSV(line, separator_char);
                first = false;
            }else{

            QStringList cellList = processLineFromCSV(line, separator_char);

            int i=0;

            for(i; i<cellList.size(); i++)
            {
                // check the header cell for "[unit]" string
                // returns -1 if does not have the string
                // if it has the string, it's stored in capturedUnits[1]
                int test = exp.indexIn(headerList.at(i));

                // store the captured units in a QStringList
                QStringList capturedUnits = exp.capturedTexts();

                if(test==-1){ // if header does not have a captured unit - general column
                    QString name = headerList.at(i);
                    QString sanitizeName= name.remove(exp.capturedTexts().at(0), Qt::CaseSensitive);
                    map[sanitizeName] = cellList.at(i);
                }
                else{ // if header string has a captured unit - property column

                    QString propertyName = headerList.at(i); // extract string in header
                    QString sanitizedPropertyName = propertyName.remove(exp); //remove the unit regex from the string
                    sanitizedPropertyName.remove(QChar('\n'), Qt::CaseSensitive); // clear newlines

                    if(sanitizedPropertyName.startsWith('"') && sanitizedPropertyName.endsWith('"'))
                    {
                        sanitizedPropertyName.remove(0,1);
                        sanitizedPropertyName.remove(sanitizedPropertyName.length(),1);
                    }

                    QString value =cellList.at(i); // extract string in value
                    QString sanitizedValue = value.remove(QChar('\n'), Qt::CaseSensitive); // clear newlines

                    if(sanitizedValue.startsWith('"') && sanitizedValue.endsWith('"'))
                    {
                        sanitizedValue.remove(0,1);
                        sanitizedValue.remove(sanitizedValue.length(),1);
                    }

                    propertyMap[sanitizedPropertyName]= sanitizedValue; // map the property: value pair
                    propertyMap["unit"] = capturedUnits.at(1); // map the unit: [unit] value pair

                    QByteArray general = QtJson::serialize(map); // serialize the pair for general column
                    QByteArray properties = QtJson::serialize(propertyMap); // serialize the pair for property column

                    QVariant genVar(general);
                    QVariant propVar(properties);

                    generalList.append(genVar);
                    propertiesList.append(propVar);
                }
            }
        }}
        QByteArray finalGeneral = QtJson::serialize(generalList);
        QByteArray finalProperties = QtJson::serialize(propertiesList);

        qDebug() << finalGeneral;
        qDebug() << finalProperties;


        file.close();
    }

The ouput:

输出:

"[
    "{ \"name\" : \"name1\" }",
    "{ \"name\" : \"name1\" }",
    "{ \"name\" : \"name2\" }",
    "{ \"name\" : \"name2\" }",
    "{ \"name\" : \"name3\" }",
    "{ \"name\" : \"name3\" }"
]" 
"[
    "{ \"property1 \" : \"4.5\", \"unit\" : \"unit1\" }",
    "{ \"property1 \" : \"4.5\", \"property2 \" : \"2.3\", \"unit\" : \"unit2\" }",
    "{ \"property1 \" : \"3.2\", \"property2 \" : \"2.3\", \"unit\" : \"unit1\" }",
    "{ \"property1 \" : \"3.2\", \"property2 \" : \"7.4\", \"unit\" : \"unit2\" }",
    "{ \"property1 \" : \"5.5\", \"property2 \" : \"7.4\", \"unit\" : \"unit1\" }",
    "{ \"property1 \" : \"5.5\", \"property2 \" : \"6.1\", \"unit\" : \"unit2\" }"
]"

3 个解决方案

#1


1  

This should be a good start for you:

这对你来说应该是个好的开始:

QString csv = "name,property1 [unit1],property2 [unit2],property3 [unit3]\n"
              "name1,4.5,2.3\n"
              "name2,3.2,7.4\n"
              "name3,5.5,6.1,4.3\n";

QStringList csvRows = csv.split('\n', QString::SkipEmptyParts);
QStringList csvHeader = csvRows.takeFirst().split(',');
csvHeader.removeFirst();

foreach(QString row, csvRows) {
    QStringList values = row.split(',');
    QString rowName = values.takeFirst();

    QVariantList properties;
    for(int i = 0; i < values.size(); i++) {
        QString value = values[i];
        QStringList propParts = csvHeader[i].split(' ');
        QString propName = propParts[0];
        QString propType = propParts[1].mid(1, propParts[1].size() - 2);

        QVariantMap property;
        property[propName] = value;
        property["unit"] = propType;

        properties.append(property);
    }

    QVariantMap propertyObj;
    propertyObj["properties"] = properties;
    QVariantList propList;
    propList.append(propertyObj);

    QVariantMap root;
    root[rowName] = propList;

    QByteArray json = QtJson::serialize(root);
    qDebug() << json;

    // Now you can save json to a file
}

#2


1  

Joum.

Joum。

Just seen your response to my comment. I don't have much experience with QT either, but a quick outline....

刚刚看到你对我的评论的回应。我没有太多经验,QT,但一个快速概述....

Extract the data one line at a time, and 'split' it into an array. If you are using CSV you need to be sure that there are no data points that have a comma in them, or the split will result in a real mess. Check with whoever extracted the data if they can use another 'less common' separator (eg a '|' is good). if you data is all numeric that is great, but be wary of locations that use the comma as a decimal separator :(

每次只提取一行数据,并将其“拆分”为一个数组。如果您使用的是CSV,那么您需要确保其中没有包含逗号的数据点,否则将导致真正的混乱。如果他们可以使用另一个“不太常见”的分隔符(如“|”是好的),请检查提取数据的人。如果您的数据都是数值的,那就太棒了,但是要注意使用逗号作为小数分隔符的位置:(

I hope that you have 1 'table' per file, if not you need to be able to 'identify' when a new table starts somehow, this could be interesting / fun - depends on your outlook ;).

我希望每个文件都有一个“表”,如果没有的话,您需要能够“识别”新表何时开始,这可能会很有趣/有趣——这取决于您的outlook;

At the end you will have a collection of 'string arrays' (a table of some sort) hopefully the first is your header info. If you have mutliple tables, you will deal with them one at a time

最后,您将拥有“字符串数组”(某种表)的集合,希望第一个是您的头信息。如果您有多组表,您将一次处理一个

You should now be able to 'output' each table in good JSON format.

现在应该能够以良好的JSON格式“输出”每个表。

Getting your 'units' from the header rows: If you know in advance where they are located (ie the index in the array) you can plan for extracting the info (using a regex if you wish) in the correct index locations.

从标题行获取“单位”:如果您事先知道它们的位置(即数组中的索引),您可以计划在正确的索引位置提取信息(如果您愿意,可以使用regex)。

Last point. If your csv file is very long (hundreds of lines), just grab the first few into a new test file for quicker debuging, then once you are happy, enlarge it a bit and check the output format... then again once you are happy that there are no other bugs... for the whole file Likewise if you have multiple tables in your file, start with the first one only, then add the first part of a second... test.... add a third.... test etc etc etc until you are happy

最后一点。如果您的csv文件非常长(数百行),只需将前几行抓取到一个新的测试文件中,以便更快地进行调试,那么一旦您满意了,请放大它并检查输出格式……再一次,当你对没有其他bug感到高兴的时候……同样,对于整个文件,如果您的文件中有多个表,那么只从第一个表开始,然后添加第二个表的第一部分……测试....添加第三个....测试等等,直到你开心为止

David.

大卫。

A possibly better solution, after reading your comment about wanting some form of 'synchronisation'. NOTE: this may seem a little more complex, but I think it would be a more flexible solution in the end. Also does this data not exist in a DB somewhere (who gave it to you?), can they give you direct read access to the underlying DB and tables? if so, you can jump straight to the 'output each table to JSON' step.

在阅读了你关于想要某种形式的“同步”的评论后,一个可能更好的解决方案。注意:这看起来可能有点复杂,但我认为最终它将是一个更灵活的解决方案。另外,这些数据是否在DB中不存在(谁给你的?),它们能让你直接读取底层DB和表吗?如果是,可以直接跳到“输出每个表到JSON”步骤。

using an embeded DB (ie SQLite). Extract the first 'header' row, and create a table in your DB that follows the info there (you should be able to add info regarding units to the 'metadata' ie a description). If all your files are the same you could even import all the data into the same single table, or auto create a new table (assuming the same format) for each new file using the same create table statement.

使用嵌入式DB(即SQLite)。提取第一个“header”行,并在DB中创建一个表,该表遵循其中的信息(您应该能够向“metadata”(即描述)添加关于单元的信息)。如果所有的文件都是相同的,您甚至可以将所有数据导入到同一个表中,或者使用相同的create table语句为每个新文件自动创建一个新表(假设格式相同)。

I'm sure there is a 'csvimport' in SQLite (I haven't checked the docs yet, and haven't done this in a while) or someone has written a library that will do this.

我确信SQLite中有一个“csvimport”(我还没有检查文档,也有一段时间没有这样做过),或者有人已经编写了一个库来实现这个功能。

Output each table to JSON format, again I'm sure someone has written a library for this.

将每个表输出为JSON格式,我确信有人为此编写了一个库。

#3


1  

Using the answer by ExplodingRat this is the final code: (without file creation at the end)

下面是最后的代码:(最后没有创建文件)

QString csvfile = ui->lineEditCSVfile->text();
    QString separator_char = ui->lineEditSeparator->text();
    QRegExp exp("\\[([^\\]]+)\\]");

    QFile file(csvfile);
     if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
         return;

    QString csv = file.readAll();

    QStringList csvRows = csv.split('\n', QString::SkipEmptyParts);

    QStringList csvHeader = csvRows.takeFirst().split(separator_char);

    csvHeader.removeFirst();

    foreach(QString row, csvRows) {
        QStringList values = row.split(separator_char);

        QString rowName = values.takeFirst();

        QVariantList general;
        QVariantList properties;
        for(int i = 0; i < values.size(); i++) {
            QString value = values[i];

            int test = exp.indexIn(csvHeader[i]);
            //qDebug() << test;

            //qDebug() << csvHeader;
            QStringList capturedUnits = exp.capturedTexts();
            QString propName = csvHeader[i];


            if(test==-1){
                //QString propName = csvHeader[i].remove(exp);
                //qDebug() <<"property name" << propName;

                QVariantMap property;
                property[propName] = value;
                general.append(property);
            }else{
                propName.remove(exp);
                //QStringList propParts = csvHeader[i].split(' ');
                //QString propName = csvHeader[i].remove(exp);
                QString propType = capturedUnits[1];

                QVariantMap property;
                property[propName] = value;
                property["unit"] = propType;

                properties.append(property);
            }
        }

        QVariantMap propertyObj;
        propertyObj["properties"] = properties;
        QVariantList propList;
        propList.append(propertyObj);

        QVariantMap generalObj;
        generalObj["general"] = general;
        QVariantList generalList;
        generalList.append(generalObj);

        QVariantList fullList;
        fullList.append(generalObj);
        fullList.append(propertyObj);

        QVariantMap root;
        root[rowName] = fullList;

        QByteArray json = QtJson::serialize(root);

        json.prepend('[');
        json.append(']');

        qDebug() << json;

        // Now you can save json to a file

#1


1  

This should be a good start for you:

这对你来说应该是个好的开始:

QString csv = "name,property1 [unit1],property2 [unit2],property3 [unit3]\n"
              "name1,4.5,2.3\n"
              "name2,3.2,7.4\n"
              "name3,5.5,6.1,4.3\n";

QStringList csvRows = csv.split('\n', QString::SkipEmptyParts);
QStringList csvHeader = csvRows.takeFirst().split(',');
csvHeader.removeFirst();

foreach(QString row, csvRows) {
    QStringList values = row.split(',');
    QString rowName = values.takeFirst();

    QVariantList properties;
    for(int i = 0; i < values.size(); i++) {
        QString value = values[i];
        QStringList propParts = csvHeader[i].split(' ');
        QString propName = propParts[0];
        QString propType = propParts[1].mid(1, propParts[1].size() - 2);

        QVariantMap property;
        property[propName] = value;
        property["unit"] = propType;

        properties.append(property);
    }

    QVariantMap propertyObj;
    propertyObj["properties"] = properties;
    QVariantList propList;
    propList.append(propertyObj);

    QVariantMap root;
    root[rowName] = propList;

    QByteArray json = QtJson::serialize(root);
    qDebug() << json;

    // Now you can save json to a file
}

#2


1  

Joum.

Joum。

Just seen your response to my comment. I don't have much experience with QT either, but a quick outline....

刚刚看到你对我的评论的回应。我没有太多经验,QT,但一个快速概述....

Extract the data one line at a time, and 'split' it into an array. If you are using CSV you need to be sure that there are no data points that have a comma in them, or the split will result in a real mess. Check with whoever extracted the data if they can use another 'less common' separator (eg a '|' is good). if you data is all numeric that is great, but be wary of locations that use the comma as a decimal separator :(

每次只提取一行数据,并将其“拆分”为一个数组。如果您使用的是CSV,那么您需要确保其中没有包含逗号的数据点,否则将导致真正的混乱。如果他们可以使用另一个“不太常见”的分隔符(如“|”是好的),请检查提取数据的人。如果您的数据都是数值的,那就太棒了,但是要注意使用逗号作为小数分隔符的位置:(

I hope that you have 1 'table' per file, if not you need to be able to 'identify' when a new table starts somehow, this could be interesting / fun - depends on your outlook ;).

我希望每个文件都有一个“表”,如果没有的话,您需要能够“识别”新表何时开始,这可能会很有趣/有趣——这取决于您的outlook;

At the end you will have a collection of 'string arrays' (a table of some sort) hopefully the first is your header info. If you have mutliple tables, you will deal with them one at a time

最后,您将拥有“字符串数组”(某种表)的集合,希望第一个是您的头信息。如果您有多组表,您将一次处理一个

You should now be able to 'output' each table in good JSON format.

现在应该能够以良好的JSON格式“输出”每个表。

Getting your 'units' from the header rows: If you know in advance where they are located (ie the index in the array) you can plan for extracting the info (using a regex if you wish) in the correct index locations.

从标题行获取“单位”:如果您事先知道它们的位置(即数组中的索引),您可以计划在正确的索引位置提取信息(如果您愿意,可以使用regex)。

Last point. If your csv file is very long (hundreds of lines), just grab the first few into a new test file for quicker debuging, then once you are happy, enlarge it a bit and check the output format... then again once you are happy that there are no other bugs... for the whole file Likewise if you have multiple tables in your file, start with the first one only, then add the first part of a second... test.... add a third.... test etc etc etc until you are happy

最后一点。如果您的csv文件非常长(数百行),只需将前几行抓取到一个新的测试文件中,以便更快地进行调试,那么一旦您满意了,请放大它并检查输出格式……再一次,当你对没有其他bug感到高兴的时候……同样,对于整个文件,如果您的文件中有多个表,那么只从第一个表开始,然后添加第二个表的第一部分……测试....添加第三个....测试等等,直到你开心为止

David.

大卫。

A possibly better solution, after reading your comment about wanting some form of 'synchronisation'. NOTE: this may seem a little more complex, but I think it would be a more flexible solution in the end. Also does this data not exist in a DB somewhere (who gave it to you?), can they give you direct read access to the underlying DB and tables? if so, you can jump straight to the 'output each table to JSON' step.

在阅读了你关于想要某种形式的“同步”的评论后,一个可能更好的解决方案。注意:这看起来可能有点复杂,但我认为最终它将是一个更灵活的解决方案。另外,这些数据是否在DB中不存在(谁给你的?),它们能让你直接读取底层DB和表吗?如果是,可以直接跳到“输出每个表到JSON”步骤。

using an embeded DB (ie SQLite). Extract the first 'header' row, and create a table in your DB that follows the info there (you should be able to add info regarding units to the 'metadata' ie a description). If all your files are the same you could even import all the data into the same single table, or auto create a new table (assuming the same format) for each new file using the same create table statement.

使用嵌入式DB(即SQLite)。提取第一个“header”行,并在DB中创建一个表,该表遵循其中的信息(您应该能够向“metadata”(即描述)添加关于单元的信息)。如果所有的文件都是相同的,您甚至可以将所有数据导入到同一个表中,或者使用相同的create table语句为每个新文件自动创建一个新表(假设格式相同)。

I'm sure there is a 'csvimport' in SQLite (I haven't checked the docs yet, and haven't done this in a while) or someone has written a library that will do this.

我确信SQLite中有一个“csvimport”(我还没有检查文档,也有一段时间没有这样做过),或者有人已经编写了一个库来实现这个功能。

Output each table to JSON format, again I'm sure someone has written a library for this.

将每个表输出为JSON格式,我确信有人为此编写了一个库。

#3


1  

Using the answer by ExplodingRat this is the final code: (without file creation at the end)

下面是最后的代码:(最后没有创建文件)

QString csvfile = ui->lineEditCSVfile->text();
    QString separator_char = ui->lineEditSeparator->text();
    QRegExp exp("\\[([^\\]]+)\\]");

    QFile file(csvfile);
     if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
         return;

    QString csv = file.readAll();

    QStringList csvRows = csv.split('\n', QString::SkipEmptyParts);

    QStringList csvHeader = csvRows.takeFirst().split(separator_char);

    csvHeader.removeFirst();

    foreach(QString row, csvRows) {
        QStringList values = row.split(separator_char);

        QString rowName = values.takeFirst();

        QVariantList general;
        QVariantList properties;
        for(int i = 0; i < values.size(); i++) {
            QString value = values[i];

            int test = exp.indexIn(csvHeader[i]);
            //qDebug() << test;

            //qDebug() << csvHeader;
            QStringList capturedUnits = exp.capturedTexts();
            QString propName = csvHeader[i];


            if(test==-1){
                //QString propName = csvHeader[i].remove(exp);
                //qDebug() <<"property name" << propName;

                QVariantMap property;
                property[propName] = value;
                general.append(property);
            }else{
                propName.remove(exp);
                //QStringList propParts = csvHeader[i].split(' ');
                //QString propName = csvHeader[i].remove(exp);
                QString propType = capturedUnits[1];

                QVariantMap property;
                property[propName] = value;
                property["unit"] = propType;

                properties.append(property);
            }
        }

        QVariantMap propertyObj;
        propertyObj["properties"] = properties;
        QVariantList propList;
        propList.append(propertyObj);

        QVariantMap generalObj;
        generalObj["general"] = general;
        QVariantList generalList;
        generalList.append(generalObj);

        QVariantList fullList;
        fullList.append(generalObj);
        fullList.append(propertyObj);

        QVariantMap root;
        root[rowName] = fullList;

        QByteArray json = QtJson::serialize(root);

        json.prepend('[');
        json.append(']');

        qDebug() << json;

        // Now you can save json to a file