使用C ++将字符串拆分为键值对

时间:2022-02-05 18:18:12

I have a string like this:

我有一个像这样的字符串:

"CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567"

Now ": " splits key from value while \n separates the pairs. I want to add the key-value pairs to a map in C++.

现在“:”从值中分离键,而\ n分隔对。我想在C ++中将键值对添加到地图中。

Is there any efficient way of doing this considering optimization in mind?

考虑到优化,有没有有效的方法来做到这一点?

8 个解决方案

#1


3  

Well I have two methods here. The first one is the easy, obvious method that I use all the time (performance is rarely an issue). The second method is likely more efficient but I have not done any formal timings.

那么我这里有两种方法。第一个是我一直使用的简单明了的方法(性能很少是问题)。第二种方法可能更有效,但我没有做任何正式的时间安排。

In my tests the second method is about 3 times faster.

在我的测试中,第二种方法的速度提高了约3倍。

#include <map>
#include <string>
#include <sstream>
#include <iostream>

std::map<std::string, std::string> mappify1(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string key, val;
    std::istringstream iss(s);

    while(std::getline(std::getline(iss, key, ':') >> std::ws, val))
        m[key] = val;

    return m;
}

std::map<std::string, std::string> mappify2(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string::size_type key_pos = 0;
    std::string::size_type key_end;
    std::string::size_type val_pos;
    std::string::size_type val_end;

    while((key_end = s.find(':', key_pos)) != std::string::npos)
    {
        if((val_pos = s.find_first_not_of(": ", key_end)) == std::string::npos)
            break;

        val_end = s.find('\n', val_pos);
        m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));

        key_pos = val_end;
        if(key_pos != std::string::npos)
            ++key_pos;
    }

    return m;
}

int main()
{
    std::string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    std::cout << "mappify1: " << '\n';

    auto m = mappify1(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';

    std::cout << "mappify2: " << '\n';

    m = mappify2(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
}

Output:

mappify1: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
mappify2: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}

#2


1  

The format is simple enough that doing the parsing "by hand" IMO is the best option, overall remains quite readable.

格式很简单,“手动”解析IMO是最好的选择,整体仍然非常易读。

This should also be reasonably efficient (the key and value strings are always the same - albeit cleared, so the reallocations inside the main loop should just stop after a few iterations); ret also should qualify for NRVO, OTOH in case of problems with that you can always change to an output parameter.

这也应该是合理有效的(键和值字符串总是相同的 - 虽然被清除,所以主循环内的重新分配应该在几次迭代后停止);如果出现问题,ret也应该有资格获得NRVO,OTOH,你可以随时改为输出参数。

Of course std::map may not be the fastest gun in the west, but it's a request in the problem text.

当然std :: map可能不是西方最快的枪,但它是问题文本中的一个请求。

std::map<std::string, std::string> parseKV(const std::string &sz) {
    std::map<std::string, std::string> ret;
    std::string key;
    std::string value;
    const char *s=sz.c_str();
    while(*s) {
        // parse the key
        while(*s && *s!=':' && s[1]!=' ') {
            key.push_back(*s);
            ++s;
        }
        // if we quit due to the end of the string exit now
        if(!*s) break;
        // skip the ": "
        s+=2;
        // parse the value
        while(*s && *s!='\n') {
            value.push_back(*s);
            ++s;
        }
        ret[key]=value;
        key.clear(); value.clear();
        // skip the newline
        ++s;
    }
    return ret;
}

#3


1  

This format is called "Tag-Value".

此格式称为“标记值”。

The most performance critical place where such encoding is used in the industry is probably financial FIX Protocol (= for key-value separator, and '\001' as entries delimiter). So if you are on x86 hardware then your best bet would be to google 'SSE4 FIX protocol parser github' and reuse the open sourced findings of HFT shops.

在业界使用此类编码的最重要的性能关键位置可能是财务FIX协议(=键值分隔符,'\ 001'作为条目分隔符)。因此,如果您使用的是x86硬件,那么您最好的选择是谷歌'SSE4 FIX协议解析器github'并重用HFT商店的开源结果。

If you still want to delegate the vectorization part to the compiler and can spare few nanoseconds for readability then the most elegant solution is to store the result in a std::string (data) + boost::flat_map<boost::string_ref, boost::string_ref> (view). Parsing is a matter of taste, while-loop or strtok would be easiest for the compiler to parse. Boost-spirit based parser would be easiest for a human (familiar with boost-spirit) to read.

如果您仍然希望将矢量化部分委托给编译器并且可以节省几纳秒的可读性,那么最优雅的解决方案是将结果存储在std :: string(数据)+ boost :: flat_map (视图)。解析是一个品味问题,而循环或strtok最容易被编译器解析。基于Boost-spirit的解析器对于人类(熟悉boost-spirit)来说是最容易阅读的。

C++ for-loop based solution

基于C ++ for循环的解决方案

#include <boost/container/flat_map.hpp> 
#include <boost/range/iterator_range.hpp>

#include <boost/range/iterator_range_io.hpp> 
#include <iostream>

// g++ -std=c++1z ~/aaa.cc
int main()
{
    using range_t = boost::iterator_range<std::string::const_iterator>;
    using map_t = boost::container::flat_map<range_t, range_t>;

    char const sep = ':';
    char const dlm = '\n';

    // this part can be reused for parsing multiple records
    map_t result;
    result.reserve(1024);

    std::string const input {"hello:world\n bye: world"};

    // this part is per-line/per-record
    result.clear();
    for (auto _beg = begin(input), _end = end(input), it = _beg; it != _end;)
    {
        auto sep_it = std::find(it, _end, sep);
        if (sep_it != _end)
        {
            auto dlm_it = std::find(sep_it + 1, _end, dlm);
            result.emplace(range_t {it, sep_it}, range_t {sep_it + 1, dlm_it});
            it = dlm_it + (dlm_it != _end);
        }
        else throw std::runtime_error("cannot parse");
    }

    for (auto& x: result)
        std::cout << x.first << " => " << x.second << '\n';

    return 0;
}

#4


1  

If worried about performance, you should probably rethink the need for the end result to be a map. That could end up being a lot of char buffers in memory. Ideally keeping track of just the char* and length of each sub string will be faster/smaller.

如果担心性能,你应该重新考虑最终结果是地图的必要性。这可能最终成为内存中很多char缓冲区。理想情况下,跟踪char *和每个子字符串的长度将更快/更小。

#5


0  

Here is a solution, using strtok as a splitting means. Please note that strtok changes your string, it puts '\0' at the split char.

这是一个解决方案,使用strtok作为分裂手段。请注意,strtok会更改您的字符串,它会将'\ 0'放在split char上。

#include <iostream>
#include <string>
#include <map>
#include <string.h>

using namespace std;



int main (int argc, char *argv[])
{
    char s1[] = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";
    map<string, string> mymap;
    char *token;

    token = strtok(s1, "\n");
    while (token != NULL) {
        string s(token);
        size_t pos = s.find(":");
        mymap[s.substr(0, pos)] = s.substr(pos + 1, string::npos);
        token = strtok(NULL, "\n");
    }

    for (auto keyval : mymap) 
        cout << keyval.first << "/" << keyval.second << endl;

    return 0;
}

#6


0  

I doubt you should worry about optimization for reading this string and converting it in a std::map. If you really want to optimize this fixed-content map, change it to a std::vector<std::pair<>> and sort it once.

我怀疑你应该担心读取这个字符串并在std :: map中转换它的优化。如果您真的想要优化此固定内容映射,请将其更改为std :: vector >并对其进行一次排序。

That said, the most elegant way of creating the std::map with standard C++ features is the following:

也就是说,使用标准C ++功能创建std :: map的最优雅方法如下:

std::map<std::string, std::string> deserializeKeyValue(const std::string &sz) {
    constexpr auto ELEMENT_SEPARATOR = ": "s;
    constexpr auto LINE_SEPARATOR = "\n"s;

    std::map<std::string, std::string> result;
    std::size_t begin{0};
    std::size_t end{0};
    while (begin < sz.size()) {
        // Search key
        end = sz.find(ELEMENT_SEPARATOR, begin);
        assert(end != std::string::npos); // Replace by error handling
        auto key = sz.substr(begin, /*size=*/ end - begin);
        begin = end + ELEMENT_SEPARATOR.size();

        // Seach value
        end = sz.find(LINE_SEPARATOR, begin);
        auto value = sz.substr(begin, end == std::string::npos ? std::string::npos : /*size=*/ end - begin);
        begin = (end == std::string::npos) ? sz.size() : end + LINE_SEPARATOR.size();

        // Store key-value
        [[maybe_unused]] auto emplaceResult = result.emplace(std::move(key), std::move(value));
        assert(emplaceResult.second); // Replace by error handling
    }
    return result;
}

The performance of this might not be ideal, though every c++ programmer understands this code.

尽管每个c ++程序员都理解这段代码,但这可能并不理想。

#7


0  

A very simple solution using boost is the following, it works also with partial tokens (e.g. key without values or empty pairs).

使用boost的一个非常简单的解决方案如下,它也适用于部分令牌(例如没有值或空对的键)。

#include <string>
#include <list>
#include <map>
#include <iostream>

#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>

using namespace std;
using namespace boost;

int main() {

    string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    list<string> tokenList;
    split(tokenList,s,is_any_of("\n"),token_compress_on);
    map<string, string> kvMap;

    BOOST_FOREACH(string token, tokenList) {
        size_t sep_pos = token.find_first_of(": ");
        string key = token.substr(0,sep_pos);
        string value = (sep_pos == string::npos ? "" : token.substr(sep_pos+2,string::npos));
        kvMap[key] = value;

        cout << "[" << key << "] => [" << kvMap[key] << "]" << endl;
    }

    return 0;
}

#8


0  

void splitString(std::map<std::string, std::string> &mymap, const std::string &text, char sep)
{
    int start = 0, end1 = 0, end2 = 0;
    while ((end1 = text.find(sep, start)) != std::string::npos && (end2 = text.find(sep, end1+1)) != std::string::npos) {
        std::string key = text.substr(start, end1 - start);
        std::string val = text.substr(end1 + 1, end2 - end1 - 1);
        mymap.insert(std::pair<std::string,std::string>(key, val));
        start = end2 + 1;
    }
}

For example:

std::string text = "key1;val1;key2;val2;key3;val3;";
std::map<std::string, std::string> mymap;
splitString(mymap, text, ';');

Will result in a map of size 3: { key1="val1", key2="val2", key3="val3" }

将产生大小为3的地图:{key1 =“val1”,key2 =“val2”,key3 =“val3”}

More examples:

"key1;val1;key2;" => {key1="val1"} (no 2nd val, so 2nd key doesn't count)

“KEY1; VAL1; KEY2;” => {key1 =“val1”}(没有第二个val,所以第二个键不计算)

"key1;val1;key2;val2" => {key1="val1"} (no delim at end of the 2nd val, so it doesn't count)

“key1; val1; key2; val2”=> {key1 =“val1”}(在第二个val的末尾没有delim,所以它不计算)

"key1;val1;key2;;" => {key1="val1",key2=""} (key2 holds empty string)

“KEY1; VAL1; KEY2 ;;” => {key1 =“val1”,key2 =“”}(key2保存空字符串)

#1


3  

Well I have two methods here. The first one is the easy, obvious method that I use all the time (performance is rarely an issue). The second method is likely more efficient but I have not done any formal timings.

那么我这里有两种方法。第一个是我一直使用的简单明了的方法(性能很少是问题)。第二种方法可能更有效,但我没有做任何正式的时间安排。

In my tests the second method is about 3 times faster.

在我的测试中,第二种方法的速度提高了约3倍。

#include <map>
#include <string>
#include <sstream>
#include <iostream>

std::map<std::string, std::string> mappify1(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string key, val;
    std::istringstream iss(s);

    while(std::getline(std::getline(iss, key, ':') >> std::ws, val))
        m[key] = val;

    return m;
}

std::map<std::string, std::string> mappify2(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string::size_type key_pos = 0;
    std::string::size_type key_end;
    std::string::size_type val_pos;
    std::string::size_type val_end;

    while((key_end = s.find(':', key_pos)) != std::string::npos)
    {
        if((val_pos = s.find_first_not_of(": ", key_end)) == std::string::npos)
            break;

        val_end = s.find('\n', val_pos);
        m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));

        key_pos = val_end;
        if(key_pos != std::string::npos)
            ++key_pos;
    }

    return m;
}

int main()
{
    std::string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    std::cout << "mappify1: " << '\n';

    auto m = mappify1(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';

    std::cout << "mappify2: " << '\n';

    m = mappify2(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
}

Output:

mappify1: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
mappify2: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}

#2


1  

The format is simple enough that doing the parsing "by hand" IMO is the best option, overall remains quite readable.

格式很简单,“手动”解析IMO是最好的选择,整体仍然非常易读。

This should also be reasonably efficient (the key and value strings are always the same - albeit cleared, so the reallocations inside the main loop should just stop after a few iterations); ret also should qualify for NRVO, OTOH in case of problems with that you can always change to an output parameter.

这也应该是合理有效的(键和值字符串总是相同的 - 虽然被清除,所以主循环内的重新分配应该在几次迭代后停止);如果出现问题,ret也应该有资格获得NRVO,OTOH,你可以随时改为输出参数。

Of course std::map may not be the fastest gun in the west, but it's a request in the problem text.

当然std :: map可能不是西方最快的枪,但它是问题文本中的一个请求。

std::map<std::string, std::string> parseKV(const std::string &sz) {
    std::map<std::string, std::string> ret;
    std::string key;
    std::string value;
    const char *s=sz.c_str();
    while(*s) {
        // parse the key
        while(*s && *s!=':' && s[1]!=' ') {
            key.push_back(*s);
            ++s;
        }
        // if we quit due to the end of the string exit now
        if(!*s) break;
        // skip the ": "
        s+=2;
        // parse the value
        while(*s && *s!='\n') {
            value.push_back(*s);
            ++s;
        }
        ret[key]=value;
        key.clear(); value.clear();
        // skip the newline
        ++s;
    }
    return ret;
}

#3


1  

This format is called "Tag-Value".

此格式称为“标记值”。

The most performance critical place where such encoding is used in the industry is probably financial FIX Protocol (= for key-value separator, and '\001' as entries delimiter). So if you are on x86 hardware then your best bet would be to google 'SSE4 FIX protocol parser github' and reuse the open sourced findings of HFT shops.

在业界使用此类编码的最重要的性能关键位置可能是财务FIX协议(=键值分隔符,'\ 001'作为条目分隔符)。因此,如果您使用的是x86硬件,那么您最好的选择是谷歌'SSE4 FIX协议解析器github'并重用HFT商店的开源结果。

If you still want to delegate the vectorization part to the compiler and can spare few nanoseconds for readability then the most elegant solution is to store the result in a std::string (data) + boost::flat_map<boost::string_ref, boost::string_ref> (view). Parsing is a matter of taste, while-loop or strtok would be easiest for the compiler to parse. Boost-spirit based parser would be easiest for a human (familiar with boost-spirit) to read.

如果您仍然希望将矢量化部分委托给编译器并且可以节省几纳秒的可读性,那么最优雅的解决方案是将结果存储在std :: string(数据)+ boost :: flat_map (视图)。解析是一个品味问题,而循环或strtok最容易被编译器解析。基于Boost-spirit的解析器对于人类(熟悉boost-spirit)来说是最容易阅读的。

C++ for-loop based solution

基于C ++ for循环的解决方案

#include <boost/container/flat_map.hpp> 
#include <boost/range/iterator_range.hpp>

#include <boost/range/iterator_range_io.hpp> 
#include <iostream>

// g++ -std=c++1z ~/aaa.cc
int main()
{
    using range_t = boost::iterator_range<std::string::const_iterator>;
    using map_t = boost::container::flat_map<range_t, range_t>;

    char const sep = ':';
    char const dlm = '\n';

    // this part can be reused for parsing multiple records
    map_t result;
    result.reserve(1024);

    std::string const input {"hello:world\n bye: world"};

    // this part is per-line/per-record
    result.clear();
    for (auto _beg = begin(input), _end = end(input), it = _beg; it != _end;)
    {
        auto sep_it = std::find(it, _end, sep);
        if (sep_it != _end)
        {
            auto dlm_it = std::find(sep_it + 1, _end, dlm);
            result.emplace(range_t {it, sep_it}, range_t {sep_it + 1, dlm_it});
            it = dlm_it + (dlm_it != _end);
        }
        else throw std::runtime_error("cannot parse");
    }

    for (auto& x: result)
        std::cout << x.first << " => " << x.second << '\n';

    return 0;
}

#4


1  

If worried about performance, you should probably rethink the need for the end result to be a map. That could end up being a lot of char buffers in memory. Ideally keeping track of just the char* and length of each sub string will be faster/smaller.

如果担心性能,你应该重新考虑最终结果是地图的必要性。这可能最终成为内存中很多char缓冲区。理想情况下,跟踪char *和每个子字符串的长度将更快/更小。

#5


0  

Here is a solution, using strtok as a splitting means. Please note that strtok changes your string, it puts '\0' at the split char.

这是一个解决方案,使用strtok作为分裂手段。请注意,strtok会更改您的字符串,它会将'\ 0'放在split char上。

#include <iostream>
#include <string>
#include <map>
#include <string.h>

using namespace std;



int main (int argc, char *argv[])
{
    char s1[] = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";
    map<string, string> mymap;
    char *token;

    token = strtok(s1, "\n");
    while (token != NULL) {
        string s(token);
        size_t pos = s.find(":");
        mymap[s.substr(0, pos)] = s.substr(pos + 1, string::npos);
        token = strtok(NULL, "\n");
    }

    for (auto keyval : mymap) 
        cout << keyval.first << "/" << keyval.second << endl;

    return 0;
}

#6


0  

I doubt you should worry about optimization for reading this string and converting it in a std::map. If you really want to optimize this fixed-content map, change it to a std::vector<std::pair<>> and sort it once.

我怀疑你应该担心读取这个字符串并在std :: map中转换它的优化。如果您真的想要优化此固定内容映射,请将其更改为std :: vector >并对其进行一次排序。

That said, the most elegant way of creating the std::map with standard C++ features is the following:

也就是说,使用标准C ++功能创建std :: map的最优雅方法如下:

std::map<std::string, std::string> deserializeKeyValue(const std::string &sz) {
    constexpr auto ELEMENT_SEPARATOR = ": "s;
    constexpr auto LINE_SEPARATOR = "\n"s;

    std::map<std::string, std::string> result;
    std::size_t begin{0};
    std::size_t end{0};
    while (begin < sz.size()) {
        // Search key
        end = sz.find(ELEMENT_SEPARATOR, begin);
        assert(end != std::string::npos); // Replace by error handling
        auto key = sz.substr(begin, /*size=*/ end - begin);
        begin = end + ELEMENT_SEPARATOR.size();

        // Seach value
        end = sz.find(LINE_SEPARATOR, begin);
        auto value = sz.substr(begin, end == std::string::npos ? std::string::npos : /*size=*/ end - begin);
        begin = (end == std::string::npos) ? sz.size() : end + LINE_SEPARATOR.size();

        // Store key-value
        [[maybe_unused]] auto emplaceResult = result.emplace(std::move(key), std::move(value));
        assert(emplaceResult.second); // Replace by error handling
    }
    return result;
}

The performance of this might not be ideal, though every c++ programmer understands this code.

尽管每个c ++程序员都理解这段代码,但这可能并不理想。

#7


0  

A very simple solution using boost is the following, it works also with partial tokens (e.g. key without values or empty pairs).

使用boost的一个非常简单的解决方案如下,它也适用于部分令牌(例如没有值或空对的键)。

#include <string>
#include <list>
#include <map>
#include <iostream>

#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>

using namespace std;
using namespace boost;

int main() {

    string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    list<string> tokenList;
    split(tokenList,s,is_any_of("\n"),token_compress_on);
    map<string, string> kvMap;

    BOOST_FOREACH(string token, tokenList) {
        size_t sep_pos = token.find_first_of(": ");
        string key = token.substr(0,sep_pos);
        string value = (sep_pos == string::npos ? "" : token.substr(sep_pos+2,string::npos));
        kvMap[key] = value;

        cout << "[" << key << "] => [" << kvMap[key] << "]" << endl;
    }

    return 0;
}

#8


0  

void splitString(std::map<std::string, std::string> &mymap, const std::string &text, char sep)
{
    int start = 0, end1 = 0, end2 = 0;
    while ((end1 = text.find(sep, start)) != std::string::npos && (end2 = text.find(sep, end1+1)) != std::string::npos) {
        std::string key = text.substr(start, end1 - start);
        std::string val = text.substr(end1 + 1, end2 - end1 - 1);
        mymap.insert(std::pair<std::string,std::string>(key, val));
        start = end2 + 1;
    }
}

For example:

std::string text = "key1;val1;key2;val2;key3;val3;";
std::map<std::string, std::string> mymap;
splitString(mymap, text, ';');

Will result in a map of size 3: { key1="val1", key2="val2", key3="val3" }

将产生大小为3的地图:{key1 =“val1”,key2 =“val2”,key3 =“val3”}

More examples:

"key1;val1;key2;" => {key1="val1"} (no 2nd val, so 2nd key doesn't count)

“KEY1; VAL1; KEY2;” => {key1 =“val1”}(没有第二个val,所以第二个键不计算)

"key1;val1;key2;val2" => {key1="val1"} (no delim at end of the 2nd val, so it doesn't count)

“key1; val1; key2; val2”=> {key1 =“val1”}(在第二个val的末尾没有delim,所以它不计算)

"key1;val1;key2;;" => {key1="val1",key2=""} (key2 holds empty string)

“KEY1; VAL1; KEY2 ;;” => {key1 =“val1”,key2 =“”}(key2保存空字符串)