字符串数组的基数排序?

时间:2021-06-18 15:59:06

I've been researching around, and while I've figured out the general idea of using Radix Sort to alphabetize an array of strings, I know I'm going the wrong direction.

我一直在研究,虽然我已经弄明白了用基数排序来按字母顺序排列字符串的基本思想,但我知道我走错了方向。

This is what I have so far:

这是我目前所拥有的:

void radixSort(string* sortMe, int l)
{
    queue<string>* sections = new queue<string>[27];    //Have a-z, and also one for strings that are null terminated.
    for(int i = 0; i < numElements; i++)
    {
        if(!(sortMe[i][l] == 32))
            sections[sortMe[i][l]-96].push(sortMe[i]);      //-96 because the ascii code for a is 97. If, for example a is the character, it will be placed at 1. 0 is left for null characters
    }

    for(int i =0; i < 26; i++)
    {
        while(!sections[i].empty())
        {
            temp.push_back(sections[i].front());
            sections[i].pop();
        }
    }
}

What I have so far sorts all the strings by the first character, and I know that I then have to go through and make subarrays of the remaining characters and sort them, but how can I implement it efficiently? The strings are of variable size and can include spaces, for example:

到目前为止,我用第一个字符对所有的字符串进行排序,我知道我必须对剩下的字符进行排序,但是我怎么才能有效地实现它呢?字符串的大小是可变的,可以包含空格,例如:

  • subdivides
  • 细分
  • main street
  • 主要街道
  • pants
  • 裤子
  • impaled decolonizing
  • 刺穿他们
  • argillaceous
  • 泥质
  • axial satisfactoriness
  • 轴向令人满意
  • temperamental
  • 喜怒无常的
  • hypersensitiveness
  • 过敏
  • bears
  • hairbreadths
  • 距离极短的
  • creams surges
  • 霜激增
  • unlaboured
  • unlaboured
  • hoosier
  • 山地人之
  • buggiest
  • mauritanians
  • 民众
  • emanators
  • 射气投置器
  • acclaiming
  • 乍得
  • zouaves dishpan
  • 义勇军洗碟盆
  • traipse
  • 漫步
  • solarisms
  • 太阳中心说
  • remunerativeness
  • remunerativeness
  • solubilizing
  • 增溶的
  • chiseled
  • 轮廓分明的
  • jugular
  • 颈静脉
  • ooziness
  • ooziness
  • toastier
  • baud
  • 波特
  • suffixed
  • 作为后缀
  • powerless tiding
  • 无能为力、
  • disassimilated
  • 异化问题
  • gasps
  • 喘息声
  • flirtier
  • 轻浮的
  • uh

This is something I found that seems like it will be of use: http://algs4.cs.princeton.edu/lectures/51DemoKeyIndexedCounting.pdf

这是我发现的有用的东西:http://algs4.cs.princeton.edu/lectures/51DemoKeyIndexedCounting.pdf

2 个解决方案

#1


5  

The slides you've found are great! But where did those queues come from in your code?

你找到的幻灯片太棒了!但是这些队列是从哪里来的呢?

Anyway, here you are (live example):

不管怎样,你在这里(活生生的例子):

template <typename E>
size_t bin(const E& elem, size_t digit)
{
    return elem.size() > digit ? size_t(elem[digit]) + 1 : 0;
}

template <size_t R, typename C, typename P>
void radix_sort(P& pos, const C& data, size_t digit)
{
    using A = std::array<size_t, R + 1>;
    A count = {};
    P prev(pos);

    for (auto i : prev)
        ++count[bin(data[i], digit)];

    A done = {}, offset = {{0}};
    std::partial_sum(count.begin(), count.end() - 1, offset.begin() + 1);

    for (auto i : prev)
    {
        size_t b = bin(data[i], digit);
        pos[offset[b] + done[b]++] = i;
    }
}

struct shorter
{
    template <typename A>
    bool operator()(const A& a, const A& b) { return a.size() < b.size(); }
};

template <size_t R, typename C>
std::vector<size_t> radix_sort(const C& data)
{
    std::vector<size_t> pos(data.size());
    std::iota(pos.begin(), pos.end(), 0);

    size_t width = std::max_element(data.begin(), data.end(), shorter())->size();

    for (long digit = long(width) - 1; digit >= 0; --digit)
        radix_sort<R>(pos, data, size_t(digit));

    return pos;
}

which you can use like that

你可以这样用吗

int main()
{
    std::vector<std::string> data = generate();
    std::vector<size_t> pos = radix_sort<128>(data);
    for (auto i : pos)
        std::cout << data[i] << std::endl;
}

where generate() is included in the live example and generates the strings given in your question.

在live示例中包含generate()并生成问题中给出的字符串。

I am not trying to explain how this works here, I assume you can figure out since you are working on the problem. But a few comments are in order.

我并不是要解释它是如何工作的,我假设你可以解决这个问题。但是有几条评论是正确的。

  • We are neither sorting the input sequence in-place, nor returning a sorted copy; we are just returning a sequence of positions of input elements in the sorted sequence.

    我们既没有对输入序列进行排序,也没有返回已排序的副本;我们只是返回排序序列中输入元素的位置序列。

  • We are processing strings from right to left.

    我们从右到左处理字符串。

  • The complexity is O(lw) where l is the input length (number of input strings) and w is the maximum input width (max. length of all input strings). So this algorithm makes sense if the string width does not vary too much.

    复杂度为O(lw),其中l为输入长度(输入字符串数),w为最大输入宽度(max)。所有输入字符串的长度。如果字符串宽度变化不大,这个算法就有意义了。

  • The first template parameter R of radix_sort() is the number of possible values for each digit (letter) in the input. E.g. R = 128 means that possible values are 0..127. This should be fine for your input. I haven't tried to do anything clever with respect to ASCII codes, but you can customize function bin() for that.

    radix_sort()的第一个模板参数R是输入中每个数字(字母)的可能值的数量。R = 128意味着可能的值是0.. .127。这对你的输入应该没问题。我还没有尝试对ASCII码做任何聪明的事情,但是您可以为此定制函数bin()。

  • In the output of bin(), value 0 is reserved to mean "we are past the end of this string". Such strings are placed before others that are still continuing.

    在bin()的输出中,值0被保留为表示“我们已经过此字符串的末尾”。这些字符串被放在仍在继续的其他字符串之前。

  • I have tried to give self-explanatory names to variables and functions, and use standard library calls for common tasks where possible.

    我尝试给变量和函数提供自解释的名称,并且使用标准库在可能的情况下调用普通的任务。

  • The code is generic, e.g. it can sort any random access container containing random access containers, not just vectors of strings.

    代码是通用的,例如,它可以对任何包含随机访问容器的随机访问容器进行排序,而不仅仅是字符串的向量。

  • I am using C++11 features here and there for convenience, but nothing is really necessary: one could easily do the same just with C++03.

    为了方便起见,我到处都在使用c++ 11的特性,但没有什么是真正必要的:仅使用c++ 03就可以很容易地做到这一点。

#2


0  

Very similar to iavr, but sorting in place (benchmarked against iavr's solution with g++ -O3 and takes about 2020ms compared to iavr's 1780ms), enjoying a regular interface and resuable code. The problem with Iavr's implementation is that its logic only works with containers of strings, and is not easily extensible to other types. Obviously his specialized version is more efficient, but it might be worth it to sacrifice some performance for regularity. You can find the rest of the code at radix sort implementation

与iavr非常相似,但是排序的地方(与iavr的g++ -O3的解决方案相比,与iavr的1780ms相比,大约需要2020ms),使用常规接口和可重用代码。Iavr实现的问题是,它的逻辑只适用于字符串容器,不容易扩展到其他类型。显然,他的专门化版本更有效,但是为了规律性而牺牲一些性能是值得的。您可以在radix sort实现中找到其余的代码

General Radix sort:

一般基数排序:

template <typename T> 
using Iter_value = std::iterator_traits<T>::value_type;

// intermediate struct to get partial template specialization
template<typename Iter, typename T, size_t range = 256>
struct rdx_impl {
    static void rdx_sort(Iter begin, Iter end, int bits) { 
        // bits is # bits to consider up to if a max val is known ahead of time
        // most efficent (theoretically) when digits are base n, having lg(n) bits
        constexpr size_t digit_bits {8};        // # bits in digit, 8 works well for 32 and 64 bit vals

            size_t d {0};                   // current digit #
            for (long long mask = (1 << digit_bits) - 1;
                d * digit_bits < bits;) {// ex. 0x000000ff for setting lower 8 bits on 32 bit num
                cnt_sort(begin, end, range, Digit_cmp<T>(mask, digit_bits*d));
                ++d;
            }
        }
    };

// specialization of rdx_sort for strings
struct Shorter {
    template <typename Seq>
    bool operator()(const Seq& a, const Seq& b) { return a.size() < b.size(); }
};
template <typename Iter>    
struct rdx_impl<Iter, std::string> {    // enough to hold ASCII char range
    static void rdx_sort(Iter begin, Iter end, int) {
        // ignore additional int argument
        int len_max = std::max_element(begin, end, Shorter())->size();
        for (int d = len_max - 1; d >= 0; --d)
            cnt_sort(begin, end, 128, Digit_cmp<std::string>(d));
    }
};

// generic call interface for all iterators 
template <typename Iter>   // use intermediate struct for partial specialization
void rdx_sort(Iter begin, Iter end, int bits) {
    rdx_impl<Iter, Iter_value<Iter>>::rdx_sort(begin, end, bits);
}

Counting sort to sort on each digit (in place):

计数排序对每一个数字排序(就位):

template <typename Iter, typename Op>
void cnt_sort(Iter begin, Iter end, size_t range, Op op) {
    using T = typename Iter::value_type;
    std::vector<int> counts(range);   // init to 0
    for (auto i = begin; i != end; ++i) // count # elems == i
        ++counts[op(*i)]; 
    for (size_t i = 1; i < range; ++i)
        counts[i] += counts[i-1];   // turn into # elems <= i
    std::vector<T> res(end - begin);
    for (auto j = end;;) {
        --j;
        res[--counts[op(*j)]] = *j;
        if (j == begin) break;
    }
    // ~18% of time is spent on copying
    std::copy(res.begin(), res.end(), begin);
}

Extract value of digit:

提取数字的价值:

template <typename T>   // overload digit_cmp for non-integral types top provide radix sort with digits
class Digit_cmp {   // functor for comparing a "digit" (particular bits)
    const long long mask; // 0..63 bitfield to test against
    const size_t to_shift;
public:
    Digit_cmp(long long m, size_t ts) : mask{m}, to_shift{ts} {}
    // by default assumes integral, just shifts
    size_t operator()(T n) const {    // char assuming r = 8
        return (n >> to_shift) & mask; // shift then mask for unit digit
    }
};
// specialization for strings
template <>
class Digit_cmp<std::string> {
    const size_t digit;
public:
    Digit_cmp(size_t d) : digit{d} {}
    size_t operator()(const std::string& str) {
        // 0 indicates past the end of the string
        return str.size() > digit ? str[digit] : 0;
    }
};

#1


5  

The slides you've found are great! But where did those queues come from in your code?

你找到的幻灯片太棒了!但是这些队列是从哪里来的呢?

Anyway, here you are (live example):

不管怎样,你在这里(活生生的例子):

template <typename E>
size_t bin(const E& elem, size_t digit)
{
    return elem.size() > digit ? size_t(elem[digit]) + 1 : 0;
}

template <size_t R, typename C, typename P>
void radix_sort(P& pos, const C& data, size_t digit)
{
    using A = std::array<size_t, R + 1>;
    A count = {};
    P prev(pos);

    for (auto i : prev)
        ++count[bin(data[i], digit)];

    A done = {}, offset = {{0}};
    std::partial_sum(count.begin(), count.end() - 1, offset.begin() + 1);

    for (auto i : prev)
    {
        size_t b = bin(data[i], digit);
        pos[offset[b] + done[b]++] = i;
    }
}

struct shorter
{
    template <typename A>
    bool operator()(const A& a, const A& b) { return a.size() < b.size(); }
};

template <size_t R, typename C>
std::vector<size_t> radix_sort(const C& data)
{
    std::vector<size_t> pos(data.size());
    std::iota(pos.begin(), pos.end(), 0);

    size_t width = std::max_element(data.begin(), data.end(), shorter())->size();

    for (long digit = long(width) - 1; digit >= 0; --digit)
        radix_sort<R>(pos, data, size_t(digit));

    return pos;
}

which you can use like that

你可以这样用吗

int main()
{
    std::vector<std::string> data = generate();
    std::vector<size_t> pos = radix_sort<128>(data);
    for (auto i : pos)
        std::cout << data[i] << std::endl;
}

where generate() is included in the live example and generates the strings given in your question.

在live示例中包含generate()并生成问题中给出的字符串。

I am not trying to explain how this works here, I assume you can figure out since you are working on the problem. But a few comments are in order.

我并不是要解释它是如何工作的,我假设你可以解决这个问题。但是有几条评论是正确的。

  • We are neither sorting the input sequence in-place, nor returning a sorted copy; we are just returning a sequence of positions of input elements in the sorted sequence.

    我们既没有对输入序列进行排序,也没有返回已排序的副本;我们只是返回排序序列中输入元素的位置序列。

  • We are processing strings from right to left.

    我们从右到左处理字符串。

  • The complexity is O(lw) where l is the input length (number of input strings) and w is the maximum input width (max. length of all input strings). So this algorithm makes sense if the string width does not vary too much.

    复杂度为O(lw),其中l为输入长度(输入字符串数),w为最大输入宽度(max)。所有输入字符串的长度。如果字符串宽度变化不大,这个算法就有意义了。

  • The first template parameter R of radix_sort() is the number of possible values for each digit (letter) in the input. E.g. R = 128 means that possible values are 0..127. This should be fine for your input. I haven't tried to do anything clever with respect to ASCII codes, but you can customize function bin() for that.

    radix_sort()的第一个模板参数R是输入中每个数字(字母)的可能值的数量。R = 128意味着可能的值是0.. .127。这对你的输入应该没问题。我还没有尝试对ASCII码做任何聪明的事情,但是您可以为此定制函数bin()。

  • In the output of bin(), value 0 is reserved to mean "we are past the end of this string". Such strings are placed before others that are still continuing.

    在bin()的输出中,值0被保留为表示“我们已经过此字符串的末尾”。这些字符串被放在仍在继续的其他字符串之前。

  • I have tried to give self-explanatory names to variables and functions, and use standard library calls for common tasks where possible.

    我尝试给变量和函数提供自解释的名称,并且使用标准库在可能的情况下调用普通的任务。

  • The code is generic, e.g. it can sort any random access container containing random access containers, not just vectors of strings.

    代码是通用的,例如,它可以对任何包含随机访问容器的随机访问容器进行排序,而不仅仅是字符串的向量。

  • I am using C++11 features here and there for convenience, but nothing is really necessary: one could easily do the same just with C++03.

    为了方便起见,我到处都在使用c++ 11的特性,但没有什么是真正必要的:仅使用c++ 03就可以很容易地做到这一点。

#2


0  

Very similar to iavr, but sorting in place (benchmarked against iavr's solution with g++ -O3 and takes about 2020ms compared to iavr's 1780ms), enjoying a regular interface and resuable code. The problem with Iavr's implementation is that its logic only works with containers of strings, and is not easily extensible to other types. Obviously his specialized version is more efficient, but it might be worth it to sacrifice some performance for regularity. You can find the rest of the code at radix sort implementation

与iavr非常相似,但是排序的地方(与iavr的g++ -O3的解决方案相比,与iavr的1780ms相比,大约需要2020ms),使用常规接口和可重用代码。Iavr实现的问题是,它的逻辑只适用于字符串容器,不容易扩展到其他类型。显然,他的专门化版本更有效,但是为了规律性而牺牲一些性能是值得的。您可以在radix sort实现中找到其余的代码

General Radix sort:

一般基数排序:

template <typename T> 
using Iter_value = std::iterator_traits<T>::value_type;

// intermediate struct to get partial template specialization
template<typename Iter, typename T, size_t range = 256>
struct rdx_impl {
    static void rdx_sort(Iter begin, Iter end, int bits) { 
        // bits is # bits to consider up to if a max val is known ahead of time
        // most efficent (theoretically) when digits are base n, having lg(n) bits
        constexpr size_t digit_bits {8};        // # bits in digit, 8 works well for 32 and 64 bit vals

            size_t d {0};                   // current digit #
            for (long long mask = (1 << digit_bits) - 1;
                d * digit_bits < bits;) {// ex. 0x000000ff for setting lower 8 bits on 32 bit num
                cnt_sort(begin, end, range, Digit_cmp<T>(mask, digit_bits*d));
                ++d;
            }
        }
    };

// specialization of rdx_sort for strings
struct Shorter {
    template <typename Seq>
    bool operator()(const Seq& a, const Seq& b) { return a.size() < b.size(); }
};
template <typename Iter>    
struct rdx_impl<Iter, std::string> {    // enough to hold ASCII char range
    static void rdx_sort(Iter begin, Iter end, int) {
        // ignore additional int argument
        int len_max = std::max_element(begin, end, Shorter())->size();
        for (int d = len_max - 1; d >= 0; --d)
            cnt_sort(begin, end, 128, Digit_cmp<std::string>(d));
    }
};

// generic call interface for all iterators 
template <typename Iter>   // use intermediate struct for partial specialization
void rdx_sort(Iter begin, Iter end, int bits) {
    rdx_impl<Iter, Iter_value<Iter>>::rdx_sort(begin, end, bits);
}

Counting sort to sort on each digit (in place):

计数排序对每一个数字排序(就位):

template <typename Iter, typename Op>
void cnt_sort(Iter begin, Iter end, size_t range, Op op) {
    using T = typename Iter::value_type;
    std::vector<int> counts(range);   // init to 0
    for (auto i = begin; i != end; ++i) // count # elems == i
        ++counts[op(*i)]; 
    for (size_t i = 1; i < range; ++i)
        counts[i] += counts[i-1];   // turn into # elems <= i
    std::vector<T> res(end - begin);
    for (auto j = end;;) {
        --j;
        res[--counts[op(*j)]] = *j;
        if (j == begin) break;
    }
    // ~18% of time is spent on copying
    std::copy(res.begin(), res.end(), begin);
}

Extract value of digit:

提取数字的价值:

template <typename T>   // overload digit_cmp for non-integral types top provide radix sort with digits
class Digit_cmp {   // functor for comparing a "digit" (particular bits)
    const long long mask; // 0..63 bitfield to test against
    const size_t to_shift;
public:
    Digit_cmp(long long m, size_t ts) : mask{m}, to_shift{ts} {}
    // by default assumes integral, just shifts
    size_t operator()(T n) const {    // char assuming r = 8
        return (n >> to_shift) & mask; // shift then mask for unit digit
    }
};
// specialization for strings
template <>
class Digit_cmp<std::string> {
    const size_t digit;
public:
    Digit_cmp(size_t d) : digit{d} {}
    size_t operator()(const std::string& str) {
        // 0 indicates past the end of the string
        return str.size() > digit ? str[digit] : 0;
    }
};