确定与数组中的间隔匹配的值的最快方法

时间:2022-12-27 21:21:45

I have a sorted array of int going from x to y (the values of the elements are random but sorted in ascending order using qsort()). The program receives various intervals like <10;50>, or <50;100>. I have the following simple for loop to determine if the values in the array are in the set interval, if yes add one to the counter.

我有一个从x到y的整数排序数组(元素的值是随机的,但是使用qsort()按升序排序)。该程序接收的时间间隔为<10;50>,或<50;100>。我有以下简单的for循环来确定数组中的值是否在set interval中,如果yes向计数器添加一个。

 for(int i = 0; i < arraySize ;i++ )  {        
       if (points[i] >= interval1 && points[i] <= interval2){
            counter++;               
        }
    }

I need a faster way than O(n) to search through the array, and determine if the value in points[i] is in the set interval or not. The values can be in the millions, hence slowing down dramatically.

我需要一种比O(n)更快的方法来搜索数组,并确定点[I]中的值是否在设置的区间内。价值可以是数以百万计,因此急剧放缓。

The elements in the array can range from 0 to 1000000000 (1e9). The intervals respectively.

数组中的元素可以从0到1000000000 (1e9)。分别的时间间隔。

4 个解决方案

#1


2  

Use binary search - for the input interval [i, j], find the index of the smallest integer that is larger than i, find the index of the largest integer that is smaller than j, and then return the distance between them.

使用二进搜索——对于输入区间[i, j],找到小于i的最小整数的索引,找到小于j的最大整数的索引,然后返回它们之间的距离。

ssize_t bin_search_first_larger(int arr[], size_t arr_sz, int val) {
    ssize_t l = -1;
    ssize_t r = arr_sz;
    /* invariant: arr[l] < val && val <= arr[r] */
    while (l+1 != r) {
        ssize_t m = l+(r-l)/2;
        if (arr[m] < val) {
            l = m;
        } else {
            r = m;
        }
    }
    /* l+1 == r && arr[l] < val && val <= arr[r] */
    return r;
}

ssize_t bin_search_last_smaller(int arr[], size_t arr_sz, int val) {
    ssize_t l = -1;
    ssize_t r = arr_sz;
    /* invariant: arr[l] <= val && val < arr[r] */
    while (l+1 != r) {
        ssize_t m = l+(r-l)/2;
        if (arr[m] <= val) {
            l = m;
        } else {
            r = m;
        }
    }
    /* l+1 == r && arr[l] <= val && val < arr[r] */
    return l;
}

ssize_t values_in(int arr[], size_t arr_sz, int x, int y) {
    ssize_t i = bin_search_first_larger(arr, arr_sz, x);
    ssize_t j = bin_search_last_smaller(arr, arr_sz, y);
    return j-i+1;
}

The binary search code is adapted from Jon Bentley's Programming Pearls (which is well worth a read), where it is shown how binary search can be modified to return either the first occurrence or the last occurrence of a value in a sorted array with duplicates (rather than returning an arbitrary occurrence of a duplicate value). The process is similar for your use case, the difference is subtle.

二分查找的代码是改编自乔恩·本特利的编程珍珠(这是很值得一读),它显示如何修改二进制搜索返回第一次出现或去年发生的价值排序数组副本(而不是返回任意发生的一个重复的值)。过程对于您的用例是相似的,差异是微妙的。

Note that conceptually, it is assumed that arr[-1] is negative infinity, and arr[N] is positive infinity (where N is the size of the array), but obviously, the code never attempts to access such elements.

注意,在概念上,假设arr[-1]是负无穷,arr[N]是正无穷(其中N是数组的大小),但显然,代码从不尝试访问这些元素。

Time complexity is O(log(N)) where N is the size of the array, it's hard (impossible?) to get any better than that.

时间复杂度是O(log(N)),其中N是数组的大小,很难(不可能?)得到更好的结果。

I ran some tests and it appears to work fine for the general case and also for edge cases (no elements in the range, or y larger than every element, or x smaller than every element, or both x smaller than every element and y larger than every element), but as you probably know this does not prove the absence of bugs.

我进行了一些测试,它似乎工作好一般情况下和边界情况(没有元素范围,或y大于每个元素,或者x小于每个元素,或者两个x小于每个元素和y大于每个元素),但你们可能知道这并不证明没有bug。

#2


1  

Late to the party, trying to take up the challenge to do better than O(log n), here is a O(1) (time) solution to get the number of values within a given range [a,b].

晚到的一方,试图接受挑战做得比O(log n)更好,这里有一个O(1) (time)解决方案,以获得给定范围内的值的数量[a,b]。

Initialization itself, to only do once, is O(MAXVALUE+NVALUES). MAXVALUE being the highest value that may appear in the set of data, NVALUES is the number of values in the set of data. And according to the question

初始化本身,只做一次,是O(MAXVALUE+NVALUES)。MAXVALUE是数据集中可能出现的最高值,NVALUES是数据集中值的数量。根据这个问题。

  • 0 is the lowest value
  • 0是最小值
  • 1,000,000,000 is the highest value
  • 1000,000,000是最高的值
  • set of data is in the millions
  • 数以百万计的数据

O(1) requires to allow the program to allocate an array of MAXVALUE+1 int. Basically for 1bn values, an array of 1GB x sizeof(int) (gcc on Linux x86_64 that would take typically 4 GB of RAM, or partly swap). Meaning the program has to run on a at least 64 bits machine.

O(1)要求允许程序分配MAXVALUE+1 int的数组,对于10亿个值,一个1GB x sizeof(int)的数组(Linux x86_64上的gcc,通常需要4 GB的RAM或部分交换)。意味着程序必须运行在至少64位机器上。

The initial set of data is to be ordered.

初始数据集将被排序。

Principle

原则

  • Initialization (once): the m[0, 1bn] array at index i gets the number of values greater or equal than i

    初始化(once):索引i处的m[0,10亿]数组获得大于或等于i的值数

  • Number of values in the range [a, b] is simply m[a] - m[b+1]
    (if b+1 > MAXVALUE, use 0 instead)

    范围内的值数量[a, b]只是m[a] - m[b+1](如果b+1 > MAXVALUE,则使用0)

Initialization:

初始化:

#define MAXVALUE 1000000000
#define NVALUES     1000000

int *m; // big array

void initialization(int *values)
{
   m = malloc((MAXVALUE+1) * sizeof(*m)); // check if NULL!

   int i,j;

   for(j=0,i=0 ; i<=MAXVALUE ; ) {
      if (j >= NVALUES) m[i++] = 0;
      else if (values[j] >= i) m[i++] = NVALUES-j;
      else j++;
   }
}

Get number of values within range [a, b] a<=b:

取值范围[a, b] a<=b:

int count_in_range(int a, int b) {
   int ma = m[a];
   int mb = b >= MAXVALUE ? 0 : m[b+1];
   return ma-mb;
}

m has to be freed after all ranges have been counted.

m必须在所有射程都计算完毕后被释放。

#3


0  

The needed distance is equal to :

所需要的距离等于:

// position of first element greater than interval2
auto lb = std::upper_bound(array.begin(), array.end(), interval2);
// position of first element greater or equal than interval1
auto ub = std::lower_bound(array.begin(), array.end(), interval1);
// their difference is the number of elements in the needed range
return (ub - lb);

The resulting complexity is O(log N) since the lower/upper bounds on the sorted array are O(log N).

由此产生的复杂度是O(log N),因为排序数组的下界/上界是O(log N)。

Edit: sorry, didn't notice the C tag. In C you need to implement lower/upper bound operations yourself, then. To make your life simplier - you can only impement lower_bound and then use upper_bound as lower_bound(interval2 + 1).

编辑:对不起,没有注意到C标签。在C语言中,您需要自己实现下界/上界操作。为了让你的生活更简单——你只能用下界,然后用上界作为下界(interval2 + 1)。

#4


-1  

Here you have other version of BinSearch, with complexity O(logN) too.

这里还有另一个版本的BinSearch,它的复杂度也是O(logN)。

    int BinSearch(int *array, int first, int last, int value){

          int m;
          /* Optional Error control */
          if (!array || first<0 || last<first)  return -1;

          while (first <= last){

                  m = (first + last)/2;

                  if(array[m] == value) return m;

                  if(value < array[m]) last = m-1;

                  else
                       first = m+1;
            }
           /* Failure search */
           return -1;
     }

The function returns -1 if the value is not in the array or the index where the values is.

如果值不在数组或值所在的索引中,则函数返回-1。

You can do a variant that returns 1 if find the value or 0, then you can do

如果找到值或0,就可以执行返回1的变量

      counter += BinSearch_variant(array,first,last,value);

#1


2  

Use binary search - for the input interval [i, j], find the index of the smallest integer that is larger than i, find the index of the largest integer that is smaller than j, and then return the distance between them.

使用二进搜索——对于输入区间[i, j],找到小于i的最小整数的索引,找到小于j的最大整数的索引,然后返回它们之间的距离。

ssize_t bin_search_first_larger(int arr[], size_t arr_sz, int val) {
    ssize_t l = -1;
    ssize_t r = arr_sz;
    /* invariant: arr[l] < val && val <= arr[r] */
    while (l+1 != r) {
        ssize_t m = l+(r-l)/2;
        if (arr[m] < val) {
            l = m;
        } else {
            r = m;
        }
    }
    /* l+1 == r && arr[l] < val && val <= arr[r] */
    return r;
}

ssize_t bin_search_last_smaller(int arr[], size_t arr_sz, int val) {
    ssize_t l = -1;
    ssize_t r = arr_sz;
    /* invariant: arr[l] <= val && val < arr[r] */
    while (l+1 != r) {
        ssize_t m = l+(r-l)/2;
        if (arr[m] <= val) {
            l = m;
        } else {
            r = m;
        }
    }
    /* l+1 == r && arr[l] <= val && val < arr[r] */
    return l;
}

ssize_t values_in(int arr[], size_t arr_sz, int x, int y) {
    ssize_t i = bin_search_first_larger(arr, arr_sz, x);
    ssize_t j = bin_search_last_smaller(arr, arr_sz, y);
    return j-i+1;
}

The binary search code is adapted from Jon Bentley's Programming Pearls (which is well worth a read), where it is shown how binary search can be modified to return either the first occurrence or the last occurrence of a value in a sorted array with duplicates (rather than returning an arbitrary occurrence of a duplicate value). The process is similar for your use case, the difference is subtle.

二分查找的代码是改编自乔恩·本特利的编程珍珠(这是很值得一读),它显示如何修改二进制搜索返回第一次出现或去年发生的价值排序数组副本(而不是返回任意发生的一个重复的值)。过程对于您的用例是相似的,差异是微妙的。

Note that conceptually, it is assumed that arr[-1] is negative infinity, and arr[N] is positive infinity (where N is the size of the array), but obviously, the code never attempts to access such elements.

注意,在概念上,假设arr[-1]是负无穷,arr[N]是正无穷(其中N是数组的大小),但显然,代码从不尝试访问这些元素。

Time complexity is O(log(N)) where N is the size of the array, it's hard (impossible?) to get any better than that.

时间复杂度是O(log(N)),其中N是数组的大小,很难(不可能?)得到更好的结果。

I ran some tests and it appears to work fine for the general case and also for edge cases (no elements in the range, or y larger than every element, or x smaller than every element, or both x smaller than every element and y larger than every element), but as you probably know this does not prove the absence of bugs.

我进行了一些测试,它似乎工作好一般情况下和边界情况(没有元素范围,或y大于每个元素,或者x小于每个元素,或者两个x小于每个元素和y大于每个元素),但你们可能知道这并不证明没有bug。

#2


1  

Late to the party, trying to take up the challenge to do better than O(log n), here is a O(1) (time) solution to get the number of values within a given range [a,b].

晚到的一方,试图接受挑战做得比O(log n)更好,这里有一个O(1) (time)解决方案,以获得给定范围内的值的数量[a,b]。

Initialization itself, to only do once, is O(MAXVALUE+NVALUES). MAXVALUE being the highest value that may appear in the set of data, NVALUES is the number of values in the set of data. And according to the question

初始化本身,只做一次,是O(MAXVALUE+NVALUES)。MAXVALUE是数据集中可能出现的最高值,NVALUES是数据集中值的数量。根据这个问题。

  • 0 is the lowest value
  • 0是最小值
  • 1,000,000,000 is the highest value
  • 1000,000,000是最高的值
  • set of data is in the millions
  • 数以百万计的数据

O(1) requires to allow the program to allocate an array of MAXVALUE+1 int. Basically for 1bn values, an array of 1GB x sizeof(int) (gcc on Linux x86_64 that would take typically 4 GB of RAM, or partly swap). Meaning the program has to run on a at least 64 bits machine.

O(1)要求允许程序分配MAXVALUE+1 int的数组,对于10亿个值,一个1GB x sizeof(int)的数组(Linux x86_64上的gcc,通常需要4 GB的RAM或部分交换)。意味着程序必须运行在至少64位机器上。

The initial set of data is to be ordered.

初始数据集将被排序。

Principle

原则

  • Initialization (once): the m[0, 1bn] array at index i gets the number of values greater or equal than i

    初始化(once):索引i处的m[0,10亿]数组获得大于或等于i的值数

  • Number of values in the range [a, b] is simply m[a] - m[b+1]
    (if b+1 > MAXVALUE, use 0 instead)

    范围内的值数量[a, b]只是m[a] - m[b+1](如果b+1 > MAXVALUE,则使用0)

Initialization:

初始化:

#define MAXVALUE 1000000000
#define NVALUES     1000000

int *m; // big array

void initialization(int *values)
{
   m = malloc((MAXVALUE+1) * sizeof(*m)); // check if NULL!

   int i,j;

   for(j=0,i=0 ; i<=MAXVALUE ; ) {
      if (j >= NVALUES) m[i++] = 0;
      else if (values[j] >= i) m[i++] = NVALUES-j;
      else j++;
   }
}

Get number of values within range [a, b] a<=b:

取值范围[a, b] a<=b:

int count_in_range(int a, int b) {
   int ma = m[a];
   int mb = b >= MAXVALUE ? 0 : m[b+1];
   return ma-mb;
}

m has to be freed after all ranges have been counted.

m必须在所有射程都计算完毕后被释放。

#3


0  

The needed distance is equal to :

所需要的距离等于:

// position of first element greater than interval2
auto lb = std::upper_bound(array.begin(), array.end(), interval2);
// position of first element greater or equal than interval1
auto ub = std::lower_bound(array.begin(), array.end(), interval1);
// their difference is the number of elements in the needed range
return (ub - lb);

The resulting complexity is O(log N) since the lower/upper bounds on the sorted array are O(log N).

由此产生的复杂度是O(log N),因为排序数组的下界/上界是O(log N)。

Edit: sorry, didn't notice the C tag. In C you need to implement lower/upper bound operations yourself, then. To make your life simplier - you can only impement lower_bound and then use upper_bound as lower_bound(interval2 + 1).

编辑:对不起,没有注意到C标签。在C语言中,您需要自己实现下界/上界操作。为了让你的生活更简单——你只能用下界,然后用上界作为下界(interval2 + 1)。

#4


-1  

Here you have other version of BinSearch, with complexity O(logN) too.

这里还有另一个版本的BinSearch,它的复杂度也是O(logN)。

    int BinSearch(int *array, int first, int last, int value){

          int m;
          /* Optional Error control */
          if (!array || first<0 || last<first)  return -1;

          while (first <= last){

                  m = (first + last)/2;

                  if(array[m] == value) return m;

                  if(value < array[m]) last = m-1;

                  else
                       first = m+1;
            }
           /* Failure search */
           return -1;
     }

The function returns -1 if the value is not in the array or the index where the values is.

如果值不在数组或值所在的索引中,则函数返回-1。

You can do a variant that returns 1 if find the value or 0, then you can do

如果找到值或0,就可以执行返回1的变量

      counter += BinSearch_variant(array,first,last,value);