Sparksql过滤(选择where子句)具有多个条件

时间:2021-05-20 23:10:34

Hi I have the following issue:

你好,我有以下问题:

numeric.registerTempTable("numeric"). 

All the values that I want to filter on are literal null strings and not N/A or Null values.

我想要过滤的所有值都是文本空字符串,而不是N/A或空值。

I tried these three options:

我尝试了以下三种选择:

  1. numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')

    numeric_filtered =数字。过滤器(数字['低']! =“零”)。过滤器(数字['高']! =“零”)。过滤器(数字['正常']! =“零”)

  2. numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

    numeric_filtered =数字。过滤器(数值['LOW']] != 'null'和数值['HIGH'] != 'null'和数值['NORMAL'] != 'null')

  3. sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

    sqlContext。sql(“在低!= 'null'和高!= 'null'和普通!= 'null')

Unfortunately, numeric_filtered is always empty. I checked and numeric has data that should be filtered based on these conditions.

不幸的是,numeric_filter总是空的。我检查了一下,numeric有数据应该根据这些条件进行过滤。

Here are some sample values:

以下是一些样本值:

Low High Normal

低高正常

3.5 5.0 null

3.5 - 5.0零

2.0 14.0 null

2.0 - 14.0零

null 38.0 null

38.0零空

null null null

零空零

1.0 null 4.0

1.0零4.0

1 个解决方案

#1


18  

Your are using logical conjunction (AND). It means that all columns have to be different than 'null' for row to be included. Lets illustrate that using filter version as an example:

你正在使用逻辑连接(和)。这意味着要包含行,所有列都必须与'null'不同。让我们举例说明使用过滤器版本:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

您尝试过的所有其他方法都遵循相同的模式。这里需要的是逻辑分离(或)。

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

or with raw SQL:

或与原始SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

See also: Pyspark: multiple conditions in when clause

参见:Pyspark: when子句中的多个条件。

#1


18  

Your are using logical conjunction (AND). It means that all columns have to be different than 'null' for row to be included. Lets illustrate that using filter version as an example:

你正在使用逻辑连接(和)。这意味着要包含行,所有列都必须与'null'不同。让我们举例说明使用过滤器版本:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

您尝试过的所有其他方法都遵循相同的模式。这里需要的是逻辑分离(或)。

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

or with raw SQL:

或与原始SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

See also: Pyspark: multiple conditions in when clause

参见:Pyspark: when子句中的多个条件。