How can I select an exact number of random rows from a DataFrame efficiently? The data contains an index column that can be used. If I have to use maximum size, what is more efficient, count() or max() on the index column?
如何有效地从DataFrame中选择确切数量的随机行?数据包含可以使用的索引列。如果我必须使用最大大小,那么索引列上的效率,count()或max()更高效?
1 个解决方案
#1
2
A possible approach is to calculate the number of rows using .count()
, then use sample()
from python
's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbers vals
to subset your index column.
一种可能的方法是使用.count()计算行数,然后使用python随机库中的sample()从该范围生成任意长度的随机序列。最后使用结果数字列表val来对索引列进行子集化。
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
Example:
例:
df = sc.parallelize([(1,1),(2,1),
(3,1),(4,0),
(5,0),(6,1),
(7,1),(8,0),
(9,0),(10,1)]).toDF(["a","b"])
sampler(df,"a",3).show()
+---+---+
| a| b|
+---+---+
| 3| 1|
| 4| 0|
| 6| 1|
+---+---+
#1
2
A possible approach is to calculate the number of rows using .count()
, then use sample()
from python
's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbers vals
to subset your index column.
一种可能的方法是使用.count()计算行数,然后使用python随机库中的sample()从该范围生成任意长度的随机序列。最后使用结果数字列表val来对索引列进行子集化。
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
Example:
例:
df = sc.parallelize([(1,1),(2,1),
(3,1),(4,0),
(5,0),(6,1),
(7,1),(8,0),
(9,0),(10,1)]).toDF(["a","b"])
sampler(df,"a",3).show()
+---+---+
| a| b|
+---+---+
| 3| 1|
| 4| 0|
| 6| 1|
+---+---+