scala - Spark sample is too slow

Question

Welcome To Ask or Share your Answers For Others

scala - Spark sample is too slow

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Spark sample is too slow

I'm trying to execute a simple random sample with Scala from an existing table, containing around 100e6 records.

import org.apache.spark.sql.SaveMode

val nSamples = 3e5.toInt
val frac = 1e-5
val table = spark.table("db_name.table_name").sample(false, frac).limit(nSamples)
(table
  .write
  .mode(SaveMode.Overwrite)
  .saveAsTable("db_name.new_name")
)

But it is taking too long (~5h by my estimates).

Useful information:

I have ~6 workers. By analyzing the number of partitions of the table I get: 11433.
I'm not sure if the partitions/workers ratio is reasonable.
I'm running Spark 2.1.0 using Scala.

I have tried:

Removing the .limit() part.
Changing frac to 1.0, 0.1, etc.

Question: how can I make it faster?

Best,

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T20:07:03+0000

Limit is definitely worth removing, but the real problem is that sampling requires a full data scan. No matter how low is the fraction, the time complexity is still O(N)*.

If you don't require good statistical properties, you can try to limit amount of data you've loaded in the first place by sampling data files first, and then subsampling from the reduced dataset. This might work reasonably well, if data is uniformly distributed.

Otherwise there is not much you can do about it, other than scaling your cluster.

* How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?

Categories

scala - Spark sample is too slow

scala - Spark sample is too slow

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags