Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

pyspark - How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.

I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes. The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.

I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files. This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.

Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?

question from:https://stackoverflow.com/questions/65912908/how-to-specify-file-size-using-repartition-in-spark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You could consider writing your result with the parameter maxRecordsPerFile.

storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
     "maxRecordsPerFile", 
     estimated_records_with_desired_size) 
     .parquet(storage_location, compression="snappy")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...