pyspark - When is it appropriate to use a UDF vs using spark functionality?

Question

Welcome To Ask or Share your Answers For Others

pyspark - When is it appropriate to use a UDF vs using spark functionality?

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:12:32+0000

It is quite simple: it is recommended to rely as much as possible on Spark's built-in functions and only use a UDF when your transformation can't be done with the built-in functions.

UDFs cannot be optimized by Spark's Catalyst optimizer, so there is always a potential decrease in performance. UDF's are expensive because they force representing data as objects in the JVM.

As you have also used the tag [pyspark] and as mentioned in the comment below, it might be of interest that "Panda UDFs" (aka vectorized UDFs) avoid the data movement between the JVM and Python. Instead they use Apache Arrow to transfer data and Pandas to process it. You can use Panda UDFs by using pandas_udf and read more about it in the Databricks blog Introducing Pandas UDF for PySpark which has a dedicated section on Performance Comparison.

Your peers might have used many UDFs because the built-in functions were not available on earlier version of Spark. Every release there are more functions being added.

Categories

pyspark - When is it appropriate to use a UDF vs using spark functionality?

pyspark - When is it appropriate to use a UDF vs using spark functionality?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags