Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.8k views
in Technique[技术] by (71.8m points)

scala - How to apply a function on each row of a Spark Dataframe after groupby using Java

I'm using Spark 3.0.x in Java and we've a transformation requirement that goes as follows:

A sample (hypothetical) Dataframe:

+==============+===========+
| UserCategory | Location  |
+==============+===========+
| Premium      | Mumbai    |
+--------------+-----------+
| Premium      | Bangalore |
+--------------+-----------+
| Gold         | Kolkata   |
+--------------+-----------+
| Premium      | Moskow    |
+--------------+-----------+
| Silver       | Tokyo     |
+--------------+-----------+
| Gold         | Sydney    |
+--------------+-----------+
| Bronze       | Dubai     |
+--------------+-----------+
| ...          | ...       |
+--------------+-----------+

Let's say, we have a UserCategory column on which we need to perform a groupby(). Then on each row of the grouped data we need to apply certain function. The function is dependent on both the columns of the dataframe. E.g. we might generate a CouponCode based on how many users are there in a specific UserCategory and also the user Location. Point is, it has to have a groupby() before the row-by-row rule coming in. The expected output might be something like the following:

+==============+===========+============+
| UserCategory | Location  | CouponCode |
+==============+===========+============+
| Premium      | Mumbai    | IN01P      |
+--------------+-----------+------------+
| Premium      | Bangalore | IN02P      |
+--------------+-----------+------------+
| Premium      | Moskow    | RU03P      |
+--------------+-----------+------------+
| Gold         | Kolkata   | IN01G      |
+--------------+-----------+------------+
| Gold         | Sydney    | AU01G      |
+--------------+-----------+------------+
| Silver       | Tokyo     | JP01S      |
+--------------+-----------+------------+
| Bronze       | London    | UK01B      |
+--------------+-----------+------------+
| ...          | ...       | ...        |
+--------------+-----------+------------+

Now, if we were using PySpark, I could've leveraged pandas udf to achieve the functionality. But our stack is built on Java so that's not a possibility. I was wondering what could be a way to achieve the same in Java.

I was looking at the javadoc here, and it seems a groupby() on dataframe returns a RelationalGroupedDataset. Then I browsed through the methods available on the RelationalGroupedDataset class, and the only one that looks promising is the apply() function. Unfortunately it doesn't have any documentation and the method signature looks a bit intimidating:

public static RelationalGroupedDataset apply(Dataset<Row> df,
                                             scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
                                             RelationalGroupedDataset.GroupType groupType)

So, question is, is that the way to go? If so, then a dummy example of the apply() method would be very much appreciated. Else, if you could point me towards a different approach of achieving the same, that'd also be fine.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...