Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.9k views
in Technique[技术] by (71.8m points)

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;

val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))

The solution for this query already exists in pyspark version --provided in the following link PySpark DataFrame - Join on multiple columns dynamically

I would like to code the same code using spark-scala

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

In scala you do it in similar way like in python but you need to use map and reduce functions:

val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._

val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")

val columnsdf1 = df1.columns
val columnsdf2 = df2.columns

val joinExprs = columnsdf1
   .zip(columnsdf2)
   .map{case (c1, c2) => df1(c1) === df2(c2)}
   .reduce(_ && _)

val dfJoinRes = df1.join(df2,joinExprs)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...