Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

pyspark - Split Spark Dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:

rdd.map(lambda row: row + [row.my_str_col.split('-')])

which takes something looking like:

col1 | my_str_col
-----+-----------
  18 |  856-yygrm
 201 |  777-psgdg

and converts it to this:

col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
  18 |  856-yygrm |   856 | yygrm
 201 |  777-psgdg |   777 | psgdg

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

Ideally, I want these new columns to be named as well.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
  18 |  856-yygrm |   856 | yygrm
 201 |  777-psgdg |   777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...