host count 0 xsi12.komaba.ecc.u-tokyo.ac.jp 401 1 sunspot.eds.ecip.nagoya-u.ac.jp 387 2 rungw002.ritsumei.ac.jp 343
get the university name from the data frame which is right in front of .ac.jp but after then “next” dot, from back to front e.g. for host pc021133.shef.ac.jp
, the university is shef.ac.jp
.
Can someone help me with this, I couldn’t find a method to get the output.
Answer
Try using regexp_extract
:
import pyspark.sql.functions as F df2 = df.withColumn('name', F.regexp_extract('host', '([^\.]*\.ac\.jp$)', 1))