Is there any method in pyspark to get the name of the university from a url?

      host                     count
0   xsi12.komaba.ecc.u-tokyo.ac.jp  401
1   sunspot.eds.ecip.nagoya-u.ac.jp 387
2   rungw002.ritsumei.ac.jp         343

get the university name from the data frame which is right in front of .ac.jp but after then “next” dot, from back to front e.g. for host pc021133.shef.ac.jp, the university is shef.ac.jp.

Can someone help me with this, I couldn’t find a method to get the output.

Answer

Try using regexp_extract:

import pyspark.sql.functions as F

df2 = df.withColumn('name', F.regexp_extract('host', '([^\.]*\.ac\.jp$)', 1))

Leave a Reply

Your email address will not be published. Required fields are marked *