Is there any method in pyspark to get the name of the university from a url?

      host                     count
0  401
1 387
2         343

get the university name from the data frame which is right in front of but after then “next” dot, from back to front e.g. for host, the university is

Can someone help me with this, I couldn’t find a method to get the output.


Try using regexp_extract:

import pyspark.sql.functions as F

df2 = df.withColumn('name', F.regexp_extract('host', '([^\.]*\.ac\.jp$)', 1))

Leave a Reply

Your email address will not be published. Required fields are marked *