I am trying to access hbase on EMR for read and write from a java application that is running outside EMR cluster nodes . ie;from a docker application running on ECS cluster/EC2 instance. The hbase root folder is like
s3://<bucketname/. I need to get hadoop and hbase configuration objects to access the hbase data for read and write using the core-site.xml,hbase-site.xml files. I am able to access the same if hbase data is stored in hdfs.
But when it is hbase on S3 and try to achieve the same I am getting below exception.
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638
The core-site.xml file contains the the below properties.
<property> <name>fs.s3.impl</name> <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value> </property> <property> <name>fs.s3n.impl</name> <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value> </property>
Below is the jar containing the “com.amazon.ws.emr.hadoop.fs.EmrFileSystem” class:
This jar is present only on emr nodes and cannot be included as a maven dependency in a java project from maven public repo. For Map/Reduce jobs and Spark jobs adding the jar location in the classpath will serve the purpose. For a java application running outside emr cluster nodes, adding the jar to the classpath won’t work as the jar is not available in the ecs instances. Manually adding the jar to the classpath will lead to the below error.
2021-03-26 10:02:39.420 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from http://localhost:8321/configuration , trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json 2021-03-26 10:02:39.421 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json 2021-03-26 10:02:39.421 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look 2021-03-26 10:02:45.578 WARN 1 --- [ main] c.a.w.e.h.fs.util.ConfigurationUtils : Cannot create temp dir with proper permission: /mnt/s3
We are using emr version 5.29. Is there any work around to solve the issue?
I was able to solve the issue by using s3a. EMRFS libs used in the emr are not public and cannot be used outside EMR. Hence I used S3AFileSystem to access hbase on S3 from my ecs cluster. Add
aws-java-sdk-bundle maven dependencies corresponding to your hadoop version.
And add the below property in my core-site.xml.
<property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> <description>The implementation class of the S3A Filesystem</description> </property>
then change the hbase root directory url in hbase-site.xml as follows.
<property> <name>hbase.rootdir</name> <value>s3a://ncsdevhbase/</value> </property>
You can also set the other s3a related properties. Please refer to the below link for more details related to s3a. https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html