How to access Hbase on S3 in from non EMR node

I am trying to access hbase on EMR for read and write from a java application that is running outside EMR cluster nodes . ie;from a docker application running on ECS cluster/EC2 instance. The hbase root folder is like s3://<bucketname/. I need to get hadoop and hbase configuration objects to access the hbase data for read and write using the core-site.xml,hbase-site.xml files. I am able to access the same if hbase data is stored in hdfs.

But when it is hbase on S3 and try to achieve the same I am getting below exception.

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638

The core-site.xml file contains the the below properties.

<property>
  <name>fs.s3.impl</name>
  <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

<property>
  <name>fs.s3n.impl</name>
  <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

Below is the jar containing the “com.amazon.ws.emr.hadoop.fs.EmrFileSystem” class: /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.44.0.jar This jar is present only on emr nodes and cannot be included as a maven dependency in a java project from maven public repo. For Map/Reduce jobs and Spark jobs adding the jar location in the classpath will serve the purpose. For a java application running outside emr cluster nodes, adding the jar to the classpath won’t work as the jar is not available in the ecs instances. Manually adding the jar to the classpath will lead to the below error.

2021-03-26 10:02:39.420  INFO 1 --- [           main] c.a.ws.emr.hadoop.fs.util.PlatformInfo   : Unable to read clusterId from http://localhost:8321/configuration , trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
2021-03-26 10:02:39.421  INFO 1 --- [           main] c.a.ws.emr.hadoop.fs.util.PlatformInfo   : Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
2021-03-26 10:02:39.421  INFO 1 --- [           main] c.a.ws.emr.hadoop.fs.util.PlatformInfo   : Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
2021-03-26 10:02:45.578  WARN 1 --- [           main] c.a.w.e.h.fs.util.ConfigurationUtils     : Cannot create temp dir with proper permission: /mnt/s3

We are using emr version 5.29. Is there any work around to solve the issue?

Answer

I was able to solve the issue by using s3a. EMRFS libs used in the emr are not public and cannot be used outside EMR. Hence I used S3AFileSystem to access hbase on S3 from my ecs cluster. Add hadoop-aws and aws-java-sdk-bundle maven dependencies corresponding to your hadoop version. And add the below property in my core-site.xml.

<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  <description>The implementation class of the S3A Filesystem</description>
</property>

then change the hbase root directory url in hbase-site.xml as follows.

  <property>
    <name>hbase.rootdir</name>
    <value>s3a://ncsdevhbase/</value>
  </property>

You can also set the other s3a related properties. Please refer to the below link for more details related to s3a. https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html