Apache Spark worker executor EXITED with exit status 1

I have a Spark standalone setup (v 1.4.1) with 3 workers.

I have an application that read a stream from a Kafka Topic elaborate data and store it in another Kafka Topic.

Last night the application fell down and all worker was down.

The worker’s logs report like the following:

16/02/04 21:02:10 INFO ExecutorRunner: Launch command: "/opt/jdk1.8.0_45/bin/java" "-cp" "/dati/spark-1.4.1-bin-hadoop2.4/sbin/../conf/:/dati/spark-1.4.1-bin-hadoop2.4/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=52180" "-DenabledWorkerLog=false" "-Dcom.sun.management.jmxremote.port=54330" "-Dcom.sun.management.jmxremote.ssl=false" "-Dcom.sun.management.jmxremote.authenticate=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://[email protected]:52180/user/CoarseGrainedScheduler" "--executor-id" "24279" "--hostname" "worker2" "--cores" "1" "--app-id" "app-20160201182749-0007" "--worker-url" "akka.tcp://[email protected]:57853/user/Worker"
16/02/04 21:02:10 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160201182749-0007/24279/stdout with daily rolling
16/02/04 21:02:10 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160201182749-0007/24279/stderr with daily rolling
16/02/04 21:02:10 INFO Worker: Executor app-20160129184621-0001/1430 finished with state EXITED message Command exited with code 1 exitStatus 1
16/02/04 21:02:10 INFO Worker: Asked to launch executor app-20160129184621-0001/1431 for stream-elaboration
16/02/04 21:02:10 INFO ExecutorRunner: Launch command: "/opt/jdk1.8.0_45/bin/java" "-cp" "/dati/spark-1.4.1-bin-hadoop2.4/sbin/../conf/:/dati/spark-1.4.1-bin-hadoop2.4/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=57297" "-DenabledWorkerLog=false" "-Dcom.sun.management.jmxremote.port=54326" "-Dcom.sun.management.jmxremote.ssl=false" "-Dcom.sun.management.jmxremote.authenticate=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://[email protected]:57297/user/CoarseGrainedScheduler" "--executor-id" "1431" "--hostname" "worker2" "--cores" "1" "--app-id" "app-20160129184621-0001" "--worker-url" "akka.tcp://[email protected]:57853/user/Worker"
16/02/04 21:02:10 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160129184621-0001/1431/stdout with daily rolling
16/02/04 21:02:10 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160129184621-0001/1431/stderr with daily rolling
16/02/04 21:02:11 INFO Worker: Executor app-20160201182749-0007/24279 finished with state EXITED message Command exited with code 1 exitStatus 1
16/02/04 21:02:11 INFO Worker: Asked to launch executor app-20160201182749-0007/24280 for stream-elaboration
16/02/04 21:02:11 INFO ExecutorRunner: Launch command: "/opt/jdk1.8.0_45/bin/java" "-cp" "/dati/spark-1.4.1-bin-hadoop2.4/sbin/../conf/:/dati/spark-1.4.1-bin-hadoop2.4/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=52180" "-DenabledWorkerLog=false" "-Dcom.sun.management.jmxremote.port=54330" "-Dcom.sun.management.jmxremote.ssl=false" "-Dcom.sun.management.jmxremote.authenticate=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://[email protected]:52180/user/CoarseGrainedScheduler" "--executor-id" "24280" "--hostname" "worker2" "--cores" "1" "--app-id" "app-20160201182749-0007" "--worker-url" "akka.tcp://[email protected]:57853/user/Worker"
16/02/04 21:02:11 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160201182749-0007/24280/stdout with daily rolling
16/02/04 21:02:11 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160201182749-0007/24280/stderr with daily rolling
16/02/04 21:02:11 INFO Worker: Executor app-20160129184621-0001/1431 finished with state EXITED message Command exited with code 1 exitStatus 1
16/02/04 21:02:11 INFO Worker: Asked to launch executor app-20160129184621-0001/1432 for stream-elaboration
16/02/04 21:02:11 INFO ExecutorRunner: Launch command: "/opt/jdk1.8.0_45/bin/java" "-cp" "/dati/spark-1.4.1-bin-hadoop2.4/sbin/../conf/:/dati/spark-1.4.1-bin-hadoop2.4/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/dati/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=57297" "-DenabledWorkerLog=false" "-Dcom.sun.management.jmxremote.port=54326" "-Dcom.sun.management.jmxremote.ssl=false" "-Dcom.sun.management.jmxremote.authenticate=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://[email protected]:57297/user/CoarseGrainedScheduler" "--executor-id" "1432" "--hostname" "worker2" "--cores" "1" "--app-id" "app-20160129184621-0001" "--worker-url" "akka.tcp://[email protected]:57853/user/Worker"
16/02/04 21:02:11 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160129184621-0001/1432/stdout with daily rolling
16/02/04 21:02:11 INFO FileAppender: Rolling executor logs enabled for /dati/spark-1.4.1-bin-hadoop2.4/work/app-20160129184621-0001/1432/stderr with daily rolling
16/02/04 21:02:11 INFO Worker: Executor app-20160201182749-0007/24280 finished with state EXITED message Command exited with code 1 exitStatus 1
16/02/04 21:02:11 INFO Worker: Asked to launch executor app-20160201182749-0007/24281 for stream-elaboration

at the end of the log:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp291507283-42"
Exception in thread "qtp291507283-37" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "ExecutorRunner for app-20160201182749-0007/29488" java.lang.OutOfMemoryError: GC overhead limit exceeded

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkWorker-scheduler-1"
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "qtp291507283-38" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "JMX server connection timeout 81" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "JMX server connection timeout 81"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkWorker-10"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp291507283-40"
Exception in thread "qtp291507283-35" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception in thread "qtp291507283-39" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "qtp291507283-41" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Exception in thread "RMI TCP Connection(idle)" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "qtp291507283-36" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "RMI TCP Connection(idle)" 

....

Running:

ps aux | grep "worker"

the process is still active, but I can’t see it on sparkUI.

Why are worker executor restart so frequently?

Answer

The logs show multiple java.lang.OutOfMemoryError: GC overhead limit exceeded messages, which means your executors throw errors which cause them to exit.

This error means your program spends too much time running GC (see more details here). To resolve this – you can try one of these paths:

  • The brute-force way would be disabling this safety by adding -XX:-UseGCOverheadLimit to your executors’ JVM options, but it might leave your application doing mostly GC, hence running very slowly
  • Analyze your job’s memory usage and optimize it – your code might be consuming more memory than needed, forcing the GC to work too hard
  • Tune your memory settings – for example, if you can increase heap space for the executors, pressure on GC might be reduced

Leave a Reply

Your email address will not be published. Required fields are marked *