Quick Guide to bootstrapping Java 11 on EMR

Steps to configure EMR to run jars compiled using Java 11/bytecode version 55.0 and some “gotchas”!

Caution: We tested this bootstrapping only to run Apache Spark 3.0.2. I am not sure how this impacts other applications like Pig, Hive and Tez. That being said, I haven’t encountered a need for any of these applications so far and believe it to be the same case for most of you out there.

Thanks to my team and AWS support, I was able to spin up my EMR cluster with Java 11. The reason to do so was some upstream dependencies that were compiled in java 11. EMR by default supports Java 8 and throws an exception when it encounters bytecode compiled by higher versions of java.

With no changes to JVM home, if you try running Java 11 compiled libraries, you will hit this error:

Caused by: java.lang.UnsupportedClassVersionError: com/group/project/YourClass has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
...

It is frustrating to rewrite these dependencies using Java 8 and give up all the nicer add-ons Java 11 and other libraries bring(local type inference and kryo serialization is my favorite).

Therefore, in order to get my newer libraries working, me and my team decided to work on it. We encountered some gotchas on the way and I decided to be a good samaritan and publish my findings to save the reader’s time.

Step 1: Add GC ignore options to spark configuration

When creating EMR cluster, in the configuration section, add these options to spark-defaults :

[
{
"classification": "spark-defaults",
"properties": {
"spark.driver.defaultJavaOptions": "-XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70",
"spark.executor.defaultJavaOptions": "-verbose:gc -Xlog:gc*::time -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70 -XX:+IgnoreUnrecognizedVMOptions"
},
"configurations": []
}
]

The idea here is simple, ensure that java 11 ignores some of the GC options set by Java 8. Without this config, the app will terminate with following error:

Unrecognized VM option 'UseGCLogFileRotation'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Step 2: Write and Store bootstrap.sh in S3

Add a bootstrap step to run the following script once the node is provisioned:

Please note that simply setting JAVA_HOME to point to Java 11 won’t work as the cluster provisioning fails (based on https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only), emr-6.4.0 has Hadoop 3.2.1)

This was a big Gotcha for me and my team to understand how JAVA_HOME is used.

Okay, that’s all you need to really run Apache spark on EMR using java 11. We will keep updating the blog as and how we learn more about EMR and supported java versions! Chao!