1. What is jmethodID and why do I care
    1. jmethodID and profilers
  2. Investigation
    1. What is that assert doing
    2. Hygiene is important
    3. Asserts and guarantees
    4. Down the rabbit hole
    5. Let’s fix it!
    6. What is the real cause?
    7. There is still more to it
      1. Diagram time
  3. Next steps

What is jmethodID and why do I care

Let me start with a short introduction to the concept of jmethodID.

When working in the realm of JNI (Java Native Interface), we can obtain references to Java methods which can later be used to, for example, invoke the method. The reference to the method is represented by a jmethodID type, which is an opaque pointer to a structure containing the method’s metadata.

Why do we need an opaque pointer instead of just using the method reference in the form of jmethod directly? The answer is simple - the method reference is not stable. With the ability to retransform classes, the method reference can change at any time, invalidating any previously obtained references. Therefore, JNI uses the jmethodID type to represent the method reference, which is stable and can be used to invoke the method even after retransformations. The runtime will ensure the jmethodID is updated to point to the correct method metadata after retransformation.

However, there is one caveat: the jmethodID is only valid as long as there is a strong reference to the class containing the method that the jmethodID is pointing to. Once the class is unloaded, the jmethodID becomes invalid, and when it is used to resolve to a jmethod instance, it will return NULL. Although this is Hotspot-specific behavior, it is not something required by the JNI specification. As such, crashes may occur in other JVM (Java Virtual Machine) implementations when using invalid jmethodID values.

Later in this article, I will focus on the Hotspot implementation of jmethodID and how invalid jmethodID values can still pose problems.

jmethodID and profilers

Unsurprisingly, the stack traces obtained via the GetStackTrace or GetAllStackTraces JNI functions have their stack frames represented by jmethodID values. However, for all threads except the current one, it is impossible to guarantee that the classes referenced in the stack trace via jmethodID will be strongly held, making the behavior more or less undefined.

In Hotspot, the implementation is hardened and is supposed to handle invalid jmethodID values gracefully, compensating for this hole in the specification.

However, there is a bug - JDK-8313816 - causing spurious JVM crashes when trying to resolve the jmethodID values captured as a stack trace sometime later after obtaining the stack trace. The bug is not easy to reproduce, and it took a significant amount of time to find a reliable reproducer. The reproducer is not perfect, but it is good enough to trigger the crashes reliably. Unfortunately, it is a test suite of an internal project that heavily uses Mockito and will trigger the crashes within a few tens of seconds.

The sad reality is that this crash makes any profiler implementation using the JVMTI functions to obtain stack traces unusable in an always-on, in-production manner.

Investigation

I must admit, this particular crash has preoccupied me for many months. Not completely understanding the underlying implementation, I took several wrong turns and even proposed a patch, which turned out to solve nothing. But all that time, I was convinced there was a subtle bug in the Hotspot implementation of jmethodid, while many people remained skeptical, attributing the crash to incorrect usage of jmethodid values.

What is that assert doing

The vast majority of the crashes were happening in Method::checked_resolve_jmethod_id() call. While examining the code I skimmed over this statement

assert(is_valid_method(o), "should be valid jmethodid");

just assuming that under normal circumstances this assert should never trigger. Well, that is until I ran the reproducer with fastdebug OpenJDK build which enabled all those asserts. And suddenly, I saw the root cause for all those various crashes - the captured jmethodid values are becoming invalid and when they are used there is a good chance of crashing JVM.

So, I set out on checking the source code of the Method::is_valid_method() function.

// Check that this pointer is valid by checking that the vtbl pointer matches
bool Method::is_valid_method(const Method* m) {
  if (m == nullptr) {
    return false;
  } else if ((intptr_t(m) & (wordSize-1)) != 0) {
    // Quick sanity check on pointer.
    return false;
  } else if (!os::is_readable_range(m, m + 1)) {
    return false;
  } else if (m->is_shared()) {
    return CppVtables::is_valid_shared_method(m);
  } else if (Metaspace::contains_non_shared(m)) {
    return has_method_vptr((const void*)m);
  } else {
    return false;
  }
}

Well, that’s a few cases where jmethodid can be invalid! But which ones are triggering when I am getting the follow-up crash? In order to do that, I resorted to ‘printf debugging’, inserting printf statements into each branch with a ‘unique’ id, so it would be easy to match the failed precondition with the code path leading to the crash.

Long story short - when has_method_vptr((const void*)m) returns false and such an invalid jmethodid is then used, it is almost certain that usage will crash the JVM. Well, we are trying to use a memory blob as an object instance, but it is missing the valid object header. It only makes sense that this attempt can lead to very bad things.

Hygiene is important

At this point, I formed a hypothesis that the affected jmethodID is pointing to a Method object which was reclaimed, and part or all of the memory originally occupied by that instance was overwritten.

The 100-point question is: when can a Method object be reclaimed? Well, it turns out the JVM is very strict about that. A Method can be reclaimed only when its containing class is unloaded. And when is a class allowed to be unloaded? Only when its associated classloader is not referenced anymore. A bit of code searching and following call sites led me to ClassLoaderData::unload(). And indeed, the jmethodIDs associated with the methods from that class loader are wiped in order to make sure they will not be pointing to deallocated objects.

  // Method::clear_jmethod_ids only sets the jmethod_ids to null without
  // releasing the memory for related JNIMethodBlocks and JNIMethodBlockNodes.
  // This is done intentionally because native code (e.g. JVMTI agent) holding
  // jmethod_ids may access them after the associated classes and class loader
  // are unloaded. The Java Native Interface Specification says "method ID
  // does not prevent the VM from unloading the class from which the ID has
  // been derived. After the class is unloaded, the method or field ID becomes
  // invalid". In real world usages, the native code may rely on jmethod_ids
  // being null after class unloading. Hence, it is unsafe to free the memory
  // from the VM side without knowing when native code is going to stop using
  // them.
  if (_jmethod_ids != nullptr) {
    Method::clear_jmethod_ids(this);
  }

So, this should work, right? Hotspot takes care of nulling the references to ensure no invalid memory can be accessed via dangling pointers, and unloading is conducted in a VM operation, guaranteeing that the native code and the unloading process never race. All parts are in place, and yet, it is still possible to trigger a JVM crash quite reliably. There must be something else at play.

Asserts and guarantees

You are probably familiar with the concept of code asserts - they should capture invariants and fail if these are broken. Since the invariants are supposed to hold at all times, usually the asserts are enabled only in ‘debug’ builds. This makes them extremely cheap in non-debug builds, because they are physically removed from the resulting machine code.

In the JVM codebase, in addition to assert, there is also guarantee. Guarantee works pretty much like an assert, with the difference that guarantee will trigger also in a non-debug build.

‘What are you getting at?’, you might ask now. The answer is simple - I needed a way to ‘annotate’ the JVM code dealing with jmethodIDs, and particularly, deallocating the Method instances with various invariant checks. But, since my reproducer is quite a large application, running it with a debug build became quite problematic because it became really, really slow. So, I used temporary guarantees instead.

In the InstanceKlass class, which represents Java class metadata (very simply put), there is an interestingly looking function deallocate_contents().

// This function deallocates the metadata and C heap pointers that the
// InstanceKlass points to.
void InstanceKlass::deallocate_contents(ClassLoaderData* loader_data)

This function, in turn, calls InstanceKlass::deallocate_methods(), which takes care of physically deallocating all the Method instances contained in the current InstanceKlass. That looks promising - the assumption is that once a Method instance is deallocated, the jmethodid pointing to that method should be updated to nullptr instead. This should be quite easy to express as an invariant via guarantee -

void InstanceKlass::deallocate_methods(ClassLoaderData* loader_data,
                                       Array<Method*>* methods) {
  if (methods != nullptr && methods != Universe::the_empty_method_array() &&
      !methods->is_shared()) {
    for (int i = 0; i < methods->length(); i++) {
      Method* method = methods->at(i);
      if (method == nullptr) continue;  // maybe null if error processing
      // Only want to delete methods that are not executing for RedefineClasses.
      // The previous version will point to them so they're not totally dangling
      assert (!method->on_stack(), "shouldn't be called with methods on stack");
      
      // >>> inserted code
      // try and resolve the jmethodid from the `Method` instance
      jmethodID jmid = method->find_jmethod_id_or_null();
      // <<<
      
      MetadataFactory::free_metadata(loader_data, method);
      
      // >>> inserted code
      // The method that was deallocated must not have a jmethodid pointing to it
      // Because class redefinitions/retransformations can introduce new method versions
      // a jmethodid pointing to the original version will be updated to point to newer version
      // making it necessary to also assert that the jmethodid is actually pointing to the method
      // being deallocated  
      guarantee(jmid == nullptr || *((Method**)jmid) == nullptr || *((Method**)jmid) != method, "jmethodid was not cleaned up: %p", (void*)jmid);
      // <<<
    }
    MetadataFactory::free_array<Method*>(loader_data, methods);
  }
}

Rebuilding the OpenJDK with the added guarantee and rerunning the reproducer quickly demonstrates that there is indeed an invariant violation. As expected, the guarantee fails, resulting in the following stack trace:

Stack: [0x000000016f708000,0x000000016f90b000],  sp=0x000000016f90a7b0,  free space=2057k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0xa5def0]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x52c  (instanceKlass.cpp:551)
V  [libjvm.dylib+0xa5e64c]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, char*)+0x0
V  [libjvm.dylib+0x2d9bb4]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x47da14]  InstanceKlass::deallocate_methods(ClassLoaderData*, Array<Method*>*)+0x14c
V  [libjvm.dylib+0x47dcfc]  InstanceKlass::deallocate_contents(ClassLoaderData*)+0x94
V  [libjvm.dylib+0x487b70]  InstanceKlass::purge_previous_version_list()+0xc4
V  [libjvm.dylib+0x25b08c]  ClassLoaderData::classes_do(void (*)(InstanceKlass*))+0x3c
V  [libjvm.dylib+0x25ed24]  ClassLoaderDataGraph::clean_deallocate_lists(bool)+0x7c
V  [libjvm.dylib+0x25ee08]  ClassLoaderDataGraph::walk_metadata_and_clean_metaspaces()+0x38
V  [libjvm.dylib+0xa64ad4]  VM_Operation::evaluate()+0xe4
V  [libjvm.dylib+0xa71458]  VMThread::evaluate_operation(VM_Operation*)+0xe4
V  [libjvm.dylib+0xa71c90]  VMThread::inner_execute(VM_Operation*)+0x28c
V  [libjvm.dylib+0xa7112c]  VMThread::run()+0xcc
V  [libjvm.dylib+0x9e7b5c]  Thread::call_run()+0xc8
V  [libjvm.dylib+0x8180e0]  thread_native_entry(Thread*)+0x118
C  [libsystem_pthread.dylib+0x6fa8]  _pthread_start+0x94
VM_Operation (0x0000000170252d60): CleanClassLoaderDataMetaspaces, mode: safepoint, requested by thread 0x000000011480a400

Down the rabbit hole

Hm, apparently, there is another code path that deallocates InstanceKlass contents (and transitively the contained methods) in addition to ClassLoaderData::unload(). It starts in ClassLoaderDataGraph and the method of interest is

void ClassLoaderDataGraph::clean_deallocate_lists(bool walk_previous_versions) {
  assert(SafepointSynchronize::is_at_safepoint(), "must only be called at safepoint");
  uint loaders_processed = 0;
  for (ClassLoaderData* cld = _head; cld != nullptr; cld = cld->next()) {
    // is_alive check will be necessary for concurrent class unloading.
>>> if (cld->is_alive()) { <<<<
      // clean metaspace
      if (walk_previous_versions) {
        cld->classes_do(InstanceKlass::purge_previous_versions);
      }
      cld->free_deallocate_list();
      loaders_processed++;
    }
  }
  log_debug(class, loader, data)("clean_deallocate_lists: loaders processed %u %s",
                                 loaders_processed, walk_previous_versions ? "walk_previous_versions" : "");
}

Yes! The methods are being deallocated for a class loader that is still alive. Therefore, the code taking care of all jmmethodids associated with this class loader is not called, and any deallocated method has the potential to leave a dangling pointer in the corresponding jmethodid value.

Let’s fix it!

Good, it seems like the fix is trivial. Let’s just do this

void InstanceKlass::deallocate_methods(ClassLoaderData* loader_data,
                                       Array<Method*>* methods) {
  if (methods != nullptr && methods != Universe::the_empty_method_array() &&
      !methods->is_shared()) {
    for (int i = 0; i < methods->length(); i++) {
      Method* method = methods->at(i);
      if (method == nullptr) continue;  // maybe null if error processing
      // Only want to delete methods that are not executing for RedefineClasses.
      // The previous version will point to them so they're not totally dangling
      assert (!method->on_stack(), "shouldn't be called with methods on stack");

      jmethodID jmid = method->find_jmethod_id_or_null();
      // Do the pointer maintenance before releasing the metadata, just in case
      // We need to make sure that jmethodID actually resolves to this method
      // - multiple redefined versions may share jmethodID slots and if a method
      //   has already been rewired to a newer version we could be removing reference
      //   to a still existing method instance
      if (jmid != nullptr && *((Method**)jmid) == method) {
        // dangling pointer; needs to be cleaned up
        *((Method**)jmid) = nullptr;
      }
      MetadataFactory::free_metadata(loader_data, method);
    }
    MetadataFactory::free_array<Method*>(loader_data, methods);
  }
}

And we are done. Honestly, I can’t believe the fix was so easy. I just had to update the jmethodID value to not point to Method instances that are known to be deallocated shortly.

What is the real cause?

As you probably expect, I left out a lot of wrong turns I took while investigating. Although not leading to the fix, they helped me understand the underlying cause, which is the class redefinition/retransformation. Do you remember InstanceKlass::purge_previous_versions from the failed guarantee stack trace? The ‘previous versions’ are older forms of redefined classes when some of the methods of those classes were running at the time they were redefined, and because of that, the JVM had to keep them around for a while. Then, from time to time, the JVM will check if it is possible to unload those class versions.

The tricky part is that the JVM will keep the jmethodID cache only in the ‘main’ class version, dynamically extending the cache to accommodate jmethodID values from obsolete but yet still active class versions. This can happen, e.g., when a class is redefined, the previous version is marked obsolete, but there are still some of its methods somewhere on the stack, and at that moment, we call GetStackTrace. New jmethodID values will be generated as required, and they will be placed in the jmethodID cache in the main version.

And here come the giggles - after an older class version has been purged, the jmethodID values are still present in the main class version cache but pointing to deallocated Method instances!

Ok, now the loop is closed, and I am pretty sure the proposed fix is correct and actually addresses the root cause.

There is still more to it

The way jmethodID values are stored for previous versions of a retransformed class also opens a whole new can of worms - requesting a jmethodID for an older method version may lead to cache resizing, which in turn will cause memory allocation. And remember, doing memory allocations in signal handlers is bad, mmkay? How is CPU and wallclock profiling generally done? Well, glad you asked - it’s by sending a signal to the JVM and letting the signal handler call AsyncGetCallTrace, which returns a stack trace comprised of, drumroll please, jmethodID values.

Sensible JVM profiler implementations try to avoid jmethodID cache resizes by eagerly requesting all the jmethodID values for all methods in the class upon class load. However, this strategy won’t work if the cache is resized because a previous method version jmethodID is requested.

Honestly, I don’t have a good solution for this yet. It might take some time to figure out how to address this issue.

Diagram time

For posterity, I am adding the schematic diagrams about how the jmethodID cache is organized for a redefined class.

Fig1: Overview of the jmethodID storage jmethodID cache overview

Fig2: How the jmethodIDs are stored for a redefined class jmethodID for redefined classes

Next steps

The next steps would be trying to create a JTREG test for the fix. I will have to look around for tests that involve class redefinitions with several versions of a redefined class. Additionally, I need to find other tests that trigger metaspace cleanup. This way, I can combine them to trigger the problematic behavior. However, if this turns out to be too complicated, I would rather propose the fix and aim to get it into the upcoming JDK updates as soon as possible.