Resilience in a Microservices World

Financial Engines TechBlog

Published in

Financial Engines TechBlog

11 min readJul 30, 2016

By Craig LaSalle

Introduction

Building software is often compared to building homes, and in building a home you don’t let a faulty circuit burn down the house. Decades of learnings in the home building industry have shown that one small device, the circuit breaker, can protect an entire house.

Financial Engines is on the microservices journey for what are the typical business and architectural motivations. Basically, the cost of ownership of a monolithic system is growing due to the monolithic nature of the stack: code, releases and teams are coupled together. One solution is to decompose the system into smaller pieces following the principles of microservices.

There are many great blogs and books explaining microservices architectures (we’ve been using Building Microservices by Sam Newman). A consistent theme across these is that loosely coupled distributed systems, while having many benefits, will experience failure across the service boundaries, and such systems need to gracefully handle those failures. Netflix has some interesting back-of-the-envelope stats on failure rates: Network Failure Rates.

Even without an architectural shift to microservices, the modern SaaS application is simply more distributed in terms of integrating with 3rd party functionality. At Financial Engines we have a growing number of distributed connections to external services, even within the code that remains a monolith.

While we don’t want to prematurely engineer or over engineer solutions, we don’t want to ignore sage advice either. So, to keep the system running smoothly, and recover from issues without an all-hands-on-deck emergency each time, we decided to investigate technologies to add resilience and runtime insights into the system.

That is where Hystrix comes into the picture.

Hystrix 101

Netflix has a lot of great documentation on Hystrix, and if you haven’t already visited the project wikis, definitely make it your next stop, after this blog. You can start at Netflix on Hystrix.

At a high level, Hystrix gives client code better control over how network service calls, or actually any functionality, can affect the client system. Hystrix is about the client. To afford a client better control, Hystrix strives for:

latency tolerance
fault tolerance
prevent cascading failure

In sum, these attributes describe a resilient system. Again, the Netflix wikis have a lot of great detail.

In terms of code, this resilience is accomplished by wrapping any chunk of functionality in a HystrixCommand. While “any functionality” can include work other than a network service call, we’re focused on network calls below.

The HystrixCommand gives better control primarily through 2 patterns: the bulkhead pattern and the circuit breaker pattern. The bulkhead pattern isolates system resources that are servicing a network call. The circuit breaker pattern keeps a client from hanging on a failing network service call and repeating the failing action.

In addition to resiliency, as services require more network calls to complete a single request, we want to execute those calls in parallel so we don’t pay the cost of sequential synchronous network requests. Fortunately, Hystrix integrates well with parallel, asynchronous processing.

Use Case

Financial Engines builds financial planning and portfolio management software that helps customers plan for retirement and manage their 401Ks, IRAs and other investment accounts. There’s much more going on there, but that’s a high level view. The nature of the fiduciary business is that there are legal concerns which require legal documents to be presented to the user of the system. In our use case, the legal documents are provided by a microservice that can be used by any number of client applications or services.

Implementation

The core task of using Hystrix is implementing command classes derived from HystrixCommand. It is a basic command pattern and the derived class implements a run() and a getFallback() method. Just to note, in the sample below, quite a bit of code was omitted for brevity and clarity, leaving those portions that should be of interest.

public class LegalDocRequestCommand extends HystrixCommand<LegalDocResponseDto> {
  private static int DEFAULT_TIMEOUT_MS = 5000; // millisecond timeout
  private static String HYSTRIX_GROUP_LEGAL_DOC_SERVICE = "LegalDocServiceGroup";
  LegalDocConfigurator configurator = null;
  /**
   * Parameters to the legal documents service.
   */
  String context;
  String sponsorId;
  String rkId;
  boolean includeIraManagement;

  protected LegalDocRequestCommand(LegalDocConfigurator configurator) {
    super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(HYSTRIX_GROUP_LEGAL_DOC_SERVICE))
        .andCommandPropertiesDefaults(
            HystrixCommandProperties.Setter().withExecutionTimeoutInMilliseconds(configurator.getTimeout())));
    this.configurator = configurator;
  }

  /**
   * HystrixCommand run() API.
   */
  @Override
  protected LegalDocResponseDto run() {
    String responseStr = null;
    WebResource resource = createWebResource();
    // try to get the legal documents
    responseStr = resource.get(String.class);
    LegalDocResponseDto responseDto = mapResponseJsonToPojo(responseStr);
    return responseDto;
  }

  /**
   * HystrixCommand getFallback() API.
   */
  @Override
  protected LegalDocResponseDto getFallback() {
    logger.error("LegalDocRequestCommand failed: " + constructLegalDocServiceUrl() + " Root cause: " + "",
        getFailedExecutionException());
    LegalDocResponseDto emptyResponse = new LegalDocResponseDto();
    return emptyResponse;
  }

  /**
   * Builder pattern for creating the LegalDocRequestCommand since REST endpoint requires
   * several intrinsic type parameters. Enforces that all properties are set appropriately.
   */
  public static class Builder {
    private String context = null;
    private String sponsorId = null;
    private String rkId = null;
    private boolean includeIraManagement = true;

    public Builder() {
    }

    protected Environment getEnvironment() {
      return Environment.getInstance();
    }

    public Builder withContext(String context) {
      this.context = context;
      return this;
    }

    public Builder withSponsorId(String sponsorId) {
      this.sponsorId = sponsorId;
      return this;
    }

    public Builder withRkId(String rkId) {
      this.rkId = rkId;
      return this;
    }

    public Builder withIncludeIraManagement(boolean includeIraManagement) {
      this.includeIraManagement = includeIraManagement;
      return this;
    }

    /**
     * Build the LegalDocRequestCommand.
     */
    public LegalDocRequestCommand build() {
      // Create a LegalDocConfigurator from the system environment.
      LegalDocRequestCommand command = new LegalDocRequestCommand(new LegalDocConfigurator() {
        @Override
        public int getTimeout() {
          return getEnvironment().getLegalDocServiceTimeout(DEFAULT_TIMEOUT_MS);
        }
      });
      // transfer properties to the LegalDocRequestCommand
      assignProperties(command);
      // ensure the command properties are set appropriately
      validateProperties(command);
      return command;
    }

    /**
     * Validate the properties of the command. Perform all your input/property level validation here.
     * Exceptions thrown during construction will NOT count against the circuit breaker logic since
     * they happen before the invocation of the run() method.
     */
    private void validateProperties(LegalDocRequestCommand command) {
      if (StringUtils.isEmpty(command.context)) {
        throw new IllegalArgumentException("Missing context in LegalDocRequestCommand.");
      }
      if (StringUtils.isEmpty(command.sponsorId)) {
        throw new IllegalArgumentException("Missing sponsorId in LegalDocRequestCommand.");
      }
      if (StringUtils.isEmpty(command.rkId)) {
        throw new IllegalArgumentException("Missing rkId in LegalDocRequestCommand.");
      }
    }
  }
}

In the run() method, you code the functionality that exposes your client to failure. In our examples, this is essentially the network call to fetch the legal documents. Above, we are using the Jersey client library to make REST calls, so at its core the run() method makes a WebResource get() call.

The second method to implement is getFallback(). In the event of a failure calling the distributed service, this method gets invoked to return some set of data that’ll ideally allow the client code to complete its operation, albeit with potentially degraded functionality. This forces the discussion of some key distributed system design details, namely, what happens when a network call fails. And they will fail.

Historically, in most cases we just throw exceptions that bubble up to an “internal server error” and let the caller deal with that, which usually results in a suboptimal client experience. But, can we do better than that? Can the client code handle an empty response or some default data, and still have a functioning scenario? Either way, in an application built on an increasingly distributed system, we want to start thinking about how the customer or client experience can continue in the event of network service failures.

Getting back to the code, we want to be sure to log the invocation of getFallback() with the relevant root cause information so it is easier to diagnose a problem. Even though getFallback() is making your system look like everything is OK, there could still be a problem that you need to resolve.

If you peruse the Hystrix wiki pages, you’ll also notice some advice on throwing exceptions in run(), and how to build the getFallback(). In the run() method, if there’s an error that you don’t want to count against the circuit breaker logic, then you need to throw a HystrixBadRequestException. All other exceptions factor into the circuit breaker logic to determine if the breaker should be opened, thus impeding calls to the service. Given this, perform as much validation up front as possible outside of the run() method.

Above you’ll notice that we added the builder pattern to the HystrixCommand. The LegalDocRequestCommand takes many primitive data type parameters that are query parameters to the REST call. The builder pattern adds clarity to the parameters and provides a consistent pattern for implementing a validation step so we can catch any errors before the run() operation.

Now the fun part, here’s what it looks like to use the LegalDocRequestCommand. Of course, the tests below require an environment setup that isn’t included.

/**
 * Run the command as a single blocking command.
 */
@Test
public void testSynchronousLegalDocRequestCommand() {
  LegalDocRequestCommand command =
      new LegalDocRequestCommand.Builder().withContext("FOO").withSponsorId("BAR").withRkId("BAZ")
          .withIncludeIraManagement(true).build();
  LegalDocResponseDto response = command.execute();
}

/**
 * Run the command as a single asynchronous command via Future<></>.
 */
@Test
public void testAsynchronousLegalDocRequestCommand() {
  LegalDocRequestCommand command =
      new LegalDocRequestCommand.Builder().withContext("FOO").withSponsorId("BAR").withRkId("BAZ")
          .withIncludeIraManagement(true).build();
  LegalDocResponseDto response = null;
  // processing starts but thread doesn't block
  Future future = command.queue();
  try {
    // thread blocks waiting for response
    response = future.get();
  } catch (Exception e) {
  }
}

/**
 * Run 2 commands in parallel, asynchronously, using an Observable<></> and combine the responses
 * into a single assembled response.
 */
@Test
public void testObservableLegalDocRequestCommand() {
  LegalDocRequestCommand command1 =
      new LegalDocRequestCommand.Builder().withContext("FOO1").withSponsorId("BAR1").withRkId("BAZ1")
          .withIncludeIraManagement(true).build();
  LegalDocRequestCommand command2 =
      new LegalDocRequestCommand.Builder().withContext("FOO2").withSponsorId("BAR2").withRkId("BAZ2")
          .withIncludeIraManagement(true).build();
  Observable observable1 = command1.observe();
  Observable observable2 = command2.observe();
  Observable aggregateObservable = Observable
      .zip(observable1, observable2, new Func2<LegalDocResponseDto, LegalDocResponseDto, LegalDocResponseDto>() {
        @Override
        public LegalDocResponseDto call(LegalDocResponseDto docs1, LegalDocResponseDto docs2) {
          return LegalDocsAssembler.assemble(docs1, docs2);
        }
      });
  BlockingObservable blockingObservable = aggregateObservable.toBlocking();
  LegalDocResponseDto combinedDocs = blockingObservable.last();
}

The last test above is actually not part of the first use case in production, but demonstrates one of the benefits of building on HystrixCommands. As the system of microservices need to call more microservices, we don’t want the response time to become an accumulation of all the individual network calls. There are probably many cases where we can make the network calls in parallel, asynchronously, and process the responses as required. This is somewhat analgous to asynchronous behavior that is intrinsic to the Javascript world. The RxJava github has lots of great links and information on this, RxJava.

Resilience Without Hystrix

Of course, as software engineers, before introducing another framework into the tech stack, we need to understand if the new framework provides enough value to warrant the additional learning that comes with a new framework. To get our heads around this, let’s take a look at what resilience would look like without Hystrix. The functionality we’d like to support would be:

latency tolerance (timeout control)
fault tolerance (exception handling and fallback data)
circuit breaker logic
bulkhead logic (thread pools with rejection)
parallel async requests

We are currently using the Jersey client (v1.19) to construct and execute calls to REST services. With Jersey, it is very straightforward to configure the connection timeout and read timeout. And, of course, the WebResource network operation can be wrapped in a try/catch block to handle any of the exceptions, and return an appropriate fallback response. Regarding thread pools with rejection, Jersey client supports thread pools for async requests, but not sync requests. The thread pools have standard Java thread pool behavior regarding the queueing of requests, so you’d have to implement your own ThreadPoolExecutor to configure the queue length to enable the rejection of queue requests at some level.

What about processing network service requests in parallel, asynchronously? This is functionality we feel will become important as we start to rely on more service calls to fulfill any single request. Jersey client 1.19 supports integration with Java Futures. As demonstrated below, it is straightforward to create an AsyncWebResource and interact via the returned Future.

Let’s see what the code would look like.

@Test
public void testResilientJerseyClient() {
  String url = "http://example.com/hello";
  Integer connectionTimeout = 5000;
  Integer readTimeout = 5000;
  ClientConfig clientConfig = new DefaultClientConfig();
  // initialize clientConfig
  clientConfig.getProperties().put(ClientConfig.PROPERTY_CONNECT_TIMEOUT, connectionTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_READ_TIMEOUT, readTimeout);
  Client client = Client.create(clientConfig);
  WebResource resource = client.resource(url);
  String response = null;
  try {
    response = resource.accept("application/json").get(String.class);
  } catch (ClientHandlerException clientHandlerException) {
    response = getFallbackResponse();
  } catch (Exception e) {
    // handle other exceptions
  }
  System.out.println(response);
}

protected String getFallbackResponse() {
  String fallbackResponse = "sample fallback response";
  // determine appropriate response, perhaps based on a particular exception
  return fallbackResponse;
}

@Test
public void testResilientJerseyClientAsync() {
  String url = "http://example.com/hello";
  Integer connectionTimeout = 5000;
  Integer readTimeout = 5000;
  Integer threadPoolSize = 5; // limit async requests to pool
  ClientConfig clientConfig = new DefaultClientConfig();
  // initialize clientConfig
  clientConfig.getProperties().put(ClientConfig.PROPERTY_CONNECT_TIMEOUT, connectionTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_READ_TIMEOUT, readTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_THREADPOOL_SIZE,
      threadPoolSize); // use thread pool for async, and limit pool size
  Client client = Client.create(clientConfig);
  /**
   * Use the AsyncWebResource API for parallel async operations.
   */
  AsyncWebResource asyncResource = client.asyncResource(url);
  String response = null;
  Future future = null;
  try {
    /**
     * Fire off this future, and any others that need to execute in parallel, asynchronously.
     */
    future = asyncResource.accept("application/json").get(String.class);
    /**
     * Get the responses from the fired off futures.
     */
    response = future.get(); // can take timeout params
  } catch (ClientHandlerException clientHandlerException) {
    response = getFallbackResponse();
  } catch (Exception e) {
  }
  System.out.println(response);
}

This is not an exhaustive analysis, but clearly parts of the resilience requirements can be implemented without Hystrix. The final solution(s), however, are not as extensive, or would require coding on your part to replicate some of what Hystrix is providing, circuit breaker logic, bulkhead logic. And, if I start cleaning up the above code to make it reusable across different service calls, it starts to look a lot like a HystrixCommand.

Oh, and we still have to make some SOAP and Spring Remoting calls. Remember SOAP? Since the above solution is reliant on the Jersey client, we’d have to also implement all of the above for SOAP requests. We don’t get the benefit of Hystrix abstracting the fundamentals of resilience away from any particular service protocol implementation, be it REST, SOAP, Spring Remoting, etc.

A couple areas we didn’t touch on are configuration and monitoring. Hystrix has an extensive configuration system built on Archaius, and supports monitoring the health of the client calls via a stream of monitoring data that can feed into the Hystrix dashboard. Details in these areas are left for another blog, but they are high value for a microservices system.

Key Takeaways

After coding up some actual cases, my engineering takeaway is that with Hystrix, we get resilience in a holistic code texture, with very complete resilience functionality. Because of the holistic code texture, resilience can be easily applied across a heterogeneous set of client network technologies. Don’t under estimate how ease leads to better adoption.

Getting started with Hystrix was very easy, and the defaults were adequate to get the first use case into production. Now we must tune the timeouts, threadpool sizes, etc., based on learnings about load and response times. I’d like to emphasize that you shouldn’t just “bump up” these values to mask other problems in the system. Investigate and solve the root causes.

From an architectural perspective, introducing Hystrix has forced the detailed consideration of resilience. In fact, whether you use Hystrix or not, do not let your system default to client library or network timeouts. Consider, for example, the client default for Jersey is infinity. Furthermore, examine what needs to happen if a service call fails. These considerations were definitely an underdeveloped muscle coming from the monolith world, where nearly everything is available and in-proc.

Building out a microservices architecture shifts the dev/ops costs from managing a monolith, with all of its issues, to managing a suite of services, with all of those issues. Be proactive and head off the costs of maintaining a non-resilient system. An engineering organization can spend lots of time (=money), not meet business requirements, and foment frustration within the organization if a new architectural initiative is plagued with reliability issues.

While this is a very humble first usage of Hystrix, it reveals that even in a simple case, the benefits will outweigh the costs for us. Remember, don’t let one faulty circuit burn down the house.