1

Resilience in a Microservices World

Introduction

circuit-breaker-operation-schematic
circuit breaker

Building software is often compared to building homes, and in building a home you don’t let a faulty circuit burn down the house. Decades of learnings in the home building industry have shown that one small device, the circuit breaker, can protect an entire house.

Financial Engines is on the microservices journey for what are the typical business and architectural motivations. Basically, the cost of ownership of a monolithic system is growing due to the monolithic nature of the stack: code, releases and teams are coupled together. One solution is to decompose the system into smaller pieces following the principles of microservices.

There are many great blogs and books explaining microservices architectures (we’ve been using Building Microservices by Sam Newman). A consistent theme across these is that loosely coupled distributed systems, while having many benefits, will experience failure across the service boundaries, and such systems need to gracefully handle those failures. Netflix has some interesting back-of-the-envelope stats on failure rates: Network Failure Rates.

Even without an architectural shift to microservices, the modern SaaS application is simply more distributed in terms of integrating with 3rd party functionality. At Financial Engines we have a growing number of distributed connections to external services, even within the code that remains a monolith.

While we don’t want to prematurely engineer or over engineer solutions, we don’t want to ignore sage advice either. So, to keep the system running smoothly, and recover from issues without an all-hands-on-deck emergency each time, we decided to investigate technologies to add resilience and runtime insights into the system.

That is where Hystrix comes into the picture.

Hystrix 101

Netflix has a lot of great documentation on Hystrix, and if you haven’t already visited the project wikis, definitely make it your next stop, after this blog. You can start at Netflix on Hystrix.

At a high level, Hystrix gives client code better control over how network service calls, or actually any functionality, can affect the client system. Hystrix is about the client. To afford a client better control, Hystrix strives for:

  • latency tolerance
  • fault tolerance
  • prevent cascading failure

In sum, these attributes describe a resilient system. Again, the Netflix wikis have a lot of great detail.

In terms of code, this resilience is accomplished by wrapping any chunk of functionality in a HystrixCommand. While “any functionality” can include work other than a network service call, we’re focused on network calls below.

The HystrixCommand gives better control primarily through 2 patterns: the bulkhead pattern and the circuit breaker pattern. The bulkhead pattern isolates system resources that are servicing a network call. The circuit breaker pattern keeps a client from hanging on a failing network service call and repeating the failing action.

In addition to resiliency, as services require more network calls to complete a single request, we want to execute those calls in parallel so we don’t pay the cost of sequential synchronous network requests. Fortunately, Hystrix integrates well with parallel, asynchronous processing.

Use Case

Financial Engines builds financial planning and portfolio management software that helps customers plan for retirement and manage their 401Ks, IRAs and other investment accounts. There’s much more going on there, but that’s a high level view. The nature of the fiduciary business is that there are legal concerns which require legal documents to be presented to the user of the system. In our use case, the legal documents are provided by a microservice that can be used by any number of client applications or services.

Implementation

The core task of using Hystrix is implementing command classes derived from HystrixCommand. It is a basic command pattern and the derived class implements a run() and a getFallback() method. Just to note, in the sample below, quite a bit of code was omitted for brevity and clarity, leaving those portions that should be of interest.

public class LegalDocRequestCommand extends HystrixCommand<LegalDocResponseDto> {

  private static int DEFAULT_TIMEOUT_MS = 5000; // millisecond timeout
  private static String HYSTRIX_GROUP_LEGAL_DOC_SERVICE = "LegalDocServiceGroup";

  LegalDocConfigurator configurator = null;
  /**
   * Parameters to the legal documents service.
   */
  String context;
  String sponsorId;
  String rkId;
  boolean includeIraManagement;

  protected LegalDocRequestCommand(LegalDocConfigurator configurator) {
    super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(HYSTRIX_GROUP_LEGAL_DOC_SERVICE))
        .andCommandPropertiesDefaults(
            HystrixCommandProperties.Setter().withExecutionTimeoutInMilliseconds(configurator.getTimeout())));
    this.configurator = configurator;
  }

  /**
   * HystrixCommand run() API.
   */
  @Override
  protected LegalDocResponseDto run() {
    String responseStr = null;
    WebResource resource = createWebResource();
    // try to get the legal documents
    responseStr = resource.get(String.class);
    LegalDocResponseDto responseDto = mapResponseJsonToPojo(responseStr);
    return responseDto;
  }

  /**
   * HystrixCommand getFallback() API.
   */
  @Override
  protected LegalDocResponseDto getFallback() {
    logger.error("LegalDocRequestCommand failed: " + constructLegalDocServiceUrl() + " Root cause: ",
        getFailedExecutionException());
    LegalDocResponseDto emptyResponse = new LegalDocResponseDto();
    return emptyResponse;
  }

  /**
   * Builder pattern for creating the LegalDocRequestCommand since REST endpoint requires
   * several intrinsic type parameters. Enforces that all properties are set appropriately.
   */
  public static class Builder {
    private String context = null;
    private String sponsorId = null;
    private String rkId = null;
    private boolean includeIraManagement = true;

    public Builder() {}
    protected Environment getEnvironment() {
      return Environment.getInstance();
    }

    public Builder withContext(String context) {
      this.context = context;
      return this;
    }
    public Builder withSponsorId(String sponsorId) {
      this.sponsorId = sponsorId;
      return this;
    }
    public Builder withRkId(String rkId) {
      this.rkId = rkId;
      return this;
    }
    public Builder withIncludeIraManagement(boolean includeIraManagement) {
      this.includeIraManagement = includeIraManagement;
      return this;
    }

    /**
     * Build the LegalDocRequestCommand.
     */
    public LegalDocRequestCommand build() {

      // Create a LegalDocConfigurator from the system environment.
      LegalDocRequestCommand command = new LegalDocRequestCommand(new LegalDocConfigurator() {
        @Override
        public int getTimeout() {
          return getEnvironment().getLegalDocServiceTimeout(DEFAULT_TIMEOUT_MS);
        }
      });

      // transfer properties to the LegalDocRequestCommand
      assignProperties(command);
      // ensure the command properties are set appropriately
      validateProperties(command);

      return command;
    }

    /**
     * Validate the properties of the command. Perform all your input/property level validation here.
     * Exceptions thrown during construction will NOT count against the circuit breaker logic since
     * they happen before the invocation of the run() method.
     *
     */
    private void validateProperties(LegalDocRequestCommand command) {
      if (StringUtils.isEmpty(command.context)) {
        throw new IllegalArgumentException("Missing context in LegalDocRequestCommand.");
      }
      if (StringUtils.isEmpty(command.sponsorId)) {
        throw new IllegalArgumentException("Missing sponsorId in LegalDocRequestCommand.");
      }
      if (StringUtils.isEmpty(command.rkId)) {
        throw new IllegalArgumentException("Missing rkId in LegalDocRequestCommand.");
      }
    }
  }
}

In the run() method, you code the functionality that exposes your client to failure. In our examples, this is essentially the network call to fetch the legal documents. Above, we are using the Jersey client library to make REST calls, so at its core the run() method makes a WebResource get() call.

The second method to implement is getFallback(). In the event of a failure calling the distributed service, this method gets invoked to return some set of data that’ll ideally allow the client code to complete its operation, albeit with potentially degraded functionality. This forces the discussion of some key distributed system design details, namely, what happens when a network call fails. And they will fail.

Historically, in most cases we just throw exceptions that bubble up to an “internal server error” and let the caller deal with that, which usually results in a suboptimal client experience. But, can we do better than that? Can the client code handle an empty response or some default data, and still have a functioning scenario? Either way, in an application built on an increasingly distributed system, we want to start thinking about how the customer or client experience can continue in the event of network service failures.

Getting back to the code, we want to be sure to log the invocation of getFallback() with the relevant root cause information so it is easier to diagnose a problem. Even though getFallback() is making your system look like everything is OK, there could still be a problem that you need to resolve.

If you peruse the Hystrix wiki pages, you’ll also notice some advice on throwing exceptions in run(), and how to build the getFallback(). In the run() method, if there’s an error that you don’t want to count against the circuit breaker logic, then you need to throw a HystrixBadRequestException. All other exceptions factor into the circuit breaker logic to determine if the breaker should be opened, thus impeding calls to the service. Given this, perform as much validation up front as possible outside of the run() method.

Above you’ll notice that we added the builder pattern to the HystrixCommand. The LegalDocRequestCommand takes many primitive data type parameters that are query parameters to the REST call. The builder pattern adds clarity to the parameters and provides a consistent pattern for implementing a validation step so we can catch any errors before the run() operation.

Now the fun part, here’s what it looks like to use the LegalDocRequestCommand. Of course, the tests below require an environment setup that isn’t included.

/**
 * Run the command as a single blocking command.
 */
@Test
public void testSynchronousLegalDocRequestCommand() {
  LegalDocRequestCommand command = new LegalDocRequestCommand.Builder()
    .withContext("FOO")
    .withSponsorId("BAR")
    .withRkId("BAZ")
    .withIncludeIraManagement(true).build();

  LegalDocResponseDto response = command.execute();
}

/**
 * Run the command as a single asynchronous command via Future<>.
 */
@Test
public void testAsynchronousLegalDocRequestCommand() {
  LegalDocRequestCommand command = new LegalDocRequestCommand.Builder()
    .withContext("FOO")
    .withSponsorId("BAR")
    .withRkId("BAZ")
    .withIncludeIraManagement(true).build();

  LegalDocResponseDto response = null;

  // processing starts but thread doesn't block
  Future future = command.queue();

  try {
    // thread blocks waiting for response
    response = future.get();
  } catch (Exception e) {
  }

}

/**
 * Run 2 commands in parallel, asynchronously, using an Observable<> and combine the responses
 * into a single assembled response.
 */
@Test
public void testObservableLegalDocRequestCommand() {
  LegalDocRequestCommand command1 = new LegalDocRequestCommand.Builder()
    .withContext("FOO1")
    .withSponsorId("BAR1")
    .withRkId("BAZ1")
    .withIncludeIraManagement(true).build();

  LegalDocRequestCommand command2 = new LegalDocRequestCommand.Builder()
    .withContext("FOO2")
    .withSponsorId("BAR2")
    .withRkId("BAZ2")
    .withIncludeIraManagement(true).build();

  Observable observable1 = command1.observe();
  Observable observable2 = command2.observe();

  Observable aggregateObservable =
    Observable.zip(observable1, observable2, new Func2<LegalDocResponseDto, LegalDocResponseDto, LegalDocResponseDto>() {
      @Override
      public LegalDocResponseDto call(LegalDocResponseDto docs1, LegalDocResponseDto docs2) {
        return LegalDocsAssembler.assemble(docs1, docs2);
      }
    });

  BlockingObservable blockingObservable = aggregateObservable.toBlocking();

  LegalDocResponseDto combinedDocs = blockingObservable.last();

}

The last test above is actually not part of the first use case in production, but demonstrates one of the benefits of building on HystrixCommands. As the system of microservices need to call more microservices, we don’t want the response time to become an accumulation of all the individual network calls. There are probably many cases where we can make the network calls in parallel, asynchronously, and process the responses as required. This is somewhat analgous to asynchronous behavior that is intrinsic to the Javascript world. The RxJava github has lots of great links and information on this, RxJava.

Resilience Without Hystrix

Of course, as software engineers, before introducing another framework into the tech stack, we need to understand if the new framework provides enough value to warrant the additional learning that comes with a new framework. To get our heads around this, let’s take a look at what resilience would look like without Hystrix. The functionality we’d like to support would be:

  • latency tolerance (timeout control)
  • fault tolerance (exception handling and fallback data)
  • circuit breaker logic
  • bulkhead logic (thread pools with rejection)
  • parallel async requests

We are currently using the Jersey client (v1.19) to construct and execute calls to REST services. With Jersey, it is very straightforward to configure the connection timeout and read timeout. And, of course, the WebResource network operation can be wrapped in a try/catch block to handle any of the exceptions, and return an appropriate fallback response. Regarding thread pools with rejection, Jersey client supports thread pools for async requests, but not sync requests. The thread pools have standard Java thread pool behavior regarding the queueing of requests, so you’d have to implement your own ThreadPoolExecutor to configure the queue length to enable the rejection of queue requests at some level.

What about processing network service requests in parallel, asynchronously? This is functionality we feel will become important as we start to rely on more service calls to fulfill any single request. Jersey client 1.19 supports integration with Java Futures. As demonstrated below, it is straightforward to create an AsyncWebResource and interact via the returned Future.

Let’s see what the code would look like.

@Test
public void testResilientJerseyClient() {
  String url = "http://example.com/hello";

  Integer connectionTimeout = 5000;
  Integer readTimeout = 5000;

  ClientConfig clientConfig = new DefaultClientConfig();

  // initialize clientConfig
  clientConfig.getProperties().put(ClientConfig.PROPERTY_CONNECT_TIMEOUT, connectionTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_READ_TIMEOUT, readTimeout);

  Client client = Client.create(clientConfig);
  WebResource resource = client.resource(url);

  String response = null;

  try {
    response = resource.accept("application/json").get(String.class);
  } catch (ClientHandlerException clientHandlerException) {
    response = getFallbackResponse();
  } catch (Exception e) {
    // handle other exceptions
  }
  System.out.println(response);
}

protected String getFallbackResponse() {
  String fallbackResponse = "sample fallback response";
  // determine appropriate response, perhaps based on a particular exception
  return fallbackResponse;
}

@Test
public void testResilientJerseyClientAsync() {

  String url = "http://example.com/hello";
  Integer connectionTimeout = 5000;
  Integer readTimeout = 5000;
  Integer threadPoolSize = 5; // limit async requests to pool

  ClientConfig clientConfig = new DefaultClientConfig();

  // initialize clientConfig
  clientConfig.getProperties().put(ClientConfig.PROPERTY_CONNECT_TIMEOUT, connectionTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_READ_TIMEOUT, readTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_THREADPOOL_SIZE, threadPoolSize); // use thread pool for async, and limit pool size

  Client client = Client.create(clientConfig);

  /**
   * Use the AsyncWebResource API for parallel async operations.
   */
  AsyncWebResource asyncResource = client.asyncResource(url);

  String response = null;
  Future future = null;

  try {
    /**
     * Fire off this future, and any others that need to execute in parallel, asynchronously.
     */
    future = asyncResource.accept("application/json").get(String.class);
    /**
     * Get the responses from the fired off futures.
     */
    response = future.get(); // can take timeout params
  } catch (ClientHandlerException clientHandlerException) {
    response = getFallbackResponse();
  } catch (Exception e) {
  }
  System.out.println(response);
}

This is not an exhaustive analysis, but clearly parts of the resilience requirements can be implemented without Hystrix. The final solution(s), however, are not as extensive, or would require coding on your part to replicate some of what Hystrix is providing, circuit breaker logic, bulkhead logic. And, if I start cleaning up the above code to make it reusable across different service calls, it starts to look a lot like a HystrixCommand.

Oh, and we still have to make some SOAP and Spring Remoting calls. Remember SOAP? Since the above solution is reliant on the Jersey client, we’d have to also implement all of the above for SOAP requests. We don’t get the benefit of Hystrix abstracting the fundamentals of resilience away from any particular service protocol implementation, be it REST, SOAP, Spring Remoting, etc.

A couple areas we didn’t touch on are configuration and monitoring. Hystrix has an extensive configuration system built on Archaius, and supports monitoring the health of the client calls via a stream of monitoring data that can feed into the Hystrix dashboard. Details in these areas are left for another blog, but they are high value for a microservices system.

Key Takeaways

After coding up some actual cases, my engineering takeaway is that with Hystrix, we get resilience in a holistic code texture, with very complete resilience functionality. Because of the holistic code texture, resilience can be easily applied across a heterogeneous set of client network technologies. Don’t under estimate how ease leads to better adoption.

Getting started with Hystrix was very easy, and the defaults were adequate to get the first use case into production. Now we must tune the timeouts, threadpool sizes, etc., based on learnings about load and response times. I’d like to emphasize that you shouldn’t just “bump up” these values to mask other problems in the system. Investigate and solve the root causes.

From an architectural perspective, introducing Hystrix has forced the detailed consideration of resilience. In fact, whether you use Hystrix or not, do not let your system default to client library or network timeouts. Consider, for example, the client default for Jersey is infinity. Furthermore, examine what needs to happen if a service call fails. These considerations were definitely an underdeveloped muscle coming from the monolith world, where nearly everything is available and in-proc.

Building out a microservices architecture shifts the dev/ops costs from managing a monolith, with all of its issues, to managing a suite of services, with all of those issues. Be proactive and head off the costs of maintaining a non-resilient system. An engineering organization can spend lots of time (=money), not meet business requirements, and foment frustration within the organization if a new architectural initiative is plagued with reliability issues.

While this is a very humble first usage of Hystrix, it reveals that even in a simple case, the benefits will outweigh the costs for us. Remember, don’t let one faulty circuit burn down the house.

4

How we improved our EV charging station sharing with HipChat, AWS and ChargePoint API

Have you ever noticed that there are never enough EV charging stations at work?

You probably have if you own an electric car or a hybrid. Electric vehicles and plug-in hybrids have come a long way and are steadily growing in popularity. A combination of government incentives and exemptions from carpool lane rules make them a great choice for commuters. As a result, it is a common employee perk in Silicon Valley, for a company to offer free or subsidized charging at work. However, it seems the number of available charging stations always seems to be dwarfed by the growing number of electric car drivers. Even if there are several chargers, the process around sharing the chargers between employees is never perfect and someone inevitably gets stuck without enough juice to drive home.

At Financial Engines Sunnyvale, CA headquarters, we have four charging stations managed by ChargePoint Network. At the time of this writing, there are roughly 35 EV drivers trying to get a charge on a daily basis. At first, our “sharing” process didn’t quite work. First, there was no visibility into when people plugged in and out. Our offices in Sunnyvale are far enough from the charging stations that you cannot see if there is one available at any given point. Second, not everyone bought into the whole “sharing” idea, and sometimes cars would continue to occupy the precious charging spots long after their car was fully  charged.

EV Concierge to the rescue

By experimenting with various options, and with a little bit of coding and iterating, we developed a nifty system that is fun to use and works much better – so much so that on most days, by late afternoon we have chargers available and nobody is questioning that “sharing” concept anymore. We call it – EV Concierge.

EV Concierge is an automated assistant that monitors our charging session availability, manages orderly queue of drivers who need a charge and occasionally shames you into moving your car in time for others to get a charge.

How it works

EV Concierge2

At the center of the system, we have HipChat, which is already used by nearly everyone in the company for daily collaboration. We set up a dedicated room, called “EV Charging” and that is where all the magic takes place.

It all starts with an employee typing an “ev add” command to put themselves in the queue. He (let’s call him Dave) does it right there in the HipChat window, the same way he would post a short message. In response, the system will put Dave at the bottom of the queue and display a complete listing of the queue in the same HipChat window. Now Dave knows where he is on the list.

In the background, we have a monitoring process running (EV Concierge), which is communicating with ChargePoint network via XMPP protocol and listens as people plug-in and plug-out. In response to these events, EV Concierge notifies everyone in the room about a change in charger availability and also informs the next person in the queue that it is his or her turn. That is how Dave knows that it is his turn. EV Concierge will automatically remove Dave from the queue as soon as it detects that Dave has actually plugged in.

Growing feature list

Adding new features to EV Concierge and using it daily has been equally  fun. There is something to be said about being able to interact with your customers and iterate on the solution on a daily basis. Iterative development does not get better than that! Since basic functionality was put in place, we enhanced EV Concierge with a number of useful features. Here is a complete list:

  1. ev add/remove/list – driver queue management
  2. ev suspend/resume – keeps your place in the queue but lets person behind you go ahead of you. This is useful when you are stuck in a meeting.
  3. ev next – calls out next person in the queue. This happens automatically when charger becomes available.
  4. Reminders:
    • When a charger is available for more than 10 minutes, EV Concierge will notify the room about precious time being wasted.
    • When a user has been charging for more than 2.5 hours, EV Concierge will (politely) ask him or her to move their car to let others partake in the experience.

In fact, EV Concierge has been getting so smart, people sometimes talk back to it, not realizing it is a robot. 🙂

How we built it

If you are intrigued by this creative solution to a common workplace problem, you can build your own EV Concierge relatively easily. We are working to open source our implementation, but in the meantime, here is how to do it.

ChargePoint API

Charge Point publishes their Web Services API and it can be found here https://na.chargepoint.com/UI/downloads/en/ChargePoint_Web_Services_API_Guide_Ver4.1_Rev4.pdf

If your company owns the chargers, most likely you have a support agreement with ChargePoint and your facilities manager has the API key that you will need to integrate with the API. There are two ways to talk to ChargePoint: SOAP API and XMPP protocol. We use a combination of both because they expose different levels of information about drivers and their sessions. In fact, the level of detail you get in the API is highly dependent on your corporate support agreement with ChargePoint and will control how sophisticated of a system you can build. If you are a Java developer, Smack API is an easy to use library for XMPP listener.

HipChat Integration

There are many ways to integrate with HipChat. For on-premise installations, you can extend HipChat with HUBOT. It is a tiny process that can listen for special words and execute commands. It uses CoffeeScript and is relatively easy to use. For cloud versions, you can use a feature called “slash” which allows you to map anything that starts with a “slash” (duh!) to a REST API call. Finally, to programmatically post messages to HipChat from EV Concierge, you can use the simple REST interface that HipChat supports out of the box. For extra fun, you can extend HipChat with custom icons and make your EV Concierge messages include some branding (or character).

If your company uses Slack, Yammer, or Google Hangouts instead of HipChat – no worries. The same type of integration can be achieved with any of these systems. The key is to try not to introduce yet another communications interface in your workplace. If you want high adoption, stick with existing tools.

Queue Persistence

This can be done in a thousand different ways, but if you have access to the AWS ecosystem, it is an obvious choice. You can use a document storage DB like AWS Dynamo, put the persistence logic in an AWS Lambda function, slap a REST interface on top of it through AWS API Gateway and be done in a day or less. This was a hackathon project, so I know for a fact that it can be done in less than a day, even if you have never heard of Lambda or Dynamo before.

EV Concierge

Finally, the heart of the logic is implemented as a standalone Java program. It loads the initial charger status via ChargePoint SOAP API on start up and then goes into listening mode for events. With every event, it keeps track of who is plugged in, who is finished and who should be notified. To make HipChat messaging personal, you will need to map ChargePoint user ids that you get through the API to your internal HipChat user ids. This will allow you to automatically manage the queue when a driver is recognized and also send direct messages using HipChat “@” feature.

EV Concierge is deployed on Amazon cloud via Elastic Beanstalk, which simplifies deployment and management of the process. One caveat to note is that we noticed that XMPP connection gets stale after prolonged use (12+ hours), so we had to build an automated restart at 6AM every morning to renew it. It works fine for us because EV Concierge can sleep at night ahead of a busy working day.

Conclusion

EV Concierge has been a fun little project to work on. It’s great to have a problem that is so well defined and actionable that it begs to be solved. It’s also a great feeling to build something that your fellow colleagues can use and enjoy on a daily basis. As I mentioned, since we adopted EV Concierge, many more drivers get a chance to charge and the process is much smoother and more fun for everyone. The list of proposed features is also steadily growing. Oh, if one can only find enough time…

As part of the hackathon, we put together a little presentation that describes EV Concierge in 3 minutes and shows it in action. Here is the video for your to enjoy.