How we improved our EV charging station sharing with HipChat, AWS and ChargePoint API

Have you ever noticed that there are never enough EV charging stations at work?

You probably have if you own an electric car or a hybrid. Electric vehicles and plug-in hybrids have come a long way and are steadily growing in popularity. A combination of government incentives and exemptions from carpool lane rules make them a great choice for commuters. As a result, it is a common employee perk in Silicon Valley, for a company to offer free or subsidized charging at work. However, it seems the number of available charging stations always seems to be dwarfed by the growing number of electric car drivers. Even if there are several chargers, the process around sharing the chargers between employees is never perfect and someone inevitably gets stuck without enough juice to drive home.

At Financial Engines Sunnyvale, CA headquarters, we have four charging stations managed by ChargePoint Network. At the time of this writing, there are roughly 35 EV drivers trying to get a charge on a daily basis. At first, our “sharing” process didn’t quite work. First, there was no visibility into when people plugged in and out. Our offices in Sunnyvale are far enough from the charging stations that you cannot see if there is one available at any given point. Second, not everyone bought into the whole “sharing” idea, and sometimes cars would continue to occupy the precious charging spots long after their car was fully  charged.

EV Concierge to the rescue

By experimenting with various options, and with a little bit of coding and iterating, we developed a nifty system that is fun to use and works much better – so much so that on most days, by late afternoon we have chargers available and nobody is questioning that “sharing” concept anymore. We call it – EV Concierge.

EV Concierge is an automated assistant that monitors our charging session availability, manages orderly queue of drivers who need a charge and occasionally shames you into moving your car in time for others to get a charge.

How it works

EV Concierge2

At the center of the system, we have HipChat, which is already used by nearly everyone in the company for daily collaboration. We set up a dedicated room, called “EV Charging” and that is where all the magic takes place.

It all starts with an employee typing an “ev add” command to put themselves in the queue. He (let’s call him Dave) does it right there in the HipChat window, the same way he would post a short message. In response, the system will put Dave at the bottom of the queue and display a complete listing of the queue in the same HipChat window. Now Dave knows where he is on the list.

In the background, we have a monitoring process running (EV Concierge), which is communicating with ChargePoint network via XMPP protocol and listens as people plug-in and plug-out. In response to these events, EV Concierge notifies everyone in the room about a change in charger availability and also informs the next person in the queue that it is his or her turn. That is how Dave knows that it is his turn. EV Concierge will automatically remove Dave from the queue as soon as it detects that Dave has actually plugged in.

Growing feature list

Adding new features to EV Concierge and using it daily has been equally  fun. There is something to be said about being able to interact with your customers and iterate on the solution on a daily basis. Iterative development does not get better than that! Since basic functionality was put in place, we enhanced EV Concierge with a number of useful features. Here is a complete list:

  1. ev add/remove/list – driver queue management
  2. ev suspend/resume – keeps your place in the queue but lets person behind you go ahead of you. This is useful when you are stuck in a meeting.
  3. ev next – calls out next person in the queue. This happens automatically when charger becomes available.
  4. Reminders:
    • When a charger is available for more than 10 minutes, EV Concierge will notify the room about precious time being wasted.
    • When a user has been charging for more than 2.5 hours, EV Concierge will (politely) ask him or her to move their car to let others partake in the experience.

In fact, EV Concierge has been getting so smart, people sometimes talk back to it, not realizing it is a robot. 🙂

How we built it

If you are intrigued by this creative solution to a common workplace problem, you can build your own EV Concierge relatively easily. We are working to open source our implementation, but in the meantime, here is how to do it.

ChargePoint API

Charge Point publishes their Web Services API and it can be found here https://na.chargepoint.com/UI/downloads/en/ChargePoint_Web_Services_API_Guide_Ver4.1_Rev4.pdf

If your company owns the chargers, most likely you have a support agreement with ChargePoint and your facilities manager has the API key that you will need to integrate with the API. There are two ways to talk to ChargePoint: SOAP API and XMPP protocol. We use a combination of both because they expose different levels of information about drivers and their sessions. In fact, the level of detail you get in the API is highly dependent on your corporate support agreement with ChargePoint and will control how sophisticated of a system you can build. If you are a Java developer, Smack API is an easy to use library for XMPP listener.

HipChat Integration

There are many ways to integrate with HipChat. For on-premise installations, you can extend HipChat with HUBOT. It is a tiny process that can listen for special words and execute commands. It uses CoffeeScript and is relatively easy to use. For cloud versions, you can use a feature called “slash” which allows you to map anything that starts with a “slash” (duh!) to a REST API call. Finally, to programmatically post messages to HipChat from EV Concierge, you can use the simple REST interface that HipChat supports out of the box. For extra fun, you can extend HipChat with custom icons and make your EV Concierge messages include some branding (or character).

If your company uses Slack, Yammer, or Google Hangouts instead of HipChat – no worries. The same type of integration can be achieved with any of these systems. The key is to try not to introduce yet another communications interface in your workplace. If you want high adoption, stick with existing tools.

Queue Persistence

This can be done in a thousand different ways, but if you have access to the AWS ecosystem, it is an obvious choice. You can use a document storage DB like AWS Dynamo, put the persistence logic in an AWS Lambda function, slap a REST interface on top of it through AWS API Gateway and be done in a day or less. This was a hackathon project, so I know for a fact that it can be done in less than a day, even if you have never heard of Lambda or Dynamo before.

EV Concierge

Finally, the heart of the logic is implemented as a standalone Java program. It loads the initial charger status via ChargePoint SOAP API on start up and then goes into listening mode for events. With every event, it keeps track of who is plugged in, who is finished and who should be notified. To make HipChat messaging personal, you will need to map ChargePoint user ids that you get through the API to your internal HipChat user ids. This will allow you to automatically manage the queue when a driver is recognized and also send direct messages using HipChat “@” feature.

EV Concierge is deployed on Amazon cloud via Elastic Beanstalk, which simplifies deployment and management of the process. One caveat to note is that we noticed that XMPP connection gets stale after prolonged use (12+ hours), so we had to build an automated restart at 6AM every morning to renew it. It works fine for us because EV Concierge can sleep at night ahead of a busy working day.


EV Concierge has been a fun little project to work on. It’s great to have a problem that is so well defined and actionable that it begs to be solved. It’s also a great feeling to build something that your fellow colleagues can use and enjoy on a daily basis. As I mentioned, since we adopted EV Concierge, many more drivers get a chance to charge and the process is much smoother and more fun for everyone. The list of proposed features is also steadily growing. Oh, if one can only find enough time…

As part of the hackathon, we put together a little presentation that describes EV Concierge in 3 minutes and shows it in action. Here is the video for your to enjoy.



Enabling AWS X-Ray on AWS Lambda

As you have probably noticed, debugging and getting latency data for your microservices can be painful if they interact with multiple distributed services. For these types of microservices, you are usually forced to build your own performance testing application, add an inordinate amount of log statements, or simply crossing your fingers and hoping for the best. From one of AWS’s posts on the subject:

“Traditional debugging methods don’t work so well for microservice based applications, in which there are multiple, independent components running on different services.” – AWS Lambda Support For AWS X-Ray

As a result, AWS built AWS X-Ray which, according to them, solves this problem:

“AWS X-Ray makes it easy for developers to analyze the behavior of their distributed applications by providing request tracing, exception collection, and profiling capabilities. ” – AWS X-Ray Documentation

Back in December, AWS announced a preview release of AWS X-Ray. While this was great and awesome, if you were a serverless shop and used AWS Lambda you were still out of luck.  Fortunately, in May ’17, AWS Lambda support for AWS X-Ray was released. Instrumenting your app has never been easier.

Below, I will go through the steps to update your CloudFormation template and instrument a Java application. We will then take a quick tour of the reporting and search features of the X-Ray dashboard.

Update CloudFormation

While we could enable X-Ray via AWS console, it is always better to have your application be fully deployable with a push of a button and a stack definition. On June 6, 2017 AWS CloudFormation released the TracingConfig property, that, along with a permissions change enables AWS X-Ray on your Lambda function.

Step 1: Enable TraceConfig

In your Lambda resource, you will add a new property called TracingConfig with the mode set to Active.

You will also add a DependsOn field for the execution role as the Lambda service checks permissions as soon as CloudFormation creates the Lambda function.

*Note: The default TracingConfig mode is Passthrough. This means that if any other service that has the Active mode enabled, your Lambda function will send tracing information to X-Ray. But if you access your Lambda function directly or through a service, that does not have X-ray enabled it will not send tracing information.

  Type: "AWS::Lambda::Function"
    Handler": "demo.XRayLambda::handleRequest"
    Role: !Join ["", ["arn:aws:iam::", !Ref "AWS::AccountId", ":role/", !Ref roleLambdaExecutionPolicy ] ]
    Description: Cloud formation created lambda for demo-xray-lambda
    FunctionName: demo-xray-lambda 
    MemorySize: 128
    Timeout: 140
      S3Bucket: my.awesome.bucket.lambda.us-west-1
      S3Key: demo-xray-lambda/demo-xray-lambda-1.3.2.zip
    Runtime: java8
      Mode: Active
  - roleLambdaExecutionPolicy

Step 2: Add AWS X-Ray permissions

Next we need to give our Lambda function permission for the xray:PutTraceSegments and xray:PutTelemetryRecords capabilities. Here I have added a new statement to my inline policy in the execution role.

  Type: "AWS::IAM::Role"
     Version: "2012-10-17"
     - Action: "sts:AssumeRole"
         Service: lambda.amazonaws.com
       Effect: Allow
      PolicyName: demo-xray-lambda-policy
        Version: "2012-10-17"
        - Action:
          - "xray:PutTraceSegments"
          - "xray:PutTelemetryRecords"
          Effect: Allow
          Resource: "*"

There are a couple of issues that can trip you up. First, IAM is a global service. As such, when you create a new role, it needs to be propagated to all regions. There is a possibility that your role has not been propagated to your stack’s region by the time CloudFormation starts to create your Lambda function. The Lambda service will throw an exception, and the stack will fail to create if it doesn’t have xray:PutTraceSegments permission. To get around this, you can either make your policy inline in your role resource, have two separate stacks for execution role/permissions and for your Lambda function, or reference an existing managed policy. I made my policy inline and have yet to run into an issue.

Another issue is when you have an existing stack/role that you want to add the X-Ray permissions to and enable TraceConfig in the same changeSet. This fails 100% of the time. Instead what you will need to do is rename your role resource so that it creates a brand new one instead of updating the existing one. As I mentioned with the previous issue you need to have your policy inline instead of as a separate resource. You should also add a dependsOn condition to your Lambda function to avoid parallel updates and ensure it will run/complete the role before creating the Lambda resource.

Instrument The Application

We will now start instrumenting our application by adding the necessary X-Ray libraries as well as adding a few lines of code to add more color to the traces. These libraries give you the mechanism to create your own custom segments to measure the performance of a subsection of your code. They allow you to add annotations which are indexed and enable you to search for subsets of your traces. They also allow you to add metadata to your subsegments which you can use for further debugging. For more information on how to instrument your application, please review the developer’s guide found here: http://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-java.html

Step 1: Add the AWS SDK to your application

The next thing we need to do is import the AWS X-Ray SDK so that we can start getting traces into our X-Ray Service Map. Update your build.gradle, to add aws-xray-recorder-sdk-core and a few other libraries into your dependencies.

dependencies {
  compile 'com.amazonaws:aws-xray-recorder-sdk-core:1.1.2'
  compile 'com.amazonaws:aws-xray-recorder-sdk-aws-sdk'
  compile 'com.amazonaws:aws-xray-recorder-sdk-aws-sdk-instrumentor'

At this point you could theoretically stop. You can push your code and you will begin to see traces in your X-Ray Service Map of your AWS::Lambda and AWS::Lambda::Function with a subsegment of Initialization. Per AWS documentation, this is because “the AWS SDK will dynamically import the X-Ray SDK to emit subsegments for downstream calls made by your function.” But wait, there is so much more we could be doing here.

Step 2: Add Custom Subsegments

Now let’s say that our function does a few things; download an S3 image, do some image manipulation then push it back up to S3.

public void handleRequest(String key, Size size) {
  Image image = downloadImage(String key);
  Image thumbnail = resizeImage(image, size);

Because you imported the aws-xray-recorder-sdk-aws-sdk-instrumentor you will automagically get subsegments for the S3 API calls. You could, however, create a your own custom subsegments for the image manipulation portion. Like so:

import com.amazonaws.xray.AWSXRay;
  public void resizeImage(Image image, Size size) throws SessionNotFoundException {
    // wrap in subsegment
    Subsegment subsegment = AWSXRay.beginSubsegment("Resize Image");
    try { 
      Image resizedImage = image.resizeMagic(size.getWidth(), size.getHeight());
    } catch (Exception e) {
      throw e;
    } finally {

You will now see a subsegment for resizing the thumbnail.

Step 3: Add Annotations to your Subsegments

So now that you have the custom subsegments, how do you know which one is which for, let’s say, large thumbnails versus small thumbnails? In comes annotations, which allows you to query your reports for a subset of your traffic.

*Note: you can only add annotations to subsegments, and not the root segment. I have seen where some people create a subsegment for the length of their handler, to which they add annotations and metadata, and then subsegments for the different subsections of that handler.

Simply updating our above code to this will give us this ability.

  public void resizeImage(Image image, Size size) throws SessionNotFoundException {
    subsegment.putAnnotation("Size", size.toString());
    Image resizedImage = image.resizeMagic(size.getWidth(), size.getHeight());

Step 4: Add Metadata to your Subsegments

Additional useful tooling you can add is metadata to your subsegments. This can help you debug traces, for example, that have exceptions. In our image resizing example we could add things like image source size, or file type. That way when reporting on traces with exceptions we can drill down and see if we can narrow down root cause.

subsegment.putMetadata("source", "size", image.getSize().toString());
subsegment.putMetadata("source", "fileType", image.getFileType());

Reporting on your application

Ok, now that we have enabled X-Ray and instrumented our application it is time to head over to the AWS UI and start learning about our application.

Service Map

After accessing your application a couple times, head over to the X-Ray dashboard in your AWS console. Make sure you are in the region where you deployed your microservice. You will start off on the Service map page. Here you will see something like the below with all the functions that have had hits in the last 5 minutes:

Screen Shot 2017-07-25 at 11.24.57 AM

There are a couple things to note on this page. There is a search bar that you can use to filter requests either by service name, annotations or trace id. You can also change your time range to anything from the last 1 minute to the last 6 hours. Or you can put a specific day, a start time and the length of time which again can be anything from 1 minute to 6 hours.

You can also click on a given bubble in your service map and see additional details as well as filter by response type, fault or throttling.

Screen Shot 2017-07-25 at 11.33.01 AM

Indexing on annotations

At this point let’s take a look at filtering with annotations. In the status bar let’s type in the below:

service(id(name: "demo-xray-lambda", type: "AWS::Lambda")) { annotation.Size = "small" }

Then let’s change it to medium and we will see a slightly higher response time.

service(id(name: "demo-xray-lambda", type: "AWS::Lambda")) { annotation.Size = "large" }

Let’s take that a step further and see source images that are greater that 2 mb.

service(id(name: "demo-xray-lambda", type: "AWS::Lambda")) { annotation.ImageFileSize > 2 }

As I’m sure you are starting to notice, is that with this level of instrumentation and granularity you can start to get a  better understanding of your application’s response times, where some of your pain points are, and what you can improve on.

Drilling deeper

Now that we took a bird’s eye view of performance of our application as a whole, let’s drill down deeper into individual traces. You can get there by clicking on View Traces in your Service Details panel or by clicking in the left navigation on Traces:

Screen Shot 2017-07-25 at 11.49.55 AM.png

Here you will see all the requests that X-Ray chose to sample. You can click on an individual trace by clicking on its ID. This could look something like the below image depending on your application.

Screen Shot 2017-07-31 at 1.05.16 PM

Here you can see each subsegment, its response times, and at what point in your service response time it executed. Also, if you click on a subsegment that, for example, you added annotations or metadata to in your code, you will get a popup panel that will allow you to view that data.

Screen Shot 2017-07-31 at 1.10.05 PM

One use case for this, is to filter by error or fault and then click in the subsegments where we added source image data to the metadata to get a better idea on where the source of the problem is.

service(id(name: "demo-xray-lambda", type: "AWS::Lambda::Function")) { error = true }


As I’m sure you have probably noticed, with very little investment you can get pretty powerful visibility into your distributed application’s performance. AWS has simplified this process to the point where debugging, tracing requests and viewing the performance of a collection of service in one view can happen with just a few lines of code and a few clicks.

I hope this simple getting started guide gets you up and running. Let us know in the comment section below if you find this helpful and any suggestions or questions you may have.


Using RestKit with Swift

What is RestKit?

RestKit is an Objective-C framework that simplifies interaction with RESTful web services. It combines a clean, simple HTTP request/response API with a powerful object mapping system that reduces the amount of code you need to write to ‘get stuff done’. RestKit does a lot of the heavy-lifting (integrated network operations, JSON/XML parsing, object-mapping, etc.) for you while allowing you to think more in terms of your application’s data model and worry less about the details of sending requests, parsing responses, and building representations of remote resources. Additionally, it has a powerful wrapper for Core Data.

At the time of this article, it is still an Objective-C framework that requires additional configuration to be used in a Swift project.

This article explains with an example, how RestKit could be used in a Swift project to make a REST call and map the response to a desired model type. To accomplish this, we will be using an Objective-C bridging header file that exposes the required headers to Swift.

This example uses Xcode 7.3.1 and Swift 2.2.

Set up a Swift project via Xcode

Step 1: Create a Single View Application project in Xcode.


Step 2: Remember to choose Swift for your Language preferences.


Install RestKit

The recommended way to install RestKit is via the CocoaPods package manager as it provides flexible dependency management.

Step 1: Install CocoaPods (if not already available).

NOTE: CocoaPods is built with Ruby and it will be installable with the default Ruby available on macOS. If you encounter problems during installation, please visit this guide: https://guides.cocoapods.org/using/troubleshooting#installing-cocoapods

$ [sudo] gem install cocoapods
$ pod setup

This example uses cocoapods 1.0.1 and ruby 2.3.1.

Step 2: Go to the directory of your Xcode project and create a Podfile.

$ cd /path/to/my/Project
$ pod init

Step 3: Edit your Podfile to include RestKit.

$ open Podfile -a Xcode
platform :ios,'9.0'

Step 4: Install RestKit into your project.

$ pod install

Step 5: Open your project in Xcode using the .xcworkspace file.

Create an Objective-C bridging header

Xcode will automatically configure an Objective-C bridging header, if you create an Objective-C file in a Swift project.

Step 1: Create an Objective-C file inside your project folder by right-clicking on the folder and selecting “New File…”.


Step 2: In the dialog that shows up, select Objective-C file under iOS -> Source category. Click Next, provide a name for the file and complete the file creation process.


Step 3: Xcode will automatically offer to create a bridging header file. If you accept, Xcode creates the file and names it as “<Project-module>-Bridging-Header.h”.


Step 4: (Optional) You may now delete the dummy Objective-C file created in Step 1 of this section.

Import RestKit

Step 1: Open the bridging header file and import RestKit.

@import RestKit;

Step 2: Build your project and make sure there are no errors.

Congratulations! You have now integrated RestKit in your Swift project.

Fetch a REST Resource

We are now going to make a simple GET request to JSONPlaceholder (https://jsonplaceholder.typicode.com) using RestKit. JSONPlaceholder has a “posts” API that returns a collection of posts. The structure of the “post” JSON returned is shown below:

  "userId": 1,
  "id": 1,
  "title":"sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
  "body":"quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"

Create a model class

Create a Swift file and name it Post.swift. This file contains the model class which holds the data from the REST API. We would use RestKit’s object mapping capabilities to transform the above JSON into a Post instance.

class Post: NSObject {
    var userId = 0
    var id = 0
    var title = ""
    var body = ""

    override init() {

    init(userId: Int, id: Int, title: String, body: String) {
        self.userId = userId
        self.id = id
        self.title = title
        self.body = body

Define Object Mappings

NOTE: The purpose of this article is to demonstrate the use of RestKit in a Swift project. To keep things simple, the networking code has been placed inside the ViewController. Ideally, we would have a thin controller and create a separate networking layer that handles REST requests.

Henceforth, we will be placing all of the code below in the viewDidLoad() method inside ViewController.swift.

Let’s define our object mappings first:

let postMapping: RKObjectMapping = RKObjectMapping(forClass: Post.self)
postMapping.addAttributeMappingsFromArray(["userId", "id", "title", "body"])

RKObjectMapping is used to define the rules for transforming data within the parsed payload to attributes on an instance of Post. If the attributes do not precisely match the field names in the payload, addAttributeMappingsFromDictionary() may be used to define the appropriate mappings.

Create a Response Descriptor

The response descriptor verifies that an HTTP response (encountered in the HTTP success code range) matches the object mapping defined above. It also establishes the path pattern.

let statusCodes = RKStatusCodeIndexSetForClass(RKStatusCodeClass.Successful)
let resDescriptor = RKResponseDescriptor(mapping: postMapping, method: RKRequestMethod.GET, pathPattern:"/posts/:id", keyPath: nil, statusCodes: statusCodes)

Initialize an Object Manager

The object manager gets initialized with a new AFHTTPClient which in turn is initialized with the given base URL.

let url = NSURL(string: "https://jsonplaceholder.typicode.com")
let jsonPlaceholderManager = RKObjectManager(baseURL: url)

GET Objects at Path

The object manager is used to perform a GET request with a URL for the given path. This creates an RKObjectRequestOperation object that manages the transmission of the HTTP request, deserialization of the response and parsing of the response through the object mapping engine.

jsonPlaceholderManager.getObjectsAtPath("/posts/1", parameters: nil, success: { (operation, mappingResult) -> Void in

     let post: Post = mappingResult.firstObject as! Post

     print("ID: \(post.id)")
     print("UserId: \(post.userId)")
     print("Title: \(post.title)")
     print("Body: \(post.body)")

}) { (operation, error) -> Void in

Note that the mappingResult needs to be type casted to our model type.


In this article, we learned how to configure a Swift project to use RestKit. We also used RestKit to fetch data from remote APIs and map the parsed payload to a model using object mappings.

Here is the example project with all of the code explained above.


Disaster Recovery Using Hybrid Cloud


Financial Engines recently celebrated the 20th anniversary since the company was founded.  Those two decades reflect our growth en route to becoming the largest registered investment advisor in the US.
During those same two decades the technology industry has changed profoundly and we have adjusted along the way.  One change we completed earlier in 2016 was moving our disaster recovery footprint to a hybrid cloud solution using AWS. This document describes that effort in more detail and the results we achieved.

Moving to IaaS

Our offerings have been web-based since inception. For hosting these web experiences we utilized top tier colocation providers. That relieved us from building and operating physical datacenters.
Today enterprise capable IaaS is avaiable from providers such as AWS.  We are now on a journey to move “up the stack” and adopt IaaS and reduce the burden we bear for things like:

  • hardware (servers, network gear, storage) procurement
  • physical site design and engineering (space, power, cooling, rack design, cable management, etc)
  • hardware maintenance: replacing failed drives, DIMMs, CPUs, NICs, motherboards, switch blades
  • firmware maintenance: qualifying and applying updates/patches across all hardware devices
  • hypervisor work: licensing, installation, tuning, maintenance, patching, upgrades
  • physical storage: design, engineering, and maintenance for iSCSI boot and data disks and NFS/CIFS network attached storage

Moreover, when we are ready to decomission infrastructure we just call an API to terminate/free those resources. This eliminates physical maintenance at the end of hardware lifecycles.

Peak Colo

Our transition to IaaS marks early 2016 as the point of “Peak Colo” for Financial Engines.

Over the coming quarters we expect to:

  • require fewer racks in colocation facilities
  • buy fewer servers from Cisco, Dell, or IBM
  • consume fewer VMware licenses
  • spend more on AWS for IaaS resources
  • achieve a net savings in our infrastructure total cost of ownership (see chart below for details)

Lift and shift for DR

Our rebuild of the DR environment had a fixed timeline due to a colocation contract ending. We therefore focused our effort on a lift-and-shift approach and moved the Linux compute portion of our stack into a VPC. We connected that VPC using Direct Connect to a reduced colo footprint resulting in a seamless LAN spanning our colo space and AWS:


This hybrid posture converts roughly 80% of our servers from on-prem hosted to cloud-hosted.  In doing so we trade capital for expense and ownership for rental.

For disaster recovery this trade is attractive since these resources are rarely needed (our DR utilization is < 10% for testing, drills, etc).

This lift-and-shift hybrid project has a residual footprint in our colo consisting of:

  • backend NetApp storage
  • large database hosts which are more diffcult to run on EC2 (due to size, iops, and cpu requirements)
  • batch machines which currently run on Windows Server

Future revisions of our hybrid posture should enable more of this infrastructure to run on AWS.

Our previous generation disaster recovery consisted of a colocation-hosted footprint containing:

  • 6 racks
  • vmware compute on IBM blades
  • NetApp storage
  • RHEL subscription fees
  • Load balancers as hardware appliances

The new disaster recovery footprint built on a hybrid cloud consists of:

  • 1 rack of UCS and NetApp (tech refreshed to yield better density and performance)
  • EC2 compute (our upgrade to the latest Xeon E5 v3 hardware was just selecting from the M4/C4 instance families)
  • Ubuntu 14.04 LTS
  • Load balancing on ELBs

In terms of costs here is what the transformation looks like:


Our DR site on AWS uses the pilot light model which incurs a modest monthly expense.

In exchange for that pilot light expense we achieved large reductions in depreciation, engineering time, and colo expense.

Looking Ahead

Following this disaster recovery rebuild we are moving to other re-hosting projects such as:

  • dev and test environments
  • production, starting with cpu-intensive tiers of our footprint

We expect these new environments to utilize the same hybrid cloud architecture with similar results.

Related Work

In addition to our lift-and-shift projects we are also moving to cloud native substrates for net new functionality.

These projects are using high-level primitives in AWS such as:

  • Lambda
  • API Gateway
  • DynamoDB
  • S3
  • Kinesis

Look for future blog posts covering that work.



AWS Lambdas with a static outgoing IP

Take a spin around the technical universe, and you will see that serverless computing is all the rage these days. Serverless computing doesn’t mean that there are no servers running your code. In the most popular use of the word, it simply means that you, the developer, don’t have to worry about it. Someone else has, and will monitor your service and make sure you have the right infrastructure and scalability in place.

Public Cloud providers like AWS and Google are simplifying the process for developers to leverage this architectural design concept. Do a quick search on the “serverless” keyword and the most popular related topics are in fact AWS with IoT being a close second.

Screen Shot 2016-07-03 at 1.00.04 PM


For software developers, serverless computing opens a world of possibilities as well as new security concerns. One of those concerns is how to handle whitelisting your service’s IP address for third party APIs on an infrastructure where IPs change quite frequently. For example, AWS released a post on this very subject & REST APIs where you can see what the IP ranges are at a given moment, saying:

You can expect it to change several times per week…

So, should we then specify a range of IPs in the API whitelist? Well, that would basically allow all of AWS to hit that third party API (not to mention some third party apis do not allow for a range). Not what you want, right?

Google’s implementation of serverless computing comes in the form of Google Cloud Functions, which was released in February 2016. At the time of this article, it is still an Alpha release and there is currently no way to define a static outgoing IP address. AWS’s implementation of serverless computing, called AWS Lambda functions has been in the wild for over a year now. As of February 2016, your Lambda functions can now access VPC resources. What does that mean for us? Simply put, we can now put them in a private subnet in our VPC and in essence assign static outgoing public IP addresses to them!

As a POC of this feature I decided to have a little fun with my latest game addiction, Clash of Clans. Over the next couple paragraphs I’ll walk you through how I configured my AWS Lambda behind a static public IP address, to then hit Clash of Clans’ public APIs.

Architectural Design

For this project we will need the following resources:

  • A VPC with:
    • Public Subnet
    • Private Subnet
    • NAT Gateway
    • Elastic IP
    • 2 Routes (public/private)
    • Internet Gateway
  • Lambda
  • API Gateway


Following the digram, at a high level, this is what we need to do and, what these resources will do for us. First we will create a new VPC.

Next we will create 2 subnets. When you initially create a subnet it will get a default route and are both basically private subnets. Since we want one of these subnets to be public, we will create an Internet Gateway, and a new route that points all traffic to this gateway which we will then assign to our public subnet. Now any subsequent resources created in our public subnet will automatically get internet access, and as long as it has a public IP it will be publicly accessible to the outside world.

Then we will create a NAT Gateway in the public subnet. Its job is to provide internet access to resources in our private subnet. It will need a public IP which EIP (Elastic IP) will provide us with. At this point we will update our default route table (assigned to our private subnet) to route all web traffic to our NAT Gateway.

The last service we will need to configure is our AWS Lambda service. By using the microservice-http-endpoint blueprint, we will create a function that is publicly accessible with API Gateway. It will live in the private subnet of our newly minted VPC so that we can leverage the outgoing elastic IP address. The code will be very simple. It will make an authenticated HTTPS call to the Clash of Clans API and return the JSON object of the top ten international clans.

Resource Creation

Step 1: Create a new VPC

Head over to your AWS VPC dashboard and click on over to your list of VPCs. If you have never done anything with VPCs you will see a default VPC that AWS gives you out of the box. Click on the Create VPC link and enter in a meaningful name for you VPC. For example I used:


Step 2: Create 2 Subnets

Now we are going to go to the Subnets page and create two subnets. One public and one private. (For availability purposes you would want to have multiple private subnets in different availability zones for your lambda to run on. For simplicity sake we will stick to one here) In the Subnet tab click on “Create Subnet”. For the name tag, make sure to include “Private subnet” in one and in the other “Public Subnet,” choose our newly created VPC, and select an availability zone (us-west-2c for example). For CIDR block use the same IP Range as your public subnet, but increment the 3rd octet by 1 from the highest number in your subnets in the same VPC. For example:



Step 3: Create an Internet Gateway

Next we will head over to the Internet Gateway view, click on Create Internet Gateway and tag it with a descriptive tag.


Then, we will click on our new internet gateway, and click on Attach to VPC, to attach it to our newly minted VPC like this:


Step 4: Create a public Route Table and Assign it to our public route

Now that that is done we can head over to our Route Tables view and click on Create Route Table, giving it a descriptive tag and linking it to our VPC:


Then we need to edit this route to point it to our new internet gateway. Click on the new route, click on the Routes tab, and click edit. Then add a new route, and we will set all traffic ( to target our internet gateway and save it:


Now, click on Subnet Associations tab, click edit and, by ticking the check box by your public subnet and clicking Save, you will associate this new route to your public subnet.


Step 5: Create a NAT Gateway

First, take note of your public subnet’s id. You can see in my previous screenshot that it is subnet-8225a8da. Head over the the NAT Gateway view and click on Create NAT Gateway. On the creation screen go ahead and paste in your subnet id and click on “Create New EIP.” For example here is my new NAT Gateway with public IP of


On the confirmation screen copy your nat instance id and let’s go back and edit our default route created when we created our VPC. Click on the default route (you will see the Main column for that route says Yes), click on the Routes tab, and click edit. Then add a new route, and we will set all traffic ( to target our nat instance id and save it:


Lambda and API Gateway Configuration

Ok, now that our VPC is configured we can head over to Lambda and create/configure our new function.

Step 1: Create a new Lambda Function

On the Lambda Dashboard, click on Create a Lambda Function. On the first page, called “Select blueprint,” select the microservice-http-endpoint. This will then prompt you for API Gateway configuration options as well as Lambda configuration options.

Clicking next, I then configure the trigger (API Gateway options) giving it an API name of TechBlog-Lambda-IP, a resource name of /top10ClashOfClans, set the method type to GET and deployment to prod. Lastly, for the purposes of this demo, I’m setting the Security to Open. (Note: In the real world you wouldn’t want to do this, instead you would want to use either IAM, Open with access key, or implement CORS).


Step 2: Configure our Lambda Function

On the next page we then configure our Lambda function. First, I give my function a name (e.g. topTenClashOfClans), select Node.js as my runtime and after selecting “Edit code inline” for the code entry type, I paste in the below code (NOTE: ideally your key doesn’t reside as clear text in your code, instead you can leverage KMS encryption, but that’s a post for another day):

'use strict';
var http = require('https');
console.log('Loading function');

exports.handler = function(event, context) {
  console.log('start request to ' + 
    "https://api.clashofclans.com" +
  var options = {
    "method": "GET",
    "hostname": "api.clashofclans.com",
    "port": null,
    "path": "/v1/locations/32000006/rankings/clans?limit=10",
    "headers": {
      "authorization": "Bearer SUPER_SECRET_KEY",
      "cache-control": "no-cache"

  var req = http.request(options);
  req.on('response', function(res) {
    var chunks = [];

    res.on("data", function (chunk) {

    res.on("end", function () {
      var body = Buffer.concat(chunks);
      console.log("Got response: " + body.toString());
  req.on('error', function(e) {
    console.log("Got error: " + e.message);
    context.done(null, 'FAILURE');

  console.log('end request to ' + 
    "https://api.clashofclans.com" + 

Below the code block you will now need to create a role and configure your VPC settings. I selected our newly minted VPC along with our private subnet. For example:


Below that I also selected the default security group. (Note: in production you would want to have this tightened down a bit more. Like for example, only allowing outgoing and inbound traffic via HTTPS.) Finally click next, verify your details and click on Create function.

So we did quite a lot of configurations between our Lambda service and in our VPC, but it is important to note that this was all done manually to better understand the interconnectivity of each resource. Ideally you would instead use something like AWS CloudFormation, Terraform by HashiCorp, etc. where you can spin up your complete stack or even subsequently destroy it with one click.

Clash Of Clans API configuration

Hopping on over to the Clash of Clans developer portal, I now need to tell them about my new IP address as well as download my auth key.

Step 1: Create a Key

To create a key, I need to give my key a name, and description and tell them the IP address I’ll be using. (At this point you might want to create separate keys for each environment and use your API Gateway configuration to tell your Lambda service what environment it is running in and therefore which key it should use.) So for example their UI looks like this:


Step 2: Get Authentication Token

Upon clicking Create Key I now get my token which I’ll update my lambda code with:


Testing it out

Now that my VPC has been configured, my lambda function is configured, Clash of Clans now knows my IP and I got my super secret key, I can know test out my API. Head over to the triggers tab of your Lambda service and you can see your API Gateway url. This is the url you will call from your application:


If I head over to my browser and paste it in. Here is my snazzy JSON response from my lambda service, from Clash of Clans:

         "name":"Kings Rock",
         "name":"MEGA EMPIRE",
         "name":"HOUSE of CLOUDS",
         "name":"Come & Take It",
         "name":"GULF KNIGHTS",
         "name":"kurdistan is 1",
         "name":"BRASIL TEAM",
         "name":"FACÇÃO CENTRAL",
         "name":"Los Inmortales",
         "name":"Req and Leave",

In Conclusion

So there you have it. The fact that our API call returned successfully, proves that the Clash Of Clans APIs where able to verify that 1) we called from the IP we said we would call from, 2) we used the token they created for us, and 3) we made our call via SSL.

Granted, there are definitely quite a few shortcuts we took in this implementation where security could be tightened up. This is in no way a productized implementation. It is, instead, an over simplified POC on demonstrating the new relationship between AWS Lambda and AWS VPCs. We have proven that we can use AWS VPC infrastructure to configure a AWS Lambda to use a static outgoing IP. This allows for tighter security when locking down who has rights to access your APIs. In our business case we can now say that our microservices connecting via SSL, using a security token X, as well as, a calling from IP X.X.X.X can access our financial resources is a fully trusted consumer, and any other connection is blocked from accessing those same resources.

Feel free to take a spin with the above instructions and provide any comments or feedback on this implementation.


Integrate with Ease – AWS + Twilio, Slack and IoT

Amazon Web Services (AWS) is changing the way engineers develop solutions.  It is so easy to prototype and have a scalable architecture with very little hand holding from the Office of the CTO or Systems Engineers.  This, in turn, fosters the DevOps culture within the organization.

One of the prototypes we’ve tried is to integrate AWS with Twilio, Slack and Intel Edison + Grove IoT device.  We cannot take all the credit here because this was inspired by a recent trip to AWS’s San Francisco pop-up loft.  They had a zombie apocalypse themed workshop but we took it a step further and used our company use cases and dissected each step since when we were there, we were just going through the motions.  We also thought that learning about how some of these technology companies work will provide us a fresh new perspective on what is moving and shaking the software industry.

The discussion below assumes some AWS knowledge of some of the services used.  It does not elaborate on each service and explain it in detail but here’s a quick refresher taken from AWS themselves:

  1. Lambda – “Lets you run code without provisioning or managing servers. You pay only for the compute time you consume – there is no charge when your code is not running”
  2. API Gateway – “Makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale”
  3. SNS – “Pub-sub Service for Mobile and Enterprise Messaging”
  4. Dynamo DB – “Fast and flexible NoSQL database service”

Let’s get right to it, shall we?  First up…


Twilio is a technology company that allows programmable communications.  They are the interface for sending and receiving global SMS and MMS messaging from any app.

Who uses Twilio?

  1. Box
  2. Nordstrom
  3. OpenTable
  4. Intuit
  5. Uber
  6. EMC2
  7. Zendesk
  8. Cocacola

Main competitors:

  1. Nexmo
  2. Plivo

Use Case:

Our use case is to send a client their next scheduled appointment to speak with one of our investment advisors if they send a text message of “schedule”.  For this prototype, the date would have been previously set.

Solution Architecture



User-appt-schedule is the API gateway that fronts the lambda function named user-schedule-get.  The lambda function then fetches the data from the dynamo DB for the schedule date.


  1. Create your Dynamo DB with the primary key of phone number and column for appointment date.pic2
  2. Create your node JS lambda function with that gets the date from the DB given the phone number.

console.log('Loading function');
var aws = require('aws-sdk');
var ddb = new aws.DynamoDB(
{region: &amp;amp;amp;quot;us-west-2&amp;amp;amp;quot;,
params: {TableName: &amp;amp;amp;quot;participant-schedule-reminder&amp;amp;amp;quot;}});

var theContext;

function dynamoCallback(err, response) {
if (err) {
console.log('error' + err, err.stack); // an error occurred

else {
console.log('result: ' + JSON.stringify(response)) // successful response
console.log('parsed response ' +JSON.stringify(response.Item.scheduledDate.S).replace(&amp;amp;amp;quot;\&amp;amp;amp;quot;&amp;amp;amp;quot;, &amp;amp;amp;quot;&amp;amp;amp;quot;));
theContext.succeed(&amp;amp;amp;quot;Your next retirement checkup is scheduled on &amp;amp;amp;quot; +
JSON.stringify(response.Item.scheduledDate.S).replace(&amp;amp;amp;quot;\&amp;amp;amp;quot;&amp;amp;amp;quot;, &amp;amp;amp;quot;&amp;amp;amp;quot;));

exports.handler = function(event, context, callback) {
theContext = context;
console.log(&amp;amp;amp;quot;Request received:\n&amp;amp;amp;quot;, JSON.stringify(event));
console.log(&amp;amp;amp;quot;Context received:\n&amp;amp;amp;quot;, JSON.stringify(context));

//Determine text
var textBody = event.body;

//Get the phone number, only the last 10 characters
var phoneNumber = event.fromNumber.substring(2, 12);
console.log('phone ' + phoneNumber);

if (event.body.trim() == &amp;amp;amp;quot;schedule&amp;amp;amp;quot;){

var params = {
&amp;amp;amp;quot;phone-number&amp;amp;amp;quot;: { N: phoneNumber }
AttributesToGet: [
//var response = ddb.scan(params,dynamoCallback);
var response = ddb.getItem(params,dynamoCallback);
} else {
theContext.succeed(&amp;amp;amp;quot;Text 'schedule' to get your retirement check up date&amp;amp;amp;quot;);

3.  Create your API gateway that invokes the lambda function for each get request.  The nuance with this is that every request that comes from Twilio is in TwiML format (XML format) but our lambda function requires a JSON object so a conversion has to be done in the integration piece of the API gateway and every JSON response from the lambda function needs to be converted to TwiML format.




  1. You can sign up for a trial account with Twilio and they you will get a phone number from them.
  2. You can then program this phone number with web hooks.  The URL below is the ARN URL from AWS for the API gateway.



Slack is an instant messaging and collaboration system.  They have teams and channels and have robust features for search.  There are commands that can be programmed called “slash commands” which allows for forwarding the message typed in the chat box to an external source with the use of a web hook.


Who uses Slack?

  1. Airbnb
  2. CNN
  3. Buzzfeed
  4. EA Sports
  5. Ebay
  6. Harvard University
  7. Samsung
  8. Expedia
  9. Intuit

Main Competitors:

  1. HipChat
  2. Yammer
  3. Google Hangouts
  4. Facebook at work

Use Case:

Our use case is to send a client their next scheduled appointment to speak with one of our investment advisors if they send a slash command in the appropriate FE channel.  The date would have been previously set.

Solution Architecture:



  1. Create a slash command (“fngn” in this example).  You still wouldn’t have the URL at this point since that will come from AWS.  The token below will be copied over to AWS’ lambda function to verify that the request came from the right channel.



  1. Create a Dynamo DB with the slack handle as the primary key:
  2. Create a lambda function that takes the handle and looks it up in the DB.
  3. Create an API gateway to convert the XML format from slack to JSON and vice versa.


Intel Edison + Grove IoT device

Intel Edison is a chip where code can be pushed built for wearables and Internet of Things (IoT) devices.  The grove toolkit contains a variety of widgets that can be attached to the Edison board.


Use case:

When motion is detected from the IoT device, send me an email notification.

Solution Architecture:

Grove has a motion detector widget so that was attached to the Edison board.

  1. Intel XDK IoT Edition is the IDE that provides templates to create Nodejs projects and deploy the code for sensors to Edison.
  2. From there, it’s just a matter of setting up the SNS service so that every motion detected triggers a publish to the email.


This was a very fun hack.  Not only did I find out more information about Twilio, Slack and IoT but it made me realize how easy it is to prototype and possibly productize solutions using the power of AWS.  We are now only limited by the power of our imagination.


Resilience in a Microservices World


circuit breaker

Building software is often compared to building homes, and in building a home you don’t let a faulty circuit burn down the house. Decades of learnings in the home building industry have shown that one small device, the circuit breaker, can protect an entire house.

Financial Engines is on the microservices journey for what are the typical business and architectural motivations. Basically, the cost of ownership of a monolithic system is growing due to the monolithic nature of the stack: code, releases and teams are coupled together. One solution is to decompose the system into smaller pieces following the principles of microservices.

There are many great blogs and books explaining microservices architectures (we’ve been using Building Microservices by Sam Newman). A consistent theme across these is that loosely coupled distributed systems, while having many benefits, will experience failure across the service boundaries, and such systems need to gracefully handle those failures. Netflix has some interesting back-of-the-envelope stats on failure rates: Network Failure Rates.

Even without an architectural shift to microservices, the modern SaaS application is simply more distributed in terms of integrating with 3rd party functionality. At Financial Engines we have a growing number of distributed connections to external services, even within the code that remains a monolith.

While we don’t want to prematurely engineer or over engineer solutions, we don’t want to ignore sage advice either. So, to keep the system running smoothly, and recover from issues without an all-hands-on-deck emergency each time, we decided to investigate technologies to add resilience and runtime insights into the system.

That is where Hystrix comes into the picture.

Hystrix 101

Netflix has a lot of great documentation on Hystrix, and if you haven’t already visited the project wikis, definitely make it your next stop, after this blog. You can start at Netflix on Hystrix.

At a high level, Hystrix gives client code better control over how network service calls, or actually any functionality, can affect the client system. Hystrix is about the client. To afford a client better control, Hystrix strives for:

  • latency tolerance
  • fault tolerance
  • prevent cascading failure

In sum, these attributes describe a resilient system. Again, the Netflix wikis have a lot of great detail.

In terms of code, this resilience is accomplished by wrapping any chunk of functionality in a HystrixCommand. While “any functionality” can include work other than a network service call, we’re focused on network calls below.

The HystrixCommand gives better control primarily through 2 patterns: the bulkhead pattern and the circuit breaker pattern. The bulkhead pattern isolates system resources that are servicing a network call. The circuit breaker pattern keeps a client from hanging on a failing network service call and repeating the failing action.

In addition to resiliency, as services require more network calls to complete a single request, we want to execute those calls in parallel so we don’t pay the cost of sequential synchronous network requests. Fortunately, Hystrix integrates well with parallel, asynchronous processing.

Use Case

Financial Engines builds financial planning and portfolio management software that helps customers plan for retirement and manage their 401Ks, IRAs and other investment accounts. There’s much more going on there, but that’s a high level view. The nature of the fiduciary business is that there are legal concerns which require legal documents to be presented to the user of the system. In our use case, the legal documents are provided by a microservice that can be used by any number of client applications or services.


The core task of using Hystrix is implementing command classes derived from HystrixCommand. It is a basic command pattern and the derived class implements a run() and a getFallback() method. Just to note, in the sample below, quite a bit of code was omitted for brevity and clarity, leaving those portions that should be of interest.

public class LegalDocRequestCommand extends HystrixCommand&lt;LegalDocResponseDto&gt; {

  private static int DEFAULT_TIMEOUT_MS = 5000; // millisecond timeout
  private static String HYSTRIX_GROUP_LEGAL_DOC_SERVICE = &quot;LegalDocServiceGroup&quot;;

  LegalDocConfigurator configurator = null;
   * Parameters to the legal documents service.
  String context;
  String sponsorId;
  String rkId;
  boolean includeIraManagement;

  protected LegalDocRequestCommand(LegalDocConfigurator configurator) {
    this.configurator = configurator;

   * HystrixCommand run() API.
  protected LegalDocResponseDto run() {
    String responseStr = null;
    WebResource resource = createWebResource();
    // try to get the legal documents
    responseStr = resource.get(String.class);
    LegalDocResponseDto responseDto = mapResponseJsonToPojo(responseStr);
    return responseDto;

   * HystrixCommand getFallback() API.
  protected LegalDocResponseDto getFallback() {
    logger.error(&quot;LegalDocRequestCommand failed: &quot; + constructLegalDocServiceUrl() + &quot; Root cause: &quot;,
    LegalDocResponseDto emptyResponse = new LegalDocResponseDto();
    return emptyResponse;

   * Builder pattern for creating the LegalDocRequestCommand since REST endpoint requires
   * several intrinsic type parameters. Enforces that all properties are set appropriately.
  public static class Builder {
    private String context = null;
    private String sponsorId = null;
    private String rkId = null;
    private boolean includeIraManagement = true;

    public Builder() {}
    protected Environment getEnvironment() {
      return Environment.getInstance();

    public Builder withContext(String context) {
      this.context = context;
      return this;
    public Builder withSponsorId(String sponsorId) {
      this.sponsorId = sponsorId;
      return this;
    public Builder withRkId(String rkId) {
      this.rkId = rkId;
      return this;
    public Builder withIncludeIraManagement(boolean includeIraManagement) {
      this.includeIraManagement = includeIraManagement;
      return this;

     * Build the LegalDocRequestCommand.
    public LegalDocRequestCommand build() {

      // Create a LegalDocConfigurator from the system environment.
      LegalDocRequestCommand command = new LegalDocRequestCommand(new LegalDocConfigurator() {
        public int getTimeout() {
          return getEnvironment().getLegalDocServiceTimeout(DEFAULT_TIMEOUT_MS);

      // transfer properties to the LegalDocRequestCommand
      // ensure the command properties are set appropriately

      return command;

     * Validate the properties of the command. Perform all your input/property level validation here.
     * Exceptions thrown during construction will NOT count against the circuit breaker logic since
     * they happen before the invocation of the run() method.
    private void validateProperties(LegalDocRequestCommand command) {
      if (StringUtils.isEmpty(command.context)) {
        throw new IllegalArgumentException(&quot;Missing context in LegalDocRequestCommand.&quot;);
      if (StringUtils.isEmpty(command.sponsorId)) {
        throw new IllegalArgumentException(&quot;Missing sponsorId in LegalDocRequestCommand.&quot;);
      if (StringUtils.isEmpty(command.rkId)) {
        throw new IllegalArgumentException(&quot;Missing rkId in LegalDocRequestCommand.&quot;);

In the run() method, you code the functionality that exposes your client to failure. In our examples, this is essentially the network call to fetch the legal documents. Above, we are using the Jersey client library to make REST calls, so at its core the run() method makes a WebResource get() call.

The second method to implement is getFallback(). In the event of a failure calling the distributed service, this method gets invoked to return some set of data that’ll ideally allow the client code to complete its operation, albeit with potentially degraded functionality. This forces the discussion of some key distributed system design details, namely, what happens when a network call fails. And they will fail.

Historically, in most cases we just throw exceptions that bubble up to an “internal server error” and let the caller deal with that, which usually results in a suboptimal client experience. But, can we do better than that? Can the client code handle an empty response or some default data, and still have a functioning scenario? Either way, in an application built on an increasingly distributed system, we want to start thinking about how the customer or client experience can continue in the event of network service failures.

Getting back to the code, we want to be sure to log the invocation of getFallback() with the relevant root cause information so it is easier to diagnose a problem. Even though getFallback() is making your system look like everything is OK, there could still be a problem that you need to resolve.

If you peruse the Hystrix wiki pages, you’ll also notice some advice on throwing exceptions in run(), and how to build the getFallback(). In the run() method, if there’s an error that you don’t want to count against the circuit breaker logic, then you need to throw a HystrixBadRequestException. All other exceptions factor into the circuit breaker logic to determine if the breaker should be opened, thus impeding calls to the service. Given this, perform as much validation up front as possible outside of the run() method.

Above you’ll notice that we added the builder pattern to the HystrixCommand. The LegalDocRequestCommand takes many primitive data type parameters that are query parameters to the REST call. The builder pattern adds clarity to the parameters and provides a consistent pattern for implementing a validation step so we can catch any errors before the run() operation.

Now the fun part, here’s what it looks like to use the LegalDocRequestCommand. Of course, the tests below require an environment setup that isn’t included.

 * Run the command as a single blocking command.
public void testSynchronousLegalDocRequestCommand() {
  LegalDocRequestCommand command = new LegalDocRequestCommand.Builder()

  LegalDocResponseDto response = command.execute();

 * Run the command as a single asynchronous command via Future&lt;&gt;.
public void testAsynchronousLegalDocRequestCommand() {
  LegalDocRequestCommand command = new LegalDocRequestCommand.Builder()

  LegalDocResponseDto response = null;

  // processing starts but thread doesn't block
  Future future = command.queue();

  try {
    // thread blocks waiting for response
    response = future.get();
  } catch (Exception e) {


 * Run 2 commands in parallel, asynchronously, using an Observable&lt;&gt; and combine the responses
 * into a single assembled response.
public void testObservableLegalDocRequestCommand() {
  LegalDocRequestCommand command1 = new LegalDocRequestCommand.Builder()

  LegalDocRequestCommand command2 = new LegalDocRequestCommand.Builder()

  Observable observable1 = command1.observe();
  Observable observable2 = command2.observe();

  Observable aggregateObservable =
    Observable.zip(observable1, observable2, new Func2&lt;LegalDocResponseDto, LegalDocResponseDto, LegalDocResponseDto&gt;() {
      public LegalDocResponseDto call(LegalDocResponseDto docs1, LegalDocResponseDto docs2) {
        return LegalDocsAssembler.assemble(docs1, docs2);

  BlockingObservable blockingObservable = aggregateObservable.toBlocking();

  LegalDocResponseDto combinedDocs = blockingObservable.last();


The last test above is actually not part of the first use case in production, but demonstrates one of the benefits of building on HystrixCommands. As the system of microservices need to call more microservices, we don’t want the response time to become an accumulation of all the individual network calls. There are probably many cases where we can make the network calls in parallel, asynchronously, and process the responses as required. This is somewhat analgous to asynchronous behavior that is intrinsic to the Javascript world. The RxJava github has lots of great links and information on this, RxJava.

Resilience Without Hystrix

Of course, as software engineers, before introducing another framework into the tech stack, we need to understand if the new framework provides enough value to warrant the additional learning that comes with a new framework. To get our heads around this, let’s take a look at what resilience would look like without Hystrix. The functionality we’d like to support would be:

  • latency tolerance (timeout control)
  • fault tolerance (exception handling and fallback data)
  • circuit breaker logic
  • bulkhead logic (thread pools with rejection)
  • parallel async requests

We are currently using the Jersey client (v1.19) to construct and execute calls to REST services. With Jersey, it is very straightforward to configure the connection timeout and read timeout. And, of course, the WebResource network operation can be wrapped in a try/catch block to handle any of the exceptions, and return an appropriate fallback response. Regarding thread pools with rejection, Jersey client supports thread pools for async requests, but not sync requests. The thread pools have standard Java thread pool behavior regarding the queueing of requests, so you’d have to implement your own ThreadPoolExecutor to configure the queue length to enable the rejection of queue requests at some level.

What about processing network service requests in parallel, asynchronously? This is functionality we feel will become important as we start to rely on more service calls to fulfill any single request. Jersey client 1.19 supports integration with Java Futures. As demonstrated below, it is straightforward to create an AsyncWebResource and interact via the returned Future.

Let’s see what the code would look like.

public void testResilientJerseyClient() {
  String url = &quot;http://example.com/hello&quot;;

  Integer connectionTimeout = 5000;
  Integer readTimeout = 5000;

  ClientConfig clientConfig = new DefaultClientConfig();

  // initialize clientConfig
  clientConfig.getProperties().put(ClientConfig.PROPERTY_CONNECT_TIMEOUT, connectionTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_READ_TIMEOUT, readTimeout);

  Client client = Client.create(clientConfig);
  WebResource resource = client.resource(url);

  String response = null;

  try {
    response = resource.accept(&quot;application/json&quot;).get(String.class);
  } catch (ClientHandlerException clientHandlerException) {
    response = getFallbackResponse();
  } catch (Exception e) {
    // handle other exceptions

protected String getFallbackResponse() {
  String fallbackResponse = &quot;sample fallback response&quot;;
  // determine appropriate response, perhaps based on a particular exception
  return fallbackResponse;

public void testResilientJerseyClientAsync() {

  String url = &quot;http://example.com/hello&quot;;
  Integer connectionTimeout = 5000;
  Integer readTimeout = 5000;
  Integer threadPoolSize = 5; // limit async requests to pool

  ClientConfig clientConfig = new DefaultClientConfig();

  // initialize clientConfig
  clientConfig.getProperties().put(ClientConfig.PROPERTY_CONNECT_TIMEOUT, connectionTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_READ_TIMEOUT, readTimeout);
  clientConfig.getProperties().put(ClientConfig.PROPERTY_THREADPOOL_SIZE, threadPoolSize); // use thread pool for async, and limit pool size

  Client client = Client.create(clientConfig);

   * Use the AsyncWebResource API for parallel async operations.
  AsyncWebResource asyncResource = client.asyncResource(url);

  String response = null;
  Future future = null;

  try {
     * Fire off this future, and any others that need to execute in parallel, asynchronously.
    future = asyncResource.accept(&quot;application/json&quot;).get(String.class);
     * Get the responses from the fired off futures.
    response = future.get(); // can take timeout params
  } catch (ClientHandlerException clientHandlerException) {
    response = getFallbackResponse();
  } catch (Exception e) {

This is not an exhaustive analysis, but clearly parts of the resilience requirements can be implemented without Hystrix. The final solution(s), however, are not as extensive, or would require coding on your part to replicate some of what Hystrix is providing, circuit breaker logic, bulkhead logic. And, if I start cleaning up the above code to make it reusable across different service calls, it starts to look a lot like a HystrixCommand.

Oh, and we still have to make some SOAP and Spring Remoting calls. Remember SOAP? Since the above solution is reliant on the Jersey client, we’d have to also implement all of the above for SOAP requests. We don’t get the benefit of Hystrix abstracting the fundamentals of resilience away from any particular service protocol implementation, be it REST, SOAP, Spring Remoting, etc.

A couple areas we didn’t touch on are configuration and monitoring. Hystrix has an extensive configuration system built on Archaius, and supports monitoring the health of the client calls via a stream of monitoring data that can feed into the Hystrix dashboard. Details in these areas are left for another blog, but they are high value for a microservices system.

Key Takeaways

After coding up some actual cases, my engineering takeaway is that with Hystrix, we get resilience in a holistic code texture, with very complete resilience functionality. Because of the holistic code texture, resilience can be easily applied across a heterogeneous set of client network technologies. Don’t under estimate how ease leads to better adoption.

Getting started with Hystrix was very easy, and the defaults were adequate to get the first use case into production. Now we must tune the timeouts, threadpool sizes, etc., based on learnings about load and response times. I’d like to emphasize that you shouldn’t just “bump up” these values to mask other problems in the system. Investigate and solve the root causes.

From an architectural perspective, introducing Hystrix has forced the detailed consideration of resilience. In fact, whether you use Hystrix or not, do not let your system default to client library or network timeouts. Consider, for example, the client default for Jersey is infinity. Furthermore, examine what needs to happen if a service call fails. These considerations were definitely an underdeveloped muscle coming from the monolith world, where nearly everything is available and in-proc.

Building out a microservices architecture shifts the dev/ops costs from managing a monolith, with all of its issues, to managing a suite of services, with all of those issues. Be proactive and head off the costs of maintaining a non-resilient system. An engineering organization can spend lots of time (=money), not meet business requirements, and foment frustration within the organization if a new architectural initiative is plagued with reliability issues.

While this is a very humble first usage of Hystrix, it reveals that even in a simple case, the benefits will outweigh the costs for us. Remember, don’t let one faulty circuit burn down the house.