Overview

Financial Engines recently celebrated the 20th anniversary since the company was founded.  Those two decades reflect our growth en route to becoming the largest registered investment advisor in the US.
During those same two decades the technology industry has changed profoundly and we have adjusted along the way.  One change we completed earlier in 2016 was moving our disaster recovery footprint to a hybrid cloud solution using AWS. This document describes that effort in more detail and the results we achieved.

Moving to IaaS

Our offerings have been web-based since inception. For hosting these web experiences we utilized top tier colocation providers. That relieved us from building and operating physical datacenters.
Today enterprise capable IaaS is avaiable from providers such as AWS.  We are now on a journey to move “up the stack” and adopt IaaS and reduce the burden we bear for things like:

  • hardware (servers, network gear, storage) procurement
  • physical site design and engineering (space, power, cooling, rack design, cable management, etc)
  • hardware maintenance: replacing failed drives, DIMMs, CPUs, NICs, motherboards, switch blades
  • firmware maintenance: qualifying and applying updates/patches across all hardware devices
  • hypervisor work: licensing, installation, tuning, maintenance, patching, upgrades
  • physical storage: design, engineering, and maintenance for iSCSI boot and data disks and NFS/CIFS network attached storage

Moreover, when we are ready to decomission infrastructure we just call an API to terminate/free those resources. This eliminates physical maintenance at the end of hardware lifecycles.

Peak Colo

Our transition to IaaS marks early 2016 as the point of “Peak Colo” for Financial Engines.

Over the coming quarters we expect to:

  • require fewer racks in colocation facilities
  • buy fewer servers from Cisco, Dell, or IBM
  • consume fewer VMware licenses
  • spend more on AWS for IaaS resources
  • achieve a net savings in our infrastructure total cost of ownership (see chart below for details)

Lift and shift for DR

Our rebuild of the DR environment had a fixed timeline due to a colocation contract ending. We therefore focused our effort on a lift-and-shift approach and moved the Linux compute portion of our stack into a VPC. We connected that VPC using Direct Connect to a reduced colo footprint resulting in a seamless LAN spanning our colo space and AWS:

HybridDiagram.png

This hybrid posture converts roughly 80% of our servers from on-prem hosted to cloud-hosted.  In doing so we trade capital for expense and ownership for rental.

For disaster recovery this trade is attractive since these resources are rarely needed (our DR utilization is < 10% for testing, drills, etc).

This lift-and-shift hybrid project has a residual footprint in our colo consisting of:

  • backend NetApp storage
  • large database hosts which are more diffcult to run on EC2 (due to size, iops, and cpu requirements)
  • batch machines which currently run on Windows Server

Future revisions of our hybrid posture should enable more of this infrastructure to run on AWS.

Our previous generation disaster recovery consisted of a colocation-hosted footprint containing:

  • 6 racks
  • vmware compute on IBM blades
  • NetApp storage
  • RHEL subscription fees
  • Load balancers as hardware appliances

The new disaster recovery footprint built on a hybrid cloud consists of:

  • 1 rack of UCS and NetApp (tech refreshed to yield better density and performance)
  • EC2 compute (our upgrade to the latest Xeon E5 v3 hardware was just selecting from the M4/C4 instance families)
  • Ubuntu 14.04 LTS
  • Load balancing on ELBs

In terms of costs here is what the transformation looks like:

hybrid-waterfall.jpg

Our DR site on AWS uses the pilot light model which incurs a modest monthly expense.

In exchange for that pilot light expense we achieved large reductions in depreciation, engineering time, and colo expense.

Looking Ahead

Following this disaster recovery rebuild we are moving to other re-hosting projects such as:

  • dev and test environments
  • production, starting with cpu-intensive tiers of our footprint

We expect these new environments to utilize the same hybrid cloud architecture with similar results.

Related Work

In addition to our lift-and-shift projects we are also moving to cloud native substrates for net new functionality.

These projects are using high-level primitives in AWS such as:

  • Lambda
  • API Gateway
  • DynamoDB
  • S3
  • Kinesis

Look for future blog posts covering that work.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s