TechDoc: Load Balancing and portable Hub/Spoke connections of Application SandBoxes

Posted on June 15, 2011

0


Brief

  1. Application SandBoxes – All applications in a HotForestGreen network run in SandBoxes.
  2. To keep the architecture of HotForestGreen simple I have chosen for a Hub/Spoke approach.
  3. To make sure the Real Time applications running on the Framework can continue to work when a Server drops out, I have built in some systems of redundancy and automatic reconnects. So when a Server drops out, all clients will automatically find a new Server to host their Sandbox.
  4. To prevent data loss, as it is a real time system, both Servers and Client have mechanisms to Cache the data for a specific period of time, making sure that when Clients connect later, they still receive all messages sent in the time between their disconnect and reconnect.
  5. To prevent Server Overload and chaos on who Serves what, the Servers use a simple system of load-balancing where the Server with the lowest Load is presented as a candidate for a new SandBox first.

Goal

The goal of the set of systems and mechanisms described in this article is to create an almost indestructible Application Cluster in which Servers can bomb out, Servers can rewire Clients and SandBoxes to a different Server when they start reaching overload and where Servers can be added during Application Runtime, without creating anything more than short hickups in the data flow.

The article below describes these mechanisms and systems in brief.

General rules

  1. The Hub is formed by the Server that runs the Sandbox.
  2. There is always only one Server in the ServerCluster running that Sandbox
  3. Each Spoke is a Socket Connection to the Server, specifically created for that SandBox
  4. Servers are connected to each other via one specific / reserved SandBox
  5. If a Server drops out, another Server on a pre-defined and dynamic list will stand in as the new Hub
  6. The distribution / assignment of Servers to a SandBox is based on load balancing principles
  7. This load-balancing is based on:
    1. Number of Application SandBoxes Served
    2. Number of Socket connections to the Server
    3. Relative Performance Indicator of the Server (A number between 1 and 3, based on the amount of RAM and the processor capacity of the machine it runs on)

SandBoxes

SandBoxes help keep applications separated – even if their events and data are similar.

The idea behind Application SandBoxes is that you should be able to run limitless amounts of applications through one single HotForestGreen Server, without having to worry about event- and data leakage from one application to another.

Servers

Servers serve SandBoxes. Using load-balancing and communication over a reserved SandBox only shareable between Servers, they keep tabs on each other.

Testing: Multiple Servers on Multiple Ports

Separate Servers can run on separate ports on the same machine and connect to each other.

The main purpose is to test scenarios with multiple Servers and Load Balancing.

Returning requested SandBox IP address

When the Client requests a Server for a IP number related to a SandBox, the Server will return IP number and port on which the Server can be found.

This IP number and Port are not restricted to the range where the Server is running in now.

Rules regarding returned SandBox IP Addresses

  1. Always the current host for the requested SandBox – The returned IP Address always leads to the SandBox host
  2. Can be in other local IP range – IP Addresses can be in another Scope or Range
  3. Can be online – This scope can be an IP address online
  4. Can be on different port – Port numbers returned can be different from the one the Client connected through

Redundancy and load balancing

To make sure that the Application Cluster does not stop when one Server craps out, I have built in a system of redundancy, about which you can read more below.

The bottom line is this:

  1. Replacement Server – When one Server craps out, the Client side Applications will automatically – and usually within 2 seconds – start to scan the network for another replacement Server to host the dropped out SandBox
  2. Servers provide alternative – The Servers keep tabs on each other and when one dissappear, they will propose a new Server to host a SandBox the moment that the Client Side application finds a Server and requests a Sandbox Server IP Address
  3. Load Balancing – The Servers do not just randomly mover forward a (new replacement) Server to host a SandBox, but do this based on a simple Load Balancing system based on an Relative Performance Multiplier Value (RPMV). The lower the performance of a machine, the higher the multiplier factor is.

Portable Hub/Spoke connections

Any Server in the Application Cluster can be- and very likely will be- a hub for one or more Sandboxes.

These roles are assigned to a Server via Load Balancing and based on the following rules:

  1. Served by the Server with the lowest Relative Load – If a client requests a Sandbox connection, the first found Server in the scan will check which Server in the Server Cluster can be assigned. This is usually the Server with the lowest Relative Load.
  2. Reconnect to the next Server with the lowest Relative Load – If one of the Servers bombs out, all Clients will start requesting a connection to the SandBboxes that were served by that Server. Since nobody is serving those sandboxes, the first found Server will poll the list with the Relative Load of all Servers and choose the ones with the lowest score to assign the Unserved SandBoxes to.
This means that:
  1. There are no fixed hubs -For a specific SandBbox
  2. Hubs can drop out and be replaced by another Server – Without destroying the running application
  3. Spokes will not sit down  and mope on disconnect – But automatically reconnect to a new Hub as soon as a replacement Server is given

A system of Server / Hub redundancy

The more Servers you assign to the Application Cluster, the more redundancy you create. This means that – as long as your network continues to run – one or more Servers can drop out and your application will still continue to function.

Default drop out time of about 5 seconds

By default each SandBox will check every 5 seconds if everything is OK.

If in that period any Sandbox got disconnected, the SandBox will attempt to reconnect to that SandBox.

“Connection Lost” Notification by the ServerConnection

Every time you try and send a message through a ServerConnection, the system will first check if the connection still exists. If the connection is broken, the ServerConnection object will notify the SandBox, leading the Application SandBox to start scanning for available Servers again.

Reconnection in 2 seconds

If your system is actively sending messages, usually the connection drop is discovered immediately and the reconnection-process is started immediately as well.

The maximum time for a Client to reconnect all SandBoxes should not be more than 2 seconds and in most cases will be less than one second once a disconnect is discovered.

However: if your system is only occasionally sending messages, for instance because those messages are based on user-interactions, we have two heartbeats taking care of checking the connections:

  1. A heartbeat on the ServerConnection – Checking ever N seconds (never less than 1 second though) if the connection is still up
  2. A heartbeat on the Sandbox – Chekcing every N seconds (default is 5) if the connections are still OK and starting a reconnect when a connection to a SandBox in a specific IP range is broken

Data Loss on reconnect? Client side Queue and Server side history of messages

If you have a real-time system where every message counts, you can use two mechanisms:

  1. Client side message queuing – Each message that is attempted to be sent through a non-existing Socket Connection can be queued  Client side. The moment the SandBox has established the connection the queue will be sent to the Server.
  2. Server side Message History – Each message that reaches the Server can be kept in a short term history at Server side. This function is by default switched off. The size of this Server Message history can never be lower than 5 seconds and will be enough to prevent potential data loss on a SandBox when Clients are reconnecting to a new Server.
You activate the Message queuing / Message history via the method “mySandBox.cacheMessages( <duration in milliseconds>)” on the Client side Application SandBox.

Server Application SandBox

The Server Application SandBox is completely identical in setup as a SandBox used by Clients in the Application Cluster.

Rules of the  Server Application SandBox

  1. There is only one Hub for the Server Application SandBox
  2. There can be many spokes
  3. The Hub can be any Server
  4. Each Server has a FallBack-List of active Servers to connect to when the Hub breaks down
  5. This list is created by the Hub-Server and sent to all Spoke-Servers and is automatically updated when Servers drop out form the Hub
  6. When the Hub-Server drops out (disconnection, break down)
    1. The First Server on the FallBack-List takes over and becomes the Hub
    2. All Servers will automatically reconnect to the new Hub

Load Balancing

Load balancing is always relative and based on the following parameters:

  1. Amount of served SandBoxes
  2. Amount of active Client Connections
  3. Capacity of the Server – Which is a number between 1 and 8
  4. The Server with the least amount of relative load will be presented first

Relative Performance Indication of the Server

When a Server is registered and started, it will be given a Relative Performance Indicator. This Relative Performance Indicator is a number between 1 and 3 put into three static constants:
  1. SERVER_RELATIVEPERFORMANCE_HIGH – Representing a machine with a lot of RAM and a good processor. Containing the multiplier value 1.
  2. SERVER_RELATIVEPERFORMANCE_MEDIOCRE – Representing a low-end machine – like a Netbook which is OK, but not the best. Containing the default multiplier value 6.
  3. SERVER_RELATIVEPERFORMANCE_LOW – Representing a low-consumption machine like Android Tablets used as a Server. Containing the default multiplier value 30.

The Performance Indicator is relative to the machines used in the network.

The relative performance indicator is based on two factors:

  1. Processing speed – Does it have dual core, single core, quad core processors? What is the clock speed? What benchmarks does it make?
  2. Memory – How much RAM is there?

You take the lowest performing machine and the highest performing machine as your starting points.

All machines in the middle are “mediocre”

Examples

  1. Desktops, Laptops, Netbooks – If all machines acting as Servers are Laptops, some machines are Netbooks and you have one or two Desktops, you might consider Netbooks to be relatively Low-performing.
  2. Laptops, Netbooks, Android devices – If your fastest machine is a Laptop and the slowest an Android device, you might consider the Netbooks to be mediocre.
  3. Dual core Netbooks, 1 year old Netbooks with 2 GB RAM, 2 year old Netbooks with one GB RAM – Dual Core Netbooks perform better than single core Netbooks. A 1 year old Netbook might perform better than a 2 year old. 2GB RAM is better than 1 GB.

Server Manifest

Amongst other things, the Server Manifest contains the Relative Performance Multiplier Value.

Calculating relative load

The Relative Load per Server is based on:

  1. The number of active connections
  2. The multiplier value provided by the Relative Performance Index

Examples

  1. Case 1: Assigned to high performance Server – We have two machines and the following outcome:
    1. LPS1 – A Low Performance Server with a Relative Performance Multiplier Value (RPMV) of 20 (low performance), 4 Sandboxes and 200 Active Socket Connections. It will receive a score of 4.000.
    2. HPS2 – A High performance Server with a RPMV of 1 (high performance), 50 SandBoxes and 3000 Active Sockets. It will receive a score of 3000.
    3. HPS2 is the first to receive a new connection – As HPS2 has a score of 50/3000 and LPS1 has a score of 80/4000, the next requested SandBox will be assigned to HPS2

Actual system monitoring not there yet

The current Load Balancing method is a crude first version. As we do not monitor the actual load on each Server, we might crash a Server when our network really gets flooded with messages and connections.

To avoid these crashes and do better load balancing, the Servers should observe their own resources. Even with only 20 active Socket Connections a low performance machine might already be at the edge of its possibilities.

Advertisements
Posted in: Uncategorized