Registration Server Failover and Scalability Considerations

Scaling a TeamDrive Registration Server Setup

A first step in increasing a single Registration Server’s performance would be to monitor and review the system’s CPU and RAM utilization, and to adjust the server configuration by adding more RAM or CPUs, if necessary (also called “scale-up strategy”).

Adding more CPUs typically increases the maximum number of possible concurrent connections to the service and reduces the latency. However, the ability to handle more connections also requires more memory, as the system needs to spawn more concurrent Apache instances. So usually both parameters need to be adjusted.

Adding more RAM can also help to improve database throughput and latency, as it allows the database to keep more of its working set in memory, which enables it to return query results quicker.

If your setup has reached the physical limits of a single server instance, you can further improve the scalability as well as the redundancy of a TeamDrive Registration Server by implementing a “scale out” strategy.

In this setup, you distribute the load across several systems, by deploying multiple virtual or physical Apache server instances of the TeamDrive Registration Server behind one or more load balancers.

This configuration also mitigates the risk of a service outage, e.g. if an instance fails or needs to be taken offline for maintenance purposes.

A migration from a single instance setup to such a scaled-out setup can usually be performed with very little downtime, so you can grow your setup as the need arises.

However, you must ensure that in case of a node failover/outage, the remaining nodes can handle the load usually distributed across all server instances.

Note

In a scale-out scenario, the Registration Server’s MySQL database server must be set up as a separate instance, so each Registration Server node has access to the same data set.

To avoid the MySQL database to become a single point of failure, we recommend to set up MySQL in a redundant configuration, too (e.g. by using MySQL replication or other clustering technologies like Galera/Percona Cluster).

Note

The TeamDrive Registration Server configuration does not support accessing more than one MySQL Server; you need to use a floating/virtual IP address that gets assigned to the currently active MySQL instance.

If you intend to run multiple independent Registration Server instances (e.g. to serve a globally distributed user base), you can assign users to local Registration Servers using different Provider Codes. Use TDNS to facilitate collaboration (e.g. exchanging Space invitations) between these independent TeamDrive Registration Server instances (which can in turn be scaled using the strategies above).

In a single instance configuration, a re-appearing server can suffer from a “thundering herd problem”, as a large number of TeamDrive Clients will try to synchronize their accumulated pending changes simultaneously. This can lead to a peak in the number of concurrent connections to this server and its MySQL database, as well a noticable increase in network and disk I/O.

This effect can be mitigated by temporarily extending the poll interval used by the Clients, by increasing the number of Apache instances, or by temporarily assigning more resources like vCPUs or vRAM to a virtual machine.

The MySQL server’s configuration might also need to be reviewed in order to support more concurrent database connections.

Registration Server Failure Scenarios

This chapter discusses the most likely outages that can occur on a TeamDrive Registration Server, if no additional redundancy is provided.

Chapter Registration Server Failover Test Plan outlines some possible tests you should perform, and what results to expect.

Entire Registration Server Outage

An outage of the entire TeamDrive Registration Server can be triggered by any of the following events:

  • Failure of the entire Registration Server host system (e.g. a hardware or OS crash/failure)
  • Network failure that renders the Registration Server unavailable
  • Failure of the Registration Server’s Apache http Server
  • Failure of the Registration Server’s MySQL Database

Running Clients will indicate that the Registration Server can not be reached (for example, the TeamDrive 3 Desktop Client has an LED-like indicator icon in the bottom right corner, which will turn from green to red in case the Registration Server cannot be reached).

The following Client operations will continue to work:

  • Running Clients can still operate on their existing Spaces (e.g. adding/removing files, uploading new versions)
  • Clients can create new Spaces and delete existing Spaces
  • Creating Space invitations to users stored in the Client’s local addressbook

The following operations will be not be possible while the Registration Server is unavailable:

  • Performing a login after having logged out of the TeamDrive Client
  • Registration of a new device/Client
  • Sending out Space Invitations to other users
  • Changing the password or email address, requesting a temporary password
  • Distributing comments on files via email
  • Enabling/disabling the Key Repository

Once the Registration Server is reachable again, the Clients will proceed with sending out any pending invitations. The notification icon will change from red to green.

Except for the MySQL Server outage, this failure scenario can be avoided by setting up multiple instances of the Registration Server behind a load balancer with failover capabilities.

MySQL Database Outage

A failure of the Registration Server’s MySQL Database could be triggered by one of the following events:

  • Failure of the entire MySQL Server host system (e.g. a hardware or OS crash/failure)
  • Network failure that renders the MySQL Server unavailable for the Registration Server
  • Failure of the MySQL Server’s mysqld process

The failure will be indicated by error messages in the following Registration Server log files.

/var/log/pbvm.log:

[ERROR] 11/20/2014 11:46:55: Internal error, use_database: (-12507, 2003)
[ERROR] 11/20/2014 11:46:55 *** ERROR 2003 (0): Can't connect to MySQL
server on '127.0.0.1' (107)

/var/log/pbac_mailer.log:

[Protocol] 11/20/2014 11:47:55.39 TRACE : ---AUTO TASK...
[Protocol]
[ERROR]   1: ERROR: -12507 (2003) : "autotask.pbt"@server line 92: Can't
connect to MySQL server on '127.0.0.1' (107).

/var/log/httpd/error_log:

[notice] Mod_pbt Error: pid: 9524, where: "open TD2REG_WRITE dbms USER
..."@server line 1: Error originated in sx dbms server : Opening and
initializing PBI connection, Alias "td2as", perr: -12507, serr: 2003
[notice] !sessid

A MySQL Database server failure will affect the entire Registration Server functionality as described in chapter Entire Registration Server Outage.

The service will return to normal operations as soon as the MySQL service is reachable again.

To mitigate the risk of a MySQL Server outage, consider setting up a cluster of MySQL Servers, using MySQL replication, DRBD or other replication and HA technologies like Pacemaker/Corosync to provide synchronization and redundancy.

SMTP Server Outage

If the local or remote SMTP server is unavailable for sending out email, the Registration Server will no longer be able to send out invitations, registration email notifications or file comment notification to the TeamDrive users. These messages will be kept in the Registration Server’s internal mail queue until the SMTP service is available again.

Note

Note that sending out messages from a TeamDrive Client perspective still works — the Client receives a success notification as soon as the Registration Server has queued the message in its database for delivery.

Failures to connect to the SMTP server will be logged in file /var/log/pbvm.log as follows:

[ERROR] Connect to 'localhost:25' failed, getsockopt(SO_ERROR) returned
(111): Connection refused

The pending messages can also be viewed from the Registration Server Administration Console by clicking Manage Emails -> View mail queue.

Once the SMTP service is back online again, pending messages can be rescheduled for delivery by clicking Reset Status in the mail queue overview page.

Currently, there is no automatic method for rescheduling all pending messages in a bulk operation.

Outage of the teamdrive Background Service

The teamdrive background service is responsible for running a number of tasks, see the chapter Autotasks in the TeamDrive Registration Server Reference Guide.

If the teamdrive background service (process name pbac) has failed or was not started at bootup time, most notably the following operations are affected:

  • Invitation and notification emails won’t be delivered anymore. For example, users that requested a new temporary password won’t receive that information and won’t be able to log in.
  • Client licenses with an expiration date won’t be invalidated and reminder email notifications won’t be sent out.
  • Messages not retrieved by Clients from the Registration Server’s message queues within the periods defined in InvitationStoragePeriod and InvitationStoragePeriodFD won’t be deleted.
  • Old entries won’t be purged from the API log table.

Restarting the teamdrive background service will pick up where the previous process has stopped.

For increased redundancy, it is possible to run this service on each TeamDrive Registration Server instance in a multi-server installation. In this setup, each instance needs to have a functional SMTP configuration, to ensure that email messages can be delivered.

Registration Server Failover Test Plan

Based on the failover scenarios described in chapter Registration Server Failover and Scalability Considerations, the following tests should be performed to verify the correct behaviour and recovery from failures of individual TeamDrive Registration Server components.

This test plan assumes an environment consisting of two virtualized TeamDrive Registration Server instances (regserv01 and regserv02), located behind a load balancer and using a dedicated single MySQL Server instance (td-mysql). Other setups/configurations may require additional tests, depending on the environment.

Note

Note that this configuration contains several components for which no redundancy is provided, therefore these components are considered single points of failure (SPOF). In particular, the following components can become a SPOF:

  • The MySQL database instance (td-mysql). If this instance becomes unavailable, the entire TeamDrive service will be affected and rendered unavailable until the service is restored.
  • The load balancer/firewall. If the public-facing load balancer/firewall fails, the TeamDrive service will be unavailable.

Single Registration Server Instance Failure

An outage of one of the TeamDrive Registration Server instances (regsrv01 or regserv02) should be simulated/triggered in the following ways:

  • Shutting down the Apache httpd Server running service httpd stop.
  • Shutting down the network connection, e.g. by running service network stop, ifconfig eth0 down or by disconnecting the virtual network interface via the virtual machine management console.
  • Shutting down the entire virtual machine e.g. via the virtual machine management console or by running poweroff.

Expected results:

  • The load balancer should detect that the Registration Server instance is no longer available and redirect any incoming traffic to the remaining instance instead. If configured, a notification about the outage should be sent out to the monitoring software.
  • The monitoring software should raise an alert about the Registration Server instance being unavailable, specifying the nature of the outage (e.g. httpd process missing, network unavailable, etc.).
  • The remaining Registration Server instance should handle all incoming Client requests. The TeamDrive Service should not be impacted/affected in any way.

Once the outage has been resolved and the instance has recovered, the following is expected to happen:

  • The load balancer should detect that the Registration Server instance is available again. Incoming traffic should be spread across both instances again.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).
  • The TeamDrive Service should continue unaffected throughout this process

Multiple Registration Server Failures

An outage of both of the TeamDrive Registration Server instances (regsrv01 and regserv02) should be simulated/triggered in the following ways:

  • Shutting down the Apache httpd Servers running service httpd stop on both instances.
  • Shutting down the network connections, e.g. by running service network stop, ifconfig eth0 down on both instances, or by disconnecting the virtual network interfaces via the virtual machine management console.
  • Shutting down the entire virtual machines e.g. via the virtual machine management console or by running poweroff.

Expected results:

  • The load balancer should detect that the Registration Server instances are no longer available and stop redirecting any incoming traffic to the instances. Incoming requests should be answered with an appropriate error code (HTTP error code 503 - Service Unavailable). If configured, a notification about the outage should be sent out to the monitoring software.
  • The monitoring software should raise an alert about the Registration Server instances being unavailable, specifying the nature of the outage (e.g. httpd process missing, network unavailable, etc.).
  • The TeamDrive Service will be impacted/affected as outlined in chapter Entire Registration Server Outage.

Once the outage has been resolved and at least one of the Registration Server instances has been recovered, the following is expected to happen:

  • The load balancer should detect that a Registration Server instance is available again. Incoming traffic should be redirected to this instance and incoming requests should no longer result in HTTP errors.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).
  • Once the TeamDrive Clients have noticed the service being available again, operations should proceed as before.

Testing MySQL Server Failures

An outage of one of the MySQL Server instance (td-mysql) should be simulated/triggered in the following ways:

  • Shutting down the MySQL Server by running service mysqld stop.
  • Shutting down the network connection, e.g. by running service network stop, ifconfig eth0 down or by disconnecting the virtual network interface via the virtual machine management console.
  • Shutting down the entire virtual machine e.g. via the virtual machine management console or by running poweroff.

Expected results:

  • The Registration Server instances will no longer be able to handle incoming Client requests as outlined in chapter MySQL Database Outage.
  • The monitoring software should raise an alert about the MySQL Server instance being unavailable, specifying the nature of the outage (e.g. mysqld process missing, network unavailable, etc.).

Once the outage has been resolved and the MySQL Server is available again, the following is expected to happen:

  • The TeamDrive Registration Server instances will continue to operate where they were interrupted by the MySQL Server outage. The TeamDrive Clients will pick up wherey they left, synchronizing all accumulated/pending changes.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).

Testing Load Balancer Failure

Since all TeamDrive instances are accessed through a load-balancer, an outage of this component should be tested as well:

  • Shutting down the load balancer
  • Removing the network connections to the TeamDrive Server components

Expected results:

  • The Registration Server instances will no longer be able to handle incoming Client requests as outlined in chapter Entire Registration Server Outage.
  • The monitoring software should raise an alert about the load balancer instance being unavailable, specifying the nature of the outage.

Once the outage has been resolved and the load balancer is available again, the following is expected to happen:

  • The TeamDrive Registration Server instances will continue to operate as soon as they receive incoming Client requests again. The TeamDrive Clients will pick up where they left, synchronizing all pending changes that have accumulated in the meanwhile.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).