Host Server Failover Considerations and Scenarios

Scaling a TeamDrive Host Server Setup

A first step in increasing a single Host Server’s performance would be to monitor and review the system’s CPU, RAM and Disk I/O utilization, and to adjust the server configuration by adding more RAM/CPUs or increasing the storage bandwidth, if necessary (also called “scale-up strategy”).

Adding more CPUs typically increases the maximum number of possible concurrent connections to the service and reduces the latency. However, the ability to handle more connections also requires more memory, as the system needs to spawn more concurrent Apache instances. So usually both parameters need to be adjusted.

Adding more RAM can also help to improve database and file system throughput and latency, as it allows the operating system and database to keep more of its working set and caches in memory, which enables it to return data quicker.

If your setup has reached the physical limits of a single server instance, you can further improve the scalability as well as the redundancy of a TeamDrive Host Server by implementing a “scale out” strategy.

In this setup, you distribute the load across several independent systems, by deploying multiple virtual or physical Apache server instances of the TeamDrive Host Server behind one or more load balancers.

This configuration also mitigates the risk of a service outage, e.g. if an instance fails or needs to be taken offline for maintenance purposes.

Note

The Host Server’s Space Volumes must be placed on a shared storage medium like an NFSv4 server or shared disk file systems like OCFS2 or GFS2, as each Host Server instance requires concurrent access to the same Space Volume(s). As an alternative to shared storage, using an S3-compatible object store (e.g. Amazon S3 or OpenStack Swift) or the TeamDrive Scalable Hosting Storage (TSHS) can be utilized.

A migration from a single instance setup to such a scaled-out configuration can usually be performed with very little downtime, so you can start small and grow your setup as the need arises.

However, you must ensure that in case of a node failover/outage, the remaining nodes can handle the load that is usually distributed across all server instances.

Note

In a scale-out scenario, the Host Server’s MySQL database server must be set up as a separate instance, so each Host Server node has access to the same data set.

To avoid the MySQL database to become a single point of failure, we recommend to set up MySQL in a redundant configuration, too (e.g. by using MySQL replication or other clustering technologies like Galera/Percona Cluster).

Note

The TeamDrive Host Server configuration does not support accessing more than one MySQL Server; you need to use a floating/virtual IP address that gets assigned to the currently active MySQL instance.

If you intend to run multiple independent Host Server instances (e.g. to serve a globally distributed user base), you can assign users to different Host Servers, e.g. by registering more than one Host Server to a given Provider, or using multiple Provider Codes on different Registration Servers.

These independent TeamDrive Host Server instances can then be scaled using the strategies above.

In a single instance configuration, a re-appearing server can suffer from a “thundering herd problem”, as a large number of TeamDrive Clients will try to synchronize their accumulated pending changes simultaneously. This can lead to a peak in the number of concurrent connections to this server, its MySQL database and storage subsystem, resulting in a noticable increase in network and disk I/O.

This effect can be mitigated by temporarily extending the poll interval used by the Clients, increasing the number of Apache instances, or by temporarily assigning more resources like vCPUs or vRAM to a virtual machine.

The MySQL server’s configuration might also need to be reviewed in order to support more concurrent database connections.

Host Server Failure Scenarios

This chapter discusses most likely outages that can occur on a TeamDrive Host Server, if no additional redundancy is provided.

Chapter Host Server Failover Test Plan outlines some possible tests you should perform, and what results to expect.

Entire Host Server Outage

An outage of the entire TeamDrive Host Server can be triggered by any of the following events:

  • Failure of the entire Host Server host system (e.g. a hardware or OS crash/failure)
  • Network failure that renders the Host Server unavailable
  • Failure of the Host Server’s Apache HTTP Server
  • Failure of the Host Server’s Space Volume storage system
  • Failure of the Host Server’s MySQL Database
  • Failure of the S3 compatible object store
  • Failure of the TeamDrive Scalable Hosting Storage (TSHS)

In case of an outage of the entire TeamDrive Host Server, the TeamDrive Clients will mark any existing Spaces on that Host Server as “Offline”.

However, it is still possible to work with these Spaces locally; the Clients will record any local changes (e.g. adding or removing files, making modifications) and queue these events for later submission once the Host Server is available again.

The following Client operations can not be performed while the Host Server is unavailable:

  • Creating new Spaces
  • Publishing files
  • Inviting users to existing Spaces

In addition to that, the following administrative operations via the Host Server’s API or Administration Console will not be possible:

  • Creating new Space Depots via the API (e.g. using the Registration Server’s Administration Console)
  • Changing the limits of existing Space Depots
  • Assigning a default depot to a newly registered user upon first login

Except for the MySQL Server outage, this failure scenario can be avoided by setting up multiple instances of the Host Server behind a load balancer with failover capabilities and using a shared/scalable and reduntant storage system for all nodes.

Space Volume Outage

The storage volume hosting the Space Volumes might become unavailable, e.g. because the mount point /spacedata/vol01 is missing or empty due to a failed mount after a reboot, an outage of the NFS server or a network connection failure between the Host Server and the storage subsystem. The Host Server will notice the missing volume or Space data and log error messages to the Apache error log /var/log/httd/error_log, e.g.:

[error] [client x.x.x.x] Unable to create space path: Possible Mount Error:
Volume Global ID required: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", but Space not
found: xx, path: /spacedata/vol01

or:

[error] [client 10.0.3.1] Space data root path missing: /spacedata/vol01

The Clients will receive a notification for Spaces hosted on that volume, indicating that the Space has been disabled for maintenance.

The Server will return to normal operation automatically, as soon as the missing volume is available again, re-enabling the affected Spaces on the Clients.

Increasing the availablility of the storage subsystem can be performed in numerous ways and is highly dependent on the technology or vendor used. Consult the documentation of your storage technology for details/options.

MySQL Database Outage

A failure of the Host Server’s MySQL Database could be triggered by one of the following events:

  • Failure of the entire MySQL Server host system (e.g. a hardware or OS crash/failure)
  • Network failure that renders the MySQL Server unavailable for the Host Server
  • Failure of the MySQL Server’s mysqld process

The failure will be indicated by error messages in the following Host Server log files.

/var/log/td-hostserver.log:

[Error] -12036 (2002): Can't connect to local MySQL server through socket
'/var/lib/mysql/mysql.sock' (2)
[Error] "startup.yv" (80)

/var/log/mod_pspace.log:

[Error] db_connect(pspace_mdb.c:318) Failed to connect to default group:
[p1db]
[Error] db_connect(pspace_mdb.c:318) MySQL Config file:
/etc/td-hostserver.my.cnf
[Error] p1_send_xml_response(mod_pspace.c:986) Space x: [HTTP 503] Status:
0 Error code: 0

To mitigate the risk of a MySQL Server outage, consider setting up a cluster of MySQL Servers, using MySQL replication, DRBD or other replication and HA technologies to provide synchronization and redundancy.

Outage of the td-hostserver Background Service

If the td-hostserver background service (process name yvvad) has failed or was not started at bootup time, regular Client operations are not affected immediately.

However, the following background tasks are no longer performed, which may lead to unwanted consequences over time:

  • Spaces marked for deletion are no longer physically removed from the Space Volume’s file system, which could lead to the file system filling up until it runs full.
  • The disk usage of Volumes and Spaces is no longer calculated. Clients will not be notified if they have reached their storage limits.
  • Monthly traffic statistics are not being reset at the end of the month. Clients that have exceeded their traffic in the meanwhile might be blocked from synchronizing Space data once the td-hostserver service has been re-enabled.

See chapter Background Tasks Performed by ``td-hostserver`` in the TeamDrive Host Server Installation Guide for further details on the individual tasks performed by this service.

Restarting the td-hostserver background service will pick up where the previous process has stopped.

For increased redundancy, it is possible to run this service on each TeamDrive Host Server instance in a multi-server installation.

Host Server Failover Test Plan

Based on the failover scenarios described in chapter Host Server Failover Considerations and Scenarios, the following tests should be performed to verify the correct behaviour and recovery from failures of individual TeamDrive Host Server components.

This test plan assumes an environment consisting of two virtualized TeamDrive Host Server instances (hostsrv01 and hostsrv02), located behind a load balancer, using a shared NFSv4 share for storing the Space data and using a dedicated MySQL Server instance (td-mysql) for storing Space management information. Other setups/configurations may require additional tests, depending on the environment.

Note

Note that the configuration described above contains several components for which no redundancy is provided, therefore these components are considered single points of failure (SPOF). In particular, the following components can become a SPOF:

  • The Host Server’s Space Volume. In this scenario, the Space data coming from the TeamDrive Clients is stored on an NFS server. If the NFS share becomes corrupted or unavailable, the TeamDrive Clients will be unable to synchronize Space data with their peers until the file system has been restored or made available again.
  • The MySQL database instance (td-mysql). If this instance becomes unavailable, the entire TeamDrive service will be affected and rendered unavailable until the service is restored.
  • The load balancer/firewall. If the public-facing load balancer/firewall fails, the TeamDrive service will be unavailable.

Single Host Server Instance Failure

An outage of one of the TeamDrive Host Server instances (hostsrv01 or hostsrv02) should be simulated/triggered in the following ways:

  • Shutting down the Apache HTTP Server running service httpd stop.
  • Shutting down the network connection, e.g. by running service network stop, ifconfig eth0 down or by disconnecting the virtual network interface via the virtual machine management console.
  • Shutting down the entire virtual machine e.g. via the virtual machine management console or by running poweroff.

Expected results:

  • The load balancer should detect that the Host Server instance is no longer available and redirect any incoming traffic to the remaining instance instead. If configured, a notification about the outage should be sent out to the monitoring software.
  • The monitoring software should raise an alert about the Host Server instance being unavailable, specifying the nature of the outage (e.g. httpd process missing, network unavailable, etc.).
  • The remaining Host Server instance should handle all incoming Client requests. The TeamDrive Service should not be impacted/affected in any way.

Once the outage has been resolved and the instance has recovered, the following is expected to happen:

  • The load balancer should detect that the Host Server instance is available again. Incoming traffic should be spread across both instances again.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).
  • The TeamDrive Service should continue unaffected throughout this process

Multiple Host Server Failures

An outage of both of the TeamDrive Host Server instances (hostsrv01 and hostserv02) should be simulated/triggered in the following ways:

  • Shutting down the Apache HTTP Servers running service httpd stop on both instances.
  • Shutting down the network connections, e.g. by running service network stop, ifconfig eth0 down on both instances, or by disconnecting the virtual network interfaces via the virtual machine management console.
  • Shutting down the entire virtual machines e.g. via the virtual machine management console or by running poweroff.

Expected results:

  • The load balancer should detect that the Host Server instances are no longer available and stop redirecting any incoming traffic to the instances. Incoming requests should be answered with an appropriate error code (HTTP error code 503 - Service Unavailable). If configured, a notification about the outage should be sent out to the monitoring software.
  • The monitoring software should raise an alert about the Host Server instances being unavailable, specifying the nature of the outage (e.g. httpd process missing, network unavailable, etc.).
  • The TeamDrive Service will be impacted/affected as outlined in chapter Entire Host Server Outage.

Once the outage has been resolved and at least one of the Host Server instances has been recovered, the following is expected to happen:

  • The load balancer should detect that the Host Server instance is available again. Incoming traffic should be redirected to the instance and incoming requests should no longer result in HTTP errors.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).
  • Once the TeamDrive Clients have noticed the service being available again, operations should proceed as before.

Testing Space Volume Outage

An outage of the TeamDrive Host Server instance’s Space Volume (NFS share) should be simulated/triggered in the following ways:

  • Detaching/unmounting the NFS share from the Space Volume mount point, for example by temporarily shutting down the Apache HTTP Server and the Host Server background tasks and unmounting the volume, e.g. by running the following commands:

    # service httpd stop
    # service td-hostserver stop
    # umount /spacedata/vol01
    # service td-hostserver start
    # service httpd start
    
  • If technically possible, disconnecting NFS share from the virtual machine at run time, e.g. by detaching the network connection (by running ifconfig <device> down for the respective network interface, detaching the virtual network card from the virtual machine) or shutting down the NFS server. Note that this operation may lead to data inconsistencies or file system corruption and should only be performed on non-criticial test data.

Expected results:

  • The TeamDrive Host Server should detect the missing volume and react as outlined in chapter Space Volume Outage.
  • The monitoring software should raise an alert about the missing volume.
  • Optionally, the load balancer could be instructed to return an error (e.g. HTTP error code 503 - Service Unavailable), to fend off incoming Client requests until the outage has been resolved.

Once the outage has been resolved and the Space Volume has been mounted again, the following is expected to happen:

  • The Clients should continue the Space synchronization at the point where they were interrupted by the outage. Incomplete Spaces will be restarted again. In case of a severe corruption of the Space Volume (e.g. file system errors), a restore from backup and a Space/Volume recovery might be required, as documented in the Team Drive Host Server Administration Guide.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).
  • If configured, the load balancer should be instructed to stop returning 503 Errors to Client requests (e.g. by the administrator).

Testing MySQL Server Failures

An outage of one of the MySQL Server instance (td-mysql) should be simulated/triggered in the following ways:

  • Shutting down the MySQL Server by running service mysqld stop.
  • Shutting down the network connection, e.g. by running service network stop, ifconfig eth0 down or by disconnecting the virtual network interface via the virtual machine management console.
  • Shutting down the entire virtual machine e.g. via the virtual machine management console or by running poweroff.

Expected results:

  • The TeamDrive Host Server instances will no longer be able to handle incoming Client requests as outlined in chapter MySQL Database Outage.
  • The monitoring software should raise an alert about the MySQL Server instance being unavailable, specifying the nature of the outage (e.g. mysqld process missing, network unavailable, etc.).

Once the outage has been resolved and the MySQL Server is available again, the following is expected to happen:

  • The TeamDrive Host Server instances will continue to operate where they were interrupted by the MySQL Server outage. The TeamDrive Clients will pick up where they left, synchronizing all accumulated/pending changes.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).

Testing Load Balancer Failure

Since all TeamDrive instances are accessed through a load-balancer, an outage of this component should be tested as well:

  • Shutting down the load balancer
  • Removing the network connections to the TeamDrive Server components

Expected results:

  • The TeamDrive Host Server instances will no longer be able to handle incoming Client or API requests as outlined in chapter Entire Host Server Outage.
  • The monitoring software should raise an alert about the load balancer instance being unavailable, specifying the nature of the outage.

Once the outage has been resolved and the load balancer is available again, the following is expected to happen:

  • The TeamDrive Host Server instances will continue to operate as soon as they receive incoming Client requests again. The TeamDrive Clients will pick up where they left, synchronizing all pending changes that have accumulated in the meanwhile.
  • The monitoring software should detect the service recovery and perform the respective actions (e.g. resetting the alert, sending an update notification).