A Successful Backup Strategy in Practice: Insight into the WoltLab Cloud
- 0 Comments
- 2,323 Views
In our first article on the technology of WoltLab Cloud, we described how we take advantage of the properties of the ZFS file system to best prevent data loss as a result of technical problems. However, ZFS cannot protect against all problems: Sometimes you are just unlucky and all disks in a redundancy group fail at the same time. ZFS also cannot protect against human error, such as when an administrator mistakenly deletes the wrong thread or user account. Confirmation prompts avoid most of these problems, but one's mind is not always fully on the matter and the damage has been done.
To protect against such situations, proper data backups must be used. In contrast to the automatic mirroring of all data in real time, other requirements are placed on data backups in order to be armed against the problems described above. Data backups that rightly bear the name must fulfill several criteria:
Once a backup has been created, it is not changed until it is deleted, as this is the only way to guarantee that the backup contains a consistent state and is functional.
When everything goes south, data loss must be a calculated risk that matches the individual requirements. A data backup that is only made irregularly every few weeks or even months offers no added value.
- Physical Separation
A data backup on the same device or the same server may fulfill the requirement of immutability and regularity, but is also lost in case of a failure.
- Explicit Versioning
Problems are often only discovered after a delay. A physically separate data backup that is overwritten daily by a new status may already contain faulty data from the live system after a weekend and is therefore useless. Explicit versioning must be performed with a retention period during which existing backups are not deleted.
In addition to the requirements for reliable data recovery, however, there are other requirements arising from the applicable laws and regulations or contracts. For example, data protection must be guaranteed. This means that data backups must typically be stored in encrypted form and the retention time must be limited to a period that allows recovery in the event of errors discovered late, without the backup effectively becoming a long-term archive of data that has already been deleted.
It should also be possible to restore data in an efficient manner. If the form of backup used does not allow efficient recovery, then the loss of data since the last backup is compounded by the downtime until the backup is fully restored. A company might not be able to access essential company data for a long period of time and would have to accept a loss of revenue as a result.
Backups in the WoltLab Cloud
For data backups in the WoltLab Cloud, we rely on BorgBackup (Borg for short), a proven standard software in the industry. Borg already includes functions for transferring to a backup server, encryption and convenient versioning of the data backups as standard. Also, created data backups are immutable, a change is only possible by creating a changed one with subsequent deletion of the original backup.
Data backups can be stored in compressed form with Borg: This reduces the storage space required on the backup server and improves the speed at which the data can be restored. The smaller size of the backups means that less data has to be transferred from the backup server when restoring, so the backup is available more quickly. In sum, Borg meets all the necessary requirements for the backup process itself and is well tested and reliable as a standard program. The last building block for a robust data backup is only the physical separation of the backup server and ensuring the regularity of the backups.
Backups created using Borg on the application servers and encrypted locally are transferred to dedicated backup servers within the same data center. Like all servers in the WoltLab Cloud, the backup servers use ZFS as their file system. However, the ZFS configuration used here is specifically tailored to the requirements of Borg to ensure the best possible performance, and by storing the backups in the same data center, recovery is fast in the event of a failure.
Strict Separation of Key and Encrypted Data
A separate Borg repository with a separate key is used for each customer, and by encrypting the backups directly on the application servers, the backups are always encrypted on the backup server. The key used to decrypt the data is never transferred to the backup server. A potential compromise of the backup server therefore does not allow an attacker to access the unencrypted data. Even the compromise of a single key has no effect on other customers.
Conversely, we also protect existing backups from application server compromise: Access to the appropriate backup server is granted to the application servers via temporary SSH certificates. These certificates are only valid for a few minutes and limit access to making a new backup for a single customer. Deletion of existing data or access to another customer's Borg repository is thus reliably prevented. Once the SSH certificate has expired, access is no longer possible until the time of the next backup.
While storage on dedicated servers within the same data center protects against technical problems affecting an entire server, it cannot protect against the total loss of an entire data center in the event of a disaster, such as an unexpected flood. For this reason, we create replicas of the data in a second data center.
Once a night, all Borg repositories are transferred to a second data center that meets the requirements of the German Federal Office for Information Security (BSI) for geo-redundant data centers, such as being more than 200km away from the primary data center and located on a different main river. The data at hand is persisted via a snapshot that is kept for 7 days and does not allow prior deletions or modifications. In this way, the backups in the second data center are also protected from compromise of the backup server.
Our “manager”, mentioned in the first article, keeps track of the last time each client was backed up and initiates a new backup every 19 to 20 hours. To do this, the manager connects to the application servers and starts the Borg-based process to create a new backup. As part of this process, the manager creates a new temporary SSH certificate to allow the application server to access the backup server.
At the end of the process, the application server marks as deleted all backups that have exceeded the maximum retention time. The actual release of storage space (“Borg Compaction”) is done in a separate process to ensure that the application server does not have access to actually delete backups. Compaction can work directly with the encrypted data, so this operation can be performed directly by the backup servers without knowledge of the key.
Backups are performed at a random interval every 19 and 20 hours to evenly distribute backups of all customers throughout the day and night, so that backups can be made quickly at any point in time instead of making dozens of backups at a time, thus dragging out the process.
To ensure that the backups are actually made, we have the execution of each backup logged. If the manager notices an error in the creation, we are informed immediately to check the situation. In addition, we perform regular spot checks to ensure that there is no error in both the backup and the notification of the error. This is the only way we can ensure that we not only have backups, but that they actually work.
As with the selection of the right file system in the form of ZFS, we leave nothing to chance when creating backups to ensure the best possible data security for the customer data entrusted to us. By using well-tested standard software in the form of Borg and our carefully planned infrastructure around Borg as the core component, we – and hopefully WoltLab Cloud customers – can sleep soundly at night without fear of serious data loss.
For customers with special requirements in business continuity management, we are happy to offer the option of additional automated replication of backups on customer-provided systems as part of our enterprise offering. Please feel free to contact us.