Data can be the currency, Intellectual Property, and life blood of many a company. One technique to make sure that your data is readily available is data replication. Not quite the same as data backup but can be equally important.
Data is the Foundation
Recently there has been a transition from physical products being the most critical aspect of many company’s businesses to data being the key driver. This transformation started some time ago and has steadily progressed over time. While one can argue over the subtleties of whether a company actually makes a physical product or not, it is fairly clear that to almost all companies data has become if not the key to their success, then very close to it.
An even more subtle — but perhaps more important result of this transformation — is the impact that these original “non-physical-product” companies have had on other companies. For example, the creation of spreadsheet applications has allowed companies to better manage themselves even if they make a physical product. In essence, these companies have become the tool-makers in the age of data. Using these non-physical tools, virtually all companies have become more efficient, more transparent, and better managed. Along the way, they had to create their own data perhaps switching the importance from the actual physical product as the key to the company to the data being the star of the show. I think one can safely argue that virtually all companies today rely on data for operating their company. The degree to which they rely on data is variable, but for many, not having access to the data is equivalent to having the raw materials for their products taken away, or having factories shut down (a condition that seems to happen just about weekly in France these days).
Consequently, data has become the life blood of just about every company. Without it, operation comes to a screeching halt with the requisite injury to the company that results from any massive deceleration. Therefore, availability of data and access to data is extraordinarily important for just about everyone including companies.
There are many techniques to ensuring that data doesn’t disappear and is accessible in a timely manner. Techniques such as backups, off-site copies, disaster recovery sites, and replication, are all used to make sure that data is available and “safe” at all times. The simplest concept for ensuring that data is available is to have multiple copies of the data in case something happens to the original copy. This can be accomplished in a number of ways but the fundamental goal is to ensure that a mistake, accident, disaster, or other occurrence does not cause the complete lose of data.
In an effort to stave off disaster, let’s investigate data replication. While the phrase “replication” and “backup” are sometimes used interchangeably, we’ll see that they are, in fact, very different from each other, but they can used together and often they are.
Typically the word replication, in a data context, is used to mean the process of sharing data between resources (storage) to make sure that they are consistent. In essence, it’s making sure that a copy of the data on one storage pool is mirrored on a secondary storage pool. Many times this means redundant storage resources but this isn’t always the case depending upon your definition of redundancy.
However, replication is different than a backup. A backup can keep some historical information about the data allowing you to get to a previous version of the data (such as an earlier version of a document or of an application). Replication of the data means that the copy is an exact duplicate, or as close as possible, of the current data. Consequently, no historical information is kept. Simply put, backups keep records of past versions and replication is just a mirror of the current state of the data.
Therefore replication is not a replacement for a backup but it can be used as a compliment to backups. Replication allows immediate restoration of the data as it was when the primary storage went off-line. It can happen by using fail-over storage or by taking the secondary storage pool and using it as the primary storage pool (depending upon how the storage and servers are configured this could involve rebooting the servers).
In contrast, restoring the current state of data from backups can take a great deal of time. Moreover, restoring from a backup will only restore the data to the point at which it was taken. This means that data created between the last backup and when the primary storage went off-line is lost. But backups can be very useful in restoring previous versions of data or restoring data that has been erased. The classic example is a user who just did “rm -rf” in their
/home directory. Replication can’t restore any of the missing data since the recursive remove also removed data from the secondary storage pool, but a backup can at least restore a version of the data from the time when the backup was created.
A common desire in using replication is to keep a copy of the data at a remote site. Exactly what “remote” constitutes depends upon your situation and requirements but the concept is that if the primary storage pool is lost due to an accident or a disaster such as a fire or a tornado, the second storage pool is in a different location and can be used in place of the primary storage. Consequently, people will sometimes refer to replication as a “disaster recovery” mechanism.
For non-database data storage there are typically two approaches to replication – (1) real-time replication and (2) point-in-time replication. The first option means that a write operation happens on the primary storage and also happens at the same time, or very shortly thereafter, on the replicated storage. The second option means that something like a snapshot is made on the primary storage and then replicated to the secondary storage. The point-in-time replication means that the secondary storage is not necessarily up to date and again you have a “gap” in data states on the storage pools similar to a backup. But this gap is typically fairly small.
The rest of this article will focus on real-time replication. The phrase “real-time” has special meaning in IT, but I will be using it more loosely meaning that things aren’t really happening in real-time. There are two techniques for real-time replication: synchronous, and asynchronous.
In the case of synchronous replication, a write on the primary storage also takes place on the secondary storage at the same time. Both writes, the one on the primary storage and the one on the secondary storage, must complete for the “write” to complete. So if the write on the primary or secondary storage is slow, it blocks the completion of the write operation to the application. This can have an enormous impact on performance which means that synchronous replication happens over very short networks (to reduce latencies and improve overall performance), and happens over very reliable networks. Since the replication happens synchronously there is no difference in the data between the two pools of storage. Typically synchronous replication is used when zero differences in the data between the two storage pools is desired (or required).
Asynchronous replication is more common because it relaxes the need for the write on the primary and the secondary to complete before the application write function is completed successfully. Asynchronous replication allows the write operation to be completed on the primary storage so that the application can continue. Then the data is copied or replicated from the primary storage to the secondary storage in some period of time. This means that at any instance in time the data on the primary and the secondary storage may not be exactly the same. However, the delay between them is usually fairly small depending upon a number of factors such as the amount of data to be replicated, the network between the two storage pools, and the replication mechanism.
Asynchronous replication is the most common type of replication because it allows for slower networks or longer distances between storage pools to be used. Let’s examine two common options in Linux for asynchronous replication – rsync and drbd.