API, Zero Downtime Deployment and SQL Migration: Theory and Case Study

To demystify Zero Downtime Deployment

In the patterns associated with the web giants, the Zero Downtime Deployment (ZDD) shares a characteristic with auto-scaling: it is talked about even more because it is implemented so little.

The ZDD is victim of a vicious circle: it looks very complex because it is not very practiced, and as it is little practiced, it is thought to be very complex.

To get out of this impasse, this article presents a practical case, code with support.

The goal is not to make everyone ZDD, because it will add more work and complexity, but provide you a clear vision of how to achieve it. This will allow you to decide whether or not it is worth doing.

Our example case

Our example approaches the one described in the first article.

An application exposed via a REST API.

Initially this application manages people, each person having zero or one address. This version is exposed on the prefix /v1 using a unique resource type /person.

The next version of the application will associate multiple addresses to a person, and will be exposed on the /v2 prefix using two resources, /person and /address.

The two versions

This application highlights its high availability, so it is essential that all operations are carried out without interruption of service.

The API is public, so we do not control the use of it. It is therefore impossible to make a toggle from /v1 to /v2 at one time. The two versions will therefore have to work together to allow the migration of all users. This period may be very long depending on the case.

Customers consume APIs via intermediaries, so it is possible that during this period they use both versions /v1 and /v2.

In the rest of the article, /v1 and /v2 correspond to the versions of the services, and v1 and v2 correspond to the two ways in which the data will be stored in the database.

The strategy

How to manage the migration ?

It is possible to do this without a migration script in the traditional sense: you expose services in /v2, and when called for data still in /v1 format, you migrate the data on the fly. This allows data to be migrated gradually as service users call the APIs /v2. This is the approach that is often taken with NoSQL databases.

Unfortunately, by doing so, the migration may never end, or else it may take a very long time (if you purge older data). During this time, you must maintain the additional code to support this case.

The other approach is to use a script. This ensures that migration takes place quickly. This is the same type of script you use for your usual migrations, except that it must take into account the fact that it runs at the same time as the code. Thus, all operations that create locks for more than a few milliseconds should not be used, and you must ensure that you do not create deadlocks.

Migration must be done in a transactional way, to avoid inconsistencies in the data. In case of problems, the script must also be able to be interrupted and restarted without disrupting the execution of the program.

In the article it is this approach that we will use.

When to migrate ?

Without ZDD, the migration process is conventionally the following :

Close the service
Stop the application
Make a backup of the data, in order to be able to restore it in case of problems
Migrate the data
Deploy the new version of the application
Start the new version of the application
Verify that everything works
Open the service

In ZDD, there is no longer any question of closing the service: the migration of data takes place "hot".

There are two possible ways to do this, depending on whether the migration takes place before or after the services are opened /v2 :

The two ways to migrate

Number of the stage	Code version	Migration after the opening of `/v2`	Migration before the opening of `/v2`
1	1	Application serving `/v1` with a data model `v1`
2	1	Change the BDD schema to allow storing the necessary data to `/v2`
3	2	Deploy a new version of the exposed application `/v1` and `/v2` on the `v1` or `v2` data model	Deploy a new version of the exposed application `/v1` on the `v1` or `v2` data model
4	2	Migrate data to `v2` without disruption, taking into account code `/v1` and `/v2`	Migrate data to `v2` without disruption, considering the code that serves `/v1`
5	3	Deploy a new version of the exposed application `/v1` and `/v2` on the `v2` data model
6	3	Clean BDD Schema of Artifacts `v1`
7	4	Deploy a new version of the manager `/v2` application on the data model `v2`

The first approach opens services /v2 faster because there is no need to wait for the data migration.

The second approach is simpler :

The exposed version of the application that works with the data models v1 and v2 only exposes the services /v1, so you do not need to worry about when a service /v2 call accesses data v1 ;
During data migration, services /v2 are not yet exposed, this means fewer data access patterns to consider when designating a migration that avoids data inconsistencies and deadlocks.

Unless your migration process is extremely long, the second approach is preferred, and this is the one that will be used in the rest of the article.

`/v1` and `/v2` are in a boat …

Migrations of open APIs pose two business problems and a technical problem.

How to migrate data ?

The first problem, also valid for closed APIs, is how to migrate the data from /v1 to /v2. I do not speak from a technical point of view but from a business point of view: the semantics changes between the two versions, so we must determine how to transform the data from /v1 to /v2 in a logical way, and in a way that does not surprise the users of the API.

In our case the solution is immediate : /v1 has at most only one address, and /v2 can have several addresses, the address of /v1 thus becomes one of the addresses of /v2.

How to handle backward compatibility ?

The other problem is how to interpret data /v1 in /v2. If the API is open, your users can call your services /v1 while the data is already in the/v2 model.

It is often more complicated than the first car as the evolutions, the API tend to become richer. Accessing richer data from the /v2 through the narrower prism of the API /v1 can be a real headache.

If this is the only way that this transition happens, it is sometimes necessary to adapt the design of the API /v2.

It is a balance to be found between the transition facility, possible restrictions to add for IPA callers, and the time to invest.

How to respond quickly and well ?

The technical problem is to achieve the various services, including compatibility, while making sure to always have consistent data and without penalizing performance (too much). If, between the two versions, the data is no longer structured in the same way, the compatibility management can request to cross the data of several tables.

So in our example, in v1 the addresses are stored in the person table while in v2 they are in a separate adress table. During the compatibility period, calls to v1 that update the person's name and address change the two tables in a transactional manner to prevent a v1 reading that would occur at the same time from returning inconsistent data . Additionally, you have to be able to do this without having to put too many locks in the database, because otherwise it will make data accesses too rude..

The best strategy is to focus on an approach that you have mastered well and that gives acceptable results rather than a more efficient or faster but more complex solution.

In all cases, tests are absolutely essential.

To serve both versions of the API, you can use a single application or choose to separate your code into two applications, one per service version. This question is not relevant for the question of the ZDD, we choose not to consider it here. In our example, we chose to have only one application.

… and ZDD joins them on board

Without ZDD the situation is clear: we stop the application, the data is migrated, and we restart the application in the new version. So there is a before and an after.

With ZDD the migration takes place while the services are available, so there is an intermediate situation.

During this period, the data can still be stored in /v1 format or migrated to /v2.

It is then necessary to determine in which state the data is: in order to know which code is to be called, it is necessary to know whether the data has been migrated or not. In addition, the piece of code in charge of this will be run very often, so it must be very effective.

In case of difficulty, the solution that should work in all cases is to add to the tables involved a number indicating the "schema version" of the corresponding data, and which will be incremented when the data is migrated. In this case the verification operation is very simple and fast. The column addition operation is then done in advance of phase, which increases the work necessary for the migration.

If you choose to migrate data after opening /v2, the case where an api /v2 is called while the data is still stored in v1 format is added. It is then necessary to migrate the data hot, in a transactional manner by limiting the induced slowdown.

To summarize, there are four situations :

	Call`/v1`	`Call/v2`
Data stored in format `v1`	Reply as before	(Only if migration after opening `/v2`) Migrate hot data
Data stored in format `v2`	Compatibility `v1`	Reply with the new semantics

Well open `/v2`, and properly decomission `/v1`

When you open /v2 for the first time, pay attention to how the toggle to the new version is made.

Before making the new endpoints accessible, make sure that all servers use the latest version of the application. Otherwise, if you call a /v1 when the corresponding data has been migrated to v2 the code will not be able to read it correctly and may crash or return false information.

Another problem is the way you have implemented the data changes when you call an API /v1.

The first case is to save the data in v2 format, but this means that the previous versions of the application will not be able to read it again. The easiest solution is to use feature flipping to switch the code.

Otherwise, your code must detect in which format the data is stored, and re-save it in the same format: a piece of data in v1 remains in v1, and a piece of data v2 remains in v2. The feature flipping is avoided, but in exchange the code is more complex.

In order to decommission /v1 it is enough to make the endpoints inaccessible, the deletion of the code can be done later.

About locks and schema changes

As we have seen, the ZDD relies heavily on the use of the database, and in particular its concurrent access features. If your business requirements are simple, you are using an ORM, and you have automated performance tests, this is an area you do not often need to consider. If you do not do it right, it's easy to block the database, return errors (in case of deadlock), or inconsistent results.

Our advice is to document your upstream or even to make POCs to avoid having to redo a design because your database does not work as you imagined. Do not trust memories or rumours: read in detail the documentation for the version of the tool you are using, and most of all ensure that you test !

If you have never dug into these topics or are rusty, the first migration will surely require a lot of work, and will give you a few cold sweats when you run it. But tell yourself that all the subsequent operations will deal with the same concepts, and will therefore go much better.

It's not just the REST in life

REST has two features that make it an ideal candidate for the ZDD :

Exposing several versions of services is a standard practice ;
The calls are supposed to be stateless.

If your services are exposed in any other way, you will need to be interested in these topics. Sessions, like all cache types, may require special attention if the data they contain is subject to a change in version structure.

Back to our example

We assume that the data model directly follows the resources to be exposed. The address is initially a field of the person table, and is migrated to a separate address table.

The evolution of the schema

We do not use a specific column to store the "schema version" of the objects. Instead we will check in the database to see how the data is stored: if the person table contains an address, it is because it is in version v1, otherwise it is necessary to check the existence of an address in the dedicated table . This avoids adding to the SQL schema, but increases the number of queries executed.

Steps to follow for migration :

Initial version: the address is in the address column of theperson table, the code only works in this way.
Adding the new table address in the database, at this stage the code does not yet know this table.
Deploying the code that provides the api /v1 and that is compatible with both ways of storing the address.
Running the migration script.
Deploying the code that provides api /v1 and /v2 and which is compatible with the new way to store the address, the address column of the person table is no longer used by the code..
Removing the address column of the table person table.

The ZDD adds code versions and migrations of intermediate schemas. In an environment where deployments are not automated, this means an increase in workload and therefore the risk of error. So it's better to equip and have a reliable delivery pipeline before you get started.

Detailed analysis

Compatibility of services

In our example the problem of compatibility is the following: once a person migrated, it can have several addresses. What do you do when you retrieve that same person through the API /v1 ?

Here there is no obvious answer: there is no notion of preferred address, or last address used that would provide a way of discriminating the different possibilities. Since the response affects the behaviour of the API, it is a decision to be made by those skilled in the art.

The solution chosen here is to return an address from the one in the list. It is not perfect, but it can be acceptable according to the use made of it: it is up to the persons skilled in the art to decide.

Transactionality

To solve the question of transactionality, we chose the simplest solution: put a lock on the corresponding entries of the person table.

If all the operations follow the same principle, this lock acts as a mutex by making sure that the calls are executed one after the other: when an operation poses a risk, it begins by asking for access to this lock, and for this she must wait her turn.

Example with a call to PUT /v1/people/127 while the corresponding person is stored in format v2 but has not yet an address.

Example without lock :

Execution thread 1	Execution thread 2
`PUT /v1/people/127/addresses`	`PUT /v1/people/127/addresses`
`BEGIN`	`BEGIN`
`SELECT * from person where id = 127` to retrieve the person, verifies that there is no address and that the other fields are not to be modified	`SELECT * from person where id = 127` to retrieve the person, verifies that there is no address and that the other fields are not to be modified
`SELECT * from address where id_person = 127` to retrieve an address to be updated, does not find it and deduces therefore that it is necessary to insert one	`SELECT * from address where id_person = 127` to retrieve an address to be updated, does not find it and deduces therefore that it is necessary to insert one
`INSERT INTO address …` to insert the address	`INSERT INTO address …` to insert the address
`commit`	`commit`

Result: the person ends up with two addresses !

Example with lock :

Fil d’exécution 1	Fil d’exécution 2
`PUT /v1/people/127/addresses`	`PUT /v1/people/127/addresses`
`BEGIN`	`BEGIN`
`SELECT address from person where id = 127 FOR UPDATE` to retrieve the person, verifies that there is no address and that the other fields are not to be modified and locks the line
`SELECT * from address where id_person = 127` to retrieve an address to be updated, does not find it and deduces therefore that it is necessary to insert one
`INSERT INTO address …` to insert the address
`commit` which releases the lock on `person`
	`SELECT address from person where id = 127 FOR UPDATE` to retrieve the person, verifies that there is no address and that the other fields are not to be modified and locks the line, waiting for the lock on `person` to be available
	`SELECT id, address FROM address WHERE id_person = 127` retrieves address
	`SELECT * from address where id_person = 127` to retrieve an address to be updated, finds the address inserted by the other thread
	`UPDATE address set address = … where id = 4758` updates address
	`commit` which releases the lock on `person`

Result: a single address.

The SQL Migration Script

The migration script moves the data in blocks of person to address.

In our example, once the code is switched to the new version, all data is written in v2 format, whether it is creatives or changes..

Migration is therefore irreversible, we know that it is enough to migrate all the data once for the work to be done.

It starts by retrieving the highestid of the person . As the script is launched after the new version is deployed, all people created after that time are created with an address stored in address. This means that the script can stop at this value.
The script iterates in groups of person from 0 to the id that it has just recovered. The step of the iteration is to be determined experimentally: a larger step makes it possible to make fewer requests thus to decrease the total time of the migration, to the detriment of the unit time of each iteration, and thus the time when the locks exist in base.
- It starts a transaction.
- It selects the id of people who have an address, and locks them.
- It inserts the corresponding data in address with an INSERT … SELECT …`.
- It clears the address field of these entries in theperson table .
- It validates the transaction, releasing the data.

If the script is stopped, the already migrated data is not lost, and restarting the script does not cause any problems, since the migrated data is not reprocessed.

Steps to follow

Initial version providing the API /v1 and where the address is stored in the address column of the persontable.
Added in base of the address table, not yet used by the code. Creating a table does not normally have any impact on the database, but it must be verified.
Provides the API /v1, stores the address in the address table and knows how to read it in both places. During a reading in /v1 on a data item /v1 the data is not migrated to v2 to keep the code simpler.
Migrates addresses to theaddress table.
Provides the API /v1 and /v2, and only reads it in v2 format, removing the address column from the person table of the code, then the column is always in base.
Delete the address column of the person table. In some databases, deleting a column triggers the rewriting of the entire table and therefore can not be done in ZDD. We are therefore satisfied with a logical deletion, for example by adding an underscore in front of its name, and by "recycling" it when a new column is needed.

The implementation

The implementation is on GitHub. The code is open source under MIT license, so you can use it..

Each step of the migration is in a separate module, it allows to easily examine what happens without having to handle git.

The code is in Java and references the Dropwizard library. The database is PostgreSQL, access is via Hibernate, and SQL migrations utilize Liquibase.

Some highlights :

In step 3, the person's DAO with the methods to set locks to avoid inconsistencies.
In step 4, the migration script. Since this is a script and not a single query, it is in the form of a Java class called since Liquibase.
In step 6 it is possible to remove column address because PostgreSQL simply makes it invisible, and retrieves the space later.

To conclude

Doing ZDD is not magic: it takes work and rigor. If you can do without, so much better for you, but if you need it, you should now have a little more precise idea of what it represents. Remember that the example developed here is a simple case: use it to get an idea of what to do, and not as a guide to measuring effort.

The first migration will surely be a bit of a challenge, but the next ones will be easier. In any case, do not forget to test, test, and still test !