Strengths and weaknesses of a cloud-hosted Android CI Server

Using a CI Server is a programming practice that is well established and not opened to debate anymore. Sometimes, it's even a topic on which the IT Department has regained control of, managing and rationalizing the servers. Yet, as is often the case, mobile is following its own way: the technologies used can be considered non-standard among the company, the ecosystem is updated way more frequently than in other computing areas, and the need of running on a specific OS for iOS can be the fatal blow that leaves mobile developers on their own. Therefore, they frequently end up installing a mac mini in the open-space, in order to run their builds on it.

Whereas the problems related to using such installations are mostly shared between Android and iOS, the solutions differ: this article focuses on Android.

Limitations of an on-premise CI Server

When he sets its Jenkins Server up, the mobile developer will have to deal with the following issues:

How to install Jenkins on the machine? How to make it boot alongside the OS?
How to allow or forbid network access to the machine from the company’s intranet or the whole Internet?
On which privileges should Jenkins run (root or not)? On which session are the needed binaries and the Android SDK installed? Are the read/write rights managed correctly?
How to deal with the Jenkins and OS updates? And, more frequently, how to deal with the Android SDK updates?

The mobile developer is not a devops! If he's lucky enough, he will manage to deal with these issues and end up with a fair install. Occasionally, a weakness of this setup will arise and mess with the team, but overall, it'll be livable.

Regardless the configuration and quality of the installation, a serious limitation will be present on this in-house CI Server: the need to have an Android simulator, emulator or device connected to run automated tests.

Why this need for an emulator? Android does not run on a standard JVM, but on Dalvik (or more recently, ART). Before a relatively recent update (and still incomplete to this day), tests written by the developers had to run on a Dalvik/ART VM, thus on an emulator or a device. (Note 1)

Using a device is hard to industrialize: it has to continuously stay on and plugged in. Data transfer through the USB cable is slow, but more important, the computer sometimes loses its connection with the device, requiring someone to physically disconnect and reconnect it to the machine. Unthinkable for a CI Server. The Android emulator is, therefore, the only viable solution. As it is quite slow to boot (it can take up to several minutes), developers often let an emulator run continuously on the server. If this allows a saving of a few minutes on each build, it also brings a few limitations:

All the projects built on the CI Server must work on the emulator's Android version.
The server will only be able to run one job at a time. Otherwise, jobs may end up fighting for the emulator during the tests execution stage.
The CI Server will only be able to run the tests against the specific emulator's Android version.
All the builds will be broken when the emulator will crash...

For the last few years, we have made it a habit to move to the cloud all the vital services we use, but find painful to maintain. Does the cloud have anything to offer from an "Android CI Server" perspective?

What the cloud offers

Mobile specialized CI solutions can be found, alongside more standard platforms supporting Android. The following table was assembled by running the following project (available on github) on different platforms. This project contains a non-instrumented (JVM) test, and an instrumented espresso test, checking that a TextView contains the text "Hello world!".

On the following table, the "Git" row tells whether the platform can read from any git repository accessible using SSH/HTTPS, or if it's tied to Github. "Job count" states the number of different jobs it is possible to define then launch separately. "Distinction CI / Deploy" indicates if it is possible to differentiate CI jobs from jobs that go as far as deploying the APK. "Emulator configuration" tells the different Android emulator versions available, if the configuration has to be done manually (with code) or through a web interface. "Deploy" lists the plugins available from a web interface to deploy the app. Manually indicates that deploying is possible, but needs a gradle task or a shell script.

Note: Configuring the emulator on SnapCI must be done manually through code. I could not manage to start an emulator within a one-hour delay. The interface test could therefore not be run on this platform.

The first striking point is the lack of maturity from the different players, even if they all advertise Android support as being one of their feature.

GreenHouseCI and Travis might be sufficient on "basic" projects. The main limitation comes from the inability to differentiate a CI job from a delivery job. It is also impossible to follow any code quality metrics on those platforms.

CloudBees, which simply offers a Jenkins instance "as a service", is the only platform that seems mature enough today to host a professional project.

To differentiate the players regarding the price, the example project ran before is not relevant anymore. To assemble the following table, I considered an existing project, of about 140k LOC, on which every build takes 15 minutes, and is run 10 times per working day. The plans I selected are the cheapest ones allowing at least 5 users, and the execution of a minimum of 2 builds in parallel.

Note: Prices are in USD.

Every player but CloudBees offers fixed-price plans, depending on the user number and the number of builds that can be run in parallel.

On CloudBees, the monthly plan starts at $60. You then have to add $1,32 per build hour. The price thus varies with the load. With the same parameters as told previously, the total monthly price would drop to $104 with 10 minutes builds, and climb to $148 with 20 minutes builds.

An Android CI Server on CloudBees

It is not really a surprise to find that Jenkins, the reference solution regarding continuous integration, is still the most powerful option when hosted on the cloud. In its "Dev@Cloud" offer, CloudBees brings you a hosted Jenkins master instance. It will launch each of your builds on a dedicated virtual machine instantiated specifically for this occasion. As you are probably already using a Jenkins server today, migrating your jobs and plugins is easy.

It's important to ensure that every instance booted to run one of your build has an up-to-date Android SDK installed. The use of Jake Wharton's "sdk-manager" gradle-plugin is thus mandatory. The plugin will check that the sdk, build tools, support repositories and google libraries are up to date before letting you compile your project.

All the "OPS" issues that we enumerated in the first half of this article disappear on CloudBees. The master Jenkins never crashes (unless, of course, the whole platform experiences downtimes). Everything is siloed: each build leads to a slave instance being booted. This instance can be as fast as you want. Finally, you can choose, for each and every job, on which Android version you want the emulator to run to play your test suite.

This last bullet-point is a strength and a weakness at the same time.

Each project can now run its instrumented and interface tests on the lowest Android version you support on your app. But you can also choose to run those tests on a configuration matrix mixing for example the Android version, the device's locale, and the screen size.

Playing with the emulator versions leads to reaching one of the limitations of CloudBees: the difficulty to boot emulators running on the latests Android versions. Official Android emulators are known to be "heavy", slow to boot and laggy. These flaws worsen as you increase the Android version. Additionally, it is impossible to benefit from hardware acceleration (whether GPU or HAXM) or from x86/64 versions on CloudBees. It is thus very difficult to reliably boot an emulator starting from version 4.4.4: the success rate is lower that 25%. (Note 2)

This limitation is frankly annoying, but is mitigated by various factors:

It is recommended to run your automated tests on the lowest Android version your project supports, because it is "easy" to unintentionally call methods that appeared later on the SDK. The current market status forces us to support at least Jelly Bean (Android 16), on which a third of the Android devices are still running. On our projects, the Jelly Bean Android emulator boots reliably in less than 90 seconds.
Solutions are emerging to allow executing instrumented tests "as-a-service" on emulators or real devices. We can for example mention Amazon's "Device Farm", or the upcoming Google's "Cloud Test Lab". Those tools will probably be mature enough when existing apps with large instrumented test harnesses will target KitKat or Lollipop as the lowest supported versions.
Projects currently launched or launched recently should not run into the same problems. New application architectures (MVP or even MVVM), in conjunction with improvements on development tools, allow the developers to write mainly non-instrumented tests, leaving the use of emulators to the most complex scenarios and interface tests.

Conclusion

On my previous Android project (140k lines of code, 3 years of development), the migration from an on-premise Jenkins server to CloudBees halved the build time of the CI job, and split by 4 the build time for the delivery job. Despite a few troubles discussed before with emulators during the migration, it has clearly been beneficial to the project.

The caveats around emulators should only slow down projects that are currently in production, supporting "very recent" Android versions (such as KitKat) as their minimum, and relying on large instrumented tests harnesses. I hope that the rise of better Android emulators or more mature solutions to access emulators and devices "as-a-service" should allow me to reconsider this limit soon.

Notes :

A non-official tool allowing to run tests on the local JVM, Robolectric, has been around for a few years. This tool has trouble following the developer-tools update rhythm imposed by google. As a result, your whole test-suite can be broken by a simple build tools update. It is thus a solution that raises debates in the community, and that I decided to ignore in this article.
On this topic, the CloudBees support is powerless and recommends using custom slaves hosted on-premise, or on a cloud supporting hardware acceleration. It is however specifically to avoid having to maintain custom slaves or VMs that we started this talk about migrating to the cloud.