The upstream Kubernetes

Kubernetes has quickly become the de facto standard for container orchestration. If the basics are now well understood, the new “upstream” features are much less, even though they make the product richer and able to address some very specific use cases. This article is a review of these new features, as well as the upcoming ones that come with the 1.10 release. A prior knowledge of kubernetes is a prerequisite. Disclaimer: some of these features are alpha, meaning that some backward incompatible changes may be released with the upcoming versions. Always refer to the release-notes when upgrading.

Kubernetes and node management

Taints and tolerations are Kubernetes features that ensure Pods are assigned to appropriate nodes. A taint is composed of a key or a key/value pair (the value is optional), and an effect. Three effects are available:

A taint with the NoExecute effect indicates that no more Pods should be assigned to the node and that it should be drained out. This enables the creation of eviction policies based on taints.
A taint with the NoSchedule effect indicates that no more Pods should be assigned to the node.
A taint with the PreferNoSchedule effect indicates that Pods should be assigned to this node only as a last resort.

In some cases, the cluster administrator needs to tolerate taints for some reasons. A toleration is a Pod’s property. It’s composed of a tolerated taint’s key, an optional value, a toleration delay (infinite by default), and an operator. There are 2 types of operators:

The Exist operator indicates that the check was done on the key and effect
The Equal operator performs the same checks as the Exist operator, and also checks that the optional value is the same as on the taint

The taints and tolerations introduce new behaviours in Kubernetes, such as tainting node with their NodeCondition. This allows to tolerate some states and to assign Pods even though the node is tainted with networkUnavailable, so as to diagnose or prepare a node deployment.

Kubernetes makes kube-scheduler a critical component

The scheduler is getting richer and richer, starting with the priority & preemption APIs that begin to move towards beta support. Indeed the critical pods will integrate directly with the priority API instead of relying on rescheduler and taints. Preemption will become more complete in the next version, managing Pods preemptions to assign DaemonSet when resources forbid it. To get there, the assignment will no longer be handled by the DaemonSet controller but directly by the kube-scheduler. This makes the kube-scheduler a critical component when starting a cluster with tools that deploy masters as DaemonSet, as early as the next version. The preemption management when several schedulers are in use will happen in version 1.11. Performance wise, the version 1.10 will embed the Predicates-ordering design. Short reminder, the scheduler algorithm operates a suite of predicates, a set of functions, to determine if a pod can be hosted on a node. The eligible nodes are then prioritised by a set of prioritisation functions to elect the best fitted node. The scheduler will define the predicates execution order, so as to execute first the higher restriction and less complex predicates. This will optimise the execution time, and if a predicate fails, the remainder of predicates will not be executed. This last part will be detailed further in a following article, to present the work done by Googler Bobby Salamat about it.

Towards more modularity

Kubernetes is becoming more modular and easier to extend with external contributions, such as CNI (Container Network Interface) and CRI (Container Runtime Interface) currently. It now becomes important to split Kubernetes’ core when it come to:

Integration with cloud providers thanks to CCM (Cloud Controller Manager)
Integration to storage management thanks to CSI (Container Storage Interface)

Kubernetes addresses the Cloud with the Cloud Controller Manager

Ever since Kubernetes 1.8, the Cloud Controller Manager makes possible for any cloud provider to integrate with Kubernetes. This is a great improvement as they don’t have to contribute directly into Kubernetes code. Thus, the release pace is given by the provider and not by the Kubernetes community, which improves velocity and the variety of features proposed by the cloud providers. Some providers are already developing their own implementation: Rancher, Oracle, Alibaba, DigitalOcean or Cloudify.

This works by relying on a plugin mechanism, as any cloud provider implementing cloudprovider.Interface and registering to Kubernetes can be linked to the CCM binary. In the next releases, every cloud providers (including providers already supported by kubernetes/kubernetes) will implement this interface outside of the Kubernetes repository, making the project more modular. For instance, here is the roadmap discussed by the Azure community.

Kubernetes launches CSI (Container Storage Interface)

Launched in version 1.9 in alpha, CSI is a spec that allows any provider implementing the specification to provide a driver that allows Kubernetes to manage storage. CSI will go alpha in version 1.10. The pain point addressed by CSI is twofold. Like CCM, it enables externalization of storage drivers and makes it possible for providers to define their release pace. Secondly, CSI resolves the hard installation issue of the FlexVolume plugins. If you want to write your own driver, see here how to deploy it.

Kubernetes v1beta1.Event to the rescue

Events have always been an issue on Kubernetes, semantic wise and performance wise.

Currently, the whole semantic is embedded in a message, which hardens considerably the information extraction and the events requesting. Performance wise, there is room for improvement. For now, Kubernetes has a deduplication process to remove identical entries, which decreases the memory footprint of etcd, Kubernetes’ distributed key-value store. However, the number of requests to etcd remains an issue. The version 1.10 will introduce the new event processing logic which should release this pressure. The principle is simple, event deduplication and events updates will happen periodically, which will greatly decrease the number of api-server calls for writing requests to etcd. This part will be further detailed in a following article, to present the related work by Googler Marek Grabowski and Heptio’s Timothy St. Clair.

Kubernetes HPA - Horizontal Pod Autoscaler

Kubernetes will make it possible to scale your Pods depending on custom metrics. This feature is now beta. An aggregation layer makes it possible to extend the api-server beyond the native APIs provided by Kubernetes. It does so by adding 2 APIs:

Resources Metrics API: collect metrics called resources, namely CPU and memory. The nominal implementation is Metrics-Server, that makes Heapster redundant.
Custom Metrics API: collect custom metrics, which facilitates the use tools such as Prometheus to manage auto-scaling. An adapter polls Prometheus metrics and makes them available as auto-scaling thresholds.

There are various use cases, and this allows to address each of their specific needs. This part will be detailed in a next article.

Kubernetes federation

The need for deploying multiple clusters becomes more and more mainstream. Kube-fed offers to manage that across on-premise sites, various cloud providers or on different regions of a same hosting service, which enables a considerable flexibility. Resources are relatively mature, and several features were added for federation:

The HPA support is on its way, with the same principle as for a cluster, except that the processing is done on another level: the federation one ensures auto-scaling on multiple clusters.
The private image registries for federation is planned for version 1.10.
It will be possible to choose on what node to deploy the federation-controller thanks to the nodeSelector.

Conclusion

The Kubernetes eco-system is reaching a satisfying maturity level, and keeps improving to address the many needs of its users. The possibility to extend Kubernetes will offer the users a way to address some very specific use cases. The next articles will detail more thoroughly some of the features introduced in this article such as the HPA principles, the event management, etc.