What is a Service Mesh?

A Service Mesh is a system that carries the requests and responses that microservices send each other. This traffic ultimately travels from Pod to Pod the same way it always has, but by passing through a Service Mesh layer as well much more advanced observability and control is possible. Think of a Service Mesh as a smarter network.


Service Meshes are at the forefront of Cloud Native infrastructure. As Kubernetes revolutionised compute—the execution of services—Service Meshes are a massive value-add in networking. They’re also replacing a lot of boilerplate that used to happen in application code, but which is better done by the infrastructure.


Much as projects like Terraform made the infrastructure team’s lives better, a Service Mesh is something that microservice developers and owners can use to better operate their applications. Service Meshes also augment the features of these applications, offering real top-line value, not just saving cost.


Operation


The internal workings of a Service Mesh are conceptually fairly simple: every microservice is accompanied by its own local HTTP proxy. These proxies perform all the advanced functions that define a Service Mesh (think about the kind of features offered by a reverse proxy or API Gateway). However, with a Service Mesh this is distributed between the microservices—in their individual proxies—rather than being centralised.


In a Kubernetes environment these proxies can be automatically injected into Pods, and can transparently intercept all of the microservices’ traffic; no changes to the applications or their Deployment YAMLs (in the Kubernetes sense of the term) are needed. These proxies, running alongside the application code, are called sidecars.


These proxies form the data plane of the Service Mesh, the layer through which the data—the HTTP requests and responses—flow. This is only half of the puzzle though: for these proxies to do what we want they all need complex and individual configuration. Hence a Service Mesh has a second part, a control plane. This is one (logical) component which computes and applies config to all the proxies in the data plane. It presents a single, fairly simple API to us, the end user. We use that to configure the Service Mesh as a logical whole (e.g. service A can talk to service B, but only on path /foo and only at a rate of 10qps) and it takes care of the details of configuring the individual sidecars (in this case identifying the sidecar alongside Service B,configuring it to recognise which requests are from Service A, and apply its rate-limiting and ACL features).





Any HTTP proxy implementation could be used for the sidecars (although they all have different features). However modern ones like Envoy are well-suited as they can be configured and reconfigured by a network API rather than consuming config files. They’re also designed so that they don’t drop traffic during reconfiguration, meaning uninterrupted service.


Features


Observability


One immediate benefit of implementing a Service Mesh is increased observability of the traffic in the cluster.


With all of the requests to our services running through sidecars, they’re in a perfect position to produce logs, metrics, and trace spans documenting what they see. This is nothing services can’t do of course, but now they don’t have to. Consistently-labelled statistics are generated for all workloads, with no code needed in the services, and all are sent to the same central collection point (such as a Prometheus server). This means, for example, that there is now one place to go to see p99 latency for all services, meaning you can easily spot outliers. Because these are HTTP proxies, they produce these statistics at Layer 7—e.g. latency broken down by HTTP path and method. It’s worth saying that although HTTP is by far the most common Layer 7 protocol in use, and very well supported by Service Meshes, it’s not the only one. Depending on the Service Mesh and hence its proxy implementation, most can see “inside” HTTP to give stats on gRPC, as well as understanding other TPC protocols like postgresql and Kafka.


This first feature comes for “free”: just inject the sidecars to get the traffic flowing through them, and there’s the observability. That said there are a couple of caveats to be aware of.

The first is that trace events are automatically produced at the entrance and exit of each service, but for them to be attributed to useful spans, services must forward the set of tracing headers. This requires cooperation from the application code, forwarding the necessary headers.


The second is that the sidecars do have some impact on performance since each extra hop in and out of the sidecar for request and response will inevitably add some latency. I won’t quote any numbers here because they’re both context-specific and can quickly become out-of-date, but good independent articles on the subject do exist, and your best bet as ever is to test your use cases in your environment, and compare the results to your performance requirements.


We should note as well that there is a theoretical cap on bandwidth (qps) imposed by the processing rate of the proxies, but this will likely be a non-issue as your services will be slower, unless they’re highly optimised (or written in Rust 😉 ).


Routing


If the previous Observability features were “passive”, we’re now onto “active” Service Mesh functions. These actions aren’t taken by default because they interfere with traffic flow, but if you choose to configure them they can offer great power and flexibility when building systems of microservices.


Since every traffic hop passes through its data plane, a Service Mesh can route traffic between microservices in sophisticated ways. For example, the following YAML would configure the Istio Service Mesh to intercept all traffic heading for service-b sending 90% to the incumbent version and 10% to a new test version (destination.host items are Kubernetes Services).


apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: service-b
spec:
  hosts:
    - service-b
  http:
    - route:
      - destination:
          host: service-b-v1
          weight: 90
      - destination:
          host: service-b-v2
          weight: 10

These percentages can be as precise as you want: the raw-Kubernetes “hack” of running 1 v2 Pod and 99 v1 Pods to get a 1%/99% split isn’t needed for example. With layer 7 routing the possibilities are myriad. Other examples include

  • Inspecting the HTTP headers and sending, say, just people with Accept-Language: en-GB to v2 because perhaps we haven’t had a chance to translate our new version yet and want to release it just to English speakers.

  • Routing GET requests for a certain HTTP path to a caching service layer

Conclusions


Service Meshes add powerful features to an often unconsidered part of Kubernetes clusters: the network. Layer 7 observability, sophisticated traffic routing, and solid security can all be factored out of application microservices and instead configured as network functions. Or, perhaps more realistically, a Service Mesh can easily add them to apps that never had them in the first place. New innovations, like WASM Envoy plugins and sidecar-less eBPF meshes, are moving the space further forwards all the time. However there are already several well-established Service Mesh projects which offer an easy route to all these benefits today. Deploying a Service Mesh unlocks even more value in the projects now built on top of them, like the Flagger progressive delivery operator. They can be incrementally adopted (Kubernetes labels are used to opt in one Namespace or even just one workload) meaning you can start safely experimenting today.