linux, mac, windows comments edit

I used to think setting up your PATH for your shell - whichever shell you like - was easy. But then I got into a situation where I started using more than one shell on a regular basis (both PowerShell and Bash) and things started to break down quickly.

Specifically, I have some tools that are installed in my home directory. For example, .NET global tools get installed at ~/.dotnet/tools and I want that in my path. I would like this to happen for any shell I use, and I have multiple user accounts on my machine for testing scenarios so I’d like it to ideally be a global setting, not something I have to configure for every user.

This is really hard.

I’ll gather some of my notes here on various tools and strategies I use to set paths. It’s (naturally) different based on OS and shell.

This probably won’t be 100% complete, but if you have an update, I’d totally take a PR on this blog entry.

Shell Tips

Each shell has its own mechanism for setting up profile-specific values. In most cases this is the place you’ll end up setting user-specific paths - paths that require a reference to the user’s home directory. On Mac and Linux, the big takeaway is to use /etc/profile. Most shells appear to interact with that file on some level.


PowerShell has a series of profiles that range from system level (all users, all hosts) through user/host specific (current user, current host). The one I use the most is “current user, current host” because I store my profile in a Git repo and pull it into the correct spot on my local machine. I don’t currently modify the path from my PowerShell profile.

  • On Windows, PowerShell will use the system/user path setup on launch and then you can modify it from your profile.
  • On Mac and Linux, PowerShell appears to evaluate the /etc/profile and ~/.profile, then subsequently use its own profiles for the path. On Mac this includes evaluation of the path_helper output. (See the Mac section below for more on path_helper.) I say “appears to evaluate” because I can’t find any documentation on it, yet that’s the behavior I’m seeing. I gather this is likely due to something like a login shell (say zsh) executing first and then having that launch pwsh, which inherits the variables. I’d love a PR on this entry if you have more info.

If you want to use PowerShell as a login shell, on Mac and Linux you can provide the -Login switch (as the first switch when running pwsh!) and it will execute sh to include /etc/profile and ~/.profile execution before launching the PowerShell process. See Get-Help pwsh for more info on that.


Bash has a lot of profiles and rules about when each one gets read. Honestly, it’s pretty complex and seems to have a lot to do with backwards compatibility with sh along with need for more flexibility and override support.

/etc/profile seems to be the way to globally set user-specific paths. After /etc/profile, things start getting complex, like if you have a .bash_profile then your .profile will get ignored.


zsh is the default login shell on Mac. It has profiles at:

  • /etc/zshrc and ~/.zshrc
  • /etc/zshenv and ~/.zshenv
  • /etc/zprofile and ~/.zprofile

It may instead use /etc/profile and ~/.profile if it’s invoked in a compatibility mode. In this case, it won’t execute the zsh profile files and will use the sh files instead. See the manpage under “Compatibility” for details or this nice Stack Overflow answer.

I’ve set user-specific paths in /etc/profile and /etc/zprofile, which seems to cover all the bases depending on how the command gets invoked.

Operating System Tips


Windows sets all paths in the System => Advanced System Settings => Environment Variables control panel. You can set system or user level environment variables there.

The Windows path separator is ;, which is different than Mac and Linux. If you’re building a path with string concatenation, be sure to use the right separator.

Mac and Linux

I’ve lumped these together because, with respect to shells and setting paths, things are largely the same. The only significant difference is that Mac has a tool called path_helper that is used to generate paths from a file at /etc/paths and files inside the folder /etc/paths.d. Linux doesn’t have path_helper.

The file format for /etc/paths and files in /etc/paths.d is plain text where each line contains a single path, like:


Unfortunately, path_helper doesn’t respect the use of variables - it will escape any $ it finds. This is a good place to put global paths, but not great for user-specific paths.

In /etc/profile there is a call to path_helper to evaluate the set of paths across these files and set the path. I’ve found that just after that call is a good place to put “global” user-specific paths.

if [ -x /usr/libexec/path_helper ]; then
  eval `/usr/libexec/path_helper -s`


Regardless of whether you’re on Mac or Linux, /etc/profile seems to be the most common place to put these settings. Make sure to use $HOME instead of ~ to indicate the home directory. The ~ won’t get expanded and can cause issues down the road.

If you want to use zsh, you’ll want the PATH set block in both /etc/profile and /etc/zprofile so it handles any invocation.

The Mac and Linux path separator is :, which is different than Windows. If you’re building a path with string concatenation, be sure to use the right separator.

kubernetes comments edit

I have a situation that is possibly kind of niche, but it was a real challenge to figure out so I thought I’d share the solution in case it helps you.

I have a Kubernetes cluster with Istio installed. My Istio ingress gateway is connected to an Apigee API management front-end via mTLS. Requests come in to Apigee then get routed to a secured public IP address where only Apigee is authorized to connect.

Unfortunately, this results in all requests coming in with the same Host header:

  1. Client requests
  2. Apigee gets that request and routes to via the Istio ingress gateway and mTLS.
  3. An Istio VirtualService answers to hosts: "*" (any host header at all) and matches entirely on URL path - if it’s /v1/resource/operation it routes to mysvc.myns.svc.cluster.local/resource/operation.

This is how the ingress tutorial on the Istio site works, too. No hostname-per-service.

However, there are a couple of wrenches in the works, as expected:

  • There are some API endpoints on the service that aren’t exposed through Apigee. They’re internal-only operations that allow for service-to-service communications in the cluster but aren’t for outside callers.
  • I want to do canary deployments and route traffic slowly from an existing version of the service to a new, canary version. I need both the external and internal traffic routed this way to get accurate results.

The combination of these things is a problem. I can’t assume that the match-on-path-regex setting will work for internal traffic - I need any internal service to route properly based on host name. However, you also can’t match on host: "*" for internal traffic that doesn’t come through an ingress. That means I would need two different VirtualService instances - one for internal traffic, one for external.

But if I have two different VirtualService objects to manage, it means I need to keep them in sync over the canary, which kind of sucks. I’d like to set the traffic balancing in one spot and have it work for both internal and external traffic.

I asked how to do this on the Istio discussion forum and thought for a while that a VirtualService delegate would be the answer - have one VirtualService with the load balancing information, a second service for internal traffic (delegating to the load balancing service), and a third service for external traffic (delegating to the load balancing service). It’s more complex, but I’d get the ability to control traffic in one spot.

Unfortunately (the word “unfortunately” shows up a lot here, doesn’t it?), you can’t use delegates on a VirtualService that doesn’t also connect to a gateway. That is, if it’s internal/mesh traffic, you don’t get the delegate support. This issue in the Istio repo touches on that.

Here’s where I landed.

First, I updated Apigee so it takes care of two things for me:

  1. It adds a Service-Host header with the internal host name of the target service, like Service-Host: mysvc.myns.svc.cluster.local. It more tightly couples the Apigee part of things to the service internal structure, but it frees me up from having to route entirely by regex in the cluster. (You’ll see why in a second.) I did try to set the Host header directly, but Apigee overwrites this when it issues the request on the back end.
  2. It does all the path manipulation before issuing the request. If the internal service wants /v1/resource/operation to be /resource/operation, that path update happens in Apigee so the inbound request will have the right path to start.

I did the Service-Host header with an “AssignMessage” policy.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<AssignMessage async="false" continueOnError="false" enabled="true" name="Add-Service-Host-Header">
    <DisplayName>Add Service Host Header</DisplayName>
            <Header name="Service-Host">mysvc.myns.svc.cluster.local</Header>
    <AssignTo createNew="false" transport="http" type="request"/>

Next, I added an Envoy filter to the Istio ingress gateway so it knows to look for the Service-Host header and update the Host header accordingly. Again, I used Service-Host because I couldn’t get Apigee to properly set Host directly. If you can figure that out and get the Host header coming in correctly the first time, you can skip the Envoy filter.

The filter needs to run first thing in the pipeline, before Istio tries to route traffic. I found that pinning it just before the istio.metadata_exchange stage got the job done.

kind: EnvoyFilter
  name: propagate-host-header-from-apigee
  namespace: istio-system
      istio: ingressgateway
      app: istio-ingressgateway
    - applyTo: HTTP_FILTER
        context: GATEWAY
              name: "envoy.http_connection_manager"
              # istio.metadata_exchange is the first filter in the connection
              # manager, at least in Istio 1.6.14.
              name: "istio.metadata_exchange"
        operation: INSERT_BEFORE
          name: envoy.filters.http.lua
            inline_code: |
              function envoy_on_request(request_handle)
                local service_host = request_handle:headers():get("service-host")
                if service_host ~= nil then
                  request_handle:headers():replace("host", service_host)

Finally, the VirtualService that handles the traffic routing needs to be tied both to the ingress and to the mesh gateway. The hosts setting can just be the internal service name, though, since that’s what the ingress will use now.

kind: VirtualService
  name: mysvc
  namespace: myns
    - istio-system/apigee-mtls
    - mesh
    - mysvc
    - route:
        - destination:
            host: mysvc-stable
          weight: 50
        - destination:
            host: mysvc-baseline
          weight: 25
        - destination:
            host: mysvc-canary
          weight: 25

Once all these things are complete, both internal and external traffic will be routed by the single VirtualService. Now I can control canary load balancing in a single location and be sure that I’m getting correct overall test results and statistics with as few moving pieces as possible.

Disclaimer: There may be reasons you don’t want to treat external traffic the same as internal, like if you have different DestinationRule settings for traffic management inside vs. outside, or if you need to pass things through different authentication filters or whatever. Everything I’m working with is super locked down so I treat internal and external traffic with the same high levels of distrust and ensure that both types of traffic are scrutinized equally. YMMV.

mac, network comments edit

I have a Mac and my user account is attached to a Windows domain. The benefit of this is actually pretty minimal in that I can change my domain password and it propagates to the local Mac user account, but that’s about it. It seems to cause more trouble than it’s worth.

I recently had an issue where something got out of sync and I couldn’t log into my Mac using my domain account. This is sort of a bunch of tips and things I did to recover that.

First, have a separate local admin account. Make it a super complex password and never use it for anything else. This is sort of your escape hatch to try to recover your regular user account. Even if you want to have a local admin account so your regular user account can stay a user and no admin… have a dedicated “escape hatch” admin account that’s separate from the “I use this sometimes for sudo purposes” admin account. I have this, and if I hadn’t, that’d have been the end of it.

It’s good to remember for a domain-joined account there are three security tokens that all need to be kept in sync: Your domain user password, your local machine OS password, and your disk encryption token. When you reboot the computer, the first password you’ll be asked for should unlock the disk encryption. Usually the token for disk encryption is tied nicely to the machine account password so you enter the one password and it both unlocks the disk and logs you in. The problem I was running into was those got out of sync. For a domain-joined account, the domain password usually is also tied to these things.

Next, keep your disk encryption recovery code handy. Store it in a password manager or something. If things get out of sync, you can use the recovery code to unlock the disk and then your OS password to log in.

For me, I was able to log in as my separate local admin account but my machine password wasn’t working unless I was connected to the domain. Only way to connect to the domain was over a VPN. That meant I needed to enable fast user switching so I could connect to the VPN under the separate local admin and then switch - without logging out - to my domain account.

Once I got to my own account I could use the Users & groups app to change my domain password and have the domain and machine accounts re-synchronized. ALWAYS ALWAYS ALWAYS USE USERS & GROUPS TO CHANGE YOUR DOMAIN ACCOUNT PASSWORD. I have not found a way otherwise to ensure everything is in sync. Don’t change it from some other workstation, don’t change it from Azure Active Directory. This is the road to ruin. Stay with Users & Groups.

The last step was that my disk encryption token wasn’t in sync - OS and domain connection was good, but I couldn’t log in after a reboot. I found the answer in a Reddit thread:

su local_admin
sysadminctl -secureTokenStatus domain_account_username
sysadminctl -secureTokenOff domain_account_username \
  -password domain_account_password \
sysadminctl -secureTokenOn domain_account_username \
  -password domain_account_password \

Basically, as the standalone local admin, turn off and back on again the connection to the drive encryption. This refreshes the token and gets it back in sync.

Reboot, and you should be able to log in with your domain account again.

To test it out, you may want to try changing your password from Users & Groups to see that the sync works. If you get a “password complexity” error, it could be the sign of an issue… or it could be the sign that your domain has a “you can’t change the password more than once every X days” sort of policy and since you changed it earlier you are changing it again too soon. YMMV.

And, again, always change your password from Users & Groups.

kubernetes comments edit

I have a Kubernetes 1.19.11 cluster deployed along with Istio 1.6.14. I have a central instance of Prometheus for scraping metrics, and based on the documentation, I have a manually-injected sidecar so Prometheus can make use of the Istio certificates for mTLS during scraping. Under Prometheus v2.20.1 this worked great. However, I was trying to update some of the infrastructure components to take advantage of new features and Prometheus after v2.21.0 just would not scrape.

These are my adventures in trying to debug this issue. Some of it is to remind me of what I did. Some of it is to save you some trouble if you run into the issue. Some of it is to help you see what I did so you can apply some of the techniques yourself.

TL;DR: The problem is that Prometheus v2.21.0 disabled HTTP/2 and that needs to be re-enabled for things to work. There should be a Prometheus release soon that allows you to re-enable HTTP/2 with environment variables.

I created a repro repository with a minimal amount of setup to show how things work. It can get you from a bare Kubernetes cluster up to Istio 1.6.14 and Prometheus using the same values I am. You’ll have to supply your own microservice/app to demonstrate scraping, but the prometheus-example-app may be a start.

I deploy Prometheus using the Helm chart. As part of that, I have an Istio sidecar manually injected just like they do in the official 1.6 Istio release manifests. By doing this, the sidecar will download and share the certificates but it won’t proxy any of the Prometheus traffic.

I then have a Prometheus scrape configuration that uses the certificates mounted in the container. If it finds a pod that has the Istio sidecar annotations (indicating it’s got the sidecar injected), it’ll use the certificates for authentication and communication.

- job_name: "kubernetes-pods-istio-secure"
  scheme: https
    ca_file: /etc/istio-certs/root-cert.pem
    cert_file: /etc/istio-certs/cert-chain.pem
    key_file: /etc/istio-certs/key.pem
    insecure_skip_verify: true

If I deploy Prometheus v2.20.1, I see that my services are being scraped by the kubernetes-pods-istio-secure job, they’re using HTTPS, and everything is good to go. Under v2.20.1, I see the error connection reset by peer. I tried asking about this in the Prometheus newsgroup to no avail, so… I dove in.

My first step was to update the Helm chart extraArgs to turn on Prometheus debug logging.

  log.level: debug

I was hoping to see more information about what was happening. Unfortunately, I got basically the same thing.

level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target= msg="Scrape failed" err="Get \"\": read tcp> read: connection reset by peer"

This got me thinking one of two things may have happened in v2.21.0:

  • Something changed in Prometheus; OR
  • Something changed in the OS configuration of the Prometheus container

I had recently fought with a dotnet CLI problem where certain TLS cipher suites were disabled by default and some OS configuration settings on our build agents affected what was seen as allowed vs. not allowed. This was stuck in my mind so I couldn’t immediately rule out the container OS configuration.

To validate the OS issue I was going to try using curl and/or openssl to connect to the microservice and see what the cipher suites were. Did I need an Istio upgrade? Was there some configuration setting I was missing? Unfortunately, it turns out the Prometheus Docker image is based on a custom busybox image where there are no package managers or tools. I mean, this is actually a very good thing from a security perspective but it’s a pain for debugging.

What I ended up doing was getting a recent Ubuntu image and connecting using that, just to see. I figured if there was anything obvious going on that I could take the extra steps of creating a custom Prometheus image with curl and openssl to investigate further. I mounted a manual sidecar just like I did for Prometheus so I could get to the certificates without proxying traffic, then I ran some commands:

curl \
  --cacert /etc/istio-certs/root-cert.pem \
  --cert /etc/istio-certs/cert-chain.pem \
  --key /etc/istio-certs/key.pem \

openssl s_client \
  -connect \
  -cert /etc/istio-certs/cert-chain.pem  \
  -key /etc/istio-certs/key.pem \
  -CAfile /etc/istio-certs/root-cert.pem \
  -alpn "istio"

Here’s some example output from curl to show what I was seeing:

root@sleep-5f98748557-s4wh5:/# curl --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v
*   Trying
* Connected to ( port 9102 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/istio-certs/root-cert.pem
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: [NONE]
*  start date: Jul  7 20:21:33 2021 GMT
*  expire date: Jul  8 20:21:33 2021 GMT
*  issuer: O=cluster.local
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x564d80d81e10)
> GET /metrics HTTP/2
> Host:
> user-agent: curl/7.68.0
> accept: */*
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 2147483647)!
< HTTP/2 200

A few things in particular:

  1. I found the --alpn "istio" thing for openssl while looking through Istio issues to see if there were any pointers there. It’s always good to read through issues lists to get ideas and see if other folks are running into the same problems.
  2. Both openssl and curl were able to connect to the microservice using the certificates from Istio.
  3. The cipher suite shown in the openssl output was one that was considered “recommended.” I forgot to capture that output for the blog article, sorry about that.

At this point I went to the release notes for Prometheus v2.21.0 to see what had changed. I noticed two things that I thought may affect my situation:

  1. This release is built with Go 1.15, which deprecates X.509 CommonName in TLS certificates validation.
  2. [CHANGE] Disable HTTP/2 because of concerns with the Go HTTP/2 client. #7588 #7701

I did see in that curl output that it was using HTTP/2 but… is it required? Unclear. However, looking at the Go docs about the X.509 CommonName thing, that’s easy enough to test. I just needed to add an environment variable to the Helm chart for Prometheus:

  - name: GODEBUG
    value: x509ignoreCN=0

After redeploying… it didn’t fix anything. That wasn’t the problem. That left the HTTP/2 thing. However, what I found was it’s hardcoded off, not disabled through some configuration mechanism so there isn’t a way to just turn it back on to test. The only way to test it is to do a fully custom build.

The Prometheus build for a Docker image is really complicated. They have this custom build tool promu that runs the build in a custom build container and all this is baked into layers of make and yarn and such. As it turns out, not all of it happens in the container, either, because if you try to build on a Mac you’ll get an error like this:

... [truncated huge list of downloads] ...
go: downloading v0.0.0-20170810143723-de5bf2ad4578
go: downloading v0.3.1
go: downloading v0.4.0
go build /usr/local/go/pkg/tool/linux_amd64/compile: signal: killed
!! command failed: build -o .build/linux-amd64/prometheus -ldflags -X -X -X -X -X  -extldflags '-static' -a -tags netgo,builtinassets exit status 1
make: *** [Makefile.common:227: common-build] Error 1
!! The base builder docker image exited unexpectedly: exit status 2

You can only build on Linux even though it’s happening in a container. At least right now. Maybe that’ll change in the future. Anyway, this meant I needed to create a Linux VM and set up an environment there that could build Prometheus… or figure out how to force a build system to do it, say by creating a fake PR to the Prometheus project. I went the Linux VM route.

I changed the two lines where the HTTP/2 was disabled, I pushed that to a temporary Docker Hub location, and I got it deployed in my cluster.

Success! Once HTTP/2 was re-enabled, Prometheus was able to scrape my Istio pods again.

I worked through this all with the Prometheus team and they were able to replicate the issue using my repro repo. They are now working through how to re-enable HTTP/2 using environment variables or configuration.

All of this took close to a week to get through.

It’s easy to read these blog articles and think the writer just blasted through all this and it was all super easy, that I already knew the steps I was going to take and flew through it. I didn’t. There was a lot of reading issues. There was a lot of trying things and then retrying those same things because I forgot what I’d just tried, or maybe I discovered I forgot to change a configuration value. I totally deleted and re-created my test Kubernetes cluster like five times because I also tried updating Istio and… well, you can’t really “roll back Istio.” It got messy. Not to mention, debugging things at the protocol level is a spectacular combination of “not interesting” and “not my strong suit.”

My point is, don’t give up. Pushing through these things and reading and banging your head on it is how you get the experience so that next time you will have been through it.

azure, kubernetes, linux, spinnaker comments edit

Kayenta is the subcomponent of Spinnaker that handles automated canary analysis during a deployment. It reads from your metric sources and compares the stats from an existing deployed service against a new version of the service to see if there are anomalies or problems, indicating the rollout should be aborted if the new service fails to meet specified tolerances.

I’m a huge fan of Spinnaker, but sometimes you already have a full CI/CD system in place and you really don’t want to replace all of that with Spinnaker. You really just want the canary part of Spinnaker. Luckily, you can totally use Kayenta as a standalone service. They even have some light documentation on it!

In my specific case, I also want to use Azure Storage as the place where I store the data for Kayenta - canary configuration, that sort of thing. It’s totally possible to do that, but, at least at the time of this writing, the hal config canary Halyard command does not have Azure listed and the docs don’t cover it.

So there are a couple of things that come together here, and maybe all of it’s interesting to you or maybe only one piece. In any case, here’s what we’re going to build:

Standalone Kayenta diagram

  • A Kubernetes ingress to allow access to Kayenta from your CI/CD pipeline.
  • A deployment of the Kayenta microservice.
  • Kayenta configured to use an Azure Storage Account to hold its configuration and such.

Things I’m not going to cover:

  • How exactly your CI/CD canary stage needs to work.
  • How long a canary stage should last.
  • How exactly you should configure Kayenta (other than the Azure part).
  • Which statistics you should monitor for your services to determine if they “pass” or “fail.”
  • Securing the Kayenta ingress so only authenticated/authorized access is allowed.

This stuff is hard and it gets pretty deep pretty quickly. I can’t cover it all in one go. I don’t honestly have answers to all of it anyway, since a lot of it depends on how your build pipeline is set up, how your app is set up, and what your app does. There’s no “one-size-fits-all.”

Let’s do it.


First, provision an Azure Storage account. Make sure you enable HTTP access because right now Kayenta requires HTTP and not HTTPS.

You also need to provision a container in the Azure Storage account to hold the Kayenta contents.

# I love me some PowerShell, so examples/scripts will be PowerShell.
# Swap in your preferred names as needed.
$ResourceGroup = "myresourcegroup"
$StorageAccountName = "kayentastorage"
$StorageContainerName = "kayenta"
$Location = "westus2"

# Create the storage account with HTTP enabled.
az storage account create `
  --name $StorageAccountName `
  --resoure-group $ResourceGroup `
  --location $Location `
  --https-only false `
  --sku Standard_GRS

# Get the storage key so you can create a container.
$StorageKey = az storage account keys list `
  --account-name $StorageAccountName `
  --query '[0].value' `
  -o tsv

# Create the container that will hold Kayenta stuff.
az storage container create `
  --name $StorageContainerName `
  --account-name $StorageAccountName `
  --account-key $StorageKey

Let’s make a namespace in Kubernetes for Kayenta so we can put everything we’re deploying in there.

# We'll use the namespace a lot, so a variable
# for that in our scripting will help.
$Namespace = "kayenta"
kubectl create namespace $Namespace

Kayenta needs Redis. We can use the Helm chart to deploy a simple Redis instance. Redis must not be in clustered mode, and there’s no option for providing credentials.

helm repo add bitnami

# The name of the deployment will dictate the name of the
# Redis master service that gets deployed. In this example,
# 'kayenta-redis' as the deployment name will create a
# 'kayenta-redis-master' service. We'll need that later for
# Kayenta configuration.
helm install kayenta-redis bitnami/redis `
  -n $Namespace `
  --set cluster.enabled=false `
  --set usePassword=false `
  --set master.persistence.enabled=false

Now let’s get Kayenta configured. This is a full, commented version of a Kayenta configuration file. There’s also a little doc on Kayenta configuration that might help. What we’re going to do here is put the kayenta.yml configuration into a Kubernetes ConfigMap so it can be used in our service.

Here’s a ConfigMap YAML file based on the fully commented version, but with the extra stuff taken out. This is also where you’ll configure the location of Prometheus (or whatever) where Kayenta will read stats. For this example, I’m using Prometheus with some basic placeholder config.

apiVersion: v1
kind: ConfigMap
  name: kayenta
  namespace: kayenta
  kayenta.yml: |-
      port: 8090

    # This should match the name of the master service from when
    # you deployed the Redis Helm chart earlier.
      connection: redis://kayenta-redis-master:6379

        enabled: false

        enabled: false

    # This is the big one! Here's where you configure your Azure Storage
    # account and container details.
        enabled: true
          - name: canary-storage
            storageAccountName: kayentastorage
            # azure.storageKey is provided via environment AZURE_STORAGEKEY
            # so it can be stored in a secret. You'll see that in a bit.
            # Don't check in credentials!
            accountAccessKey: ${azure.storageKey}
            container: kayenta
            rootFolder: kayenta
              - OBJECT_STORE

        enabled: false

        enabled: false

        enabled: false

        enabled: false

    # Configure your Prometheus here. Or if you're using something else, disable
    # Prometheus and configure your own metrics store. The important part is you
    # MUST have a metrics store configured!
        enabled: true
        - name: canary-prometheus
            baseUrl: http://prometheus:9090
            - METRICS_STORE

        enabled: true

        enabled: false

        enabled: false

        enabled: true

        enabled: false

        enabled: false

        enabled: false

        enabled: false

        enabled: false

    # Enable the SCAPE endpoint that has the same user experience that the Canary StageExecution in Deck/Orca has.
    # By default this is disabled - in standalone we enable it!
        enabled: true

          series: SERVER_ERROR
          attempts: 10
          backoffPeriodMultiplierMs: 1000

        writeDatesAsTimestamps: false
        writeDurationsAsTimestamps: false

    management.endpoints.web.exposure.include: '*' always

          queueName: kayenta.keiko.queue
          deadLetterQueueName: kayenta.keiko.queue.deadLetters

      applicationName: ${}
        enabled: true

      enabled: true
      title: Kayenta API
        - /admin.*
        - /canary.*
        - /canaryConfig.*
        - /canaryJudgeResult.*
        - /credentials.*
        - /fetch.*
        - /health
        - /judges.*
        - /metadata.*
        - /metricSetList.*
        - /metricSetPairList.*
        - /metricServices.*
        - /pipeline.*
        - /standalone.*

Save that and deploy it to the cluster.

kubectl apply -f kayenta-configmap.yml

You’ll notice in the config we just put down that we did not include the Azure Storage acccount key. Assuming we want to commit that YAML to a source control system at some point, we definitely don’t want credentials in there. Instead, let’s use a Kubernetes secret for the Azure Storage account key.

# Remember earlier we got the storage account key for creating
# the container? We're going to use that again.
kubectl create secret generic azure-storage `
  -n $Namespace `

It’s deployment time! Let’s get a Kayenta container into the cluster! Obviously you can tweak all the tolerances and affinities and node selectors and all that to your heart’s content. I’m keeping the example simple.

apiVersion: apps/v1
kind: Deployment
  name: kayenta
  namespace: kayenta
  labels: kayenta
  replicas: 1
    matchLabels: kayenta
      labels: kayenta
        - name: kayenta
          # Find the list of tags here:
          # This is just the tag I've been using for a while. I use one of the images NOT tagged
          # with Spinnaker because the Spinnaker releases are far slower.
          image: ""
            # If you need to troubleshoot, you can set the logging level by adding
            # -Dlogging.level.root=TRACE
            # Without the log at DEBUG level, very little logging comes out at all and
            # it's really hard to see if something goes wrong. If you don't want that
            # much logging, go ahead and remove the log level option here.
            - name: JAVA_OPTS
              value: "-XX:+UnlockExperimentalVMOptions -Dlogging.level.root=DEBUG"
            # We can store secrets outside config and provide them via the environment.
            # Insert them into the config file using ${dot.delimited} versions of the
            # variables, like ${azure.storageKey} which we saw in the ConfigMap.
            - name: AZURE_STORAGEKEY
                  name: azure-storage
                  key: storage-key
            - name: http
              containerPort: 8090
              protocol: TCP
              path: /health
              port: http
              path: /health
              port: http
            - name: config-volume
              mountPath: /opt/kayenta/config
        - name: config-volume
            name: kayenta

And let’s save and apply.

kubectl apply -f kayenta-deployment.yml

If you have everything wired up right, the Kayenta instance should start. But we want to see something happen, right? Without kubectl port-forward?

Let’s put a LoadBalancer service in here so we can access it. I’m going to show the simplest Kubernetes LoadBalancer here, but in your situation you might have, say, an nginx ingress in play or something else. You’ll have to adjust as needed.

apiVersion: v1
kind: Service
  name: kayenta
  namespace: kayenta
  labels: kayenta
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector: kayenta
  type: LoadBalancer

Let’s see it do something. You should be able to get the public IP address for that LoadBalancer service by doing:

kubectl get service/kayenta -n $Namespace

You’ll see something like this:

NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)    AGE
kayenta      LoadBalancer   80/TCP     54s

Take note of that external IP and you can visit the Swagger docs in a browser:

If it’s all wired up, you should get some Swagger docs!

The first operation you should try is under credentials-controller - GET /credentials. This will tell you what metrics and object stores Kayenta thinks it’s talking to. The result should look something like this:

    "name": "canary-prometheus",
    "supportedTypes": [
    "endpoint": {
      "baseUrl": "http://prometheus"
    "type": "prometheus",
    "locations": [],
    "recommendedLocations": []
    "name": "canary-storage",
    "supportedTypes": [
    "rootFolder": "kayenta",
    "type": "azure",
    "locations": [],
    "recommendedLocations": []

If you are missing the canary-storage account pointing to azure - that means Kayenta can’t access the storage account or it’s otherwise misconfigured. I found the biggest gotcha here was that it’s HTTP-only and that’s not the default for a storage account if you create it through the Azure portal. You have to turn that on.


What do you do if you can’t figure out why Kayenta isn’t connecting to stuff?

Up in the Kubernetes deployment, you’ll see the logging is set up at the DEBUG level. The logging is pretty good at this level. You can use kubectl logs to get the logs from the Kayenta pods or, better, use stern for that Those logs are going to be your secret. You’ll see errors that pretty clearly indicate whether there’s a DNS problem or a bad password or something similar.

If you still aren’t getting enough info, turn the log level up to TRACE. It can get noisy, but you’ll only need it for troubleshooting.

Next Steps

There’s a lot you can do from here.

Canary configuration: Actually configuring a canary is hard. For me, it took deploying a full Spinnaker instance and doing some canary stuff to figure it out. There’s a bit more doc on it now, but it’s definitely tricky. Here’s a pretty basic configuration where we just look for errors by ASP.NET microservice controller. No, I can not help or support you in configuring a canary. I’ll give you this example with no warranties, expressed or implied.

  "canaryConfig": {
    "applications": [
    "classifier": {
      "groupWeights": {
        "StatusCodes": 100
      "scoreThresholds": {
        "marginal": 75,
        "pass": 75
    "configVersion": "1",
    "description": "App Canary Configuration",
    "judge": {
      "judgeConfigurations": {
      "name": "NetflixACAJudge-v1.0"
    "metrics": [
        "analysisConfigurations": {
          "canary": {
            "direction": "increase",
            "nanStrategy": "replace"
        "groups": [
        "name": "Errors By Controller",
        "query": {
          "customInlineTemplate": "PromQL:sum(increase(http_requests_received_total{app='my-app',azure_pipelines_version='${location}',code=~'5\\\\d\\\\d|4\\\\d\\\\d'}[120m])) by (action)",
          "scopeName": "default",
          "serviceType": "prometheus",
          "type": "prometheus"
        "scopeName": "default"
    "name": "app-config",
    "templates": {
  "executionRequest": {
    "scopes": {
      "default": {
        "controlScope": {
          "end": "2020-11-20T23:01:09.3NZ",
          "location": "baseline",
          "scope": "control",
          "start": "2020-11-20T21:01:09.3NZ",
          "step": 2
        "experimentScope": {
          "end": "2020-11-20T23:01:09.3NZ",
          "location": "canary",
          "scope": "experiment",
          "start": "2020-11-20T21:01:09.3NZ",
          "step": 2
    "siteLocal": {
    "thresholds": {
      "marginal": 75,
      "pass": 95

Integrate with your CI/CD pipeline: Your deployment is going to need to know how to track the currently deployed vs. new/canary deployment. Statistics are going to need to be tracked that way, too. (That’s the same as if you were using Spinnaker.) I’ve been using the KubernetesManifest@0 task in Azure DevOps, setting trafficSplitMethod: smi and making use of the canary control there. A shell script polls Kayenta to see how the analysis is going.

How you do this for your template is very subjective. Pipelines at this level are really complex. I’d recommend working with Postman or some other HTTP debugging tool to get things working before trying to automate it.

Secure it!: You probably don’t want public anonymous access to the Kayenta API. I locked mine down with oauth2-proxy and Istio but you could do it with nginx ingress and oauth2-proxy or some other mechanism.

Put a UI on it!: As you can see, configuring Kayenta canaries without a UI is actually pretty hard. Nike has a UI for standalone Kayenta called “Referee”. At the time of this writing there’s no Docker container for it so it’s not as easy to deploy as you might like. However, there is a Dockerfile gist that might be helpful. I have not personally got this working, but it’s on my list of things to do.

Huge props to my buddy Chris who figured a lot of this out, especially the canary configuration and Azure DevOps integration pieces.