Architecture on network-notes

Stop Picking Tools, Start Picking Functions: The NAF Framework

brett@network-notes.com (Brett Lykins) — Wed, 20 May 2026 10:00:00 -0500

The Tool-First Trap

Every network automation conversation I’ve been part of starts the same way: “Should we use Ansible or Nornir? NetBox or Nautobot? Terraform or Pulumi?”

These are the wrong first questions. They’re implementation details masquerading as architecture decisions. You end up picking a tool, building around it, then discovering six months later that you’ve solved 30% of the problem and created three new ones.

The result is what Damien Garros describes as the “Frankenstack”: a pile of point tools stitched together with glue scripts, each solving a narrow problem but none composing into a coherent system. I built these early in my career. I spent years at Network to Code and OpsMill helping customers untangle them. You’ve probably built or inherited one yourself. They work until they don’t, and when they break, nobody can reason about the whole thing because there was never a whole thing to reason about.

What’s missing isn’t better tools. It’s a shared vocabulary for the functions your automation system needs to perform. The NAF Reference Framework provides exactly that.

The Six Building Blocks

The Network Automation Forum (NAF) published a reference architecture that breaks network automation into six functional building blocks. It’s not a product. It’s not a standard. It’s a blueprint, a way to think about what your automation system needs to do before you decide how to do it.

The six blocks answer four questions:

Question	Building Block(s)
What do I want the network to look like?	Intent
What does the network actually look like?	Observability
How do I read from and write to the network?	Collector (read) / Executor (write)
How do I coordinate all of this?	Orchestrator / Presentation

Here’s what each block does:

Intent stores and manages the desired state of your network. This is your source of truth: IP addressing, topology, service definitions, configuration templates, validation rules. It exposes an API, supports CRUD operations, and should provide versioning and validation.

Observability stores and processes the actual state. It persists what the Collector retrieves, runs analytics against it, and generates events when actual state diverges from intended state.

Orchestrator coordinates workflows across the other blocks. It doesn’t touch the network directly. It responds to events, schedules tasks, chains operations together, and handles rollback when something fails.

Executor pushes changes to the network. Configuration deploys, software upgrades, device reboots. It speaks SSH, NETCONF, gNMI, REST, whatever the device supports. Operations should be idempotent and support dry-run.

Collector pulls state from the network: show commands, SNMP polls, streaming telemetry, syslog, flow data. It normalizes vendor-specific output into structured data that Observability can consume.

Presentation is how humans (and external systems) interact with everything else. Dashboards, CLIs, ChatOps, ITSM integrations, API gateways.

The architecture has a deliberate symmetry: the left side is the read path (Observability and Collector reading state from infrastructure), the right side is the write path (Intent and Executor pushing state to infrastructure), and the Orchestrator sits in the middle coordinating both.

Why Functions Before Tools

Thinking in building blocks separates what you need from how you implement it.

One tool can fill multiple blocks. Nautobot, for example, covers Intent (it’s a source of truth), Orchestrator (its Jobs framework coordinates workflows), and Presentation (it has a web UI and API). That’s fine. The framework doesn’t prescribe how many tools you use. What matters is that the functions are covered.

Conversely, one block might need multiple tools. Your Collector might be Telegraf for metrics, a streaming telemetry receiver for gNMI, and a custom script for legacy SNMP devices. Three tools, one function.

This is why starting with tools fails. If you pick Ansible because someone on the team knows it, you’ve filled part of the Executor block and maybe part of the Orchestrator block. But you haven’t thought about Intent, Observability, or how the pieces connect. Six months later you’re writing a wrapper script that queries a spreadsheet (your accidental Intent block) and pipes it into an Ansible playbook (your Executor), with no Observability, no Collector feeding back actual state, and no Orchestrator handling failures.

The framework makes these gaps visible before you start building.

Mapping Real Tools

Here’s how I map tools I’ve used to the framework. This isn’t exhaustive. It’s meant to show how the mental model works in practice.

Intent

The Intent block is where most network automation projects should start, because everything downstream depends on having reliable desired-state data.

Three platforms dominate this space right now: NetBox, Nautobot, and Infrahub. All three serve as network sources of truth. All three expose APIs for automation to consume. They differ in architecture, schema flexibility, and how far they extend beyond pure data storage, but any of them can fill the Intent block effectively. I’ll have a detailed comparison of all three in a forthcoming post.

My preference is Infrahub, and I should be transparent about why: I was Director of Product at OpsMill during its development, and I worked with several of the NAF Framework contributors at Network to Code before that. It’s schema-first with a versioned graph database, and it reaches into Orchestrator territory (CI/CD integration, proposed changes with approval workflows) and Presentation (web UI, GraphQL API). Having built automation systems with all three platforms, I think Infrahub covers more of the Intent block’s requirements in a single platform than the alternatives do today.

That said, NetBox has the largest community and plugin ecosystem by a wide margin, and Nautobot extends further into Orchestrator (its Jobs framework) and Presentation than NetBox does. Both are proven at scale in ways that Infrahub, being newer, is still proving. Your team’s requirements should drive the decision.

At Amazon, we use internal platforms that serve the same Intent function. They store desired state for network infrastructure and expose it through APIs that automation consumes. The tools are different, but the function is identical.

Executor and Collector

These are the transport layer, the blocks that actually touch your network devices.

NAAS (Netmiko-as-a-Service) is a project I maintain that wraps Netmiko behind a REST API. It handles both read and write operations: you POST a command or configuration payload, NAAS manages the SSH session, and returns structured results. It fills both the Executor block (config pushes, device operations) and the Collector block (show commands, state retrieval).

Wrapping device interaction in a service means your Orchestrator doesn’t need to know how to SSH into a Juniper vs. an Arista. It just calls an API. The transport complexity is encapsulated in one place.

Other tools that fill these blocks: Nornir (Python-native, multi-threaded), Ansible (push-mode config management), NAPALM (multi-vendor abstraction), and for collection specifically, Telegraf, streaming telemetry receivers, or SNMP-based collectors.

Orchestrator

This is where workflows live. CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins) are the most common Orchestrator in practice. They respond to events (a merge to main), coordinate steps (validate, generate config, deploy, verify), and handle failures.

Nautobot Jobs serve this function within the Nautobot ecosystem. Tools like Prefect, Temporal, or Apache Airflow work here too, especially for complex multi-step workflows with retry logic and rollback.

In my experience, you almost always end up with a dedicated orchestration platform at scale because the coordination logic gets complex enough to warrant its own system.

Observability

Prometheus and Grafana are the default answer for metrics. Elasticsearch or Loki for logs. For network-specific observability, tools like Suzieq or Batfish provide deeper network state analysis.

The key requirement from the framework: Observability should generate events when actual state diverges from intended state. That feedback loop (Collector reads state, Observability detects drift, Orchestrator triggers remediation via Executor) is where automation becomes closed-loop rather than fire-and-forget.

Presentation

ServiceNow, Slack bots, custom dashboards, CLI tools, API gateways. This block is the most organization-specific because it depends entirely on how your users (network engineers, NOC staff, other teams requesting changes) prefer to interact with the system.

How to Use This

If you’re starting a new automation project, use the framework as a checklist:

Identify which blocks you need first. Not every project needs all six on day one. If your immediate pain is “we don’t know what the network should look like,” start with Intent. If it’s “we can’t deploy changes reliably,” start with Executor.
Audit your existing stack. Map every tool you currently use to a block. You’ll likely find gaps (no Observability feeding back to Intent) and overlaps (three different tools all partially filling Orchestrator, none of them well).
Evaluate new tools against the framework. When someone pitches you a product, ask: “Which block does this fill? Do I already have something there? Does it compose with what I have in adjacent blocks?”
Look for the missing feedback loops. The framework’s read/write symmetry implies closed loops: Intent defines desired state, Executor pushes it, Collector reads actual state, Observability compares them, Orchestrator remediates drift. If any link in that chain is missing, your automation is open-loop. It pushes changes but can’t verify or correct them.

The framework is a thinking tool, not a solution. But it gives you a shared language for talking about what you’re building and a way to spot what’s missing before you’re six months into a project that can’t close the loop.

The full framework documentation is at reference.networkautomation.forum. The community discussion happens in the NAF Slack.

CLI Over HTTPS Part 4: Where Do We Go from Here?

brett@network-notes.com (Brett Lykins) — Thu, 07 May 2026 09:00:00 +0000

This is the last post in the series. Part 1 explained why SSH is slow for automation. Part 2 measured it. HTTPS batch is up to 17x faster at real-world latencies. Part 3 showed that an edge proxy and a transparent tunnel capture most of that improvement even when devices don’t speak HTTPS natively.

This post is the practical takeaway: when to use what, how to deploy it, and what the industry should build next.

The Decision Framework

Not every network needs a proxy. Not every automation run is latency-sensitive. Here’s how to think about it:

SSH Direct Is Fine When

Your automation server is co-located with the devices (same DC, same site)
Round-trip time to the devices is under ~5ms
You’re managing fewer than a few hundred devices
You’re already using SSH multiplexing (ControlMaster) or persistent connections

At local latency, SSH overhead is measured in single-digit milliseconds. The protocol tax from Part 1 is real but negligible. Don’t add architectural complexity to save 3ms per device.

Deploy a Proxy When

Your automation runs over a WAN (30ms+ RTT to the devices)
You manage devices across multiple sites, regions, or continents
Automation run time is operationally painful (the Rackspace problem from Part 1)
You’re already running regional infrastructure (jump hosts, bastion servers, Ansible Tower nodes)
You can modify your automation to speak HTTPS instead of SSH

The proxy pattern from Part 3 showed 5.3-14.7x improvement at real WAN latencies (14.7x with connection reuse at 150ms RTT). If you already have bastion hosts in each region, you’re halfway there. The proxy is a bastion that speaks HTTPS instead of (or in addition to) SSH.

Deploy a Tunnel When

You need WAN optimization but can’t change your automation tooling
Your team has years of Ansible playbooks, Nornir scripts, or Netmiko wrappers that must speak SSH
You want a migration path: tunnel first, proxy later

The tunnel from Part 3 is transparent to both sides: automation speaks SSH to the headend, the device sees SSH from the site proxy. With command batching, it hits 12.7x speedup at intercontinental latency. Without batching, it’s still 3.0x faster than SSH direct.

Deploy NAAS When

You want a production-ready proxy with multi-vendor support out of the box
You need connection pooling, async jobs, and circuit breakers
You manage devices across 100+ platforms (everything Netmiko supports)

NAAS (Netmiko as a Service) implements the proxy pattern with production concerns handled. Deploy an instance per region, point your automation at it, and the SSH stays local. More on NAAS below.

Push for Native HTTPS When

You’re evaluating new platforms or vendors
You have influence over vendor roadmaps (large deployments, design partners)
You’re building internal tooling that could expose CLI over HTTPS

Native HTTPS eliminates the proxy entirely. The ~17x batch improvement from Part 2 is the ceiling. No intermediate hop, no backend SSH overhead. If a vendor offers it, use it.

NAAS: The Production Proxy

NAAS (Netmiko as a Service) is what the proxy pattern looks like when you build it for production. Written in Python, it wraps Netmiko behind a REST API, which means it supports 100+ device platforms: Cisco IOS, NX-OS, ASA, Juniper Junos, Arista EOS, Palo Alto, and everything else Netmiko handles.

You POST a JSON payload with the device address, platform, credentials, and commands. NAAS opens the SSH session, runs the commands, and returns the output in the HTTP response:

1
2
3
4


curl -k -X POST https://naas.dc1.example.com:8443/v1/send_command \
 -u "automation:token" \
 -H "Content-Type: application/json" \
 -d '{"host": "10.1.1.1", "platform": "cisco_ios", "commands": ["show version", "show ip route"]}'

What NAAS handles that a minimal proxy doesn’t:

Multi-vendor connection pooling. Persistent SSH connections with health checks and automatic reconnection.
Async job queue. Long-running commands (show tech-support, bulk config pushes) run in a Redis-backed queue. Your automation gets a job ID back immediately and polls for results.
Circuit breaker and observability. Stops hammering unreachable devices, exposes Prometheus metrics for connection pool health and per-device latency.

Deploy a NAAS instance in each data center or region, and your automation talks HTTPS to the nearest one. The SSH sessions stay local. The architecture is the same as what the benchmarks in Part 3 measured: HTTPS over the WAN, SSH on the last hop.

1
2
3
4
5
6


git clone https://github.com/lykinsbd/naas.git && cd naas
docker compose up -d
curl -k -X POST https://localhost:8443/v1/send_command \
 -u "username:password" \
 -H "Content-Type: application/json" \
 -d '{"host": "10.1.1.1", "platform": "cisco_ios", "commands": ["show version"]}'

See the NAAS getting started guide for full setup and configuration.

What Exists and What’s Missing

Some of the pieces are already in place.

Arista’s eAPI accepts CLI commands via JSON-RPC over HTTPS. It wraps everything in JSON, but the core pattern is there: send commands over HTTPS, get output back. The ASA interface and eAPI have been in production for years. NAAS (described above) brings the proxy pattern to the 100+ platforms Netmiko supports. The clibench tunnel mode demonstrates the transparent SSH-to-HTTP approach for teams that can’t change their automation tooling.

What’s missing:

A standard CLI-over-HTTPS interface. Not RESTCONF, not gNMI. Those are structured data interfaces for a different use case. A simple, standardized way to send CLI commands over HTTPS and get text output back. The ASA pattern is a reasonable starting point: GET /cli/exec/{command} for show commands, POST /cli/config for configuration. Basic auth or token auth over TLS. Content-Type: text/plain. No JSON wrapping unless the client asks for it. Arista’s eAPI is the closest thing to this, but it’s vendor-specific and JSON-only.

Proxy support in the automation ecosystem. Ansible could ship a connection plugin that talks HTTPS to a proxy like NAAS instead of SSH to the device. Nornir could support an HTTP transport alongside Paramiko and Netmiko. NAAS works today as a standalone API, and a native connection plugin would make adoption even easier: a configuration option instead of a code change.

Broader vendor adoption. Every network OS already has an HTTPS server for its web UI. Exposing the CLI through that same server is not a large engineering effort. The ASA proves the concept. A plain-text CLI endpoint alongside the structured API would cover both use cases.

None of this requires abandoning SSH. SSH remains the right tool for interactive sessions, for out-of-band recovery, for environments where HTTPS infrastructure doesn’t exist. The argument isn’t “replace SSH.” It’s “stop using SSH for the thing it’s worst at.”

The Numbers, One More Time

For reference, here’s the speedup picture from the series. Most automation tools (Netmiko, Ansible, Scrapli) use PTY mode, so SSH PTY is the realistic baseline:

Scenario	Speedup vs SSH PTY
HTTPS batch (native)	~17x
Proxy (reused connection)	~14.7x
Tunnel batch	~12.7x
Proxy (new connection)	~5.3x
HTTPS keep-alive (native)	~3.4x
Tunnel per-command	~3.0x
SSH with ControlMaster	~1.7x

The proxy with connection reuse gets you most of the native HTTPS improvement without requiring any changes to the devices. The tunnel with batching is close behind, and requires zero changes to your automation tooling either.

Try It Yourself

The benchmark tool supports all scenarios:

1
2
3
4
5
6
7
8


# All transports (SSH, HTTPS, HTTP/3, proxy, tunnel)
sudo ./bin/clibench bench --latency regional --iterations 20 --commands 5

# Proxy pattern (HTTPS + HTTP/3 variants)
sudo ./bin/clibench bench --latency regional --iterations 20 --commands 5 --transport proxy

# Tunnel (transparent SSH-to-HTTP WAN optimization)
sudo ./bin/clibench bench --latency intercontinental --iterations 20 --commands 5 --transport tunnel-https

The code is MIT licensed. Run it on your own infrastructure, with your own latency profiles, and see what the numbers look like for your network.

My take: The network automation community has treated SSH as a given for fifteen years. It was the right default when automation meant one engineer scripting against a handful of devices. At the scale most organizations operate today, SSH’s protocol overhead is a measurable, avoidable cost. Native HTTPS CLI is the right long-term direction. The proxy and tunnel patterns are deployable today. I built NAAS so you can start today. Contributions welcome.

CLI Over HTTPS Part 3: The Proxy Pattern

brett@network-notes.com (Brett Lykins) — Tue, 05 May 2026 09:00:00 +0000

In Part 1 I showed that SSH burns 10-15 round trips before delivering a single byte of command output. In Part 2 I proved it. HTTPS batch is ~17x faster than SSH at real-world latencies when the device supports it natively. Even HTTPS keep-alive, with no batching, is 3.4x faster.

The obvious objection: most devices don’t support it natively. Your Cisco IOS switches, your Juniper routers, your Arista leaf nodes, they speak SSH. And while some of them have other interfaces, SSH is not changing anytime soon.

So the question isn’t “how do I get my switches to speak HTTPS.” The question is: where does the SSH happen?

There are two answers. The proxy requires your automation to speak HTTPS, but it’s architecturally simple: one hop, one translation. The tunnel keeps SSH on both ends and optimizes only the WAN segment, so existing tooling works unchanged. Both relocate the expensive SSH round trips to a local link where they cost almost nothing.

The Proxy: Replace SSH on the WAN

SSH is slow because of round trips. Round trips are slow because of distance. If you move the SSH session closer to the device, the round trips get cheap.

A proxy co-located with the devices, in the same data center or local network, talks SSH to the devices over a 1-2ms link where the protocol overhead is negligible. Your automation platform talks HTTPS to the proxy over the WAN, where the round-trip savings from Part 1 actually matter.

The device never knows the difference, it sees an SSH session from a local IP. Your automation never touches SSH directly, it sends an HTTP request and gets CLI output back in the response body.

The Architecture

The proxy is the only component that touches SSH. Everything upstream is HTTPS: connection pooling, TLS 1.3, request batching, proper Content-Length framing. Everything downstream is SSH, but over a link where it doesn’t matter.

Proving It

I added a proxy mode to the benchmark tool from Part 2. The proxy is an HTTPS server that receives commands via the same ASA-style endpoints (/admin/exec/, /admin/config), then opens an SSH session to a backend device and returns the output.

The test setup:

Backend device: SSH listener with 2ms RTT (local latency)
Proxy: HTTPS frontend with WAN latency, SSH client to backend
Benchmark client: Talks HTTPS to the proxy, same as it would to a native HTTPS device

Four proxy modes tested with a new WAN connection per request (cold start, first request of an automation run):

fresh-ssh: New WAN connection + new SSH to backend per request
pooled-ssh: New WAN connection, reuses one SSH connection on the backend
h3-fresh-ssh: Same as fresh-ssh, but the WAN leg uses HTTP/3 (QUIC)
h3-pooled-ssh: QUIC on WAN, pooled SSH on backend

Plus two connection-reuse modes (steady state, what a running automation platform does):

keep-alive: Persistent HTTPS connection to proxy, pooled SSH on backend
h3-keep-alive: Persistent QUIC connection to proxy, pooled SSH on backend

The proxy’s WAN-facing listener gets the same latency injection as the direct SSH and HTTPS tests from Part 2. The backend SSH link gets a fixed 2ms RTT. All transports experience the same WAN conditions. The only difference is what happens on the last hop.

Results

All runs: 20 iterations, 5 commands per iteration (batched in one POST). SSH direct numbers from Part 2 for comparison. The “SSH direct” column uses PTY/shell mode, what Netmiko and Ansible actually do.

WAN RTT	SSH direct (PTY)	Proxy (new conn)	Proxy (reused conn)	Speedup (reused vs SSH PTY)
30ms	528ms	124ms	56ms	9.4x
70ms	1,208ms	248ms	95ms	12.7x
150ms	2,571ms	489ms	175ms	14.7x

The new-connection proxy (full TLS handshake per request) is 4.3-5.3x faster than SSH direct. With connection reuse, the proxy hits 9.4-14.7x, and the advantage grows with latency because reusing the connection eliminates the TLS handshake entirely, paying only 1 round trip per request.

At 150ms RTT (a US NOC managing devices in Hong Kong) SSH direct (PTY) takes 2.6 seconds per device. The proxy with a persistent connection does it in 175ms.

Why It Works

SSH direct (PTY mode) at 150ms RTT pays the full protocol tax on every round trip over the WAN:

TCP handshake: 1 RT × 150ms
SSH version exchange: 1 RT × 150ms
Key exchange: 2 RT × 150ms
Auth + channel + PTY + shell: 4 RT × 150ms
Session prep: 2 RT × 150ms
5 commands with echo verification: 5 RT × 150ms

That’s ~15 round trips × 150ms = ~2,250ms of protocol overhead, plus processing time.

The proxy with a new connection splits that cost across two links:

WAN leg (HTTPS, new connection): TCP + TLS 1.3 + HTTP request = ~3 RT × 150ms = ~450ms
Local leg (SSH): The same ~15 SSH round trips, but at 2ms = ~30ms

Total: ~480ms. That’s the cold-start cost when your automation opens a new connection to the proxy.

The proxy with a reused connection eliminates the handshake entirely:

WAN leg (HTTPS, reused connection): HTTP request/response = ~1 RT × 150ms = ~150ms
Local leg (SSH): Same ~30ms

Total: ~180ms. Measured: 175ms.

This is the steady-state performance. Once your automation has an open connection to the proxy (which any HTTP client maintains by default), every subsequent request costs exactly one WAN round trip plus the local SSH work. The SSH overhead is still there. It’s just happening on a link where 15 round trips cost 30ms instead of 2,250ms.

Fresh vs Pooled: Does It Matter?

At local latency, not much. The gap between fresh-ssh (132ms) and pooled-ssh (119ms) at 30ms WAN RTT is 13ms, the cost of one SSH handshake at 2ms RTT. In production you’d pool connections anyway for resource efficiency, but the performance argument for pooling is modest when the backend latency is low.

The operational argument matters more. A pooled connection means fewer SSH sessions on the device, and Network devices have finite session limits. An ASA might handle 5 concurrent SSH sessions, a catalyst might allow 16. If your proxy is serving 50 requests per second, fresh connections will exhaust those limits instantly. Pooling keeps one session open per device and multiplexes commands through it.

The pooling logic in clibench is simple. getSSH() returns an existing connection if one is pooled, or dials a new one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


func (s *Server) getSSH() (*ssh.Client, bool, error) {
 if !s.pooled {
 c, err := ssh.Dial("tcp", s.backendAddr, s.sshCfg)
 return c, false, err
 }
 if s.pool != nil {
 return s.pool, true, nil
 }
 c, err := ssh.Dial("tcp", s.backendAddr, s.sshCfg)
 if err != nil {
 return nil, false, err
 }
 s.pool = c
 return c, true, nil
}

The tradeoff is stale connections; devices reboot, sessions time out, firewalls drop idle flows. The proxy needs to detect dead connections and reconnect, the same problem as HTTP connection pooling or database connection pooling. In clibench, a failed session operation clears the pool so the next request gets a fresh connection. In production, you’d add periodic health checks and a circuit breaker for unreachable devices; which is what NAAS does, for example.

The Tunnel: Keep SSH on Both Ends

The proxy requires changing your automation client. What if you can’t?

Many teams have years of Ansible playbooks, Nornir scripts, and Netmiko wrappers that all speak SSH. Rewriting them to speak HTTPS is a project, not a config change. The tunnel solves this: both your automation and the device speak SSH. The WAN segment in between uses HTTPS or HTTP/3, but neither endpoint knows or cares.

Architecture

The headend sits near your automation server. It accepts SSH connections, parses the exec command, and forwards it as an HTTP request over the WAN to the site proxy. The site proxy is the same component from the proxy pattern above. It receives the HTTP request and talks SSH to the device on a local link.

Your automation runs ssh headend "show version" and gets back the device output. Under the hood, the WAN segment used HTTPS with 2-3 round trips instead of SSH’s 15+.

Results

WAN RTT	SSH direct (PTY)	Tunnel (per-cmd)	Tunnel (batch)	Speedup (batch vs SSH PTY)
30ms	528ms	228ms	82ms	6.4x
70ms	1,208ms	429ms	121ms	10.0x
150ms	2,571ms	856ms	202ms	12.7x

Two things jump out.

Without batching, the tunnel is slower than the proxy. The per-command tunnel mode (ssh-https-ssh) pays SSH overhead on both ends: the automation-to-headend SSH handshake, plus the site proxy-to-device SSH handshake. That’s two sets of SSH round trips at campus latency (~2ms each), plus the WAN HTTP request per command. At 150ms, 856ms is still 3.0x faster than SSH direct, but much worse than the proxy’s 175ms.

With batching, the tunnel approaches proxy performance. The batch mode sends all 5 commands in a single SSH exec payload to the headend. The headend forwards them as one HTTP POST. The site proxy runs them all in one SSH session. At 150ms, that’s 202ms vs the proxy’s 175ms. The tunnel pays a small penalty for the extra SSH hop on the automation side, but it’s close.

The tunnel’s value isn’t raw speed. It’s that you get 12.7x improvement with zero changes to your automation code or your devices.

Proxy vs Tunnel: When to Use Which

Use the proxy when you can modify your automation to speak HTTPS. It’s faster (14.7x with connection reuse vs 12.7x for the tunnel), simpler (one hop instead of two), and has lower per-request overhead.

Use the tunnel when you can’t change the automation client. If your tooling must speak SSH (connection plugins, credential management, or organizational inertia) the tunnel gives you WAN optimization transparently. The batch mode requires that your SSH client sends multiple commands in one exec call (which tools like ssh host "cmd1 && cmd2" do naturally), but even per-command mode is 3.0x faster than SSH direct at high latency.

Use both in a migration. Deploy the tunnel first for immediate wins with no code changes, then migrate automation to speak HTTPS to the proxy directly as you refactor.

What This Looks Like in Practice

If you have an internal API that accepts “run this command on this device” requests and returns the output, you’re already running a version of this.

Examples in the ecosystem:

Salt proxy minions with NAPALM behind the Salt REST API
AWX execution environments co-located with devices
Oxidized’s web interface
The Rackspace Go microservices from Part 1
NAAS (Netmiko as a Service): wraps Netmiko behind a REST API with connection pooling, async jobs, and circuit breakers

Most of these co-locate SSH with the devices (good), but don’t expose a clean HTTPS interface upstream, or they bury it under job queues, inventory sync, and YAML sprawl. The core pattern is simpler than any of those tools. The proxy in clibench proves the concept in ~180 lines of Go. A production deployment adds multi-vendor support, health checks, and credential management on top.

Security

The proxy doesn’t make things more or less secure. It changes the trust model.

With SSH direct, your automation server holds the SSH keys and authenticates directly to every device. With the proxy pattern, the trust boundary splits in two: your automation authenticates to the proxy (over HTTPS, using API tokens, mTLS, or whatever your org uses for service-to-service auth), and the proxy authenticates to the devices (over SSH, using keys that are available to the proxy itself).

What actually changes:

Where the SSH keys live. They move from the automation server to the proxy. The private keys never cross the WAN in either model (SSH public key auth sends a signature, not the key), but the proxy pattern puts the keys physically closer to the devices they unlock.
The WAN-side auth mechanism. Your automation no longer speaks SSH to devices. It speaks HTTPS to the proxy. That’s not inherently better or worse. It’s a different credential type (API token or client cert vs SSH key) managed through whatever system your organization already runs for service authentication.
The blast radius of a compromised proxy. The proxy has access to the SSH keys for every device it manages. Compromise the proxy, and you have access to the fleet. This is the same risk profile as an SSH bastion host, which most organizations already operate and already know how to harden: minimal attack surface, restricted network access, key rotation, session logging, and monitoring. The proxy deserves the same care you’d give a bastion.

When the Proxy Doesn’t Help

The proxy pattern assumes the WAN latency between your automation and the devices is the bottleneck. If your automation server is already co-located with the devices (same rack, same DC), there’s no WAN leg to optimize. SSH at 1-2ms RTT is fast enough.

It also doesn’t help if your bottleneck is device processing time rather than transport overhead. If a show tech-support takes 30 seconds to generate on the device, the transport saves you a few hundred milliseconds on a 30-second operation. Still worth it at scale, but the relative improvement is smaller.

And the proxy adds operational complexity. It’s another service to deploy, monitor, and maintain. For a team managing 50 devices in one location, the overhead isn’t justified. For a team managing thousands of devices across multiple continents, which is where SSH overhead actually hurts, the proxy pays for itself on the first automation run.

Try It

The benchmark code includes all modes. Run it yourself:

1
2
3
4
5
6
7
8


# All transports at 150ms WAN
sudo ./bin/clibench bench --latency intercontinental --iterations 20 --commands 5

# Proxy only (HTTPS + HTTP/3 variants)
sudo ./bin/clibench bench --latency intercontinental --iterations 20 --commands 5 --transport proxy

# Tunnel mode (SSH-to-HTTPS transparent WAN tunnel)
sudo ./bin/clibench bench --latency intercontinental --iterations 20 --commands 5 --transport tunnel-https

In Part 4, I’ll lay out a decision framework for choosing between SSH direct, a proxy, a tunnel, and native HTTPS, and dig into NAAS as a production deployment of the proxy pattern.

My take: The proxy pattern isn’t a workaround. It’s the right architecture for managing geographically distributed network infrastructure. SSH is fine for the last hop. HTTPS (or QUIC) is better for everything upstream.