FlowProxy: Approaching Optimality

FlowProxy: Approaching Optimality

Tags
RustNetworking
Author
Lorenzo Feroleto
Publish on
12/29/2025
Read Time

10 minutes

Theme
Engineering

By Jonas (@mempirate) and Lorenzo (@thedevbirb) from Chainbound.

image

Table of Contents

  • Table of Contents
  • Background
  • Network Analysis
  • Bundle Loss
  • Latency
  • Improvements
  • Thread modelling
  • HTTP/2
  • TCP + bitcode
  • Kernel Parameter Tuning
  • mTLS
  • Next Steps

Background

In Introducing FlowProxy, Chainbound collaborated with Flashbots to deprecate its original orderflow proxy implementation in Go in favor of a new one built in Rust, created with the goal of reducing end-to-end latency, efficiency and increasing observability in BuilderNet’s networking layer. This first collaboration aimed at bringing it to production-ready quality and ready for deployment, already observing good improvements in networking and processing latency.

This second collaboration expanded on the Next Steps section outlined in the previous report, with success. It includes a quantitative analysis of order losses and latencies across BuilderNet and extensive work to improve the transport used for communicating and reduce both CPU and memory usage. We encountered a couple of non-obvious learnings that we outline below, and we hope this will be useful to the community.

Network Analysis

FlowProxy features a Clickhouse integration which makes possible to record bundle receipts. A bundle receipt consists of a short summary that includes the timestamp of when the bundle was sent and received by instances, along with its raw size in bytes (useful for tracking size → latency correlation). We used this data to better understand FlowProxy’s network performance in its entirety, and to discover what improvements to focus on.

Bundle Loss

We started with analyzing bundle loss between instances. The table below reports the percentage of bundle loss on individual links, during a small time range of congestion:

src
dst
lost
total
loss_rate_pct
beaver_eastus_07
nethermind_we_08
7857
35848
21.92%
beaver_eastus_07
flashbots_we_09
1969
96597
2.04%
beaver_eastus_07
beaver_we_08
1354
94000
1.44%
beaver_eastus_07
flashbots_eastus_10
2
140801
0.00%
beaver_eastus_07
nethermind_eastus_07
1
141534
0.00%
nethermind_eastus_07
flashbots_eastus_10
10
35315
0.03%
nethermind_eastus_07
beaver_eastus_07
10
35377
0.03%
nethermind_eastus_07
flashbots_we_09
6
34560
0.02%
nethermind_eastus_07
beaver_we_08
6
35216
0.02%
nethermind_eastus_07
nethermind_we_08
4
34490
0.01%
flashbots_eastus_10
beaver_we_08
4
34877
0.01%
flashbots_eastus_10
flashbots_we_09
4
34266
0.01%
flashbots_eastus_10
nethermind_we_08
3
34156
0.01%
flashbots_eastus_10
beaver_eastus_07
1
35054
0.00%
Full Table

We can see in particular that inter-continental links are mostly affected by bundle loss, especially the instance beaver_eastus_07 which receives the most flow in that region.

The main reason bundles are marked as “lost” is when they time out, or when buffers overflow and backpressure is applied. This confirms what we explored in the last post: that HTTP/1.1 and its incapacity to multiplex requests makes it highly inadequate to deal with BuilderNet’s workload.

Latency

From the table below we can see latency between individual BuilderNet links during high congestion.

src
dst
p50_ms
p99_ms
corr_payload_size
observations
beaver_eastus_07
nethermind_we_08
39.619
1947.045
-0.06
27991
beaver_eastus_07
flashbots_we_09
38.022
1789.051
-0.06
94628
beaver_eastus_07
beaver_we_08
39.440
1521.599
-0.04
92646
beaver_eastus_07
nethermind_eastus_07
62.639
292.451
-0.02
141533
nethermind_eastus_07
beaver_we_08
33.381
194.660
0.59
35210
nethermind_eastus_07
nethermind_we_08
31.541
193.107
0.58
34486
nethermind_eastus_07
flashbots_we_09
30.992
192.134
0.60
34554
nethermind_eastus_07
flashbots_eastus_10
-0.662
58.648
0.06
35305
nethermind_eastus_07
beaver_eastus_07
-4.514
49.629
0.02
35367
flashbots_eastus_10
beaver_we_08
40.197
201.535
0.60
34873
flashbots_eastus_10
nethermind_we_08
38.467
200.267
0.59
34153
flashbots_eastus_10
flashbots_we_09
37.945
199.814
0.61
34262
flashbots_eastus_10
beaver_eastus_07
2.506
56.186
0.02
35053
flashbots_eastus_10
nethermind_eastus_07
-7.746
26.184
0.04
35048
Full Table

As with the previous table, we can see the highest latency between the most active inter-region links, peaking at nearly 2s in p99. Another interesting result is the correlation with payload size (corr_payload_size): in normal working conditions (low p99s), it is more pronounced. This could mean a couple of things that will be useful to know later:

  • We’re not able to send large payloads in a single transmission (related to BDP, which we’ll talk about below).
  • Preparing the order for transmission, or processing the order on the receiver side, takes a noticeably longer time the bigger the message. Some correlation is expected here, but it should be minimal. The main processing steps in the hot path here are JSON encoding / decoding, and signing / signature verification.

Improvements

Thread modelling

FlowProxy runs with the Tokio asynchronous runtime. The initial implementation of the proxy indiscriminately used Tokio tasks for all different kind of workloads, including CPU intensive operations like signature recovery, signing and decoding transactions. This approach is not ideal because the runtime and its tasks are fundamentally optimized for non-blocking, I/O-bound work, and using it for other blocking or CPU-bound work will increase tail latencies.

Tokio schedules many lightweight tasks onto a small number of OS threads. If a task performs CPU-heavy or blocking work, it can monopolise a worker thread, preventing other tasks from making progress. We suspected this could partly be causing some of the high tail latencies we were seeing.

The Tokio authors recommend using tokio::task::spawn_blocking, which we tried initially. This will spawn (or reuse) a thread managed by the runtime that is purely used for CPU-bound and blocking operations. However, this resulted in many more threads being spawned than we knew was necessary, and also had more overhead than expected. We didn’t dive into this too much, but intuitively it seemed like the blocking thread scheduler was not reusing threads effectively (Github issue).

To mitigate this, we introduced a configurable pool of specialised rayon threads for compute-heavy operations, and tweaked the number of Tokio worker threads to match production environment requirements. This setup allows also to have a healthy environment where can control how much resources a certain service is using, since the same machine would also run other processes like the block builder.

Side Note

While experimenting with the parameters, we observed how moving some specific compute-heavy operations resulted in a worse order processing latency. An interesting learning was that on a busy machine with many threads, sending tasks off to a different thread than the currently executing one can increase latency significantly. The cost of scheduling a new thread (context switching, waiting) should be taken into account, and is very context dependent!

HTTP/2

HTTP/2 was designed to address various performance limitations of HTTP/1.1 while keeping the same semantics. Among various improvements, the most impactful for FlowProxy is multiplexing: with HTTP/1.1 only one request/response can be in flight per TCP connection (called head-of-line blocking), while HTTP/2 allows multiplexing multiple requests and responses over a single TCP connection, using streams.

Streams are logical, bidirectional channels within one connection. They’re managed by windows: credit-based flow control mechanisms to limit how much data can be sent on a single stream, to ensure it doesn’t starve the connection. This multiplexing allowed us to greatly reduce the number of open connections, and improve connection reuse, which we already hinted was a source of message loss.

Upgrading to HTTP/2 was the first improvement we rolled out, because of its complete backwards compatibility: communication between and towards instances running on a previous version of FlowProxy would simply fallback to HTTP/1.1.

Below, you can see how the number of failures (read: lost messages) have been essentially reduced to 0 after its deployment.

HTTP failures after deployment of HTTP/2
HTTP failures after deployment of HTTP/2

While request failures dropped, latency didn’t significantly improve. In particular, we’ve observed some improvement over small requests (with body size less than 32KiB) over inter-regional links, as we can see below. However, for same-region requests and bigger messages the situation remained identical or slightly worsened.

RPC call duration latency (p99) before and after the deployment of HTTP/2.
RPC call duration latency (p99) before and after the deployment of HTTP/2.

This was a very different result compared to our staging environment, consisting of four nodes distributed between East US and West Europe. We think the main culprit is an overall different topology and network load compared to the production environment, which would be hard to completely emulate. After this result, we started looking into tuning configurations.

FlowProxy instances operate with a reverse HAProxy that sits before the user and system endpoint, with the latter used for internal orderflow sharing. The proxy exposes some HTTP/2 tuning configurations that could help further reducing latency and spikiness.

The dimensions in which we could operate were:

  • The number of maximum open streams;
  • The size of the window buffers;
  • Creating dedicated clients for small and big requests.

While tuning those resulted in marginal improvements, we were still working on high-level abstractions, without much control over the metal. Moreover, HTTP/2 windows play a similar role to TCP congestion control / window scaling, resulting in some overhead and confusion about how the two interoperate. Because of this, we decided to pause HTTP/2 tuning efforts, and focus on a full migration to raw TCP (with TLS) with the msg-rs messaging library.

Bundle loss after HTTP/2

After the deployment of this improvement, we analyzed bundle loss once again (read a full analysis here). The table below contains a day worth of data, that includes both periods of low activity and high activity. We can see that bundle loss has essentially disappeared. Sample for 2025-11-14:

src
dst
lost
total
nethermind_eastus_07
beaver_we_08
28
8894080
beaver_eastus_07
flashbots_eastus_10
8
3001624
beaver_eastus_07
flashbots_we_09
8
11168780
beaver_eastus_07
nethermind_eastus_07
8
30111251
beaver_eastus_07
beaver_we_08
8
10864297
beaver_eastus_07
nethermind_we_08
8
10069603
nethermind_eastus_07
flashbots_we_09
10
9028245
nethermind_eastus_07
nethermind_we_08
8
9052880
nethermind_eastus_07
beaver_eastus_07
6
10309013
nethermind_eastus_07
flashbots_eastus_10
5
10302474
flashbots_eastus_10
beaver_we_08
14
9024652
flashbots_eastus_10
nethermind_eastus_07
6
10443797
flashbots_eastus_10
beaver_eastus_07
6
10458639
flashbots_eastus_10
nethermind_we_08
4
9087691
flashbots_eastus_10
flashbots_we_09
2
9115212
Full Table

HTTP/2 latency

Below we can see a more complete before-after of latency between links. We can see that after HTTP/2 we see more bounded p99 and p999, while p50 and p90 stayed almost the same. While picking a single day from both deployments may not be completely indicative, the behaviour remained quite consistent during the next days. Latency comparison before and after HTTP/2 for a days worth of data (all numbers in milliseconds):

src
dst
p50_a
p50_b
Δp50
p99_a
p99_b
Δp99
nethermind_we_08
beaver_eastus_07
42
42
~0
458
155
-303
nethermind_we_08
flashbots_eastus_10
42
41
-1
455
142
-313
beaver_eastus_07
nethermind_we_08
44
41
-3
357
125
-232
beaver_eastus_07
flashbots_we_09
45
46
+1
297
122
-175
nethermind_eastus_07
flashbots_we_09
48
45
-3
299
120
-179
flashbots_eastus_10
nethermind_we_08
42
43
+1
206
119
-87
flashbots_eastus_10
flashbots_we_09
43
44
+1
206
106
-100
flashbots_eastus_10
beaver_eastus_07
43
44
+1
284
131
-153
nethermind_eastus_07
beaver_eastus_07
39
41
+2
200
96
-104
beaver_eastus_07
flashbots_eastus_10
40
41
+1
197
103
-94
Full Table

TCP + bitcode

To preserve backwards compatibility, we offered a new TCP-only version of system API along with the existing HTTP/2 one. Using a new temporary endpoint allowed us to experiment with binary encoding format, so we could finally remove JSON which other than offering high decoding latency, resulted in ~50% bigger payloads than necessary due to encoding hex data as strings.

First, we tried to look into serde compatible binary encoding formats, like bincode and MessagePack, but both of them do not have full feature compatibility with serde_derive. Because of that, we opted out from serde compatibility and went with bitcode instead, which currently sits at the top of the Rust Serialization Benchmark leaderboard.

This was the result of both a switch to TCP (without any additional tuning yet), and migration from JSON to bitcode:

image

However, there’s a big caveat here: we only saw these levels of improvements on very specific links. The plot above depicts latency from a specifically broadcast-heavy instance: one that ingests and forwards a lot of orders. We barely saw any improvements on other instances, which led us down the rabbit hole of TCP kernel parameters.

Kernel Parameter Tuning

In order to get the most of raw TCP sockets, it has been fundamental to tune kernel parameters to achieve optimised connections. To understand better the changes we’ve done, let’s brush up on some preliminaries.

The bandwidth-delay product, or BDP, is a property of a network path, and is computed as the product of bandwidth, expressed in bytes per second, and round-trip time “delay”, expressed in second. In the context of TCP, it represents how much data the network can hold “in flight” before the sender must wait acknowledgments.

Let’s make a concrete example: a node in FlowProxy has a upload bandwidth of 1 Gbps, and an Azure link between East US and West Europe is around 85ms (source). This results in a BDP of 1Gbps85ms=85Mbit=10.625MB1 \text{Gbps} \cdot 85 \text{ms} = 85 \text{Mbit} = 10.625 \text{MB}, meaning that is the theoretical maximum of data we can have unacknowledged in TCP, assuming ideal network conditions.

Related to BDP is the congestion window (referred as cwnd in code). It is a core TCP mechanism that controls how much data a sender is allowed to have “in flight” (sent, but not yet acknowledged) at any given time. On the receiver side, there’s rwnd (the receive window), which determines how much unacknowledged data can accumulate at the receiver side before packets are dropped. The minimum of cwnd and rwnd determines your maximum throughput.

One direct consequence of BDP is the following: if min(cwnd, rwnd) < size(message), that message will need an additional round trip to fully transmit!

By definition, the theoretical maximum size of the congestion window for communicating over a link matches the BDP. That would mean there is no congestion at all! However, the network might not be always stable, and packet loss might happen and RTT may vary over time. As such, the congestion window dynamically adapts to the appropriate amount of unacknowledged data to not waste any resources. These mechanisms are called TCP congestion control algorithms.

All of these parameters can be modified in the Linux kernel with sysctl. The following are the most important:

  • net.ipv4.tcp_congestion_control: the default (cubic) worked for us.
  • net.ipv4.tcp_window_scaling: ensure this is turned on.
  • net.ipv4.tcp_rmem: sets the bounds for rwnd. Default and max had to be significantly increased, some multiple of your expected max message size is a good guideline. Linux will take care of autoscaling this if window scaling is turned on.

Other than congestion algorithms, there are other settings that may impact the congestion window. One of them is called “slow start after idle”, which sysctl setting is net.ipv4.tcp_slow_start_after_idle . A connection is considered idle if a packet hasn’t been sent for a certain amount of time, computed as a function of the RTT, but with a default minimum of 200ms that we would be used in case of a 85ms RTT. In case of idleness, the congestion window is greatly reduced, diminishing throughput and increasing latency.

Given connection between FlowProxy instances are long lived, and can be bursty in period of high-traffic, this setting should be disabled. In the picture below you can see the tremendous impact of the switch on low volume instances:

Call duration latency (p99) before and after the
Call duration latency (p99) before and after the net.ipv4.tcp_slow_start_after_idle has been disabled.

This setting was exactly why the high message volume instances displayed really good latency after the switch to TCP, and the others didn’t: their TCP connections were never idle, so the congestion windows were never reset.

The general takeaway here is that Linux TCP settings are very conservative and always try to save on resources, and should almost always be changed depending on your workload.

After all these modifications, we were finally approaching optimality in terms of order propagation: variance for big and small messages, over long and short links was reducing significantly. However, there was one more change that we knew would be impactful, not on latency but on CPU usage.

mTLS

The last improvement we’ve been working on during this collaboration has been mutual TLS (mTLS). If required by the server, the client must present a valid X509 TLS certificate to authenticate itself. This works, because all instances know each others TLS certificates through a central service registry.

Currently, FlowProxy authenticates other instances on a message-by-message basis: every message it gets needs to be signed by a known ECDSA public key. This is secure, but wasteful: every message needs to be a) signed by the sender, and b) verified by the receiver. mTLS would change authentication on a message basis, to authentication on a connection basis. Authentication would occur during connection setup, and after a successful authentication, all messages sent on that connection are transitively authenticated too.

As of @Monday, this feature hasn’t been deployed yet on a production environment, but from the picture below we can already appreciate its effects on the staging environment.

Global CPU usage and thread usage after deployment of mTLS on a staging environment.
Global CPU usage and thread usage after deployment of mTLS on a staging environment.

From this preliminary result we can see a 50% reduction of CPU and thread usage, lowering total CPU usage of FlowProxy from ~10% to 5%. While this may seem like a minor improvement, freeing up CPU to be used by other critical services such as rbuilder can make the difference between winning and losing a relay auction.

Next Steps

At the moment we’re pretty satisfied with the performance of FlowProxy, but there is still room for improvement. In particular, we haven’t yet worked on the communication between a FlowProxy instance and its local builder, which still uses JSON-RPC over HTTP, although via localhost. Assuming these two services keep running on the same machine, we could implement a shared memory based transport. While this would be clearly a benefit over the status quo, we’re coming closer to a point of diminishing returns compared to other improvements.

As such, next steps could move away from FlowProxy and concentrate over a different part of the stack of BuilderNet and its architecture.