Engineering the xx network alpha

The xx network public alpha has now been running for a month. It has been an exciting challenge for our team to maintain a test network in the public eye, and we want to update you on some of the lessons we have learned so far.

By Richard Carback and Benjamin Wenger, October 22, 2019.

The xx network public alpha has now been running for a month. It has been an exciting challenge for our team to maintain a test network in the public eye (see a live dashboard of the xx network here), and we want to update you on some of the lessons we have learned so far.

In its current configuration, the xx network public alpha is a network of eight server nodes. At any given time, five of these nodes are part of an active network, running the Elixxir privacy-protecting decentralized transaction platform. The five-node team allows users to send messages to each other while shredding metadata. This page explains how the AlphaNet fits within our larger roadmap.

The “plumbing” of the internet is far less orderly than it appears from a web browser. Connections die, bottlenecks happen, and our network needs to be flexible enough to tolerate these events and recover. Early network failures have helped us make improvements to the AlphaNet services, improving the uptime of the servers from a couple of days to two weeks and counting. We fixed these issues by tuning our reconnection logic to meet real world observations. Lengthening the amount of time nodes wait between trying to reconnect and the amount of time nodes wait before giving up. We also fixed a race condition in our metrics reporting, where metrics data was being sent before it could be populated by the server thread.

Our backend team was also able to vastly improve network performance over this short period of time. When we first deployed, the network was spending a very large period of time on network operations. Transfers of 4 MegaBits (mixing 1000 payloads) from node to node to be mixed took roughly 600ms. Given the locations of the nodes, we were expecting 40-80ms average communications latency, but given the nodes are running with 1 GigaBit connections, even a conservative estimate of transfer times does not exceed 8ms[1].

Even more curious, as we added more payloads to be transferred simultaneously, the network performed relatively faster. When we reduced the number of payloads down to 100 it took 200ms, but increasing it to 10,000 transfers took 2s.

To investigate, we looked at the latency independently. We added back a “Round Trip Ping” which we had used previously to monitor the system. This sends a ping across the network loop, from the first node to the second, third, fourth, fifth, and then back to the first. Interestly, it showed the actual latency we were encountering was exactly as expected. The full network loop took on average 227ms, so ~45ms per connection on average. From this we drew the conclusion that the issue was in the actual data transfer.

The communications library we are using is go GRPC, which uses HTTP2 underneath. From research we found a very similar solved issue from 2018 in related to GRPCs use of HTTP2 windowing. Essentially, the issue was that HTTP2 windows default to a size of 64K, and the sender must hear back from the receiver after it sends 64k of data before it sends more. 64K over a 1gigabgit connection just takes 0.064ms to send, meaning the sender is spending the vast majority of its time waiting for an ACK back. Given the average of 45ms of latency, it's clear that the issue was not occuring in the manner described, because then transfers would take upwards of 2.8 seconds[2].

It turns out that the problem was solved in GRPC by doing two things. The first was to add a dynamic windowing algorithm, and the second was to allow the initial windowing properties to be set manually. Given that we were not seeing nearly 3 second communications, we came to the conclusion that the dynamic windowing algorithm must be helping, but not as much as we would like. A cMix node communicates relatively large chunks of data relatively infrequently. It will send 4 megabits every 500ms to 2 seconds, depending on its position in the pipeline.

Given this knowledge, we decided to fast track streaming communications. GRPC allows for larger packets to be sent piece-meal. We wanted to use this capability to start operating on a batch before the entire block was received, but it turns out that GRPC streaming comms communicates differently, and did not suffer the same windowing issues. So we fast tracked the work and on October 16 pushed streaming comms to the live network, resulting in an almost complete elimination of communications latency with more then 3x speedup.

When looking at the final results, we seem to have almost completely eliminated communications latency, which is obviously not correct. It turns out that when multiple nodes are operating on the same dataset simultaneously, the latency tracking built into the dashboard fails. We will fix this in the coming weeks. We will also research the issues we have encountered with GRPC with the hope of relaying any findings back to the project.

The Next Steps

The backend team has one big performance upgrade left for the AlphaNet, which we call pipelining. In cMix, nodes work sequentially on a batch of payloads to anonymize them. As a result, only one node is operating at a time:

So in this diagram, with only three nodes, the network is operating at 33% efficiency, because only one node is processing data at a time.

With pipelining, the network operates on the same number of batches as there are nodes in a team, so in this three-node setup it would operate on three batches simultaneously.

In the AlphaNet, which has five nodes in its team, this will result in roughly a 5x improvement in throughput. We hope to push this update in the next three to four weeks.

Into The Future

We have also set our sights on BetaNet. Our team is working hard on finalizing the GPU optimizations as well as implementing scheduling for many teams. BetaNet will drastically improve system latencies, because it will allow us to take full advantage of precomputation. While one team is operating in real time, the rest of the teams will be precomputing. We are excited to bring BetaNet into operation, and work with all our BetaNet node operators.

Footnotes

4,000,000 bits / 500,000,000 bits per second = 8ms (assuming our partners are getting 50% of the advertised speed).
4,000,000 bits / 64,000 bits *45 ms = 2,812.5 ms.