Since the development of big cloud hosting companies like Amazon’s EC2, x86 processors have essentially dominated the market. There are other server CPU architectures available, such as IBM’s POWER and Oracle’s SPARC, but their influence on commodity cloud hosting has been minimal.
When Amazon opted to include Arm architecture chips on their Elastic Compute Cloud (EC2) service, it altered that dynamic. Your hardest workloads can now be handled by Arm processors, which have evolved from being used in much smaller devices like mobile phones.
We are obsessed with data. Redpanda is used by our clients to create data-intensive apps that daily absorb terabytes of data. So it only made sense that we would be interested in finding out how these instances stack up against x86 architecture instances when it comes to hosting data-intensive apps.
I go into great length on this query in my post.
The Graviton processor of Amazon Web Services, introduced at the end of 2018, was remarkable for being one of the first and only non-x86 CPU options by a significant cloud vendor. For workloads that could be transferred to Arm, the original Graviton 1 chips (in the a1 instance types) provided a decent price-to-performance ratio. This was true despite the fact that the maximum machine size was constrained (to an a1.4xlarge with 16 CPUs and 32 GiB of RAM), that the available instance shapes were only of the a1 type (roughly equivalent to “General Purpose” x86 instances), and that single-threaded performance was significantly worse than what was currently available on the x86 architecture. The second-generation Graviton 2 chips would be released later in 2020, however this release just served as a sample of what the Graviton family has to offer.
The second Graviton was like a cannonball from the highest diving board, whereas the first Graviton like Amazon dipping its toes into the CPU pool. Graviton 2 was a substantial improvement over its predecessor and should be available by mid-2020. With up to 64 cores and 512 GiB of RAM, it had better than 50% per-core performance and, more crucially, far bigger and more diversified instance types. Following the introduction of Graviton 2, several reviews revealed that these Arm-powered instance types provided a considerable price/performance advantage over x86 instances.
Is4gen and Im4gn with connected NVMe SSD storage are two new Graviton 2 instance types that Amazon just launched. The Is4gen instances have twice the storage and 50% more memory, but they have the same number of CPUs as an Im4gn instance of the same size. Similar workloads are targeted by these instance types as they are by the storage-optimized I3 and I3en x86 instance types.
For data-intensive applications like Redpanda, we are curious in how these new Arm instance types perform in comparison to x86-based instances. We compare the storage optimised Arm and x86 instances on EC2 to provide a response to this query.
Redpanda is a contemporary streaming data platform focused on developers that works with Apache KafkaAPI. ®’s We made it a priority to provide our users the freedom to use Redpanda on any platform of their choice. We published Redpanda for Arm in 2021, and you may follow the GitHub instructions to build it from source code or install it from pages.
Redpanda is an excellent case study for examining the different Graviton instance types since it depends substantially on disc performance, notably disc throughput and persistent write latency.
Adapting your programme for Arm
This may be challenging if your programme is like Redpanda, a C++ programme that is built to a native binary and makes use of several native third-party libraries. It has to be recompiled and maybe converted to Arm in order to execute our code together with all of its dependencies.
Nevertheless, and to my surprise, transferring Redpanda wasn’t as challenging as you may assume. The seasoned engineer who completed the job summarised the little amount of work as updating packaging and altering a few compilation flags for a limited number of dependencies.
- Changing several x86-specific parameters in the cmake build scripts to depend on the identified processor is a common modification.
- The deed was completed after a number of these adjustments and an upgrade to the packaging procedure to produce Arm packages.
There was no need to change the C++ source code.
We “vendor” all of our third party dependencies, which is one feature of our development approach that makes this simple. In other words, we create each package from scratch using a pinned version of the dependent source (often the upstream version or a fork if we have pending modifications). This enables us to modify any project’s temporary fork without having to wait for upstream to approve a pull request (though we do endeavour to send PRs upstream, too, so that we are aligned with upstream as much as possible). In this instance, we may modify the build process to generate Arm binaries as needed.
benchmarking x86 against arm
Systems that process a lot of data put a lot of strain on the disc and the network. The high bandwidth in both locations of the storage-optimized systems we will analyse today means that the bottleneck location will rely on the specific network to disc traffic ratio in your application.
The precise ratio varies depending on the context, but data streaming workloads, like those accelerated by Redpanda, often have roughly balanced network and disc demands. Here, since the instance types in issue have greater network bandwidth than disc bandwidth, the disc becomes the limiting factor for a pure ingestion task with an approximately 1:1 network to disc ratio.
We’ll concentrate on that component in our initial assessment of these situations because disc speed is the deciding factor. We will focus on these three disk-related benchmarks in particular:
- Volume bandwidth
- Consumption rate
- bandwidth management techniques
We concentrate on the new Arm Graviton 2 based Is4gen family and the Intel Skylake-SP based I3en family in this comparison because they are the two storage optimised instance families for each architecture.
We assess the raw disc write bandwidth provided by Intel (I3en) and Graviton 2 (Is4gen) instance types first since we anticipate disc bandwidth to be the limiting constraint.
We employ fio (the Flexible I/O Tester), created by Jens Axboe and coworkers, to profile the bandwidth. This cutting-edge tool includes several I/O engines that test various system-level I/O APIs and allows users to set an I/O testing pattern in a configuration file. Fio specifically enables us to construct an accurate approximation of the I/O patterns imposed on the disc by data-intensive programmes like Redpanda.
We employ a sequential write test with four jobs (or, roughly speaking, threads) writing straight to disc and a 16 KiB block size for our test (bypassing the page cache). To be even more detailed, we utilise the fio configuration listed below.