Written by
0xIchigo
Published on
August 20, 2024
Copy link

All You Need to Know About Solana's v1.18 Update

A big thank you to Rex St. John and Mike MacCana for reviewing this article.

Introduction

The super-majority adoption of Solana’s 1.18 update is a significant milestone. It ushers in a host of improvements and new features aimed at enhancing the network’s performance, reliability, and efficiency. One of the most notable changes is the introduction of a central scheduler. This new scheduler aims to streamline transaction handling and ensure more accurate and efficient priority calculations. Other improvements to the runtime environment and program deployments, for example, help provide more reliable performance even during times of peak network load.

This article explores the updates and improvements brought by the 1.18 release. We’ll explore the motivations behind these chances, the specifics of these new features, and their expected impact on improving the network. Whether you’re a validator operator, a developer, or the average Solana user, this comprehensive overview of the 1.18 update will provide you with the necessary information to understand and leverage the benefits of these new improvements.

We must first discuss Anza, a newly established development firm driving these changes, and its role in the ongoing development of Solana.

What’s Anza?

Anza is a newly established software development firm created by former executives and core engineers from Solana Labs. Its creation represents a strategic move to bolster Solana’s ecosystem, aiming to improve its reliability, decentralization, and network strength. Anza was founded to enhance Solana’s ecosystem by developing critical infrastructure, contributing to key protocols, and fostering the innovation of new tools.

The founding team includes Jeff Washington, Stephen Akridge, Jed Halfon, Amber Christiansen, Pankaj Garg, Jon Cinque, and several core engineers from Solana Labs.

Anza is focused on developing and refining Solana’s validator clients with the creation of Agave — a fork of the Solana Labs validator client. Anza’s ambitions extend beyond the development of their validator client and are committed to ecosystem-wide improvements. This includes the development of Token Extensions and a customized Rust / Clang toolchain. By fostering a collaborative and open approach to development, Anza is dedicated to accelerating and improving the Solana ecosystem.

What’s Agave?

As mentioned briefly in the previous section, Agave is a fork of the Solana Labs validator client spearheaded by Anza. In this context, the term “fork” refers to Anza’s development team taking the existing code from the Solana Labs repository and starting a new development path separate from the original codebase. This allows Anza to implement its own improvements, features, and optimizations to the Solana Labs client.

The Migration Process

The migration of the client to Anza’s GitHub organization started on March 1st. Initially, Agave will mirror the Solana Labs repository to give the community time to adjust. During this period, Anza will handle closing pull requests (PRs) and migrating relevant issues to Agave’s repository. Agave and the Solana Labs client versions 1.17 and 1.18 will be identical in terms of functionality. Anza aims to release Agave v2.0 this summer, which includes archiving the Solana Labs client and recommending that 100% of the network migrate over to the new Agave client.

The Solana Labs to Agave migration process is publicly tracked on their GitHub.

The Agave Runtime

The Agave Runtime inherits its foundational architecture from the Solana Virtual Machine (SVM) and is the backbone for executing the core functionalities defined by the Sealevel runtime

The Solana protocol delineates the runtime as a critical component for processing transactions and updating the state within the accounts database. This specification has been adopted and further refined by the Agave and Firedancer clients. The essence of the SVM is its capability to execute all Solana programs and modify account states in parallel.

The concept of a bank is key to processing transactions and understanding the changes coming in 1.18. A bank is both a piece of logic and a representation of the ledger state at a specific point in time. It acts as a sophisticated controller managing the accounts database, overseeing tracking client accounts, managing program execution, and maintaining the integrity and progression of Solana’s ledger. A bank encapsulates the state resulting from the transactions included in a given block, serving as a snapshot of the ledger at that point in time.

Each bank is equipped with caches and references necessary for transaction execution, allowing them to be initialized from a previous snapshot or the genesis block. During the Banking Stage, where the validator processes transactions, banks are used to assemble blocks and later verify their integrity. This lifecycle includes loading accounts, processing transactions, freezing the bank to finalize state, and eventually making it rooted, ensuring its permanence.

As a general overview, the transaction processing engine within the Agave Runtime is tasked with loading, compiling, and executing programs. It uses Just-In-Time (JIT) compilation, caching compiled programs to optimize execution efficiency and reduce unnecessary recompiling. Programs are compiled to eBPF format before deployment. The runtime then uses the rBPF toolkit to create an eBPF virtual machine, which performs JIT compilation from eBPF to x86_64 machine code instructions, taking full advantage of the available hardware. This ensures the programs are executed efficiently.

The 1.18 update introduces a central transaction scheduler, which is deeply intertwined with the operational efficiencies introduced by the Agave Runtime. By improving how transactions are compiled, executed, and managed via banks, the 1.18 update enables a more streamlined and efficient scheduling process. In turn, this leads to faster transaction processing times and enhanced throughput. The new Agave Runtime and its client serve as the bedrock for these enhancements, so it’s crucial that we have some sort of general understanding before we dive into the intricacies of the new scheduler. 

If you want to learn more about the Agave Runtime, I recommend reading Joe Caulfield's article on the topic. It goes into considerable detail and provides helpful code snippets throughout.

A More Efficient Transaction Scheduler

The Current Implementation

Source: Adapted from Andrew Fitzgerald’s article Solana Banking Stage and Scheduler

In the transaction processing pipeline, packets of transactions first enter the system through packet ingress. These packets then undergo signature verification during the SigVerify stage. This step ensures each transaction is valid and authorized by the sender.

Following signature verification, transactions are sent to the Banking Stage. The Banking Stage has six threads—two dedicated to processing vote transactions from either the Transaction Processing Unit (TPU) or Gossip and four focused on non-vote transactions. Each thread is independent of one another and receives packets from a shared channel. That is, SigVerify will send over packets in packet batches, and each thread will pull transactions from that shared channel and store them in a local buffer. 

The local buffer receives the transactions, determines their priority, and sorts them accordingly. This queue is dynamic, constantly updating to reflect real-time changes in transaction status and network demands. As transactions are added to the queue, their order is reassessed to ensure the highest-priority transactions are ready to be processed first.

This process happens continuously, and what happens to these packets of transactions depends the validator’s position in the leader schedule. If the validator is not scheduled to be the leader in the near future, they will forward packets to the upcoming leader and drop them. As the validator gets closer to its scheduled leadership slot (~20 slots away), it will continue forwarding packets but will no longer drop them. This is done to ensure these packets can be included in one of their own blocks if the other leaders don’t process them. When a validator is 2 slots away from becoming the leader, it starts holding packets — accepting them and doing nothing so they can be processed when the validator becomes leader.

During block production, each thread takes the top 128 transactions from their local queue, attempts to grab locks, and then checks, loads, executes, records, and commits the transactions. If the lock grab fails, the transaction is retried later. Let’s expand upon each step:

  • Lock: This step checks which transactions the thread can grab locks for. Each of these transactions will read and write some number of accounts, so the validator needs to make sure there aren’t any conflicts
  • Checks: This step checks whether the transaction is too old or has already been processed. Note the banks have a status cache that keeps track of the last 150-300 slots’ transactions
  • Loads: This step loads the accounts necessary for executing a given transaction. This step also checks whether the fee payer can actually afford to pay the fees, and whether the program invoked is a valid program. Basically, this step loads the accounts and performs some initial setup
  • Execute: This step executes each transaction
  • Record: The results of the executed transactions are sent to the Proof of History Service to be hashed. This is where the transaction signature is sent out
  • Commit: If the record step succeeds, the transactions are committed. This step also propagates the changes back into the account system so future transactions in this, or subsequent, slots will have the updated view of each account
  • Unlock: The locks for each account placed in the first step are lifted

The Banking Stage uses a multi-iterator approach to create these batches of transactions. A multi-iterator is a programming pattern that allows simultaneous traversal over a dataset in multiple sequences. Think of it as having several readers going through a single book, each starting at different chapters, coordinating to ensure they don’t read the same page at the same time if their understanding of the content might interfere with one another. In the Banking Stage, these “readers” are iterators, and the “book” is the collection of transactions waiting to be processed. The goal of the multi-iterator is to efficiently sift through transactions, grouping them into batches that can be processed without any lock conflicts.

Initially, the transactions are serialized into a vector based on priority. This gives the multi-iterator a structured sequence to segment these transactions into non-conflicting batches. The multi-iterator begins at the start of the serialized vector, placing iterators at junctures where transactions don’t conflict with one another. In doing so, it creates batches of 128 transactions without any read-write or write-write conflicts. If a transaction conflicts with the currently forming batch, it’s skipped and left unmarked, allowing it to be included in a subsequent batch where the conflict no longer exists. This iterative process adjusts dynamically as transactions continue to be processed. 

After successfully forming a batch, the transactions are executed, and if successful, they are recorded in the Proof of History Service and broadcast to the network.

The Problems with the Current Implementation

The current implementation has several areas where performance can be adversely affected, leading to potential bottlenecks in transaction processing and inconsistent prioritization. These challenges primarily stem from the architecture of the Banking Stage and the nature of transaction handling within the system.

A fundamental issue is that the four independent threads processing non-vote transactions have their own view of transaction priority within their own threads. This discrepancy can cause jitter or inconsistency in transaction ordering. These discrepancies become more pronounced when all high-priority transactions conflict. Since packets are essentially pulled randomly by each thread from the shared channel from SigVerify, each thread will have a random set of all the transactions. During competitive events, such as a popular NFT mint, it is likely that many high-priority transactions will be in multiple Banking Stage threads. This is problematic because it can cause inter-thread locking conflicts. The threads, working with different sets of priorities, may race against each other to process these high-priority transactions, inadvertently leading to wasted processing time due to unsuccessful lock attempts.

Think of the Banking Stage as an orchestra where each thread is a different section — strings, brass, woodwinds, and percussion. Ideally, a conductor would coordinate these sections to ensure a harmonious performance. However, the current system resembles an orchestra trying to perform a complex piece without a conductor. Each section plays its own tune, regularly clashing with one another. High-priority transactions are the solo parts all sections attempt to play simultaneously, causing confusion. This lack of coordination highlights the need for a centralized “conductor” to ensure efficiency and harmony in Solana’s transaction processing, much like a conductor leading an orchestra.

The New Transaction Scheduler

Note the vote threads pull from the channel and votes therefore do not go through the central scheduler

The 1.18 update introduces a central scheduling thread, which replaces the previous model of having four independent banking threads, each managing its own transaction prioritization and processing. In this revised structure, the central scheduler is the sole recipient of transactions from the SigVerify stage. It builds a priority queue and deploys a dependency graph to manage transaction prioritization and processing.

This is a transaction dependency graph. The arrows mean ‘is depended on.’ For example, Transaction A is depended on by Transaction B, and Transaction B is depended on by both Transaction C and Transaction D

This dependency graph is known as a prio-graph. It is a directed acyclic graph that is lazily evaluated as new transactions are added. Transactions are inserted into the graph to create chains of execution and are then popped in time-priority order. When dealing with conflicting transactions, the first to be inserted will always have higher priority. In the example above, we have transactions A through H. Note that transactions A and E have the highest priority within their prospective chains and do not conflict. The scheduler moves from left to right, processing the transactions in batches:

Transactions A and E are processed as the first batch; then B and F; and then C, D, G; and H as the final batch. As you can see, the highest-priority transactions are at the top of the graph (i.e., to the far left). As the scheduler examines transactions in descending order, it identifies conflicts. If a transaction conflicts with a higher priority one, an edge is created in the graph to represent this dependency (e.g., C and D conflict with B).

The new scheduler model addresses several key issues inherent to the multi-iterator approach:

  • Consistency in Priority Handling: The new system ensures that all transactions are processed in a consistent priority order by centralizing the transaction intake and scheduling. This eliminates the jitter previously caused by multiple threads having different views of transaction priorities
  • Reduction in Processing Delays: The prio-graph makes sure that batches prepared for execution are highly likely to succeed without lock conflicts, streamlining processing time and any delays that would be caused by lock contention. Note the use of the phrase “highly likely to succeed” — it isn’t strictly true that the prio-graph makes batches that can’t fail locks as they could conflict with voting threads, although this is a very rare edge case
  • Scalability and Flexibility: This new scheduler design allows the number of threads to be increased without the previous concerns of increased lock conflicts. This is thanks to the centralized view of locks and the more controlled transaction distribution among workers 

The introduction of the central scheduler in 1.18 is expected to improve transaction handling significantly, reducing the complexity and overhead associated with the previous system. This will likely lead to faster transaction processing times, increased throughput, and a more stable network. Due to the delays in releasing 1.18, the scheduler has improved since its inception. For example, precompile verification for transactions has been moved to worker threads to improve efficiency. Additionally, CU limits are now more reasonable with much lower estimated/actual ratios than the old scheduler. The new scheduler can now use CUs to throttle the scheduled work queues, preventing excessive work from being queued up due to account conflicts.

Note the central scheduler is not enabled by default and must be enabled using the new --block-production-method central-scheduler flag when starting a validator. It is currently opt-in only but will become the default scheduler in future releases. Also, note that the old scheduler can be enabled using the --block-production-method thread-local-multi-iterator flag (this is enabled by default, but please don’t do this in future releases — the central scheduler is much more efficient and addresses the issues presented by the old scheduler).

More Effective Priority Calculation 

1.18 also refines how transaction priority is determined, making the process more equitable and efficient regarding resource usage and cost recovery. Previously, transaction prioritization was primarily based on compute budget priority, sometimes leading to suboptimal compute unit pricing. This was because the prioritization did not adequately consider the base fees collected, leading to situations where resources could be underpriced and affect the network’s operational efficiency.

The new approach adjusts the transaction priority calculation to consider the transaction fees and the associated costs using the formula Priority = Fees / (Cost + 1). Here, the fees represent the transaction fees associated with a given transaction, while the cost represents the compute and resource consumption determined by Solana’s cost model. Adding “1” in the denominator is a safety measure to prevent division by zero. 

We can breakdown the formula further to make Fees and Cost more explicit:

\(\text{Priority} = \frac{\text{Priority Fee} × \text{Compute Units Requested} + \text{Base Fee}}{1 + \text{Requested Execution CUs} + \text{Signature CUs} + \text{Write Lock CUs}}\)

The cost of a transaction is now calculated comprehensively, considering all associated compute and operational costs. This ensures that priority calculations reflect the true resource consumption of a transaction. This means developers and users will receive higher priority if they request fewer compute units. This also means that simple transfers, without any priority fees, will have some priority in the queue.

Improved Program Deployment

1.18 also significantly improves program deployments with respect to deployment reliability and execution efficiency

The new update addresses an issue where programs deployed in the last slot of an epoch did not correctly apply the runtime environment changes planned for the subsequent epoch. Thus, a program deployed during this transition period would erroneously use the old runtime environment. 1.18 adjusts the deployment process to ensure that the runtime environment for any program deployed at the end of an epoch is aligned with the environment of the upcoming epoch.

1.18 also addresses the inability to set a compute unit price or limit on deploy transactions by adding the --with-compute-unit-price flag to the CLI program deploy commands. This flag can be used with the solana program deploy and solana program write-buffer commands. The compute unit limit is set by simulating each type of deploy transaction and setting it to the number of compute units consumed.  

Another important improvement involves how blockhashes for large program deployments are handled. Before 1.18, transactions sent with sign_all_messages_and_send were throttled to 100 TPS. For larger programs, the number of deploy transactions will be in the thousands. This means that transactions could be delayed and risk using expired blockhashes as many of these transactions will be delayed for more than 10 seconds at a time. 1.18 delays signing deploy transactions with a recent blockhash until after the throttling delay. Blockhashes now refresh every 5 seconds, so deployments with over 500 transactions will benefit by using a more recent blockhash.

Additionally, 1.18 introduces improvements to how the network handles program deployments and verifies transactions. Previously, some programs were incorrectly marked as FailedVerification due to errors in identifying account statuses. This could mislabel programs that hadn’t actually failed any checks. These programs are now correctly identified as Closed if they’re not supposed to be active. This change ensures that only problematic programs are flagged for rechecking and helps prevent unnecessary re-verifications.

The process for updating program states has also been refined. Programs can now transition from a Closed state to an active state within the same time slot they are deployed. This means that programs become operational faster and more reliably, which is crucial during times of high demand. However, it's important to note this improvement is still subject to the one-slot un-/re-/deployment-cooldown and one-slot visibility delay. As a result, while these adjustments help manage network load more effectively and prevent certain types of congestion, they do not significantly change the workflow for dApp developers.

“The Congestion Patch” — Handling Congestion Better

Testnet version 1.18.11, heralded as The Congestion Patch,” proposed changes to address Solana’s recent congestion. Note that this release isn’t specific to 1.18, and it has been backported to 1.17.31. Regardless, it’s crucial that we talk about it.

The big change is that QUIC now treats super low staked peers as unstaked peers in Stake-Weighted Quality of Service (SWQoS). This was to address the fact that staked nodes with a very small amount of stake could abuse the system to get disproportional bandwidth. Also, the current metrics could not tell what the proportions of packets sent down and throttled from staked versus non-staked nodes are. So, these metrics were added for greater visibility. How packet chunks were handled was also optimized by replacing instances of vec with smallvec to save an allocation per packet. This is possible as streams are packet-sized, so few are expected. 

In the Banking Stage, previously, all packets were forwarded to the next node. However, 1.18 changes this so that only packets from staked nodes are forwarded. This update effectively makes staked connections more important than ever going forward, as they carry more weight in calculating priority and forwarding transactions.

Improved Documentation

The 1.18 update also significantly improves the translation support for the official Solana documentation to ensure greater accessibility for a global audience. Updates include upgrading the Crowdin CLI and configuration (which streamlines the synchronization of documents across languages) and introducing a new serve command for better local testing via Docusaurus. The documentation also improves how static content is handled by linking PDF files directly to GitHub blobs to avoid issues with relative paths in translated builds.

For developers, the process of contributing to translations is clarified with an updated README on handling common issues such as necessary environment variables and typical build errors. This is complemented by improvements in the continuous integration flow, which now includes translations only in stable channel builds. This ensures that only vetted and stable documentation reaches end-users. These changes aim to simplify contributions, enhance the official documentation's quality, and give all users access to reliable and accurate information.

Conclusion

Driven by Anza, the 1.18 update substantially improves transaction handling, priority calculations, program deployments, official documentation, and overall network performance. With the introduction of a central scheduler and the various fixes aimed at addressing recent congestion, Solana is better equipped to handle peak loads and ensure efficient and reliable network behavior. Solana is the best chance of a scalable blockchain, and this update affirms its potential.

If you’ve read this far, thank you, anon! Be sure to enter your email address below so you’ll never miss an update about what’s new on Solana. Ready to dive deeper? Explore the latest articles on the Helius blog and continue your Solana journey, today.

Additional Resources