Blockchain Bloat: How Ethereum Is Tackling Storage Issues

shutterstock_134428790-scaled
18 January 2018

24,270 tokens. 27,358 pending transactions. 463,713 digital kittens.

Ethereum has hosted a lot of activity recently, and while many crypto enthusiasts see that as a positive sign, as the network’s usage soars, its history gets longer and its blockchain more unruly.

And although network congestion leading to transaction backlogs and rising fees has taken the spotlight, there’s another issue this scale causes – a growing database that puts significant storage costs on users wanting to run a full node.

That database, called the ethereum state, hold all the computations that need to be memorized by the computers supporting the platform and the ethereum blockchain itself. And with the costs (both in time and money) of storing the state increasing, fewer and fewer people are choosing to run full nodes, which many worry will centralize the network into the hands of only a few arbitrators.

And developers recognize the problem.

For one thing, ethereum developers are well underway engineering protocol-level changes such as sharding, aimed at minimizing the database.

But since these technologies are still in development, other stakeholders, namely those running ethereum clients – the software needed for users to communicate with the blockchain – have been under fresh pressure to cope with the growth of the state database.

“The fact that improving this stuff is critical has been known since late 2016, the ideas have been floating around for half a year to over a year. Where are the implementations?” said ethereum creator Vitalik Buterin on a developer channel recently.

The frustration is palpable with both Buterin and Afri Schoedon, who manages technical communications at ethereum software client provider Parity. Schoedon told CoinDesk:

“At the current growth rate it’s predictable that the state is going to grow very fast this year, to a point where it would be hardly manageable on small devices.”

In an effort to limit the effects of the unwieldy state, then, the two most popular ethereum clients – Geth and Parity – have recently released updates that attempt to improve the situation.

Turbocharged

The first update, released last week by Parity, reduced storage requirements by eliminating unnecessary, temporary files produced as the software memorizes ethereum’s history.

By vastly minimizing the storage requirements, users hooking up to run full nodes then experience faster synchronization times. And with that, the company said its ethereum software could now be run on a hard drive instead of a solid state drive (SSD), a particularly notable feat since long sync times have made ethereum unable to run on a hard drive since last summer.

The update even got an excited response from Buterin, who said on a developer channel,  “Wow. How did you guys accomplish that?”

As a result of the update, users have been reporting a vastly improved experience.

At the same time, independent developer Alexey Akhunov has been working on a rewrite of the geth client, called “turbo geth.” Described by Akhunov as an “obsession,” the project aims to remove a lot of unnecessary repetition in how ethereums’ clients process the overall state.

While it’s nowhere near ready, it has opened up some interesting avenues of “speculative optimization,” Akhunov said in a recent developer chat.

For example, Akhunov suggests “hard coding” certain information about the ethereum state into the clients themselves. Ultimately, the goal is to adapt the software to simply run using random access memory, or RAM, which could make the clients much faster – allowing them to potentially synchronize with the network instantly.

Developers at Geth itself are also working on optimizations, for one trying to correct a quirk in how information is stored when a client syncs with the network in what is called “fast” mode. Described by Geth core developer Péter Szilágyi as “really horrible,” the existing code is likely to be replaced along with a whole bunch of updates that make synchronization much faster and less storage-intensive.

The limits

There’s also research being done into a client type called “stateless clients,” which only store a compression of the overall state.

Even Buterin is interested in the idea, recently undertaking a study that describes a scenario where “miners and full nodes in general no longer need to store any state.” Plus, Buterin said later in a developer channel, stateless clients would also alleviate the need to clear up the state by other measures, such as pruning old, irrelevant data, for example, empty or long-inactive accounts.

“I’m now in favor of the stateless client approach,” Buterin wrote.

And there is even speculation that stateless clients might be possible without making protocol-level changes.

Touting such clients as a possible solution to the scaling hurdles faced by ethereum following the success of CryptoKitties, Akhunov wrote in a recent blog post: “I believe (stateless clients) can be implemented already now, without any hard fork, ‘simply’ by changing the ethereum clients … This means that nodes do not need to access storage from files and block validation times should drop significantly.”

However, client optimizations can’t be the only thing the network relies on to decrease state concerns.

According to Szilágyi, eventually, client optimizations will reach their limit. And then developers will have to turn their attention to in-progress technologies, such as sharding, which splits up the ethereum database into smaller pieces stored at different nodes, in an effort to alleviate the pressure of storing the full database on individual clients.

Perhaps in response to the recent strains on the network, sharding development has advanced in recent months, with an early stage specification sketched out on Github.

“We can optimize the database and make it ten times faster and more optimal, which gives us room to grow to ten times our current size,” Szilágyi said, adding:

“But eventually, we will get to the point where we won’t be able to do database optimizations anymore, and by that time we need to be able to shard our data.”

Hard drive image via Shutterstock