The Best Things in Life Don’t Scale

CUUUPID
12 min readMay 15, 2022

The biggest buzzwords in tech will fail because they just don’t scale.

As we rapidly innovate with new technology, we have to keep in mind whether or not our innovations are practical and feasible. In a world where we count users in thousands and runtime in milliseconds, it’s more important than ever to ensure new tech scales.

That’s why it’s shocking that the three biggest buzzwords in tech completely fail to scale.

Of course, I’m talking about the three categories that seemingly every bit of tech news falls into nowadays:

  1. Artificial Intelligence
  2. Blockchain
  3. The Metaverse (specifically XR, AR/VR)

It seems that nearly everything is either using or adding these and yet it’s insane how painful (and flatly impossible) it is to scale these services.

We need more hardware

The first and foremost shortcoming remains hardware. It’s no secret that the hardware requirements for tech are increasing exponentially year by year, both in price and specification. This is effectively becoming a blocking factor in widespread adoption, and in the case of blockchain it’s a killer.

Let’s start with blockchain.

Blockchain usually relies on “miners” that “mine” blocks and allow the network to execute the most basic of tasks (i.e. verifying transactions). The process of “mining” is something I won’t go into deeply, but it involves executing large cryptographic functions, a process known as “hashing.”

The problem with hashing is that it requires a significant amount of computing power; in fact, such a large amount that efficiently mining on the blockchain requires special hardware, either a GPU or an ASIC. To be even the smallest miner, you’ll need a high-spec GPU (let’s be real, Intel HD Graphics is consumer standard right now so basically every GPU above that is high-spec) or a dedicated ASIC. This racks up a pretty large bill, with most devices pricing in hundreds to thousands of dollars per unit, not counting the high electricity usage that’s considered an operating cost.

the great GPU shortage of 2018–2022

This has turned out to be a major, blocking factor in blockchain’s adoption. In fact, due to this very reason, GPU prices skyrocketed and the stock all but disappeared, leaving prospective miners without any tools, crashing the PC gaming industry and inhibiting researchers from gaining access to high end hardware. Even GPU manufacturers like Nvidia that profited from the rise have ultimately spoken out on the intense shortage as a direct cause of blockchain.

A year later, most of these blockchain technologies have vanished and the prices of major cryptocurrencies have crashed to a quarter of their valuation. The most widely stated concern with blockchain is it’s failure to scale, and hardware remains (and will continue to be) a blocking factor in the scaling of blockchain technology.

This issue isn’t limited to blockchain, however. Artificial Intelligence relies on the same hardware (GPUs) and this is becoming a blocking factor in that field as well; most upcoming startups can’t afford to build an enormous GPU farm like Google’s, and buying computing power from cloud providers like AWS comes at a 4x price hike that rack up an enormous server bill.

And what about the Metaverse?

With XR, the issue is even more exasperated. In order to run these experiences, consumers have to render the visuals and the consumer’s device has to process sensory input. This puts the burden mostly on the consumer, which sounds great for businesses looking to get into the XR space but causes a major issue with scaling.

For example, let’s take Oculus VR. Oculus was one of the first VR headsets to launch and remains a popular name in the field, having been since acquired by Facebook and solidifying their place in the market.

the ridiculously high “minimum spec” for Oculus; USB 3.0, a good Nvidia GPU, 8GB of RAM… this is from their first model, in 2018 — in 2022 the minimum spec is exponentially larger and you’re going to need a dedicated rig

Unfortunately, Oculus almost exclusively operates on desktop PCs, moreover requiring high GPU specs. It faces the same pitfalls as blockchain and AI, but with much larger concerns: whereas with AI a business could just throw money at their infrastructure to temporarily resolve the issue, with XR the burden falls on consumers. The adoption cycle of AR/VR relies on consumer GPUs to become more widespread and higher specification, as well as waiting on prices to fall to levels that are affordable to the average user. It’s a process that could potentially take decades — while Nvidia CEO Jensen Huang proclaims that Moore’s law favors GPUs, the time required for the sufficient development needed to raise specs while lowering prices still amounts to several years. Case and point — it’s been 4 years and GPU availability has barely increased.

Let’s go back to AI. My first draft of this article was written in 2018, 4 years ago today. Since then, AI has gotten aggressively compute-hungry. Nowadays all the buzz is about large language models that span hundreds of millions or, in 2022, hundreds of billions of parameters. A “parameter” is usually a numerical weight in the neural network; it’s common for these to be precision FP16, which is shorthand for a floating point number that’s 2 bytes long.

Let’s say you have a 1 billion parameter language model (that’s small btw — GPT-3’s popular form is 175x that size, with T5, PaLM, etc far exceeding that). Let’s say that you’re using FP16. Then, the memory required becomes:

1 billion * 2 bytes = 2 billion bytes = 2 Gigabytes

While you can optimize this to be much smaller during inference, you need to store all 2 billion bytes during training so you can backpropagate. Moreover, this is with a batch size of 1; if you want do a batch size of 16, 32, etc. you’ll need that many times more bytes.

the storage space you need to house a GPU farm for an AI that writes very badly structured jokes in 2024

And let’s not forget that reality hits hard: the big language models you see in the news are hundreds of billions of parameters large (100+x bytes), and use FP32 (2x more bytes). Add that up and you have an absolutely ridiculous, mind boggling amount of memory usage just to store the model’s results (let alone other computations involved).

As an example, we built a medium-sized language model at Aiko Mail that’s ~400 million parameters and shaped like a Transformer. All optimizations considered, it requires a whopping 24GB of memory just to train it with a batch size of 16 samples.

This is ludicrous for one simple reason: this memory is not RAM, but VRAM. This is GPU memory we’re talking about, and accessible GPUs do NOT have anywhere near this amount of RAM. Your GeForce 1060, 3070, 3080 etc have at most 8GB of VRAM, and will struggle just to run inference with large transformers.

To actually train even a mid-sized language model, you’re going to need 24GB of VRAM, commercially available only in the GeForce RTX 3090, which is literally the top of the line, best commercial GPU on the planet and will put you out a cool 3–4k$ per unit.

As a result, modern AI is largely inaccessible to most researchers and engineers. Compute costs to train in the cloud are astronomical, and estimates show it could cost millions to train a single large language model (+ the carbon emissions of tens of thousands of cars over their lifetimes).

There are companies trying to change this, like HuggingFace, but we’re a long way from democratizing access to machine learning.

Runtime

When we talk about execution time in computing, we generally mean in milliseconds, as users have grown to expect instant interactions.

Unfortunately, this is not the case at all with all three of the above technologies.

With blockchain, mining a single block takes longer and longer as time goes on. It’s gotten to the point where it’s no longer profitable for individuals to mine large cryptocurrencies, and mining pools have started calling the shots on the network. For a “decentralized” technology, it favors an awful lot of centralization.

It also spells doom for blockchain’s (arguably) largest use case: cryptocurrency. Cryptocurrency thrives on fast transactions, without centralized authorities, across borders and without large fees.

Image result for bitcoin fails to scale

However, with the failure of blockchain to scale and the amount of time (time=electricity) required to mine a block, the fees associated with transactions rose to record highs; at its peak, Bitcoin reached a point where the fees associated with most small transactions were larger than the actual transaction cost, making it utterly useless and a total failure for everyday consumers. Transactions began to take hours, and regulation due to the runaway, rising price of Bitcoin made it difficult to move Bitcoin across borders.

And don’t even get me started on Ethereum. Holy sh*t, the only thing more expensive than gas for my car right now is the gas I need to put a picture of a monkey on a blockchain. The very idea that we need Layer 2 blockchains is insane — and with scaling blockchains like Avalanche the issue becomes getting widespread adoption.

Cryptocurrency became everything it promised to destroy.

With artificial intelligence, there’s a very different issue arising. Inference time for many largescale models takes several seconds, which sounds like a small amount of time but begins to rack up and becomes a blocking figure when discussing userbases that amount in thousands.

Moreover, the figures quoted for inference time from most models are tricky — you’ve got to read between the lines, or rather to the next line which usually reads “as found on our XYZ GPU stack,” in which “XYZ GPU stack” costs several thousands of dollars and must be fully dedicated to that single inference task.

Google’s TPU stack that’s often used for training their models

Of course, research is being done here, but the research being conducted almost completely focuses on training time, which I would argue isn’t as important for businesses (but is important for research). For training, a week is not a big deal — any startup can spare a week to train a model that will become the cornerstone of their business.

The larger problem lies in inference. At a root level, training requires inference — inference is usually quoted as the “forward” phase of a network, and this must occur in training before backprop is conducted. However, in training all the data that needs to be inferred is available at the beginning.

In other words, the entire batch can be processed at once (i.e. 100s of images at the same time), due to math scaling — multiplying a larger matrix of several matrices is more efficient than multiplying several matrices of 1 matrix (in other words, it’s more efficient to do more at once). This goes up to a certain point, similar to the idea of diminishing returns in economics, but it remains that batched training allows us to scale inference on large amounts of data. (given you have enough compute)

Unfortunately, in practical sense, batch processing is almost never the case.

It’s rare that a model has to run inference on 200 images all at once; it’s more likely that 200 images are submitted for inference in, say, a minute. There will be a hard bottleneck in inference time per image; even an inference time of say, a third of a second (which is insanely fast — those are the times boasted by the fastest of models, such as Gmail’s super-optimized sentence prediction model) will result in only 180 of those images being processed per minute. Even at the small load of 200/min, the model fails to scale, and in order to assess the growing backlog a second instance must be spawned to balance.

It’s a hard pill to swallow in a world where large loads are a given; in fact, Node’s Express has come under fire before for only supporting a couple thousand connections/second, similar to popular NoSQL databases being criticized for having a bottleneck at several thousand transactions/second.

a dam that’s guaranteed to burst

It’s a figure unheard of in the world of AI, which runs into a hard bottleneck at a couple hundred per second in terms of separate inferences on a single instance of a model, even with the optimizations you can make.

With XR, there’s a very, very different problem at hand. The issue lies in immersion — for immersion and avoiding the uncanny valley, interactions have to take place and the render has to adjust faster than humans are able to perceive. In other words, a couple hundred milliseconds isn’t fast enough.

With XR, we’re measuring things in small amounts of milliseconds. We’re measuring acceptable latency as below ~20ms (a figure that many gamers would disagree with, as 100+ fps and under 10 ping is considered normal for most rigs).

It’s a figure that remains a bottleneck and is one we’re nowhere near; while XR tech focuses intensively on this and numerous breakthroughs have been achieved in this field in terms of regular sensory input and rendering, we still face issues in terms of interaction; precisely, XR opens up a whole new, infinite range of possibilities for interaction that just can’t be processed very quickly by most engines.

ARCore’s Augmented Image engine as showcased at Google I/O 2018

Moreover, to run image augmentation with anything more complicated than homography, the latency is too high to be considered “acceptable” by most experiences. The fastest we’ve tested so far is Google’s ARCore Augmented Images, which uses homographies (there’s also an Augmented Faces demo which is a little slower and has some noticeable latency/lag).

ARCore Augmented Faces demo

It’s a big problem

In the last two years alone, the number of advancements in these three fields are insurmountable.

In AI, we saw natural language generation make a leap forward with PaLM and GPT-3, models earning their reputation as the Imagenet of NLP. It’s opening a new world of possibilities with text processing. We’ve also seen convolutional networks and GANs grow to adopt 4k images, and we’re entering an era where AI can mimic HD images and generate content that seems believable to the human eye.

With blockchain, the field has become less muddied, with several cash grabs dying out and making more room for exposure with existing platforms. There’s been a lot of progress with the rise of NFTs and Web3, but currently ther market is in a total crash state so we’re waiting for recovery.

In XR, we’ve seen phone level XR and AR becoming a reality with many improvements to ARCore. It’s becoming possible to integrate AI into XR and develop experiences that allow for some level of human interaction (4 years ago when I drafted this, the highest level of interaction was AI-based: Fiddler AR, but now even physical touch is possible with reconstructed topologies).

These are all fields that bear a large, significant impact on the future of technology, and where human civilization is going as a whole. These are technologies that don’t impact the economy, they redefine it; they don’t impact society and interaction, they revitalize it. It makes an overhaul possible of our entire culture, and at a root level it’s being called another Industrial Revolution.

For a movement that has such large bearing on the widespread public, it’s vital and paramount that it can scale to support the public with ease. Until this is implemented it impedes adoption and forms a hard-stop barrier to entry in terms of innovation outside of large corporations.

I hope to see more innovation in scaling these technologies. At Aiko Mail, we’re rolling out language models that out-perform GPT-3 and other large language models with a fraction (2/1000th right now) of the compute. It’s not impossible to scale — just very difficult.

--

--

CUUUPID

Highly opinionated on avocado rolls, seals, and tech. Once made pancakes on a GPU.