The Artificial Intelligence Infrastructure Workshop
上QQ阅读APP看书,第一时间看更新

Designing for Scale – Choosing the Right Architecture and Hardware

In the examples we looked at, we used relatively small datasets, and we could do all of our analysis on a single commodity machine without any specialized hardware. If we were to use a larger dataset, such as the entire collection of English articles on Wikipedia, which come to many gigabytes of text data, we would need to pay careful attention to exactly what hardware we used, how we used different components of specialized hardware in combination, and how we optimized data flow throughout our system.

By the end of this section, you will be able to make calculated trade-offs in setting up machine learning solutions with specialized hardware. You will be able to do the following:

  • Optimize hardware in terms of processing, volatile storage, and persistent storage.
  • Reduce cloud costs by using long-running reserved instances and short-running spot instances as appropriate.

You will especially gain hands-on experience with running vectorized operations, seeing how much faster code can run on modern processors using these specialized operations compared to a traditional for loop.

Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage

We usually think of a computer processor as a central processing unit or CPU. This is circuitry that can perform fundamental calculations such as basic arithmetic and logic. All general-purpose computers such as laptops and desktops come with a standard CPU, and this is what we used in training the model for our text classifier. Our normal CPU was able to execute the billions of operations we needed to analyze the text and train the model in a few seconds.

At scale, when we need to process more data in even shorter time frames, one of the first optimizations we can look to is specialized hardware. Modern CPUs are general-purpose, and we can use them for multiple tasks. If we are willing to sacrifice some of the flexibility that CPUs provide, we can look to alternative hardware components to perform calculations on data. We already saw how CPUs can perform specialized operations on matrices as they are optimized for vectorized processing, but taking this same concept further leads us to hardware components such as Graphical Processing Units (GPUs), Tensor Processing Units (TPUs), and Field Programmable Gate Arrays (FPGAs). GPUs are widely used for gaming and video processing – and now, more commonly, for machine learning too. TPUs are rarer: developed by Google, they are only available by renting infrastructure through Google's cloud. FPGAs are the least generalizable and are therefore not as widely used outside of specialized use cases.

GPUs were designed to carry out calculations on images and graphics. When processing graphical data, it is very common to need to do the same operation in parallel on many blocks of data (for example, to simultaneously move all of the pixels that make up an image or a piece of video into a frame buffer). Although GPUs were originally designed only for rendering graphical data, advances from the early 2000s and onward made it practical to use this hardware for non-rendering tasks too. Because graphical data is also usually represented using matrices and relies on fundamental structures and algorithms from linear algebra, there is an overlap between machine learning and graphical rendering, though at first, they might seem like very different fields. General Purpose Computing on Graphical Processing Units, or GPGPU, which is the practice of doing non-graphics related calculations on GPUs, is an important advance in being able to train machine learning models efficiently.

Nearly all modern machine learning frameworks provide some level of support for optimizing machine learning algorithms by accelerating some or all of the processing of vectorized data on a GPU.

As an extension of this concept, Google released TPUs in 2016. These chips are specifically designed to train neural networks and can in many cases be more efficient than even GPUs.

In general, we notice a trade-off. We can use specialized hardware to execute specific algorithms and specific data types more efficiently but at the cost of flexibility. While a CPU can be used to solve a wide variety of problems by running a wide variety of algorithms on a wide variety of data structures, GPUs and TPUs are more restricted in exactly what they can do.

A further extension of this is the Field-Programmable Gate Array (FPGA), which is specialized in specific use cases at the hardware level. These chips again can see big increases in efficiency, but it is not always convenient to build specialized and customized hardware to solve one specific problem.

Optimizing how calculations are carried out is important, but memory and storage can also become a bottleneck in a system. Let's take a look at some hardware options relating to data storage.

Optimizing Volatile Memory

There are fewer hardware specializations in terms of volatile memory, where RAM is used in nearly all cases. However, it is important to optimize this hardware component nonetheless by ensuring the correct amount of RAM and the correct caching setup.

Especially with the advent of solid-state drives (SSDs), explored in more detail later, virtual memory is a vital component in optimizing data flow. Because the processing units examined previously can only store very small amounts of data at any given time, it is important that the next chunks of data queued for processing are waiting in RAM, ready to be bussed to the processing unit once the previous chunks have been processed. Since RAM is more expensive than flash memory and other memory types usually associated with persistent storage, it is common to have a page table or virtual memory. This is a piece of the hard disk that is used in the same way as RAM once the physical RAM has been fully allocated.

When training machine learning models, it is common for RAM to be a bottleneck. As we saw in Exercise 1.01, Training a Machine Learning Model to Identify Clickbait Headlines, matrices can grow in size very quickly as we multiply them together and carry out other operations. Because of this, we often need to rely on virtual RAM, and if you examine your system's metrics while training neural networks or other machine learning models, you will probably notice that your RAM, and possibly your hard disk, are used to full or almost full capacity.

The easiest way to optimize machine learning algorithms is often by simply adding more physical RAM. If this is not possible, adding more virtual RAM can also help.

Volatile storage is useful while data is actively being used, but it's also important to optimize how we store data on longer time frames using persistent storage. Let's look at that next.

Optimizing Persistent Storage

We have now discussed optimizing volatile data flow. In the cases of volatile memory and processor optimization, we usually consider storing data for seconds, minutes, or hours. But for machine learning solutions, we need longer-term storage too. First, our training datasets are usually large and need somewhere to live. Similarly, for large models, such as Google Translate or a model that can detect cancer in X-rays, it is inefficient to train a new model every time we want to generate predictions. Therefore, it's important to save these trained models persistently.

As with processing chips, there are many different ways to persistently store data. SSDs have become a standard way to store large and small amounts of data. These drives contain fast flash memory and offer many advantages over older hard disk drives (HDDs), which have spinning magnetic disks and are generally slower.

No matter what kind of hardware is used to store data persistently, it becomes challenging to store large amounts of data. A single hard drive can usually store no more than a few terabytes (TBs) of data, and it is important to be able to treat many hard drives as a single storage unit to store data larger than this. There are many databases and filesystems that aim to solve the problem of storing large amounts of data consistently, each with its advantages and disadvantages.

Figure 1.8: Linking units of hardware to simulate a larger storage capacity

As you work with larger and larger datasets, you will come across both horizontal and vertical scaling solutions, and it is important to understand when each is appropriate. Vertical scaling refers to adding more or better hardware to a single machine, and this is often the first way that scaling is attempted. If you find that you do not have enough RAM to run a particular algorithm on a particular dataset, it's often easy to try a machine that has more RAM. Similarly, for constraints in storage or processing capacity, it is often simple enough to add a bigger hard drive or a more powerful processor.

At some point, you will be using the most powerful hardware that money can buy, and it will be important to look at horizontal scaling. This refers to adding more machines of the same type and using them in conjunction with each other by working in parallel or sharing work and load in sophisticated ways.

Figure 1.9: Vertical and horizontal scaling

Once again, cloud services can help us abstract away many of these problems, and most cloud services offer both virtual databases and so-called Binary Large Object (BLOB) storage. You will gain hands-on experience with both in later chapters of this book.

Optimizing hardware to be as powerful as possible is often important, but it also comes at a cost. Cost optimization is another important factor in optimizing systems.

Optimizing Cloud Costs – Spot Instances and Reserved Instances

Cloud services have made it much easier to rent specialized hardware for short periods, instead of spending large amounts of capital upfront during research and development phases. Companies such as Amazon (with AWS), Google (with GCP), and Microsoft (with Azure) allow you to rent virtual hardware and pay by the hour, so it is feasible to spin up a powerful machine and train your machine learning models in several hours, instead of waiting days or weeks for your laptop to crunch the numbers.

There are two important cost optimizations to be aware of when renting hardware from popular cloud providers: either by renting hardware for a very short time or by committing to rent it for a very long time. Specifically, because most cloud providers have some amount of unused hardware at any given moment, they usually auction it for short-term use.

For example, Amazon Web Services (AWS), the largest cloud provider currently, offers spot instances. If they have virtual machines attached to GPUs that no one has bought, you can take part in a live auction to use these machines temporarily at the fraction of the usual cost. This is often very useful for training machine learning models as the training can take place in a few hours or days, and it does not matter if there is a small delay in the beginning while you wait for an optimal price in the auction.

On the other side of the optimization scale, if you know that you are going to be using a specific kind of hardware for several years, you can usually optimize costs by making an upfront commitment about how long you will rent it. For AWS, these are termed reserved instances, and if you commit to renting a machine for 1, 2, or 3 years, you will pay less per hour than the standard hourly rate (though in most cases still more than the spot rate described previously).

In cases when you know you will run your system for many years, a reserved instance often makes sense. If you are training a model over a few hours or even days, spot instances can be very useful.

That's enough theory for now. Let's take a look at how we can practically use some of these optimizations. Because it is difficult to go out and buy expensive hardware just to learn about optimizations, we will focus on the optimizations offered by modern processors: vectorized operations.