blog posts

What Are MMX, SSE And AVX?

What Are MMX, SSE And AVX?

Computer Technology Is Not Strange For Beginners And Abbreviations: CPU, GPU, RAM, SSD, BIOS, CD-ROM, To Name A Few. Often, New Things Come To The Fore, As Part Of An Endless Effort To Improve The Features And Capabilities Of Our Computing Devices.

SSE, Our focus today will be on explaining the suffixes of the popular MMX, SSE, and AVX processor instruction sets and examining whether these are interesting features to have or just pointless marketing tricks.

Elementary the first days

Let’s start this explanatory flashback to the mid-1980s. The processor market was quite similar now, with Intel having the largest share of sales but facing stiff competition from AMD. Home computers, such as the Commodore 64, used 8-bit core processors, while desktops switched from 16-bit to 32-bit chips.

Numbers are related to their size. Data quantities that can be mathematically processed, with larger values, help provide better accuracy and functionality. It also specifies the size of general records on the chip: the small amount of memory used to store work data.

Such processors are scalable and accurate in nature – but what exactly does that mean? ? Scalar is where each math operation is performed with only one value of data: typically described as SISD (unit instruction, unit data).

Therefore, an instruction to add two amounts of data together is only processed for these two numbers. If you want to add the same number to a group of 16 numbers, you have to issue 16 sets of instructions. Not great, but the limitations of such processors were limited at the time.

Intel from 1664 MHz 80386DX since 1985. Image: Wikipedia

An integer is a mathematical term for an integer, such as 8 or -12. Intel 80386SX processors do not have the inherent ability to add, for example, 3.80 and 7.26 – this is called floating-point (short for floating-point). To break them down, you need another processor, such as the 80387SX, and a separate set of instructions – a list of instructions that tell the CPU what to do.

For the processors of that time, x86 instructions for integer calculations and x87 for float; Today, we use the term x86 to cover both because everything is done with the same chip.

Commonly known as shared processors, the use of separate processors to manage proper and floating operations was not until Intel released the 80486: their first desktop processor to have an integrated floating unit (FPU).

\ "

FPU 80486 is highlighted in yellow. Image: Wikipedia

As shown in the image above, it occupies the overall size of the processor, but the benefits of the packaging function inside it are enormous.

The whole configuration was still scalable, and it continued for the 80486 successors: the original Intel Pentium.

Well, that was up to 3 years after the release of the particular processor. Line In October 1996, Intel launched the Pentium with MMX technology.

V is for vectors, MMX is for …?

In mathematics, numbers can be grouped into sets of different shapes and sizes – a particular group is called a vector. The best way to think about this is to list the values ​​that run horizontally or vertically. What MMX technology has introduced to the processor world is the ability to do vector math.

It was constrained to get started because it was only available for integers. The system actually uses registers dedicated to the FPU for this purpose, so programmers who want to clear some of the MMX instructions should keep in mind that any floating calculations are not done simultaneously.

Intel brilliant ads in 1997 for the 16-bit or eight-bit integer MMX. This group of numbers is a vector, and any instructions issued for processing are done on all values ​​in the group.

This type of system is called a SIMD (unit instruction). Multiple data) and a big step forward in the ability of processors used in desktop computers.

So what programs benefit from using such a system? Almost anything that involves the same calculation for groups of numbers, but especially some processing tasks in managing 3D graphics, multimedia, and general signals.

For example, MMX can be used to speed up. Multiply the matrix in vertex processing, or mix the two video power supplies in chroma key making or alpha combination.

AMD K6-2 processor – 3DNow! Somewhere there. Image: Fritchens Fritz

Unfortunately, MMX usage prolongs due to its negative impact on floating-point performance. AMD solved part of the problem by creating its own version of it, called 3DNow! About two years after the advent of MMX – it offers more SIMD instructions and can also control floating numbers, but also suffers from a lack of programmer engagement.

Oh, and your name? The letters are not clear for everything, and this is because Intel wanted to brand the brand: you can not do that with initialization. However, readers at a certain age with sharp sports memories may well remember what Intel referred to as the “Matrix Math Extensions” in early marketing documents and articles.

Easy SSE

With the launch of the Intel Pentium III processor, things improved in 1999. Its glossy vector feature is in the form of SSE (SIMD streaming). There is an additional set of eight 128-bit registers, separate from what is in the FPU, and a set of additional instructions that can control the vessels.

Using separate registers meant that the FPU was no longer too closed, although the Pentium III was not able to issue SSE instructions simultaneously as the FP instructions. The new feature also supports only one data type in registers: four 32-bit floats.

But provide floating-point instructions. Floating range of performance in programs such as video encoding/decoding, image and audio processing, file compression, and more.

SSE2 records in Pentium 4 are highlighted in yellow. Image: Fritzchens Fritz

An updated version of SSE2 appeared in 2001 with the Pentium 4, and this time the data type support was much better: four 32-bit or 64-bit two-bit floats, as well as sixteen 8-bit integers, eight 16-bit, four 32-bit, or two 64-bit numeric. MMX registers remain in the processor, but all MMX and SSE operations can be performed using separate 128-bit SSE registers.

SSE3 came to life in 2003, with more instructions and the ability to carry some math between the values ​​in the same record.

The Intel Core architecture came in 3 years later, which created another overhaul of SIMD technology (SSSE3 – SSE Supplement) with the final version, SSE4, and said hello later that year.

The 2007 Barcelona architecture at AMD had a proprietary version of the SSE4-SSE4a extension. Image: Fritzchens Fritz

A minor update to that version with a range of Core Nehalem processors was released in 2008, and Intel called it SSE4.2 (hence the original version became known as SSE4.1). None of the updates changed the records, but more instructions were added to the table, opening up the range of math and logic operations that might perform.

AMD might perform the SSE5 but instead preferred to split it into three separate extensions, one of which causes several headaches – one more moment on this topic.

By the end of 2008, both Intel and AMD were launching processors that could control MMX via SSE4.2 instructions, and many applications (most games) were running to implement these features.

Time for new letters

2008, Int, Intelorking on a significant upgrade to their SIMD settings, and in 2011, Sandy Bridge processors launch with AVX (Advanced Vector Extensions).

The number of things doubled: twice as many vector constants and doubled in size.

The 16th state of 256-bit stars could only handle eight 32-bit or four-bit 64-bit floats, so it was a bit more limited in terms of data formats than SSE, but this set of instructions was still available. By this time, software support for CPU vector operations was well defined, from the basic world of compilers to complex applications.

 

And for a good reason: the likes of the 3.8GHz Core i7-2600K can potentially kill more than 230 GFLOPS (billion floating-point operations per second) when performing AVX instructions – to add a relatively small size. The overall processor is not bad.

Or if it really works at 3.8 GHz. Part of the problem with AVX was that the chip load was so high that Intel had to automatically reduce the hours in this mode by about 20% to keep power consumption and heat levels low. Unfortunately, this is the price you have to pay to do any SIMD work on a modern processor.

Another improvement with AVX was the ability to work with three values ​​at a time. In all versions of SSE, the operation is performed between two values, then the alternative answer of one of them is recorded. AVX keeps the original values ​​secure when performing SIMD instructions and stores the result in a separate record.

Finally, AVX2 was launched in 2013 with the Haswell architecture of quad-core processors, and this was a significant upgrade thanks to the introduction of another extension: FMA (fuse multiply-aggregation).

Although this was a separate feature for AVX2, the ability to issue an instruction that then two operations was beneficial for application matrix mathematics. However, it also worked on scale operations, and they are completely incompatible. This is because Intel FMA is a system of three operands. It works with 3 separate values ​​- this can be 2 sources and separate responses, or 3 source and response values ​​that replace one of them.

AMD has four operands, so you can do the math on 3 numbers and do not need to write the answer on one of them. While FMA4 is mathematically better than FMA3, its implementation is more complex, both in programming and in-processor integration.

var googletag = googletag || }}; googletag.cmd = googletag.cmd || [] googletag.cmd.push (function () {googletag.pubads (). Show (\ \ ‘/ 8095840 / .2_A.35940.4_techbord.com_tier1 \\’, [300, 250], \ \ ‘div-pg- ad-1569574001-2 \ \ ‘);}); AVX-512: Is it a step too far?

As the AVX2 just started in the processor market, Intel had already made plans for its successor, the AVX-512, and the overall theme was “more, much more.” The same number is recorded twice, but they also double in size, and there is a set of new instructions and old support.

The first batch of chips to deliver the AVX-512 feature in the air was the Xeon Phi 7200 series – the second generation of Intel core processors that targeted many-core processors and the supercomputer world.

\ "

72-core, 288-string Talking Knights Landing Xeon Phi. Image: Fritzchens Fritz

Unlike previous iterations, the new instruction set consisted of 19 subsets: a master base, the AVX-512F, which must be provided to be compatible, and then a particular set of these additional operational sets such as cross-math, integer FMA, or Neural network algorithms cover convolution.

Initially, the AVX-512 was just the largest Intel chip used for workstations and servers, but now the recent architecture of Ice Lake and Tiger Lake also offers it. Yes, that’s right: you can buy a lightweight laptop with 512-bit vector processors.

This may sound like a good thing, but depending on the case, it probably isn’t. In your opinion, the registers in a CPU are usually all grouped in a group known as a stability file, as shown in the image of the dual-core Intel Skylake chip below.

Image: Wikichip

The yellow box highlights the vector recorder file. The red box is the most likely place to record an integer. Notice how much larger the vector is? Skylake uses 256-bit recordings for the AVX2, so for the same die scale, the AVX-512 recordings will be four times larger – twice as many as doubling the bits and twice as many as double the number of records.

Does a small chip design to be as small as possible for the mobile market really need to use enough space to register a vector? Although not a large part of every kernel footprint, every square millimeter is important when making better available space.

And since the use of AVX, in any case, leads to an automatic reduction of hours, the use of AVX-512 in such operating systems is almost worse than the use of any of the previous models because when working, It requires even more powerful than anything else.

<p> 

All four cores in this laptop are in 512-bit processors of vector recorders.

There is, but it is not just on small mobile processors that the AVX-512 presents problems. Developers who write code to run on workstations and servers that actually benefit from vector extensions need to create multiple versions. This is because not all AVX-512 processors offer the same set of instructions.

For example, the IFMA (Integer fused multiply) suite is only available on Cannon, Ice, and Tiger Lake processors – processors using the Cooper and Cascade Lake architectures, even though they are workstation/server products, make this offer. Do not give.

It should note that AMD does not offer support for the AVX-512 and does not intend to do so. Have.

What’s next?

The additional ability of processors to manage vector mathematics throughout those years is an important step forward. Today’s processors are potent and provide a set of instructions for handling correct and floating-point operations for scaled, vector, and matrix data.

For the last two data types, processors now compete directly with GPUs – the world of 3D graphics is all about SIMDs, vectors, buoyancy, etc., and the development of meteorological graphics accelerators. At the beginning of the last decade, you can get a GPU that can handle approximately 800 billion SIMD instructions per second at the cost of less than $ 500.

This is even more than the best desktop processors. They can manage now, but they are not designed to be brilliant in any role – they have to manage very general code, which is often not very repetitive and does not easily parallelize. So it’s better to think of the SIMD CPU capability as a useful extra feature rather than something fundamental.

For raw SIMD performance, you want one of these, a graphics card, not a motherboard!

But the increase in GPU means that processors do not have to use substantial vector units. This is almost exactly why AMD has not sought to develop a replacement for the AVX2 (an extension they have had on their chips since 2015). Let’s not forget that next-generation processors may be more like proprietary silicon mobile SoCs for certain types of tasks. On the other hand, Intel seems eager to introduce the AVX-512 to produce as many products as possible.

So will we see the AVX-1024? Probably not, or at least for several years. Instead, Intel is likely to offer additional components for the AVX-512, to improve its flexibility and leave the raw SIMD performance to its new GPU line.

SSE and AVX are an integral part of the software scene: Adobe Photoshop requires a CPU to support SSE4.2. Machine learning API TensorFlow requires AVX support on processors. Microsoft teams can only create background video effects if AVX2 is available.

It just means that these will not be gone soon despite the GPU SIMD power, but I hope that when the next generation of electric extensions appears, we will come across another daily ad for them.