1 Jan, 2008
The newspaper ad offers a computer system with a "2.2 GHz CPU" and "512 Megabytes of RAM". We are interested in the speed of the CPU and the size of the memory.
The CPU keeps active data in registers. The most recently accessed memory is in the Level 1 Cache, while slightly less recently access memory is in the Level 2 Cache, both of which are in the CPU chip. Some chips even have a Level 3 Cache. However, eventually the program will need data that is not in the CPU chip, and then it will have to go to the main memory plugged into the mainboard.
The CPU "waits" for data coming from the network or disk. If you look at a graph of CPU activity, during such periods the CPU appears inactive. If another program is ready to run, the system will allow it to use the processor. Unfortunately, when the CPU needs data from memory it does not appear to be idle. It is regarded as "busy" even through it is not accomplishing anything. Memory is so slow from the point of view of the CPU that it can miss the opportunity to execute hundreds of instructions while waiting for one byte of data, but memory is also so fast compared to other devices that there is no alternative except to wait for the data.
There are many memory performance numbers quoted by the vendors, and almost all of them are misleading. As with all the other subjects discussed in PCLT, the best way to understand the subject is to walk through the process step by step and explain what is going on in each step.
Remember that the CPU internally saves the most recently referenced memory in various types of Cache. Like any storage mechanism, Cache has to have some unit of storage. It could save individual bytes, but that would be horribly inefficient. The interface between an Intel CPU and the mainboard chipset is 64 bits (8 bytes). This means that the CPU cannot request less than 8 bytes of data at a time, so that is the next obvious possible storage unit. However, in modern CPU chips even 8 bytes is too small. Instead, CPUs store in cache a line of 32 or 64 contiguous bytes read from memory.
If the CPU transfers 8 bytes of data at a time through its connection from the mainboard, then it takes 4 or 8 consecutive data transfers across the memory bus to complete the line of cache. This is called a burst.
Memory is sold in DIMM modules, called a "stick of memory" in computer slang. The DIMM plugs into a memory bus that is also 8 bytes wide. Modern mainboards, however, have two parallel memory buses and therefore can transfer 16 bytes from memory in every data transfer cycle.
An AMD CPU chip connects the memory directly to the CPU chip. The CPU chip generates the two parallel memory buses, and transfers 16 bytes of data in a cycle across the buses. The AMD CPU also has entirely different buses (HyperTransport) to connect to the video card, I/O devices, and other CPU chips.
An Intel CPU chip has only one external 64 bit bus that connects it to the Northbridge chip on the mainboard. Through the Northbridge, it accesses memory, I/O devices, and other processors. Even in a Core 2 Duo chip with two processors on the same chip, the two processors communicate through this one 64 bit bus and the services of the Northbridge.
The memory controller that is built into the AMD CPU chip or the Intel Northbridge chip is responsible for managing the memory bus and fetching data from memory. In the Intel case, the Northbridge buffers data and adapts between the speed of the CPU "Front Side bus" and the speed of the memory bus. For example, memory delivers 16 bytes at once to the Northbridge, but the CPU can only accept 8 bytes per cycle. However, the CPU can transfer data on every cycle, while memory can have delays between bursts. In the long run, things balance out, but in the short run the memory controller buffers and matches speeds.
A memory DIMM has at least one bank of memory (some have two or more banks). Each bank operates as an independent device. It has its own buffers and maintains a logical "position" in memory. The bank is typically described as "rows" and "columns" and I guess we are stuck with that terminology the "row" is a chunk of contiguous memory (2K, 4K?) that can be held in the buffer at any one time. The memory controller selects a row by sending the high order half of the address, and the bank locates the row with this address and move all the data in the row to the buffer. This is by far the slowest and most expensive part of the memory access.
Once a row has been loaded into a buffer, however, the memory can jump from one address to another address in the same row in a delay known as the "CAS latency". CAS latency is expressed as a certain number of cycles on the memory bus, so a CAS latency of 2.5 on a memory bus running at 400 MHz is the same as a CAS latency of 5 on a memory bus running at 800 MHz. Of course, memory performance would suffer if you had to wait 5 cycles between every burst. Fortunately, access within a row can be "pipelined". You don't have to wait for a previous memory request to end before presenting the next address for the next chunk of data. The memory can be locating the next data in the row while it is transferring the previous burst of data.
So the real memory delay occurs when the program needs data from a new row. This operation cannot be pipelined, and there is a lot of work saving changes back to the old row, locating the new row, and moving the data into the buffer. Fortunately, programs tend to "localize" their access to data. This means that the next memory location the program needs is very likely to be close to one of the locations it recently referenced. That is why cache memory improves performance and it suggests, but does not guarantee, that once memory goes to the trouble of loading a row of data into its buffers it will get several requests for different data in the row before it has to go to the trouble of getting a new row.
Now for a very fine point of mainboard design. There is no signal that the memory can send to ask the memory controller to slow down or wait a minute. Every individual memory device has its own timings for transfer speed, CAS latency, row access, and so on. At power up, the memory controller obtains all these timings and parameters from the individual DIMM. After that, the memory controller is responsible for doing its own calculations about row access, CAS latency, pipelining, and all the rest. It must not send data before the memory is ready to receive it, and it must not expect memory sooner than the memory is ready to deliver it. If you run all the parts of the computer at their standard bus speeds, then this is handled automatically. However, if you "overclock" the mainboard and run the memory bus at a faster than standard speed, then you may have to override all the memory timing parameters that the DIMM provides with other more lax values.
The good news is that, while you must use the particular type of memory (DDR, DDR-2, DDR-3) that your mainboard requires, you can always substitute faster memory than your system expects. When memory is new, faster memory sells at a premium. However, if you are upgrading a system you have had for a year or two, you will often find no price difference between different speeds of older technology. Memory rated at the highest speed will have tested to be more stable and reliable, but it will still run at slower speeds. A DIMM can tell the memory controller to use shorter timing values (like CAS latency) on a slower bus speed. If it doesn't, however, then the memory controller might pick up the higher latency numbers intended for use on the faster bus, use them on the slower bus and degrade performance. You can override them with manual settings in the startup BIOS windows, or you can just buy upgrade memory that is identical to the original memory, even if it costs more.
While database servers can use seemingly unlimited amounts of memory, a modern desktop processor (even one running Vista) may have difficulty using more than 2 gigabytes of main memory. There simply are no desktop applications that require more memory. For a while vendors have used new generations of chip technology to reduce the number of chips on a DIMM and therefore the cost of memory. However, at some point the memory vendors may apply the extra transistors to modest performance gains.
In the first generation IBM PC, DRAM memory transferred one unit of memory for every CPU request. The CPU presented an address, the memory responded with data. The CPU presented another address, the memory responded with another unit of data. This worked well because the CPU and memory ran at essentially the same speed.
An operation that can only proceed when the sender and receiver both indicate that they are ready is said to be "asynchronous". It runs at the speed of the slower of the two ends. Baseball is an asynchronous game. The pitcher can take his time, look at the runner on first, get signals from the catcher. If the batter needs more time he can step back out of the box, stretch, and rub something on his hands. Only when the batter is in the box and the pitcher starts his windup can we really expect a pitch.
There is another mode of operation represented by the pitching machine in a batting cage. The machine delivers balls regularly and mechanically, whether the batter is ready or not. When something happens at a regular rate, driven by a clock, then computer experts call it a "synchronous" operation.
We talk about modern memory as Synchronous DRAM (SDRAM), although strictly speaking this applies only to the transfer of consecutive blocks of 8 bytes during the burst. Remember, the CPU is actually reading a 32 or 64 byte line of cache from some address in memory. Between bursts, the CPU still decides when it needs the next chunk of data from memory and the memory waits patiently until it gets that signal.
Dynamic Random Access Memory (DRAM) stores data in an electronic circuit called a capacitor. A capacitor holds a certain amount of electric current. It is commonly compared to a bucket of water. If the bucket is full, this represents a 1. If the bucket is empty, it is 0. The problem is that capacitors leak, like a bucket with a small hole in it. Over time, the full bucket becomes 3/4 full, then half full. So periodically the memory chip has to read all the data in a row and then rewrite it, thus refilling all the buckets that represent a 1. This is called a refresh cycle and it slows down memory performance.
The alternative is Static Memory which doesn't require a refresh cycle. The problem is that Static Memory takes up 4 times the chip space, or alternately you can get only 1/4 as much memory for the same amount of money. This has never seemed like a good tradeoff, and there doesn't appear to be any change in this expected any time soon.
The current standard is DDR-2 memory. DDR-2 memory is DDR-1 memory with a few slight tweaks. Now superficially it sounds like it runs a lot faster. DDR-2 memory can run at 800 MHz while DDR-1 typically stops at 400 MHz. However, if you look carefully you will often see (at affordable prices) that the CAS latency and other timing values for DDR-2 800 MHz is exactly twice the values for DDR-1 400 MHz. So it may be the exact same memory connected to a faster bus. There are also some changes to the bus structure to run more efficiently. Each new generation is faster, but not a whole lot faster.
A desktop mainboard has 4 memory slots representing two memory buses with two slots each. Some boards can accept 4 DIMMs with 2 gigabytes of memory per DIMM for a total of 8 gigabytes, but other boards max out at 4 gigabytes. If you try to run with the very highest memory speed the board supports, the memory controller may only be able to handle one DIMM per bus and max out at 2 gigabytes.
Meanwhile, there is no limit to the amount of memory that a serious Database Server can use. Mainboards designed to be used as servers may have more memory slots. Typically this memory will run one or two speeds slower than the fastest desktop memory and it will have some additional electronics (FB, registered) to allow reliable operation with more than just one or two DIMMs per bus.
If you plan on running Vista, you want a computer with 2 Gigabytes of RAM. You can buy 2 gigs of medium speed memory for $50. Getting the very fastest memory (highest clock speed and lowest latency) can cost noticeably more.
The memory vendors quote the fastest speed at which the memory operates, but they only test it with certain board configurations. The mainboard vendors quote the fastest clock speed at which their board operates, but they only test it with some memory. Sometimes there is fine print, where the board supports the fastest memory speed, but only if you use only two of the four memory slots.
If you insist on trying to get the very fastest memory speed possible, then you have to make sure the memory you want works with the board you want and read all the fine print. That is a lot of work to pay for the privilege of spending a lot more money to get to get what turns out to be a very small performance benefit.
Or you can buy a lot of decent but cheap memory and use the money and time you save to do something else.
The most important memory feature is one that most desktop users are unable to select. Memory can come with ECC error checking. This will detect a problem in the memory itself, but it will also detect sporadic problems caused by mismatches with the board. If you don't have ECC memory, then memory problems show up as corrupted data and cause your programs and OS to crash in all sorts of random ways. To use ECC, you not only have to get the feature in the memory stick but it also has to be supported by the mainboard, and most mainboard vendors only support it on server configurations. ECC costs a few bucks more, but that is a lot cheaper than the hours or days that you spend trying to track down a problem that initially appears to be a software problem but is ultimately resolved as a memory problem.
Copyright 1998, 2008 PCLT -- Introduction to PC Hardware -- H. Gilbert