Submitted by Jeremy on November 15, 2007 - 10:06am.
"Memory is getting relatively cheap these days --- we're talking maybe US$30 to US$40 per megabyte if your machine can take SIMMS. Upgrading a machine from 2 meg to 4 meg doesn't cost *that* much money."
How things have changed. Memory is cheaper than ever, but it's been outstripped by processor speeds. That means it's become more expensive, in terms of processor cycles, to actually access that memory. So you introduce various levels of caches to try to reduce the cost of memory accesses, but they're still not enough--you have to change the way you write code to take full advantage of the speed of today's processors. That means, for instance, not using lookup tables of precomputed values, because it has become faster to recompute those values whenever needed, than to access the memory containing the tables.
It really depends on what you're looking up. Modern L1s on AMD and Intel parts still have a 3-cycle load-to-use latency, so there are many things you can look up that will still be faster than a computation if you're looking many things up. I use lookup tables for bit-reversal and bit-expansion to great effect in my code. Often times the latency involved can get hidden. Pointer chasing, on the other hand, is painful.
I have to deal with this at work. The DSP I work with has a 5 cycle latency on its load instructions, and that's when it *hits* L1 memory. So, a loop which reads a linked list can run no faster than 5 cycles per iteration. (It's an 8-issue VLIW DSP with predication, though, so you can at least do something with each list element in that time.) It's painful, though, to watch the DSP sit there twiddling its thumbs when someone writes a->b->c->d->e.
Going to what you said about recomputing things, though... one positive side is that algorithms HAVE gotten more computationally intense. That is, the amount of math you have to do on each bit has gone up, such that the ratio of compute to bandwidth helps hide the growing gap between CPU speed and memory speed.
Blackfin has a 3 cycle load latency. Their assembly syntax hides it, but it's there. Address generation is in stage 5 of the pipeline and loaded data is available for use in stage 8. If you try to chase down a linked list without sufficient delay between your load instructions, you'll incur a bunch of stalls. Same with IIR computation. Take a look at slide 12:
Judging from the pipeline, it looks like things could get really ugly if you had a store with a subsequent dependent load. I don't have one so I can't measure it.
God, I remember paying $1000
God, I remember paying $1000 for one megabyte of memory.
Now you can get 2 gigabytes for $30.
they ripped you off ;)
they ripped you off, certainly ;)
Cheap, but slow
How things have changed. Memory is cheaper than ever, but it's been outstripped by processor speeds. That means it's become more expensive, in terms of processor cycles, to actually access that memory. So you introduce various levels of caches to try to reduce the cost of memory accesses, but they're still not enough--you have to change the way you write code to take full advantage of the speed of today's processors. That means, for instance, not using lookup tables of precomputed values, because it has become faster to recompute those values whenever needed, than to access the memory containing the tables.
Bandwidth is pretty good; latency sucks.
It really depends on what you're looking up. Modern L1s on AMD and Intel parts still have a 3-cycle load-to-use latency, so there are many things you can look up that will still be faster than a computation if you're looking many things up. I use lookup tables for bit-reversal and bit-expansion to great effect in my code. Often times the latency involved can get hidden. Pointer chasing, on the other hand, is painful.
I have to deal with this at work. The DSP I work with has a 5 cycle latency on its load instructions, and that's when it *hits* L1 memory. So, a loop which reads a linked list can run no faster than 5 cycles per iteration. (It's an 8-issue VLIW DSP with predication, though, so you can at least do something with each list element in that time.) It's painful, though, to watch the DSP sit there twiddling its thumbs when someone writes a->b->c->d->e.
Going to what you said about recomputing things, though... one positive side is that algorithms HAVE gotten more computationally intense. That is, the amount of math you have to do on each bit has gone up, such that the ratio of compute to bandwidth helps hide the growing gap between CPU speed and memory speed.
--
Program Intellivision and play Space Patrol!
Blackfin
There is no latency on the Blackfin for L1 reads. Carefully relocating code and data to it can produce spectacular results.
Not really
Blackfin has a 3 cycle load latency. Their assembly syntax hides it, but it's there. Address generation is in stage 5 of the pipeline and loaded data is available for use in stage 8. If you try to chase down a linked list without sufficient delay between your load instructions, you'll incur a bunch of stalls. Same with IIR computation. Take a look at slide 12:
http://www.analog.com/processors/pdf/bold/Prog_Opt_C_Code_on_Blackfin_slides.pdf
Judging from the pipeline, it looks like things could get really ugly if you had a store with a subsequent dependent load. I don't have one so I can't measure it.
--
Program Intellivision and play Space Patrol!