Banner ôȫҿӻ Google Web Designer ܽΪĹߡڣƵҳ Adobe Muse Reflow ȣǶշѵģȸѵ Web Designer ƱػǲӰ졣
Apple IDѱͣô죿ƻApple IDűõĽ취
google web designerôãgoogle web designerƵ̳
ȸշһHTMLҳƿGoogle Web Designerôgoogle web designerô?ĽͨһƵΪҽ̬Bannerľ
ֻôgoogle fit ֻgoogle fit̳
or other products of your own company?Display your Products FREE now!
Tianjin International Classic Arts Import Export Co., Ltd.
TagsStone Marble PillarStone Marble Pillar DesignView larger
Huizhou Ouhong Polyurethane Decorative Building Material Co., Ltd.
Alibaba SiteInternationalEspañolPortugusDeutschFrançaisItalianoहिंदीP한국어اللغة العربيةภาษาไทยTrkNederlandstiếng ViệtIndonesianעברית
Taixing Quanxing Decorative Material Co., Ltd.
Tags Plastic Molding Roman Building Design Marble Pillar Marble Pillar Roman Building Design Marble PillarView larger
Tianjin International Classic Arts Import Export Co., Ltd.
Yunfu Newsail Stone Material And Art Creation Co., Ltd.
High Quality Good Price Beauty polyurethane decorative PU
Tags Pillar With Lady Statue Sculpture Large Size Marble Building Design Pillar With Lady Statue Carving SculptureView larger
TagsMarble Building Design PillarDecorative Marble Building Design PillarView larger
handrail decorative cream marble square stone column
Zhengzhou Lyine Machinery And Equipment Co., Ltd.
Jilin Province Oule Wooden Industry Co., Ltd.
Finehope Xiamen Polyurethane Products Co., Ltd.
Tags Stone Building Design Pillar Pillar With Lady StatueView larger
Shijiazhuang Yuanzhao Import Export Co., Ltd.
Zhongshan Naiteli Decoration Materials Co., Ltd.
Foshan Inspiration Decoration Material Co., Ltd.
Havent found the right supplier yet ? Let matching verified suppliers find you.
Tags Pu Building Design Pillar Polyurethane Building Design PillarView larger
Tags Building Design Pillar Buliding Design Pillar Buliding Design Pillar PillarsView larger
Morden Luxury Natural Stone Marble Decoration House Roman
Tags Building Roman Square Roman Pillar DesignView larger
TagsHouse Gate Pillar DesignHand Carved Gate PillarView larger
Arts andmodern design x x / in. . x . x . cm in. diameter opening
Jerome and Simona Chazen Building / Columbus Circle / New York, NY
Medical records, credit card offers, tiki decorations, nature calendars, plastic lecloths, plastic bags,Dora the Explorerpuzzle pieces, photo, maps and oil on chicken wire and wood
Polyphonic Playground x Studio Wayne McGregor, Here East, London
Wood, speakers, amplifiers, mixers, wires, cables, piano strings, bone, cable holders, contact microphones, brass, sheet rock. . meters high x . meters wide.
Installation view ofJulianne Swartzs Sine Body
Landing Mist Parts Over Columbus Circle,
Museum purchase with funds provided by the Collections Committee,
Consider the above example once again. Of the five instructions, two are branches, and one of those is an unconditional branch. If it was possible to somehow tag themovinstructions to tell them to execute only under some conditions, the code could be simplified…
Figure The instruction flow of a VLIW processor.
SIMD vector instructions MMX/SSE/AVX, AltiVec, NEON
At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch, and this information forms the inputs to the logic circuits of the next pipeline stage. During the clock cycle, the signals propagate through the combinatorial logic of the stage, producing an output just in time to be captured by the next pipeline latch at the end of the clock cycle…
Older generations of processors were even worse, because their memory controllers werent onchip, but rather were part of the chipset on the motherboard, adding another buscycles for the transfer of the address and data between the processor and that motherboard chipset and this was at a time when the memory bus was only MHzor less, not MHz,so those buscycles often added another CPUcycles to the total. Some processors attempted to mitigate this issue by increasing the speed of theirfrontside busFSB between the processor and the chipset MHzQDR inPentium ,. GHzDDR inPowerPC G.A r better approach, used by all modern processors, is to integrate the memory controller directly into the processor chip, which allows those bus cyclesto be converted into much sterCPU cyclesinstead.UltraSPARC IIiandAthlon were the first mainstream processors to do this, while Intel was late to the party and only integrated the memory controller into their CPU chips starting with theCore iseries.
Unfortunately, both DDR SDRAM memory and onchip memory controllers are only able to do so much, and memory latency continues to be a major problem. This problem of the large, and slowly growing, gap between the processor and main memory is called thememory wall. It was, at one time, the single most important problem cing processor architects, although today the problem has eased considerably because processor clock speeds are no longer climbing at the rate they previously did, due to power and heat constraints thepower wall.
Up to a point this increase in power is okay, but at a certain point, currently somewhere around watts,the power and heat problems become unmanageable, because its simply not possible to provide that much power and cooling to a silicon chip in any practical shion, even if the circuits could, in ct, operate at higher clock speeds. This is called thepower wall.
Designing an Alpha Microprocessor a scinating look at what really goes on in the various stages of a project to make a new processor.
Since the result from each instruction is available after the execute stage has completed, the next instruction ought to be able to use that value immediately, rather than waiting for that result to be committed to its destination register in the writeback stage. To allow this, forwarding lines calledbypassesare added, going backwards along the pipeline…
Modern processors devote ever more hardware to branch prediction in an attempt to raise the prediction accuracy even further, and reduce this cost. Many record each branchs direction not just in isolation, but in the con of the couple of branches leading up to it, which is called atwolevel adaptivepredictor. Some keep a more global branch history, rather than a separate history for each inidual branch, in an attempt to detect any correlations between branches even if theyre relatively r away in the code. Thats called agshareorgselectpredictor. The most advanced modern processors often implementseveralbranch predictors and select between them based on which one seems to be working best for each inidual branch!
If additional independent instructions arent available within the program being executed, there is another potential source of independent instructions other running programs, or other threads within the same program.Simultaneous multithreadingSMT is a processor design technique which exploits exactly this of threadlevel parallelism.
AMDs Bulldozer Microarchitecture the novel resourcesharing approach used in AMDs latest processor designs, blurring the line between SMT and
WARNING This article is meant to be informal and fun!
For modern L data caches, the load latency is usually or cycles, depending on the processors general clock speed, but occasionally shorter cycles onUltraSPARC III/IVthanks to clockless wave pipelining, cycles in earlier processors due to their slower clock speeds and shorter pipelines, where a clock cycle was more real time. Increasing the load latency by a cycle, say from to , or from to , can seem like a minor change but is actually a serious hit to performance, and is something rarely noticed or understood by end users. For normal, everyday pointerchasing code, a processors load latency is a major ctor in realworld performance.
Today early , withseveralbillion transistors now available thanks toMoores Law,even aggressively brainiac designs can have quite a lot of cores Intels XeonHaswell EP,the server version ofCore i*Haswell, uses. billiontransistors to provide cores up from in Xeon SandyBridge EP,each a very aggressively brainiacissuedesign up from issue in Sandy Bridge, each still withthreadSMT, while IBMs POWER uses. billiontransistors to move to a considerably more brainiac core design than POWER, and at the same time provide coresup from in POWER, each withthreadSMT up from in POWER. Of course, whether such large, brainiac core designs are anefficientuse of all those transistors is a separate question.
Intels Sandy Bridge Microarchitecture the previous Intel x processor design,
From a compilers point of view, typical latencies in modern processors range from a single cycle for integer operations, to aroundcycles for floatingpoint addition and the same or perhaps slightly longer for multiplication, through to over a dozen cycles for integer ision.
Figure The instruction flow of a sequential processor.
Things CPU Architects Need To Think Aboutvideo an interesting
In many situations, especially in imaging, video and applications, a program needs to execute the same instruction for asmall groupof related values, usually a short vector a structure or small array. For example, an processing application might want to add groups ofbitnumbers, where eachbitnumber represents one of the red, green, blue or alpha transparency values of a pixel…
Haswell, largely based on the previous Sandy Bridge design.
When it comes to memory systems, there are often subtle tradeoffs between latency and bandwidth. Lowerlatency designs will be better for pointerchasing code, such as compilers and daase systems, whereas bandwidthoriented systems have the advantage for programs with , linear access patterns, such as processing and scientific code. Of course, its reasonablyeasy to increase bandwidth simply adding more memory banks and the busses wider can easily double or quadruple bandwidth. In ct, many highend systems do this to increase their performance, but it comes with downsides as well. In particular, wider busses mean a more expensive motherboard, restrictions on the way RAM can be added to a system install in pairs or groups of four and a higher minimum RAM configuration.
Given the multicore performanceperarea efficiency of small cores, but the maximum outright singlethreaded performance of large cores, perhaps in the future we might seeasymmetricdesigns, with one or two big, wide, brainiac cores plus a large number of smaller, narrower, r cores. In many ways, such a design makes the most sense highly parallel programs would benefit from the many small cores more than a few large ones, but singlethreaded, sequential programs want the might of at least one large, wide, brainiac core, even if it does take four times the area to provide only twice the singlethreaded performance.
The first issue that must be cleared up is the difference between clock speed and a processors performance.They are not the same thing.Look at the results for processors of a few years ago the late s…
Whats happening here is exactly the same operation as abitaddition, except that every th carry is not being propagated. Also, it might be desirable for the values not to wrap to zero once all bitsare full, and instead to hold at as a maximum value in those cases called saturation arithmetic. In other words, every th carry is not carried across but instead triggers anallonesresult. So, the vector addition operation shown above is really just a modifiedbitadd.
As already mentioned, the approach of exploiting instructionlevel parallelism through superscalar execution is seriously weakened by the ct that most normal programs just dont have a lot of finegrained parallelism in them. Because of this, even the most aggressively brainiac OOO superscalar processor, coupled with a smart and aggressive compiler to spoon feed it, will still almost never exceed an average of aboutinstructions per cycle when running most mainstream, realworld software, due to a combination of load latencies, cache misses, branching and dependencies between instructions. Issuing many instructions in the same cycle only ever happens for short bursts of a few cycles at most, separated by many cycles of executing lowILP code, so peak performance is not even close to being achieved.
The early RISC processors, such as IBMs research proto, theMIPS Rbased on the Stanford MIPS machine and the original SPARC derived from the Berkeley RISC project, all implemented a stagepipeline not unlike the one shown above. At the same time, the mainstream , and VAX CISC processors worked largely sequentially its much easier to pipeline a RISC because itsreduced instruction setmeans the instructions are mostly registertoregister operations, unlike the complex instruction sets of x, mk or VAX. As a result, a pipelined SPARC running at MHzwas way ster than a sequential running at MHz. Every processor since then has been pipelined, at least to some extent. A good summary of the original RISC research projects can be found in the CACM articleby David Patterson.
The x processors generally have deeper pipelines than the RISCs of comparable era because they need to do extra work to decode the complex x instructions more on this later. UltraSPARC T/T/T Niagara are a recent exception to the deeppipeline trend just forUltraSPARC Tand for T/T to keep those cores as small as possible more on this later, too.
In addition to instructionlevel parallelism and threadlevel parallelism, there is yet another source of parallelism in many programs data parallelism. Rather than looking for ways to execute groups of instructions in parallel, the idea is to look for ways to makeoneinstruction apply to agroup of data valuesin parallel.
Of course, theres no reason to stop at bits.If there happen to be somebitregisters, which architectures usually have for floatingpoint at least, they could be used to providebitvectors, thereby doubling the parallelism SPARC VISandx MMXdid this. If it is possible to define entirely new registers, then they might as well be even wider x SSE added newbitregisters, later increased to registers inbitmode, then widened to bitswith AVX, while POWER/PowerPC AltiVec provided a full set of newbitregisters from the start in keeping with POWER/PowerPCs more separated design , where even the branch instructions have their own registers. An alternative to widening the registers is to use pairing, where each pair of registers is treated as a single operand by the SIMD vector instructions ARM NEONdoes this, with its registers usable both as bitregisters or as bitregisters.
On a superscalar processor, the instruction flow looks something like…
The decoupling of x instruction fetch and decode from internal RISClike op instruction dispatch and execution also makes defining thewidthof a modern x processor a bit tricky, and it gets even more unclear because internally such processors often group or fuse ops into common pairs where possible for ease of tracking such as loadandadd or compareandbranch. A processor such asCore i*/i*Haswell/Broadwell, for example, can decode up to xinstructions per cycle, producing a maximum of up to fusedops per cycle, which are then stored in an L op cache, from which up to fusedops per cycle are fetched, then registerrenamed and placed into a reorder buffer, from which up to unfusedinidual ops are issued per cycle to the functional units, where they proceed down the various pipelines until they complete, whereupon up to fusedops per cycle can be committed and retired. So what does that make the width of Haswell/Broadwell? Its really anissueprocessor at heart, since up to unfusedops can be fetched, issued and completed per cycle if theyre paired/fused in just the right way and an unfused op is the most direct equivalent of a RISC instruction, but even experts disagree on exactly what to call the width of such a design, sinceissuewould also be valid, in terms of fused ops, which is what the processor mostly thinks in terms of for tracking purposes, andissueis also valid if thinking in terms of original x instructions. Of course, this widthlabelling conundrum is largely academic, since no processor is likely to actually sustain such high levels of ILP when running realworld code anyway.
Since the execute stage of the pipeline is really a bunch of differentfunctional units, each doing its own task, it seems tempting to try to execute multiple instructionsin parallel, each in its own functional unit. To do this, the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the execution resources…
The s and speeds of the various levels of cache in modern processors are absolutely crucial to performance. The most important by r are the primary L data cacheDcacheand L instruction cacheIcache.Some processors go for small L cachesPentium EPrescott, Scorpion and Krait have k L caches for each of I andDcache,earlierPentium sandUltraSPARC T/T/Tare even smaller at just k, most have settled on k as the sweet spot, and a few are larger at k Athlon,Athlon /Phenom,UltraSPARC III/IV,Apple A/Aor occasionally even k theIcacheof Denver, with a kDcache.
Consider how an instruction is executed first it is fetched, then decoded, then executed by the appropriate functional unit, and finally the result is written into place. With this scheme, a processor might take cyclesper instructionCPI …
Figure The instruction flow of a pipelined processor.
Now, given that both instructionlevel parallelism and threadlevel parallelism suffer from diminishing returns in different ways, and remembering that SMT is essentially a way to convert TLP into ILP, but also remembering that wide superscalar designs scale very nonlinearly in terms of chip area and design complexity, and power usage, the obvious question is where is the sweet spot? How wide should the cores be made to reach a good balance between ILP and TLP? Right now, many different approaches are being explored…
From the hardware point of view, a cache works like a twocolumn le one column is the memory address and the other is the block of data values remember that each cache line is a whole block of data, not just a single value. Of course, in reality the cache need only store the necessary higherend part of the address, since lookups work by using the lower part of the address to index the cache. When the higher part, called thetag, matches the tag stored in the le, this is ahitand the appropriate piece of data can be sent to the processor core…
Which is the better approach? Alas, theres no answer here once again its going to depend very much on the applications. For applications with lots of active but memorylatencylimited threads daase systems, D graphics rendering, more cores would be better because big/wide cores would spend most of their time waiting for memory anyway. For most applications, however, there simply are not enough threads active to make this viable, and the performance of just a single thread is much more important, so a design with fewer but bigger, wider, more brainiac cores is more appropriate at least for todays applications.
From the hardware point of view, each pipeline stage consists of some combinatorial logic and possibly access to a register set and/or some form of highspeed cache memory. The pipeline stages are separated by latches. A common clock signal synchronizes the latches between each stage, so that all the latches capture the results produced by the pipeline stages at the same time. In effect, the clock pumps instructions down the pipeline.
Figure A typical modern SoC NVIDIA Tegra .
Since memory is transferred in blocks, and since cache misses are an urgent show stopper of event with the potential to halt the processor in its tracks or at least severely hamper its progress, the speed of those block transfers from memory is critical. The transfer rate of a memory system is called itsbandwidth. But how is that different fromlatency?
A cache which allows data to occupy one of locations based on its address is calledwaysetassociative. Similarly, awaysetassociativecache allows for possible locations for any given piece of data, and anwaycache possible locations.Setassociativecaches work much likedirectmappedones, except there are several les, all indexed in parallel, and the tags from each le are compared to see whether there is a match for any one of them…
A brief, pullsnopunches, stpaced introduction to the main design aspects of modern processor microarchitecture.
Table Pipeline depths of common processors.
In cases where backward compatibility is not an issue, it is possible for theinstruction setitself to be designed toexplicitlygroup instructions to be executed in parallel. This approach eliminates the need for complex dependencychecking logic in the dispatch stage, which should make the processor easier to design, smaller, and easier to ramp up the clock speed over time at least in theory.
Modern processors overlap these stages in apipeline, like an assembly line. While one instruction is executing, the next instruction is being decoded, and the one after that is being fetched…
The Alpha architects in particular liked this idea, which is why the early Alphas had deep pipelines and ran at such high clock speeds for their era. Today, modern processors strive to keep the number of gate delays down to just a handful for each pipeline stage, aboutgates deep not total! plus anotherfor the latch itself, and most have quite deep pipelines…
Some of the more recent x processors even store the translated ops in a small buffer, or even a dedicatedL opinstruction cache, to avoid having to retranslate the same x instructions over and over again during loops, saving both time and power. Thats why, for example, the pipeline depth ofCore i*/i*Sandy/Ivy Bridge andCore i*/i*Haswell/Broadwell was shown as/ stagesin the earlier section on superpipelining it is stageswhen the processor is running from itsL opcache which is the common case, but stageswhen running from the L instruction cache and having to translate the x instructions into ops.
Given this new predicated move instruction, two instructions have been eliminated from the code, and both were costly branches. In addition, by being clever and always doing the firstmovthen overwriting it if necessary, the parallelism of the code has also been increased lines and can now be executed in parallel, resulting in a speedup cyclesrather than . Most importantly, though, the possibility of getting the branch prediction wrong and suffering a large mispredict penalty has been eliminated.
If the first instruction was a integer addition then this might still be okay in a pipelinedsingleissueprocessor, because integer addition is quick and the result of the first instruction would be available just in time to feed it back into the next instruction using bypasses. However in the case of a multiply, which will take several cycles to complete, there is no way the result of the first instruction will be available when the second instruction reaches the execute stage just one cycle later. So, the processor will need to stall the execution of the second instruction until its data is available, inserting abubbleinto the pipeline where no work gets done.
This is great! There are now instructions completing every cycleCPI .,orIPC ,also written asILP forinstructionlevel parallelism. The number of instructions able to be issued, executed or completed per cycle is called a processorswidth.
Processors which focused too much on clock speed, such as thePentium ,IBMs POWERand most recently AMDs Bulldozer/Piledriver, quickly hit the power wall and found themselves unable to push the clock speed as high as originally hoped, resulting in them being beaten by slowerclocked butsmarterprocessors which exploited more instructionlevel parallelism.
As mentioned earlier, latency is a big problem for pipelined processors, and latency is especially bad for loads from memory, which make up about a quarter of all instructions.
Thus, going purely for clock speed is not the best strategy. And of course, this is even more true for porle, mobile devices, such as laptops, lets and phones, where thepower wallhits much sooner, around W for laptops, W for lets and less than W for phones, due to the constraints of battery capacity and limited, often nless cooling.
In particular, you might not be aware of some key topics that developed rapidly in recent times…
Table The memory hierarchy of a modern phoneApple Ain theiPhone .
Once again, the idea is to fill those empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same code which are hard to come by, the instructions come frommultiple threadsrunning at the same time, all on theone processor core. So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multiprocessor system.
To address this problem, more sophisticated caches are able to place data in a small number of different places within the cache, rather than just a single place. The number of places a piece of data can be stored in a cache is called itsassociativity. The word associativity comes from the ct that cache lookups work by association that is, a particular address in memory is associated with a particular location in the cache or set of locations for asetassociativecache.
So the processor must make aguess. The processor will then fetch down the path it guessed andspeculativelybegin executing those instructions. Of course, it wont be able to actually commit writeback those instructions until the outcome of the branch is known. Worse, if the guess is wrong the instructions will have to be cancelled, and those cycles will have been wasted. But if the guess is correct, the processor will be able to continue on at full speed.
Of course, there are also a whole range of options between these two extremes that have yet to be fully explored. IBMs POWER, for example, was of the same generation, also having approximately billiontransistors, and used them to take the middle ground with ancore,threadSMT design with moderately but not overly aggressive OOO execution hardware. AMDs Bulldozer design used a more innovative approach, with a shared, SMT frontend for eachpairof cores, feeding a backend with unshared, multicore integer execution units but shared, SMT floatingpoint units, blurring the line between SMT andmulticore.
Of course, the processor must also keep track of which instructions and which rename registers belong to which threads at any given point in time, but it turns out this only adds a small amount to the complexity of the core logic. So, for the relatively cheap design cost of around more logic in the core, and an almost negligible increase in total transistor count and final production cost, the processor can execute several threads simultaneously, hopefully resulting in a substantial increase in functionalunit utilization and instructions per cycle, and thus overall performance.
A VLIW processors instruction flow is much like a superscalar, except the decode/dispatch stage is much r and only occurs for each group of subinstructions…
In most modern processors, the instruction cache can afford to be highlysetassociativebecause its latency is hidden somewhat by the fetching and buffering of the early stages of the processors pipeline. The data cache, on the other hand, is usuallysetassociativeto some degree, but often not overly so, to minimize the allimportant load latency. Most processors have settled onwaysetassociativeas the sweet spot, but a few are less associativewayin Athlon,Athlon /Phenom,PowerPC GandCortexA/A,and a handful are more associativewayinPowerPC Ge,Pentium Mand its Core descendants. As the last resort before heading off to raway main memory, the large L/L cache sometimes called LLC for lastlevel cache is also usually highly associative, perhaps as much as orway,although externalEcacheis sometimes directmapped for flexibility of and implementation.
There are two ways to do this. One approach is to do the reordering in hardware at runtime. Doing dynamicinstruction schedulingreordering in the processor means the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order as best it can to use the processors functional units. Not surprisingly, this is calledoutoforderexecution, or just OOO for short sometimes written OoO or OOE.
The number of cycles between when an instruction reaches the execute stage and when its result is available for use by other instructions is called the instructionslatency. The deeper the pipeline, the more stages and thus the longer the latency. So a very deep pipeline is not much more effective than a short one, because a deep one just gets filled up with bubbles thanks to all those nasty instructions depending on each other.
Given SMTs ability to convert threadlevel parallelism into instructionlevel parallelism, coupled with the advantage of better singlethread performance for particularly ILPfriendly code, you might now be asking why anyone would ever build amulticoreprocessor when an equally wide in total SMT design would be superior.
The instruction flow of an SMT processor looks something like…
If the processor is going to execute instructions out of order, it will need to keep in mind the dependencies between those instructions. This can be made easier by not dealing with the raw architecturallydefined registers, but instead using a set ofrenamedregisters. For example, a store of a register into memory, followed by a load of some other piece of memory into the same register, represent differentvaluesand need not go into the same physical register. Furthermore, if these different instructions are mapped to different physical registers they can be executed in parallel, which is the whole point of OOO execution. So, the processor must keep a mapping of the instructions in flight at any moment and the physical registers they use. This process is calledregister renaming. As an added bonus, it becomes possible to work with a potentially larger set of real registers in an attempt to extract even more parallelism out of the code.
It can be confusing when the word latency is used for related, but different, meanings. Here, Im talking about the latency as seen by a compiler. Some hardware engineers may think of latency as the number of cycles required for execution the number of pipeline stages. So a hardware engineer might say the instructions in a integer pipeline have a latency of but a throughput of , whereas from a compilers point of view they have a latency of because their results are available for use in the very next cycle. The compiler view is the more common, and is generally used even in hardware manuals.
If you want more detail on the specifics of recent processor designs, and something more insightful than the raw technical manuals, here are a few good articles…
SMT performance is a tricky business. First, the whole idea of SMT is built around the assumption that either lots of programs are simultaneously executing not just sitting idle, or if just one program is running, it haslots of threads all executingat the same time. Experience with existing multiprocessor systems shows this isnt always true. In practice, at least for desktops, laptops, lets, phones and small servers, it is rarely the case that several different programs are actively executing at the same time, so it usually comes down to just the one task the machine is currently being used for.
Other than the simplification of the dispatch logic, VLIW processors are much like superscalar processors. This is especially so from a compilers point of view more on this later.
Conditional branches are so problematic that it would be nice to eliminate them altogether. Clearly,ifstatements cannot be eliminated from programming languages, so how can the resulting branches possibly be eliminated? The answer lies in the way some branches are used.
Note that although a DDR SDRAM memory system transfersdataon both the rising and lling edges of the clock signal ie atdouble data rate, the true clock speed of the memory system bus is only half that, and it is the bus clock speed which applies for control signals. So the latency of a DDR memory system is the same as anonDDRsystem, even though the bandwidth is doubled more on the difference between bandwidth and latency later.
The second instructiondependson the first the processor cant execute the second instruction until after the first has completed calculating its result. This is a serious problem, because instructions that depend on each other cannot be executed in parallel. Thus, multiple issue is impossible in this case.
For applications where this of data parallelism is available and easy to extract, SIMD vector instructions can produce amazing speedups. The original target applications were primarily in the area of and video processing, however suile applications also include audio processing, speech recognition, some parts of D graphics rendering and many s of scientific code. For other s of software, such as compilers and daase systems, the speedup is generally much smaller, perhaps even nothing at all.
Since the clock speed is limited by among other things the length of the longest, slowest stage in the pipeline, the logic gates that make up each stage can besubided, especially the longer ones, converting the pipeline into a deeper superpipeline with a larger number of shorter stages. Then the whole processor can be run at ahigher clock speed!Of course, each instruction will now take more cycles to complete latency, but the processor will still be completing instructionper cycle throughput, and there will be more cycles per second, so the processor will complete more instructions per second actual performance…
Todays robots are very primitive, capable of understanding only a
Of course, a true multiprocessor system also executes multiple threads simultaneously but only one in each processor. This is also true formulticoreprocessors, which place two or more processor cores onto a single chip, but are otherwise no different from traditional multiprocessor systems. In contrast, an SMT processor uses just onephysicalprocessor core to present two or morelogicalprocessors to the system. This makes SMT much more efficient than amulticoreprocessor in terms of chip space, brication cost, power usage and heat dissipation. And of course theres nothing preventing amulticoreimplementation where each core is an SMT design.
Modern processors solve the problem of the memory wall withcaches. A cache is a small but st of memory located on or near the processor chip. Its role is to keep copies of small pieces of main memory. When the processor asks for a particular piece of main memory, the cache can supply it much more quickly than main memory would be able to if the data is in the cache.
As described above, the st and stest caches allow for only one place in the cache for each address in memory each piece of data is simply mapped toaddress within the cache by simply looking at the lower bits of the address as in the above diagram. This is called adirectmappedcache. Any two locations in memory whose addresses are the same for the lower address bits will map to the same cache line in a directmapped cache, causing a cache conflict.
Another key problem for pipelining is branches. Consider the following code sequence…
The widths of modern processors vary considerably…
When it comes to the brainiac debate, many vendors have gone down one path then changed their mind and switched to the other side…
The exact number and of functional units in each processor depends on its target market. Some processors have more floatingpoint execution resources IBMs POWER line, others are more integerbiasedPentium Pro/II/III/M,some devote much of their resources to SIMD vector instructionsPowerPC G/Ge,while most try to take the balanced middle ground.
Figure A pipelined microarchitecture in more detail.
the only competitor to ever really challenge Intels dominance in the world of x processors.
In addition to the reduction in effective latency, there is also a substantial increase in bandwidth, because in an SDRAM memory system, multiple memory requests can be outstanding at any one time, all being processed in a highly efficient, fully pipelined shion. Pipelining of the memory system has dramatic effects for memory bandwidth an SDRAM memory system generally provided double or triple the sustained memory bandwidth of an asynchronous memory system of the same era, even though the latency of the SDRAM system was only slightly lower, and the same underlying memorycell technology was in use and still is.
Unfortunately, even the best branch prediction techniques are sometimes wrong, and with a deep pipeline many instructions might need to be cancelled. This is called themispredict penalty. The Pentium Pro/II/III was a good example it had astagepipeline and thus a mispredict penalty of cycles.Even with a clever dynamic branch predictor that correctly predicted an impressive of the time, this high mispredict penalty meant about of the Pentium Pro/II/IIIs performance was lost due to mispredictions. Put another way, one third of the time the Pentium Pro/II/III was not doing useful work, but instead was saying oops, wrong way.
Fear not! This article will get you up to speedst. In no time, youll be discussing the finer points ofinordervsoutoforder,hyperthreading,multicoreand cache organization like a pro.
All of this dependency analysis, register renaming and OOO execution adds a lot of complex logic to the processor, it harder to design, larger in terms of chip area, and more powerhungry. The extra logic is particularly powerhungry because those transistors arealwaysworking, unlike the functional units which spend at least some of their time idle possibly even powered down. On the other hand,outoforderexecution offers the advantage that software need not be recompiled to get at least some of the benefits of the new processors design, though typically not all.
multicore and simultaneous multithreading SMT,
In the above example, the processor could potentially issue different instructions per cycle for example integer, floatingpoint and memory instruction. Even more functional units could be added, so that the processor might be able to execute integer instructions per cycle, or floatingpoint instructions, or whatever the target applications could best use.
It is possible to use either the physical address or the virtual address to do the cache lookup. Each has pros and cons like everything else in computing. Using the virtual address might cause problems because different programs use the same virtual addresses to map to different physical addresses the cache might need to be flushed on every con switch. On the other hand, using the physical address means the virtualtophysical mapping must be performed as part of the cache lookup, every lookup slower. A common trick is to use virtual addresses for the cache indexing but physical addresses for the tags. The virtualtophysical mapping TLB lookup can then be performed in parallel with the cache indexing so that it will be ready in time for the tag comparison. Such a scheme is called avirtuallyindexed physicallytaggedcache.
IBMs Cell processor used in the SonyPlayStation was arguably the first such design, but unfortunately it suffered from severe programmability problems because the small, cores in Cell were not instructionset compatible with the large main core, and only had limited, awkward access to main memory, them more like specialpurpose coprocessors than generalpurpose CPU cores. Some modern ARM designs also use an asymmetric approach, with several large cores paired with one or a few smaller, r companion cores, not for maximum multicore performance, but so the large, powerhungry cores can be powered down if the phone or let is only being lightly used, in order to increase battery life, a strategy ARM calls big.LITTLE.
Unfortunately,latency is much harderto improve than bandwidth as the saying goesyou cant bribe god. Even so, there have been some good improvements ineffectivememory latency in past years, chiefly in the form of synchronously clocked DRAM SDRAM, which uses the same clock as the memory bus. The main benefit of SDRAM was that it allowedpipelining of the memory system, because the internal timing aspects and interleaved structure of SDRAM chip operation are exposed to the system and can thus be taken advantage of. This reduces effective latency because it allows a new memory access to be started before the current one has completed, thereby eliminating the small amounts of waiting time found in older asynchronous DRAM systems, which had to wait for the current access to complete before starting the next on average, an asynchronous memory system had to wait for the transfer of half a cache line from the previous access before starting a new request, which was often several bus cycles, and we know how slow those are!.
Cache conflicts can cause pathological worstcase performance problems, because when a program repeatedly accesses two memory locations which happen to map to the same cache line, the cache must keep storing and loading from main memory and thus suffering the long mainmemory latency on each access cyclesor more, remember!. This of situation is calledthrashing, since the cache is not achieving anything and is simply getting in the way despite obvious temporal locality and reuse of data, the cache is unable to exploit the locality offered by this particular access pattern due to limitations of its simplistic mapping between memory locations and cache lines.
On top of this, the ct that the threads in an SMT design are allsharingjust one processor core, and just one set of caches, has major performance downsides compared to a true multiprocessor ormulticore.Within the pipelines of an SMT processor, if one thread saturates just one functional unit which the other threads need, it effectively stalls all of the other threads, even if they only need relatively little use of that unit. Thus, balancing the progress of the threads becomes critical, and the most effective use of SMT is for applications with highly variable code mixtures, so the threads dont constantly compete for the same hardware resources. Also, competition between the threads for cache space may produce worse results than letting just one thread have all the cache space available particularly for software where the critical working set is highly cache sensitive, such as hardware simulators/emulators, virtual machines and highquality video encoding with a large motionestimation window.
The other alternative is to have the processor make the guessat runtime. Normally, this is done by using an onchipbranch prediction lecontaining the addresses of recent branches and a bit indicating whether each branch was taken or not last time. In reality, most processors actually use two bits, so that a single nottaken occurrence doesnt reverse a generally taken prediction important for loop back edges. Of course, this dynamic branch prediction le takes up valuable space on the processor chip, but branch prediction is so important that its well worth it.
The net result is that today, increasing a modern processors clock speed by a relatively modest can take as much as double the power, and produce double the heat…
Figure A RISCy x decoupled microarchitecture.
Latencies for memory loads are particularly troublesome, in part because they tend to occur early within code sequences, which makes it difficult to fill their delays with useful instructions, and equally importantly because they are somewhat unpredicle the load latency varies a lot depending on whether the access is a cache hit or not well get to caches later.
Note that the issue width is less than the number of functional units this is typical. There must be more functional units because different code sequences have different mixes of instructions. The idea is to execute instructions per cycle, but those instructions are not always going to be integer, floatingpoint and memory operation, so more than functional units are required.
While the original Pentium, a superscalar x, was an amazing piece of engineering, it was clear the big problem was the complex and messy x instruction set. Complex addressing modes and a minimal number of registers meant few instructions could be executed in parallel due to potential dependencies. For the x camp to compete with the RISC architectures, they needed to find a way toget aroundthe x instruction set.
Usually, setassociative caches are able to avoid the problems that occasionally occur withdirectmappedcaches due to unfortunate cache conflicts. Adding even more ways allows even more conflicts to be avoided. Unfortunately, the more highly associative a cache is, the slower it is to access, because there are more operations to perform during each access. Even though the comparisons themselves are performed in parallel, additional logic is required to select the appropriate hit, if any, and the cache also needs to update the marker bits appropriately within each way. More chip area is also required, because relatively more of the caches data is consumed by tag information rather than data blocks, and extra datapaths are needed to access each inidual way of the cache in parallel. Any and all of these ctors may negatively affect access time. Thus, awaysetassociativecache isslower but smarterthan adirectmappedcache, withwayandwaybeing slower but smarter again.
Instead, a cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache. Thus, only one or a handful of checks are required during access, so access can be kept st which is the whole point of having a cache in the first place. This approach does have a downside, however it means the cache doesnt store the absolutely best set of recently accessed data, because several different locations in memory will all map to thesame one locationin the cache. When two such memory locations are wanted at the same time, such a scenario is called acache conflict.
Sandy Bridge, representing a blending of the
For example, access latency for main memory, using a modern SDRAM with aCAS latencyof , will typically be cyclesof thememory system bus tosend the address to the DIMM memory module,RAStoCASdelay of for the row access,CAS latencyof for the column access, and a final to send the first piece of data up to the processor orEcache,with the remaining data block following over the next few bus cycles. On a multiprocessor system, even more bus cycles may be required to support cache coherency between the processors. And then there are the cycles within the processor itself, checking the variousonchipcaches before the address even gets sent to the memory controller, and then when the data arrives from RAM to the memory controller and is sent to the relevant processor core. Luckily, those are ster internal CPU cycles, not memory bus cycles, but they still account for CPUcycles or so in most modern processors.
Each le, orway, may also have marker bits so that only the line of the least recently used way is evicted when a new line is brought in, or perhaps some ster approximation of that ideal.
One of the most interesting members of theRISCx group was the Transmeta Crusoe processor, which translated x instructions into an internal VLIW form, rather than internal superscalar, and usedsoftwareto do the translation at runtime, much like a Java virtual machine. This approach allowed the processor itself to be a VLIW, without the complex x decoding and registerrenaming hardware of decoupled x designs, and without any superscalar dispatch or OOO logic either. The softwarebased x translation did reduce the systems performance compared to hardware translation which occurs as additional pipeline stages and thus is almost free in performance terms, but the result was a very lean chip which ran st and cool and used very little power. A MHzCrusoe processor could match a thencurrent MHzPentium IIIrunning in its lowpower mode MHzclock speed while using only a fraction of the power and generating only a fraction of the heat. This made it ideal for laptops and handheld computers, where battery life is crucial. Today, of course, x processor variants designed specifically for low power use, such as thePentium Mand its Core descendants, have made the Transmeta softwarebased approach unnecessary, although a very similar approach is currently being used in NVIDIAs Denver ARM processors, again in the quest for high performance at very low power.
At one extreme we have processors like IntelsCore i*Sandy Bridge above left, consisting of large,wide,issue,outoforder,aggressively brainiac cores along the top, with shared L cache below, each running threads,for a total of stthreads. At the other end of the spectrum, Sun/OraclesUltraSPARC TNiagara above right contains much smaller, r,issueinordercores top and bottom, with shared L cache towards the center, each running threads,for a massive threadsin total, although these threads are considerably slower than those of Sandy Bridge. Both chips are of the same era early . Both contained around billiontransistors and are drawn approximately to scale above assuming similar transistor density. Note just how much smaller the ,inordercores really are!
Crusoe Explored the Transmeta Crusoe processor and its softwarebased approach to x compatibility.
Ideally, a cache should keep the data that is most likely to be needed in the future. Since caches arent psychic, a good approximation of this is to keep the most recently used data.
A typical modern memory hierarchy looks something like…
ThePentium was the first processor to use SMT, which Intel calls hyperthreading. Its design allowed for simultaneous threads although earlier revisions of thePentium had the SMT feature disabled due to bugs. Speedups from SMT on thePentium ranged from aroundtodepending on the applications. Subsequent Intel designs then eschewed SMT during the transition back to the brainiac designs of thePentium MandCore ,along with the transition tomulticore.Many other SMT designs were also cancelled around the same timeAlpha ,UltraSPARC V,and for a while it almost seemed as if SMT was out of vor, before it finally made a comeback with POWER, athreadSMT design as well as beingmulticore threads per core times cores per chip equals threads per chip. IntelsCore iseries are alsothreadSMT, so a typicalquadcoreCore iprocessor is thus anthreadchip. Sun was the most aggressive of all on the threadlevel parallelism front, withUltraSPARC TNiagara providing inordercores each withthreadSMT, for a total of threads on a single chip. This was subsequently increased to threads per core inUltraSPARC T,and then cores per chip inUltraSPARC T,for a whopping threads!
pipelining superscalar, OOO, VLIW, branch prediction, predication
Nonetheless, even the very best modern processors with the best, smartest branch predictors only reach a prediction accuracy of about , and still lose quite a lot of performance due to branch mispredictions. The bottom line is very deep pipelines naturally suffer fromdiminishing returns, because the deeper the pipeline, the further into the future you must try to predict, the more likely youll be wrong, and the greater the mispredict penalty when you are.
Will further improvements in memory technology, along with even more levels of caching, be able to continue to hold off the memory wall, while at the same time scaling up to the ever higher bandwidth demanded by more and more processor cores? Or will we soon end up constantly bottlenecked by memory, both bandwidth and latency, with neither the processor microarchitecture nor the number of cores much difference, and the memory system being all that matters? It will be interesting to watch, and while predicting the future is never easy, there are good reasons to be optimistic…
How r can pipelining and multiple issue be taken? If astagepipeline is timesster, why not build astagesuperpipeline? Ifissuesuperscalar is good, why not go forissue?For that matter, why not build a processor with astagepipeline which issues instructions per cycle?
Instructions are executed one after the other inside the processor, right? Well, that makes it easy to understand, but thats not really what happens. In ct, that hasnt happened since the middle of the s. Instead, several instructions are allpartially executingat the same time.
Figure The instruction flow of a superpipelinedsuperscalar processor.
And here are some articles not specifically related to any particular processor, but still very interesting…
Most of the early superscalars wereinorderdesigns SuperSPARC, hyperSPARC, UltraSPARC,Alpha , the original Pentium. Examples of early OOO designs included theMIPS R,Alpha and to some extent the entire POWER/PowerPC line with their reservation stations. Today, almost all highperformance processors areoutoforderdesigns, with the nole exceptions ofUltraSPARC III/IV,POWER and Denver. Most lowpower, lowperformance processors, such asCortexA/Aand Atom, areinorderdesigns because OOO logic consumes a lot of power for a relatively small performance gain.
Naturally, the data in the registers can also be ided up in other ways, not just asbitbytes for example asbitintegers for highquality processing, or as floatingpoint values for scientific number crunching. With AltiVec, NEONv and recent versions of SSE/AVX, for example, it is possible to execute awayparallel floatingpointmultiplyaddas a single, fully pipelined instruction.
It is worth noting, however, that most VLIW designs arenot interlocked. This means they do not check for dependencies between instructions, and often have no way of stalling instructions other than to stall the whole processor on a cache miss. As a result, the compiler needs to insert the appropriate number of cycles between dependent instructions, even if there are no instructions to fill the gap, by usingnopsnooperations,pronouncedno opsif necessary. This complicates the compiler somewhat, because it is doing something that a superscalar processor normally does at runtime, however the extra code in the compiler is minimal and it saves precious resources on the processor chip.
Clearly, OOO hardware should make it possible for more instructionlevel parallelism to be extracted, because things will be known at runtime that cannot be predicted in advance cache misses, in particular. On the other hand, a rinorderdesign will be smaller and use less power, which means you can place more smallinordercores onto the same chip as fewer, largeroutofordercores. Which would you rather have powerful brainiac cores, or rinordercores?
Table The memory hierarchy of a modern desktop/lapCore i*Haswell.
Okay, so youre a CS graduate and you did a hardware course as part of your degree, but perhaps that was a few years ago now and you havent really kept up with the details of processor designs since then.
The core problem with memory access is that building a st memory system is very difficult, in part because of fixed limits like the speed of light, which impose delays while a signal is transferred out to RAM and back, and more importantly because of the relatively slow speed of charging and draining the tiny capacitors which make up the memory cells. Nothing can change these cts of nature we must learn to work around them.
few instructions such as go left, go right and build car.
Niagara processor, revised for a second generation and taking threadlevel parallelism to the extreme.
Unfortunately, however, the effectiveness of OOO execution in dynamically extracting additional instructionlevel parallelism has been disappointing, with only a relatively small improvement being seen, perhapsor so over an equivalentinorderdesign. To quote Andy Glew, a pioneer ofoutoforderexecution and one of the chief architects of thePentium Pro/II/IIIThe dirty little secret of OOO is that we are often not very much OOO at all.Outoforderexecution has also been unable to deliver the degree of recompile independence originally hoped for, with recompilation still producing large speedups even on aggressive OOO processors.
DEC, for example, went primarily speeddemon with the first two generations of Alpha, then changed to brainiac for the third generation. MIPS did similarly. Sun, on the other hand, went brainiac with their first superscalar SPARC, then switched to speeddemon for more recent designs. The POWER/PowerPC camp also gradually moved away from brainiac designs over the years until recently, although the reservation stations in all POWER/PowerPC designs do offer a degree of OOO execution between different functional units even if the instructions within each functional units queue are executed strictly in order. ARM processors, in contrast, have shown a consistent move towards more brainiac designs, coming up from the lowpower, lowperformance embedded world as they have, but still remaining mobilecentric and thus unable to push the clock speed too high.
The compiler approach also has some other advantages over OOO hardware it can see further down the program than the hardware, and it can speculate down multiple paths rather than just one, which is a big issue if branches are unpredicle. On the other hand, a compiler cant be expected to be psychic, so it cant necessarily get everything perfect all the time. Without OOO hardware, the pipeline will stall when the compiler ils to predict something like a cache miss.
Soissuehere we come, right? Unfortunately, the answer is no.
Some applications, such as daase systems, and video processing, audio processing, D graphics rendering and scientific code, do have obvious highlevel coarsegrained parallelism available and easy to exploit, but unfortunately even many of these applications have not been written to make use of multiple threads in order to exploit multiple processors. In addition, many of the applications which are easy to parallelize, because theyre inherently embarrassingly parallel in nature, are primarily limited by memory bandwidth, not by the processor processing, audio processing, scientific code, so adding a second thread or processor wont help them much unless memory bandwidth is also dramatically increased well get to the memory system soon. Worse yet, many other s of software, such as web browsers, design tools, language interpreters, hardware simulations and so on, are currently not written in a way which is parallel at all, or certainly not enough to make effective use of multiple processors.
byJason Robert Carey Patterson, last updatedMay orig Feb
Brainiacdesigns are at the smartmachine end of the spectrum, with lots of OOO hardware trying to squeeze every last drop of instructionlevel parallelism out of the code, even if it costs millions of logic transistors and years of design effort to do it. In contrast,speeddemondesigns are r and smaller, relying on a smart compiler and willing to sacrifice a little bit of instructionlevel parallelism for the other benefits that simplicity brings. Historically, thespeeddemondesigns tended to run at higher clock speeds, precisely because they were r, hence the speeddemon name, but today thats no longer the case because clock speed is limited mainly by power and thermal issues.
So where does x fit into all this, and how have Intel and AMD been able to remain competitive through all of these developments in spite of an architecture thats now more than yearsold?
Nonetheless, the memory wall is still abigproblem.
Aissuesuperscalar processor wants independentinstructions to be available, with all their dependencies and latencies met, at every cycle. In reality this is virtually never possible, especially with load latencies of or cycles.Currently, realworld instructionlevel parallelism for mainstream, singlethreaded applications is limited to aboutinstructions per cycle at best. In ct, the average ILP of a modern processor running the SPECint benchmarks is less than instructions per cycle, and the SPEC benchmarks are somewhat easier than most large, realworld applications. Certain s of applications do exhibit more parallelism, such as scientific code, but these are generally not representative of mainstream applications. There are also some s of code, such as pointer chasing, where even sustaining instructionper cycle is extremely difficult. For those programs, the key problem is the memory system, and yet another wall, thememory wallwhich well get to later.
From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the execution state of each thread things like the program counter, the architecturallyvisible registers but not the rename registers, the memory mappings held in the TLB, and so on. Luckily, these parts only constitute a tiny fraction of the overall processors hardware. The really large and complex parts, such as the decoders and dispatch logic, the functional units, and the caches, are all shared between the threads.
Intel has been the most interesting of all to watch. Modern x processors have no choice but to be at least somewhat brainiac due to limitations of the x architecture more on this soon, and thePentium Proembraced that sentiment wholeheartedly. Then the race against AMD to reach GHzensued, which AMD won by a nose inMarch .Intel changed their focus to clock speed at all cost, and made thePentium about as speeddemon as possible for a decoupled x microarchitecture, sacrificing some ILP and using a deepstagepipeline to pass and then GHz,and with a later revision featuring a staggeringstagepipeline, reach as high as. GHz.At the same time, withIAItanium not shown above, Intel again bet solidly on the smartcompiler approach, with a design relying totally on static, compiletime scheduling. Faced with the ilure ofIA, the enormous power and heat issues of thePentium ,and the ct that AMDs more slowly clocked Athlon processors, in the GHzrange, were actually outperforming thePentium on realworld code, Intel then reversed its position once again, and revived the olderPentium Pro/II/IIIbrainiac design to produce thePentium Mand its Core successors, which have been a great success.
For thesedecoupledsuperscalar x processors, register renaming is absolutely critical due to the meager registersof the x architecture inbitmodebitmode added an additional registers.This differs strongly from the RISC architectures, where providing more registers via renaming only has a modest effect. Nonetheless, with clever register renaming, the full bag of RISC tricks become available to the x world, with the two exceptions of advanced static instruction scheduling because the ops are hidden behind the x layer and thus are less visible to compilers and the use of a large register set to avoid memory accesses.
Figure The instruction flow of an SMT processor.
Today, virtually every processor is a superpipelinedsuperscalar, so theyre just called superscalar for short. Strictly speaking, superpipelining is just pipelining with a deeper pipe anyway.
The Alpha architecture had a conditional move instruction from the very beginning. MIPS, SPARC and x added it later. WithIA,Intel went allout and made almost every instruction predicated in the hope of dramatically reducing branching problems in inner loops, especially ones where the branches are unpredicle, such as compilers and OS kernels. Interestingly, the ARM architecture used in many phones and lets was the first architecture with a fully predicated instruction set. This is even more intriguing given that the early ARM processors only had short pipelines and thus relatively small mispredict penalties.
Figure Xeon Haswell EP with brainiac cores, the best of both worlds?
The amazing thing about caches is that they workreallywell they effectively make the memory system seem almost as st as the L cache, yet as large as main memory. A modern primary L cache has a latency of just to processor cycles, which is dozens of times ster than accessing main memory, and modern primary caches achieve hit rates of around for most software. So of the time, accessing memory only takes a few cycles!
Lighterra is the software company ofJason Robert Carey Patterson, an Australian programmer with interests centered around performance and the hardware/software intece, such as the design of new programming languages and compilers, code optimization algorithms to make code run ster, processor chip design and microarchitecture, and parallel programming across many processor cores or many networked computers.
Show LineageNonexMIPSSPARCPOWER/PowerPCAlphaARMx ARM
Another approach to the whole problem is to have thecompileroptimize the code by rearranging the instructions. This is calledstatic, or compiletime, instruction scheduling. The rearranged instruction stream can then be fed to a processor with rinordermultipleissue logic, relying on the compiler to spoon feed the processor with the best instruction stream. Avoiding the need for complex OOO logic should make the processor quite a lot easier to design, less powerhungry and smaller, which means more cores, or extra cache, could be placed onto the same amount of chip area more on this later.
The Pentium and the PowerPC GeandPart II a comparison of the very different designs of two extremely popular and successful, if somewhat maligned, processors.
If branches and longlatency instructions are going to cause bubbles in the pipelines, then perhaps those empty cycles can be used to do other work. To achieve this, the instructions in the program must bereorderedso that while one instruction is waiting, other instructions can execute. For example, it might be possible to find a couple of other instructions from further down in the program and put them between the two instructions in the earlier multiply example.
Here, a new instruction has been introduced calledcmovle, for conditional move if less than or equal. This instruction works by executing as normal, but only commits itself if its condition is true. This is called apredicatedinstruction, because its execution is controlled by a predicate a true/lse test.
Of course, with all of those transistors available, it might also make sense to integrate other secondary functionality into the main CPU chip, such as I/O and networking usually part of the motherboard chipset, dedicated video encoding/decoding hardware usually part of the graphics system, or even an entire lowend GPU graphics processing unit. This integration is particularly attractive in cases where a reduction in chip count, physical space or cost is more important than the performance advantage of more cores on the main CPU chip and separate, dedicated chips for those other purposes, it ideal for phones, lets and small, lowperformance laptops. Such a heterogeneous design is called asystemonchip, or SoC…
NexGens Nx and IntelsPentium Proalso known as the P were the first processors to adopt a decoupled x microarchitecture design, and today all modern x processors use this technique. Of course, they all differ in the exact design of their core pipelines, functional units and so on, just like the various RISC processors, but the fundamental idea of translating from x to internal op RISClike instructions is common to all of them.
Figure A pipelined microarchitecture with bypasses.
Its a bit like working at a desk in a library… You might have two or three books open on the desk itself. Accessing them is st you can just look, but you cant fit more than a couple on the desk at the same time and even if you could, accessing books laid out on a huge desk would take longer because youd have to walk between them. Instead, in the corner of the desk you might have a pile of a dozen more books. Accessing them is slower, because you have to reach over, grab one and open it up. Each time you open a new one, you also have to put one of the books already on the desk back into the pile to make room. Finally, when you want a book thats not on the desk, and not in the pile, its very slow to access because you have to get up and walk around the library looking for it. However the of the library means you have access to thousands of books, r more than could ever fit on your desk.
Today, a typical SMT design implies both a wide execution core and OOO execution logic, including multiple decoders, the large and complex superscalar dispatch logic and so on. Thus, the of a typical SMT core is quite large in terms of chip area. With the same amount of chip space, it would be possible to fitseveralr, singleissue,inordercores either with or without basic SMT. In ct, it may be the case that as many as half a dozen small, cores could fit within the chip area taken by just one modern OOO superscalar SMT design!
Figure The instruction flow of a superscalar processor.
Unfortunately, keepingexactlythe most recently used data would mean that data fromanymemory location could be placed intoanycache line. The cache would thus contain exactly the most recently usednKBof data, which would be great for exploiting locality but unfortunately isnotsuile for allowing st access accessing the cache would require checkingeverycache line for a possible match, which would be very slow for a modern cache with hundreds of lines.
LighterraArticles PapersModern Microprocessors A Minute Guide!
No VLIW designs have yet been commercially successful as mainstream CPUs, however IntelsIAarchitecture, which is still in production in the form of the Itanium processors, was once intended to be the replacement for x. Intel chose to callIAan EPIC design, for explicitly parallel instruction computing, but it was essentially a VLIW with clever grouping to allow longterm compatibility and predication see below. The programmable shaders in graphics processors GPUs are sometimes VLIW designs, as are many digital signal processors DSPs, and there was also Transmeta see the x section, coming up soon.
The solution, invented independently at about the same time by engineers at both NexGen and Intel, was todynamically decodethe x instructions into , RISClike microinstructions, which can then be executed by a st, RISC registerrenaming OOO superscalar core. The microinstructions are usually calledopspronouncedmicro ops.Most x instructions decode into , or ops,while the more complex instructions require a larger number.
The overall of this article, particularly with respect to the of the processor instruction flow and microarchitecture diagrams, is derived from a combination of a wellknown ASPLOS by Norman Jouppi and David Wall, the bookPOWER PowerPCby Shlomo Weiss and James Smith, and the two very mous Hennessy/Patterson booksComputer Architecture A Quantitative ApproachandComputer Organization and Design.
Processors which focused too much on ILP, such as the early POWER processors, SuperSPARC and theMIPS R,soon found their ability to extract additional instructionlevel parallelism was only modest, while the additional complexity seriously hindered their ability to reach st clock speeds, resulting in those processors being beaten bydumberbut higherclocked processors which werent so focused on ILP.
This is sometimes called SIMD parallelism single instruction, multiple data. More often, its calledvector processing. Supercomputers used to use vector processing a lot, with very long vectors, because the s of scientific programs which are run on supercomputers are quite amenable to vector processing.
Almost every architecture has now added SIMD vector extensions, including SPARC VIS, x MMX/SSE/AVX, POWER/PowerPC AltiVec and ARM NEON. Only relatively recent processors from each architecture can execute some of these new instructions, however, which raises backwardcompatibility issues, especially on x where the SIMD vector instructions evolved somewhat haphazardly MMX, DNow!, SSE, SSE, SSE, SSE, AVX, AVX.
Unfortunately, its quite difficult for a compiler to automatically make use of vector instructions when working from normal source code, except in trivial cases. The key problem is that the way programmers write programs tends to serialize everything, which makes it difficult for a compiler to prove two given operations are independent and can be done in parallel. Progress is slowly being made in this area, but at the moment programs must basically be rewritten by hand to take advantage of vector instructions except for arraybased loops in scientific code.
Exactly which is the more important ctor is currently open to hot debate. In general, it seems both the benefits and the costs of OOO execution have been somewhat overstated in the past. In terms of cost, appropriate pipelining of the dispatch and registerrenaming logic allowed OOO processors to achieve clock speeds competitive with r designs by the late s, and clever engineering has reduced the power overhead of OOO execution considerably in recent years, leaving mainly the chip area cost. This is a testament to some outstanding engineering by processor architects.
Today, however, vector supercomputers have long since given way to multiprocessor designs where each processing unit is a commodity CPU. So why revive vector processing?
The bottom line is that without care, and even with care for some applications, SMT performance can actually beworsethan singlethread performance and traditional con switching between threads. On the other hand, applications which are limited primarily by memory latency but not memory bandwidth, such as daase systems, D graphics rendering and a lot of generalpurpose code,benefit dramaticallyfrom SMT, since it offers an effective way of using the otherwise idle time during load latencies and cache misses well cover caches later. Thus, SMT presents a very complex and applicationspecific performance picture. This also makes it a difficult challenge for marketing sometimes almost as st as two real processors, sometimes more like two really lame processors, sometimes even worse than one processor, huh?
Of course, if the blocks of code in theindelsecases were longer, then using predication would mean executing more instructions than using a branch, because the processor is effectively executingboth pathsthrough the code. Whether its worth executing a few more instructions to avoid a branch is a tricky decision for very small or very large blocks the decision is , but for mediumd blocks there are complex tradeoffs which the optimizer must consider.
So, if going primarily for clock speed is a problem, is going purely brainiac the right approach then? Sadly, no. Pursuing more and more ILP also has definite limits, because unfortunately, normal programs just dont have a lot of finegrained parallelism in them, due to a combination of load latencies, cache misses, branches and dependencies between instructions. This limit of available instructionlevel parallelism is called theILP wall.
And if you want to keep up with the latest news in the world of microprocessors…
The word cache is pronounced like cash… as in a cache of weapons or a cache of supplies. It means a place for hiding or storing things. It isnotpronouncedcashayorkaysh.
Figure Design extremesCore i*Sandy BridgevsUltraSPARC TNiagara .
A question that must be asked is whether the costlyoutoforderlogic is really warranted, or whether compilers can do the task of instruction scheduling well enough without it. This is historically called thebrainiacvsspeeddemondebate. This and fun classification of design s first appeared in a Microprocessor Report editorialby Linley Gwennap, and was made widely known by Dileep BhandarkarsAlpha Implementations Architecturebook.
Although the pipeline stages look , it is important to remember theexecutestage in particular is really made up of several different groups of logic several sets of gates, up differentfunctional unitsfor each of operation the processor must be able to perform…
Nevertheless, since the benefits of both SMT andmulticoredepend so much on the nature of the target applications, a broad spectrum of designs might still make sense with varying degrees of SMT andmulticore.Lets explore some possibilities…
The IBM POWER processor, the predecessor of PowerPC, was the first mainstream superscalar processor. Most of the RISCs went superscalar soon after SuperSPARC,Alpha .Intel even managed to build a superscalar x the original Pentium however the complex x instruction set was a real problem for them more on this later.
Well, consider the following two instructions…
Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages. This allows the r instructions to complete more quickly, reducinglatencywhich well get to soon. Since such processors have many different pipeline depths, its normal to refer to the depth of a processors pipeline when executingintegerinstructions, which is usually the shortest of the possible pipeline paths, with the memory and floatingpoint pipelines implied as having a few additional stages. Thus, a processor with astagepipeline would use stages for executing integer instructions, perhaps or stages for memory instructions, and maybe or stages for floatingpoint. There are also a bunch of bypasses within and between the various pipelines, but these have been left out of the diagram for simplicity.
Now the processor is completing instruction every cycleCPI .This is a fourfold speedup without changing the clock speed at all. Not bad, huh?
In this of processor, the instructions are reallygroupsof little subinstructions, and thus the instructions themselves are very long, often bitsor more, hence the name VLIW very long instruction word. Each instruction contains information for multiple parallel operations.
Intels Next Generation Microarchitecture Unveiled Intels revival of the venerable P core from the
Loads tend to occur near the beginning of code sequences basic blocks, with most of the other instructions depending on the data being loaded. This causes all the other instructions to stall, and makes it difficult to obtain large amounts of instructionlevel parallelism. Things are even worse than they might first seem, because in practice most superscalar processors can still only issue one, or at most two, load instructions per cycle.
Well unfortunately its not quite as as that. As it turns out, very wide superscalar designs scale very badly in terms of both chip area and clock speed. One key problem is that the complex multipleissue dispatch logic scales up as roughly thesquareof the issue width, because allncandidate instructions need to be compared against every other candidate. Applying ordering restrictions or issue rules can reduce this, as can some clever engineering, but its still in the order ofn. That is, the dispatch logic of aissueprocessor is more than larger than aissuedesign, withissuebeing more than twice as large,issueover times the ,issuemore than times larger thanissuefor only times the width, and so on. In addition, a very wide superscalar design requires highly multiported register files and caches, to service all those simultaneous accesses. Both of these ctors conspire to not only increase , but also to massively increase the amount of longerdistance wiring at the circuitdesign level, placing serious limits on the clock speed. So a singleissuecore would actually beboth larger and slowerthan twoissuecores, and our dream of aissueSMT design isnt really viable due to circuitdesign limitations.
It gets even worse, because in addition to the normal switching power, there is also a small amount ofleakagepower, since even when a transistor is off, the current flowing through it isnt completely reduced to zero. And just like the good, useful current, this leakage current also goes up as the voltage is increased. If that wasnt bad enough, leakage generally goes up as the temperature increases as well, due to the increased movement of the hotter, more energetic electrons within the silicon.
Intels Haswell CPU Microarchitecture the latest and greatest Intel x processor design,
There have, of course, been many other presentations of this same material, and naturally they are all somewhat similar, however the above four are exceptionally good in my opinion. To learn more about these topics, those books are an excellent place to start.
How can this be? Obviously, theres more to it than just clock speed its all about how much work gets done in each clock cycle. Which leads to…
A MHzMIPS R, a MHzUltraSPARC and a MHzAlpha were all about the same speed at running most programs, yet they differed by a ctor of two in clock speed. A MHzPentium IIwas also about the same speed for many things, yet it was about half that speed for floatingpoint code such as scientific number crunching. APowerPC Gat that same MHzwas somewhat ster than the others for normal integer code, but still r slower than the top three for floatingpoint. At the other extreme, anIBM POWERprocessor at just MHzmatched the MHzAlpha in floatingpoint speed, yet was only half as st for normal integer programs.
Most modern processors have a large second or third level of onchip cache, usually shared between all cores. This cache is also very important, but its sweet spot depends heavily on the of application being run and the of that applications activeworking set the difference between MBof L cache and MBwill be barely measurable for some applications, while for others it will be enormous. Given that the relatively small L caches already take up a significant percentage of the chip area for many modern processor cores, you can imagine how much area a large L or L cache would take, yet this is still absolutely essential to combat the memory wall. Often, the large L/L cache takes as much ashalfthe total chip area, so much that its clearly visible in chip photographs, standing out as a relatively clean, repetitive structure against the more messy logic transistors of the cores and memory controller.
Luckily, however, rewriting just a small amount of code in key places within the graphics and video/audio libraries of your vorite operating system has a widespread effect across many applications. Today, most OSs have enhanced their key library functions in this way, so virtually all and D graphics applications do make use of these highly effective vector instructions. Chalk up yet another win for abstraction!
Figure The heatsink of a modern desktop processor, with front n removed.
Caches can achieve these seemingly amazing hit rates because of the way programs work. Most programs exhibitlocalityin both time and space when a program accesses a piece of memory, theres a good chance it will need to reaccess the same piece of memory in the near future temporal locality, and theres also a good chance it will need to access other nearby memory in the future as well spatial locality. Temporal locality is exploited by merely keeping recently accessed data in the cache. To take advantage of spatial locality, data is transferred from main memory up into the cache in blocks of a few dozen bytes at a time, called acache line.
Niagara II The Hydra Returns Suns innovative
Assuming a typical MHzSDRAM memory systemDDR,and assuming a. GHzprocessor, this makes * / cycles of theCPU clockto access main memory! Yikes, you say! And it gets worse a. GHzprocessor would take it to cycles, a. GHzprocessor to cycles, a. GHzprocessor cycles, and a. GHzprocessor would wait a staggering cycles to access main memory!
Figure The instruction flow of a superpipelined processor.
But be prepared this article is brief and tothepoint. It pulls no punches and the pace is pretty fierce really. Lets get into it…
From the hardware point of view, adding these s of vector instructions is not terribly difficult existing registers can be used and in many cases the functional units can be shared with existing integer or floatingpoint units. Other useful packing and unpacking instructions can also be added, for byte shuffling and so on, and a few predicatelike instructions for bit masking etc. With some thought, a small set of vector instructions can enable some impressive speedups.
Now consider a pipelined processor executing this code sequence. By the time the conditional branch at line reaches the execute stage in the pipeline, the processor must have already fetched and decoded the next couple of instructions. Butwhichinstructions? Should it fetch and decode theifbranch lines and or theelsebranch line ? It wont really know until the conditional branch gets to the execute stage, but in a deeply pipelined processor that might be several cycles away. And it cant afford to just wait the processor encounters a branch every six instructions on average, and if it was to wait several cycles at every branch then most of the performance gained by using pipelining in the first place would be lost.
Typically, there are small but st primarylevelL caches on the processor chip itself, inside each core, usually aroundkin , with a largerlevelL cache further away but still onchip a few hundred KB to afew MB,and possibly an even larger and slower L cache etc. The combination of the onchip caches, any offchip external cacheEcacheand main memory RAM together form amemory hierarchy, with each successive level being larger but slower than the one before it. At the bottom of the memory hierarchy, of course, is virtual memory paging/swapping, which provides the illusion of an almost infinite amount of main memory by moving s of RAM to and from filesystem storage which is slower again, by a large margin.
A good analogy is a highway… Suppose you want to drive in to the city from miles away. By doubling the number of lanes, the total number of cars that can travel per hour the bandwidth is doubled, but your own travel time the latency is not reduced. If all you want to do is increase carspersecond, then adding more lanes wider bus is the answer, but if you want to reduce the time for a specific car to get from A to B then you need to do something else usually either raise the speed limit bus and RAM speed, or reduce the distance, or perhaps build a regional mall so that people dont need to go to the city as often a cache.
This is really great! Now that we can fill those bubbles by running multiple threads, we can justify adding more functional units than would normally be viable in a singlethreaded processor, and really go to town with multiple instruction issue. In some cases, this may even have the side effect of improving singlethread performance for particularly ILPfriendly code, for example.
Of course, theres nothing stopping a processor from having both a deep pipeline and multiple instruction issue, so it can be both superpipelined and superscalar at the same time…
The key question ishowthe processor should make the guess. Two alternatives spring to mind. First, thecompilermight be able to mark the branch to tell the processor which way to go. This is calledstatic branch prediction. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead, such as backward branches are predicted to be taken while forward branches are predicted nottaken. More importantly, however, this approach requires the compiler to be quite smart in order for it to make the correct guess, which is easy for loops but might be difficult for other branches.
ThePentium ssevere power and heat issues demonstrated there are limits to clock speed. It turns out power usage goes upeven sterthan clock speed does for any given level of chip technology, increasing the clock speed of a processor by, say , will typically increase its power usage by even more, maybe , because not only are the transistors switching more often, but the voltage also generally needs to be increased, in order to drive the signals through the circuits ster to reliably meet the shorter timing requirements, assuming the circuits work at all at the increased speed, of course. And while power increases linearly with clock frequency, it increases as thesquareof voltage, for a kind of triple whammy at very high clock speeds f*V*V.
Earn a reduction of up to on your windstorm insurance by having one of our professionals perform a Wind Mitigation Assessment.
A thorough, professional inspection of the structure will give you a firm understanding of the buildings condition. Significantly reduce your risk and ease the decision process.
Project planning, organizing, motivating, and control to achieve your goals. Project efficiency affects your bottom line, and project management can save you.
From Wind Mitigation to ASHRAE Audits we are your one stop shop for energy efficient engineering and project management
See how Energy Assessments will optimize your building operations, cut nominal operating expenses, and reduce the cost of necessary repairs.
Competitive Bid Specification documents for upcoming repairs will pay for themselves and create clear expectations from all parties.
Specialists in design, repairs, and conservation. With over years of combined professional experience, no job is too small or too massive.
Green Engineering Project Management Green Building Design, Inc
Best WordPress Page Builders Compared and Reviewed
Creating an Equal Height Pricing Table using CSS Flexbox
See the top WordPress and web design trends coming your way this year.
Affiliate Discloser We receive a commission from purchases through some links on this site
All s and content copyright CSS Drive.
Amazing Hamburger Menu Trends to Emulate
My Horrible Experience with Shared Hosting, and why VPS is the ONLY Way To Go
See how easy it is to create an equal s, responsive CSS pricing le using the power of CSS Flexbox.
Creating the Perfect Ergonomic Workspace
An elegant jQuery plugin that takes a regular UL list and transforms it into either an expanding or standard drop down menu
HighResolution clear shot or rendering, showing the architectural project in a preferably white or lighter setting
The Aim of the Architecture, Building and Structure Design Award is to attract the attention of architecture media, urban magazines, and building industry to your business by means of creating publicity and dissemination and also to separate you from the rest of the actors in the architecture sector by honoring your institution with a prestigious award.
The Architecture, Building and Structure Design award considers your submission on evaluation criteria such as Innovation, uniqueness of the project, social impacts, environment friendliness, energy utilization, and project specific criteria.
Learn SubCategories of Architecture and Building Design by Clicking Here
Showing the project in the setting where the design is realized, or showing some further details from the architectural project, birdeye view etc.
IADA The International Architecture Design Awards, is a major design award category part of A Design Awards Competitions. Enter your spatial designs for me, prestige, publicity and international recognition.
The winners of the A Architecture, Building and Structure Design are provided extensive and exclusive marketing and communication services to promote the success of winning the A Award. Furthermore, the winning designs appear on the A best designs book which is available worldwide, this book is furthermore distributed to the highprofile magazine editors, design oriented companies, top level managers and relevant parties. The winning designs are also exhibited in Italy as a Poster or Scale Model, and the best designs will be picked for the permanent exhibition.
Each winner design receives the trophy, published online and at our best designs book, receives a certificate and sticker templates to be attached to the products, the A seal of design excellence is also included in the winners package, this seal is valid for the entire product lifecycle without yearly fees.
A Shot or rendering showing a detailed view of the architectural project or focusing on a unique feature or planning feature
The winners of Architecture, Building and Structure Design appear on magazines, newss, webzines and many other publishing mediums. For concept stage projects, the A Award is an early indicator of success, this lets young architects and offices to continue and improve the winning projects for added value generation, furthermore the A Award connects architects to a large industry base and setups the links between the architects, architecture offices and the real estate developers.
When submitting to the Architecture, Building and Structure Design competition, always prepare a visually stunning presentation of your project, highresolution s, renderings and mockup photos can be submitted. It is important to attach conual presentation and ual descriptions as a PDF document.
Call for Entries to Architectural Design Awards Press Release
The A Architecture, Building and Structure Design Competition is a specialized design competition open to both concept stage and realized architectural projects, urban design projects and buildings designed by architects, architecture offices, real estate developers and construction companies worldwide.
Award winners will be able to use a title that matches their nomination category Such as Architecture Design Award Winner.
An action shot, where users interacts with the architectural project, illustrations with people or intended users
Explaining technical aspects, blue prints, details, commercial presentation or other notes about the architectural project, explosion, layered or sectional views if present.
Unlike some other design awards and competitions, you are not obliged to make any further fees for winning the award and everything listed in the winners benefits will be provided free of charge.
Learn more about the winners benefitshere. To learn more about winners serviceshere. And check the contents of the winners packhere.
A Shot or rendering of the project showing a different view, focus on details or interiors, or a view from the exterior or birdeye.
D Rotation / Exploration View, Conceptual Video, Short Feature or Advertisement of the architectural project.
Register to Architecture and Building Design Competition by Clicking Here
The A Design Award for Architecture, Building and Structure Design is not just an award, it is the indicator of quality and perfection in design, the award is recognized worldwide and takes the attention of design oriented companies, professionals and interest groups. The A Award attracts the eyes of contractors, key decision makers and buildors worldwide, winners will be able to find better and higher profile leads, get a step ahead in their commercial life.
For realized projects, having the A Award gives added value to your designs and separates them from the rest of archicture it is an excuse to communicate your design to the media, a reason for press releases.
The Collectmodern designionLouise Lawler, Virginia Rutledge, Kenneth Pietrobono
Mutter skeleton of the house under construction
, The Museum of Modern Art, New York, April July ,
TOGETHER A, January, February, March,
, The Museum of Modern Art, New York, April July ,
Various Artists, Cao Fei, Omer Fast, Adrian Ghenie, Lynette YiadomBoakye
Speculum From the series Homage to Luis Barragn
Ourevolving collectioncontains almost , works of modern and contemporary art. More than , works are currently available online.
Eukleides From the series Homage to Luis Barragn
X Poster Untitled, , Epson UltraChrome inkjet on linen, x inches, WG
Various Artists, Johanna Calle, Matas Duville, Maria Laet, Mateo Lpez, Nicols Paris, Rosângela Renn, Christian Vinck Henriquez
Tate Modernmodern designThe two parts of the Tate Modern building Boiler House and Blavatnik Building are connected on Level through Turbine Hall, Level Bridge and Level Bridge
Delve deeper into Modiglianis life and work by producing a painted portrait of your own
There is a card and cash pay phone on Level in the Boiler House, under the staircase in the lift lobby
Eor call option .., daily
The cloakroom is located on level Boiler House. The service is free of charge
Signing Art is supported by the Skills Funding Agency
Ask a member of staff at the Information desk on Level for a free token
There are seats and benches throughout Tate Modern galleries, Turbine Hall and the concourses
Our ingallery team can help you get the most out of your visit. Whether it is how to plan your day or to share stories, ideas and opinions about the art on , do ask us.
There are no parking cilities at Tate Modern or in the surrounding streets. Public transport is the easiest way of getting to the gallery.
Touch Tours introduce visually impaired visitors to the thematic arrangement of the s. Tours engage with the ideas, materials and techniques of the art on .
The Members Bar is located on Level Boiler House. Lifts are available on every floor
Ever wondered how Picasso painted his masterpieces? Find out how to paint a cubist portrait inspired by PicassosBust of …
Blackfriars District and Circle Line, metres approx
Tate Modern offers events and workshops for a wide range of community groups. Attendees currently include mental health service users, homeless people, adults with learning difficulties andESOLrefugee groups.
A drop off / pick up point is situated on Holland Street, near the Turbine Hall entrance step free access.
Audio described tours explore works on in the permanent collection and the special exhibitions.
Once you have parked, please enter via the South Entrance across the level terrace, approximately meters from the parking spaces. This entrance offers two lifts on the right as you enter, which go to level ticket desk, cloakroom, lockers, toilets, shop, Turbine Hall.
BSL Talks at Tate are free. No booking required.
All entrances are accessible for wheelchairs, prams and buggies
or call option .., daily
Large print captions are currently not available for the permanent collection. Visitors can borrow a Magnifier from one of the Information Desks.
From here, there is escalator, lift and stair access to all levels
Join our BSL talks to discover works on in the permanent collection and the special exhibitions.
Additional nappychanging cilities are available in all toilets on each floor
Signing Art video Serena Cant on research skills
The Baby Care Room is located on Level Boiler House
Explore artworks from Tates collection that respond to their social and political con
There are twelve parking spaces for disabled visitors to Tate Modern, accessed via Park Street.
Check the list of tours available on theWhats ons to learn about dates, time and meeting point.
Lead your own journey and discover more about art, artists and Tate
From here there are lifts and staircase to all floors in the Boiler House
Six wheelchairs, three mobility scooters and one walker are available at the gallery.
We are closed December, but open as usual on all other days of the year, including Bank Holidays and New Years Day.
If you want to borrow a wheelchair you can either book in advance or ask a member of staff on arrival subject to availability. The reservation is free.
There are eight lifts in the Blavatnik Building, situated by the main stairs on each floor
Audio described tours at Tate Modern occur times a year. Each tour lasts around one hour.
There are no parking spaces for Blue Badge Holders in the immediate surroundings of Tate Modern.
Tate trained guides deliver tours at Tate Modern in spoken English with British Sign Language interpretation. Hearing loops and amplifiers are available.
TheKitchen and Baron Level Boiler House and theRestaurant on Level Blavatnik Building are both accessible via lifts. Both have waiter service
They are equipped with volume control and inductive couplers, and are accessible to wheelchair users
Large bags and rucksacks must be placed in the cloakroom
Please join us in the Community Room in the Blavatnik building, next to Terrace Caf for tea and coffee from . before the tour.
Information .., daily; Membership and ticketing services .., daily
There is an entrance on the south side of the building on Sumner Street with direct access to the Level Bridge and to accessible lifts, which go to level ticket desk, cloakroom, lockers, toilets, shop, Turbine Hall
If you have any other access needs that you would like to let us know about please contact Public Programmespand use the subject line AESS or call .
Audio described tours at Tate are free. No booking required.
Large print gallery plans are available on request from the Information desks.
Visit this vast, iconic space for largescale sculpture and sitespecific installation art
TheTate Boatruns every forty minutes along the Thames between Tate Britain andTate Modern. Other river services run between Millbank Pier and Bankside Pier.
Call option daily ..
Description is followed by information about the artist and the con in which the work was made.
The nearest accessible parking bays are on Emerson Street long stay bay, Hopton Street long stay and short stay bay New Globe Walk long stay.
Eat and drink while enjoying panoramic views of Londons skyline
We welcome guide dogs. A drinking bowl is available at the cloakroom on level just ask a member of staff.
Staff at the Information desks will call the DialaRide service for visitors
Multimedia guides are available from Level guide desk. The guide is free for deaf and hearing impaired visitors.
Chinese artist Zhou Tao joins us to present the UK premiere of this stunningly surreal film
Last admission and ticket sales for special exhibitions Sunday Thursday is at . and . on Friday Saturday. Ticket desks close at this time.
This introduces you to some of the bestloved artworks in the Tate collection
Two lifts between levels and link the Blavatnik Building entrance and the Tickets and Information desks in the Turbine Hall
A flight of steps with a handrail runs alongside the length of the ramp
Routes , and stop on Blackfriars Bridge Road
If you have any other access needs that you would like to let us know about please contact Public Programmespand use the subject line AESS or call .
Folding seats are available just ask a member of staff or pick up a stool from the racks located on the concourses on levels , and
Playfully subversive artistsSUPERFLEXhave filled Tate Moderns Turbine Hall with swings
Discover the imaginative films of Canadian artist Tamara Henderson in this short film programme
There are water fountains by the toilets on levels , and in the Boiler House
Tours takes place seated in the gallery and are delivered by visually impaired artists. Hearing loops and amplifiers are available.
To book call option .., daily
The audioguide is free of charge for blind and partially sighted visitors.
Buggies can also be stored there, subject to space availability
The handheld computer plays video clips of interpreters signing a tour of highlights of the s, as well as presenting visual interactive content such as games and opinion polls.
AChanging Placestoilet is available on Level Boiler House. The key is available from the cloakroom on Level . It offers
Admission to Tate Modern is free, but there is a charge for special exhibitions.
Find out whether an event at Tate Modern has a hearing loop in theWhats ons.
We welcome guide dogs, hearing dogs and assistance dogs in the gallery. Drinking bowls are available from the cloakroom just ask a member of staff.
The two parts of the Tate Modern building Boiler House and Blavatnik Building are connected on Level through Turbine Hall, Level Bridge and Level Bridge
Signing Art video Marcus Dickey Horley on professionalism in museums and galleries
theTerrace shopon Level , Blavatnik Building
The following areas of Tate Modern are fitted with a hearing loop
See performances and interactive art and video installations; these galleries celebrate new art
There are no water fountains in the Blavatnik Building
Touch tours typically include a sculpture that can be explored through direct handling and a number of other two and three dimensional works that are explored using a combination of raised s, handling objects, description and discussion.
Cycle Hire Docking Stations are located on New Globe Street and Southwark Street.
Admission to Tate Modern isfree, except forspecial exhibitions.Become a Tate Member or a Patron and get free entry to all special exhibitions. Visitors with a disability pay a concessionary rate, and carers entrance is free. Under s go free up to four per parent or guardian and mily tickets are available two adults and two children years see inidual exhibitions for more information.
See how artists in Tates collection have responded to the impact of mass media
The tour provides ondemand interpretation for deaf and hearing impaired visitors in their preferred language.
Signing Art Project In a Box A guide on how to implement your own training programme
Signing Art was a training programme for Deaf British Sign Language BSL users interested in developing the skills needed to become gallery guides. The programme included sessions on; research skills for art, professionalism, and how to best present to a Deaf audience in a gallery setting. These three sessions have been replicated as videos here in BSL or with BSL translation throughout as a reference for those interested in learning more.
Investigate the processes artists use to make artworks, and how our responses are integral to the piece
Please note your name, contact details, date and time of visit are required to make the booking.
Signing Art video Signing Art John Wilson on how to present to a Deaf audience
You can pick up the audioguide from the Multimedia desk on Level .
St Pauls Central Line, , metres approx.
Is there more to life modelling than posing nude?
Discover how artists working between the s and the s opened up new spaces for participation
Coloured overlays and magnifiers are available at the Information desks and at the special exhibition entrances
A drop off / pick up point is situated on Southwark Street, a short walk from the main entrance.
Artists around the world have examined the modern city in a range of works
Call daily ..
Check the list of tours available on theWhats ons.
This international symposium will explore the role that gender has played in the development of Chinese contemporary art
The Members Room is located on Level Blavatnik Building. Lifts are available on every floor
From here there are lifts to all levels in the Blavatnik Building and access to the Boiler House via Level Bridge
Information desks are located near the River entrance, in the Turbine Hall and in the Clore Welcome Room on Level . Staff are happy to help with any questions you have.
please note your name, contact details, date and time of visit are required to make the booking
See works that create a dialogue with the materials and spaces of everyday buildings
Sighted companions and guide dogs are welcome.
If you have booked a scooter or a wheelchair, these can also be collected from the South Entrance step free access.
The audioguide is designed to give you an insight into the artworks featured once you are in front of them. It is not designed to navigate you through the gallery to each work. A separate guide will help you with finding each artwork.
The Tate Modern audioguide talks you through a selection of the most iconic works in the collection s, including international paintings and sculptures spanning a century.
View recently acquired works from Tates collection
See some of the most memorable art of the th century at Tate Moderns comprehensive retrospective of Modiglianis work
Go behindthescenes and discover how we recreated the artists final Parisian studio in virtual reality
Supported by the Lord Leonard and Lady Estelle Wolfson Foundation
Large print exhibition guides are available at all special exhibitions at Tate Modern. Please ask at the exhibition entrance.
Magnifiers and coloured overlays are available from the Information desks and the exhibition entrances.
Discover how the worlds largest Soviet art and design collection came to be
A monthbymonth journey through Picassos year of wonders
A drop off / pick up point is situated on Holland Street, just outside the main entrance.
Selfservice lockers are located in the Tanks lobby on Level Blavatnik Building
The entrance is located on the south side of the building and it provides access directly to the Terrace Bar and Terrace shop
Blackfriars metres from the South exit; metres from the North exit.
There are four lifts in the Boiler House part of the building situated by the main stairs on each floor. You can reach the lifts from the River entrance, the Turbine Hall entrance and the Caf entrance
For out of hours emergencies call
Visitors with a disability pay aconcessionary rate, and a companion/carers entrance isfree. Details are available on eachspecial exhibition .
You may be requested to leave briefcases, bags or umbrellas in the cloakroom
There is an additional entrance to theCafon Level next to the Turbine Hall entrance on Holland Street
Fully accessible toilets are located on every floor, on the concourses.
Please book these spaces in advance, giving at least hours notice.
Each work is thoroughly described as the narrator draws your attention to each feature of the painting, sculpture or installation.
Behold aweinspiring ᵒ views of the London skyline, from high above the River Thames
Serving bespoke drinks as well as a delicious selection of cakes and pastries
If you want to borrow a mobility scooter please book in advance giving at least hours notice. The reservation is free. You must have driven a mobility scooter before to use one at Tate Modern.
Discover a range of artist books and specially selected products including designer collaborations, jewellery and prints
Book a group visit at Tate Modern, get a group discount on exhibitions, or hire group guiding equipment
Large print captions are available in all special exhibitions. Ask at the exhibition entrance
Discover the confidence and creativity needed to undress in the name of art
The river entrance is situated either side of the chimney base. It provides access directly to level
Cafon level Boiler House and theTerrace Baron Level Blavatnik Building are accessible from street level. TheEspresso Baron Level Boiler House is accessible via lifts in the Boiler House. All are selfservice but staff are happy to help if you need assistance
We ask that visitors book at least a week in advance, although it may be possible to arrange tours at shorter notice. Tours can be arranged for any time during normal gallery opening hours.
Discover artists from Tates collection who have embraced new and unusual materials and methods
Welcome to Lake Country Building Designs your fully insured Building Design specialist with years of experience in commercial and residential construction.
Proudly powered byLake Country Pty Ltd T/A Total Trade Marketing Group
Websites Built Hosted from P/W with no upfront cost.
Steve Boughton holds licences for both Building and Building Design in NSW, Qld SA so you are ensured to have only the best technical advice available in the industry when dealing with Lake Country Building Design regarding the planning of your new home.