3D-stacked DRAMs can significantly increase cell density and bandwidth while also lowering power consumption. However, 3D structures experience significant thermomechanical stress due to the differential rate of contraction of the constituent materials, which have different coefficients of thermal expansion. This impacts circuit performance. This paper develops a procedure that performs a performance analysis of 3D DRAMs, capturing the impact of both layout-aware stress and layout-independent stress on parameters such as latency, leakage power, refresh power, area, and bus delay. The approach first proposes a semianalytical stress analysis method for the entire 3D DRAM structure, capturing the stress induced by TSVs, micro bumps, package bumps, and warpage. Next, this stress is translated to variations in device mobility and threshold voltage, after which analytical models for latency, leakage power, and refresh power are derived. Finally, a complete analysis of performance variations is performed for various 3D DRAM layout configurations to assess the impact of layout-dependent stress. We explore the use of alternative flexible package substrate options to mitigate the performance impact of stress. Specifically, we explore the use of an alternative bendable package substrate made of polyimide to reduce warpage-induced stress and show that it reduces stress-induced variations, and improves the performance metrics for stacked 3D DRAMs.
This paper presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature. These overlays are categorized based on their implementation into two groups: processor-based overlays, as their implementation follows that of conventional silicon-based microprocessors, and; CGRAlike overlays, with either an array of interconnected processor-based functional units or medium-grained arithmetic functional units. Time-multiplexing the overlay allows it to change its behavior with a cycle-bycycle execution of the application kernel, thus allowing better sharing of the limited FPGA hardware resource. However, most TM overlays suffer from large resource overheads, due to either the underlying processor-like architecture (for processor-based overlays) or due to the routing array and instruction storage requirements (for CGRA-like overlays). Reducing the area overhead for CGRA-like overlays, specifically that required for the routing network, and better utilizing the hard macros in the target FPGA are active areas of research.
Heterogeneous multichip architectures have gained significant interest in high performance computing clusters to cater to a wide range of applications. In particular, heterogeneous systems with multiple multicore CPUs, GPUs and memory have become common place to meet application requirements. The shared resources like interconnection network in such systems pose significant challenges due to the diverse traffic requirements of CPUs and GPUs. Especially, the performance and energy consumption of inter-chip communication have remained a major bottleneck due to limitations imposed by off-chip wired links. To overcome these challenges, we propose a wireless interconnection network to provide energy efficient, high performance communication in heterogeneous multi-chip systems. Interference free communication between GPUs and memory modules is achieved through directional wireless links, while omni-directional wireless interfaces connect cores in the CPUs with other components in the system. Besides providing low energy, high bandwidth inter-chip communication, the wireless interconnection scales efficiently with system size to provide high performance across multiple chips. The proposed inter-chip wireless interconnection is evaluated on two system sizes with multiple CPU and multiple GPU chips, along with main memory modules. On a system with 4 CPU and 4 GPU chips, application runtime is sped up by 3.94×, packet energy is reduced by 94.4% and packet latency is reduced by 58.34% as compared to baseline system with wired inter-chip interconnection.
In the era of short channel length, Dynamic Thermal Management (DTM) has become a challenging task for the architects and designers while engineering modern Chip Multi-Processors (CMPs). Ever increasing demand of processing power along with the developed integration technology produces power densed CMPs, which in turn increases effective chip temperature. This increased temperature raises up the reliability issues for the chip-circuitry with significant increment in leakage power consumption. Recent DTM techniques apply DVFS or Task Migration to reduce temperature at cores, the hottest on-chip components. To commensurate the high data demand of these cores, most of the modern CMPs are equipped with large multi-level on-chip caches, out of which on-chip Last Level Caches (LLCs) occupy the largest on-chip area. These LLCs are accounted for their significant leakage power consumption which is principal component of the total on-chip power consumption. As power consumption constructs the backbone of heat dissipation, hence, this work dynamically shrinks cache size while maintaining performance constraint to reduce LLC leakage. These turned off cache portions work as on-chip thermal buffer for reducing average and peak temperature of the CMP without affecting the computation. Simulation results claim that, at a minimal penalty on the performance, proposed cache based thermal management having 8MB centralised multi-banked shared LLC gives more than 5æC reduction in peak and average chip temperature, which are comparable with a Greedy DVFS policy.
As the device capacity of Dynamic Random Access Memory (DRAM) increases, refresh operation becomes a significant contributory factor towards total power consumption and memory throughput of the device. To reduce the problems associated with the refresh operation, a multi-rate refresh technique that changes the refresh period based on the retention time of DRAM cells has been proposed. Unfortunately, the multi-rate refresh technique has a scalability issue because the additional storage and logic overhead on a memory controller increases as the device capacity increases. In this paper, we propose a novel redundancy repair technique to increase the refresh period of DRAM by using a universal hashing technique. Our redundancy repair technique efficiently repairs both hard faults which occur during the manufacturing process and weak cells that have short retention time using the remaining spare elements after the process. Also, our technique solves the Variable Retention Time (VRT) problem by repairing weak cells at boot time by exploiting the Built-in self-repair (BISR) technique and ECC (Error Correction Code). Our technique outperforms a conventional BISR redundancy repair with very little hardware overhead, and ensure reliability with more extended refresh period in the entire system. In particular, our experimental results show that our BISR technique achieves 100% repair rate at a 384ms refresh period in 1.0e-6 hard fault BER configuration, and reduces the refresh energy consumption by 83.9% compared to the 64ms refresh and 12% compared to the conventional multi-rate refresh technique for the state-of-the-art 4Gb device.
On Chip Reconfigurable CMOS Analog Circuit Design and Automation Against Aging Phenomena: Sense and React
As process variations increase and devices get more diverse in their behavior, using the same test list for all devices is increasingly inefficient. Methodologies that adapt the test sequence with respect to lot, wafer, or even device?s own behavior help contain the test cost while maintaining test quality. In adaptive test selection approaches, initial test list, a set of tests that are applied to all devices to learn information, plays a crucial role in the quality outcome. Most adaptive test approaches select this initial list based on fail probability of each test individually. Such a selection approach does not take into account the correlations that exist among various measurements and potentially will lead to the selection of correlated tests. In this work, we propose a new adaptive test algorithm that includes a mathematical model for initial test ordering that takes correlations among measurements into account. The proposed method can be integrated within an existing test flow running in the background to improve not only the test quality but also the test time. Experimental results using four distinct industry circuits and large amounts of measurement data show that the proposed technique outperforms prior approaches considerably.
Emerging computational resistive memory is a promising candidate to overcome DRAM challenges and the memory wall bottleneck. However, its cell-level and array-level nonideal properties significantly degrade the reliability, performance, accuracy, and energy efficiency during memory access and analog computation. Cell-level nonidealities include nonlinearity, asymmetry, variability, etc. Array-level nonidealities include interconnect resistance, parasitic capacitance, sneak path, etc. This review summarizes solutions that can mitigate the impact of these nonideal properties. Firstly, we introduce several typical resistive memory devices with focus on their switching modes and characteristics. Secondly, we review resistive memory cells and memory array structures, including 1T1R, 1R, 1S1R, 1TnR, and CMOL. We also overview 3D cross-point arrays and their structural properties. Thirdly, we analyze the impact of cell-level and array-level nonideal properties during memory access and analog arithmetic operation with focus on dot product operation and matrix-vector multiplication. Fourthly, we discuss how to mitigate these nonideal properties by static physical and geometric parameter optimization and dynamic runtime optimization from the viewpoint of cell-array interaction-and-codesign. Dynamic runtime operation schemes include line connection, voltage bias, logical-to-physical mapping, state partition, read reference setting, and switching mode reconfiguration. We also highlight challenges on multilevel cell cross-point arrays and 3D cross-point arrays during these operations. Finally, we survey peripheral circuits design considerations. We also portray an unified reconfigurable computational memory architecture.