3D-stacked DRAMs can significantly increase cell density and bandwidth while also lowering power consumption. However, 3D structures experience significant thermomechanical stress due to the differential rate of contraction of the constituent materials, which have different coefficients of thermal expansion. This impacts circuit performance. This paper develops a procedure that performs a performance analysis of 3D DRAMs, capturing the impact of both layout-aware stress and layout-independent stress on parameters such as latency, leakage power, refresh power, area, and bus delay. The approach first proposes a semianalytical stress analysis method for the entire 3D DRAM structure, capturing the stress induced by TSVs, micro bumps, package bumps, and warpage. Next, this stress is translated to variations in device mobility and threshold voltage, after which analytical models for latency, leakage power, and refresh power are derived. Finally, a complete analysis of performance variations is performed for various 3D DRAM layout configurations to assess the impact of layout-dependent stress. We explore the use of alternative flexible package substrate options to mitigate the performance impact of stress. Specifically, we explore the use of an alternative bendable package substrate made of polyimide to reduce warpage-induced stress and show that it reduces stress-induced variations, and improves the performance metrics for stacked 3D DRAMs.
Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate for large on-chip memories as a zero-leakage, high density and non-volatile alternative to the present SRAM technology. Since memories are the dominating component of a System-on-Chip, the overall performance of the system is highly dependent on that memories. Nevertheless, the high write energy and latency of the emerging STT-MRAM are the most challenging design issues in a modern computing system. By relaxing the non-volatility of these devices, it is possible to reduce the write energy and latency costs, at the expense of reducing the retention time, which in turn may lead to loss of data. In this paper, we propose a hybrid STT-MRAM design for caches with different retention capabilities. Then, based on the application requirements (i.e., execution time and memory access rate), program data layout is re-arranged at compilation time for achieving fast and energy efficient hybrid STT-MRAM on-chip memory design with no reliability degradation. The application requirements have been defined at function granularity based on profiling and static code analysis, which estimate the required retention time and memory access rate, respectively. Experimental results show that the proposed hybrid STT-MRAM cache combined with profiling-based and compiler level analysis for the data re-arranging, on average, reduces the write energy per access by 49.7%. At system level, overall static and dynamic energy of the cache are respectively reduced by 8.1% and 44%. Whereas, the system performance has been improved up to 8.1%.
Design-for-manufacturability (DFM) guidelines are recommended layout design practices intended to capture layout features that are difficult to manufacture correctly. Avoiding such features prevents the occurrence of potential systematic defects. Layout features that result in DFM guideline violations may not be avoided completely due to the design constraints of chip area, performance and power consumption. A framework for translating DFM guideline violations into potential systematic defects, and faults, was described earlier. In a cell-based design, the translated faults may be internal or external to cells. In this article we focus on undetectable faults that are external to cells. Using a resynthesis procedure that makes fine changes to the layout while maintaining the design constraints, we target areas of the design where large numbers of external faults related to DFM guideline violations are undetectable. By eliminating the corresponding DFM guideline violations, we ensure that the circuit does not suffer from low-coverage areas that may result in detectable systematic defects escaping detection, but failing the circuit in the field. The layout resynthesis procedure is applied to benchmark circuits and logic blocks of the OpenSPARC T1 microprocessor. Experimental results indicate that the improvement in the coverage of potential systematic defects is significant.
Machine learning is a powerful lever for developing, improving, and optimizing test methodologies to cope with the demand from the advanced nodes. Ensemble methods are a particular learning paradigm that uses multiple models to boost performance. In this work, ensemble reduction and learning is explored for integrated circuit test and diagnosis. For testing, the proposed method is able to reduce the number of system-level tests without incurring substantial increase in defect escapes or yield losses. Significant cost from test execution and set-up preparation can thereby be saved. Experiments are performed on two designs of commercially fabricated chips, for an overall population of > 264,000 chips. The results demonstrate that our method is able to reduce 29.27% and 21.74% of the number of tests for the two chips, respectively, at the cost of very low defect escapes. For failure diagnosis, the framework is able to predict an adequate amount of test data necessary for accurate failure diagnosis. Experiments performed on five standard benchmarks demonstrate that our method outperforms a state-of-the-art work in terms of data-volume reduction. The proposed ensemble-based methodology creates opportunities for improving test and diagnosis efficiency.
MEMS-based sensor circuits are traditionally designed separately using CAD tools specific to each energy domain (electrical and mechanical). The paper presents a complete approach for combined MEMS-IC robustness optimization. Advanced methods for robustness analysis and optimization considering design, operating and process parameters, developed for integrated circuits, are transferred to MEMS-IC systems. Both electrical and mechanical design and process parameters are included in the optimization. The methodology is exemplified on two demonstrator examples: a MEMS microphone and a MEMS accelerometer, each with an integrated readout circuit. A successful optimization requires the simultaneous inclusion of design parameters and process tolerances from both energy domains. To save CPU time, a reduced-order, circuit-level model is used for the MEMS part and this model is created only when necessary. To integrate the generation of the simplified model into the optimization flow, a simulation-in-a-loop flow based on commercial tools for both the electrical and the mechanical domain has been implemented.
Real-time systems continuously interact with the physical environment and often have to satisfy stringent timing constraints imposed by their interactions. Those systems involve two main properties: reactivity and predictability. Reactivity allows the system to continuously react to a non-deterministic external environment, while predictability guarantees the deterministic execution of safety-critical parts of applications. However, with the increase in software complexity, traditional approaches to develop real-time systems make temporal behaviors difficult to infer, especially when the system is required to address non-deterministic aperiodic events from the physical environment. In this paper, we propose a reactive and predictable programming framework, Distributed Clockwerk (DCW), for distributed real-time systems. DCW introduces the Servant, which is a non-preemptible execution entity, to implement periodic tasks based on the Logical Execution Time (LET) model. Furthermore, a joint schedule policy, based on the slack stealing algorithm, is proposed to efficiently address aperiodic events with no violated hard time constraints. To further support predictable communication among distributed nodes, DCW implements the Time-Triggered Controller Area Network (TTCAN) to avoid collisions while accessing the shared communication medium. Moreover, a programming framework implements to provide a set of programming APIs for defining timing and functional behaviors of concurrent tasks. An example is further implemented to illustrate the DCW design flow. The evaluation results demonstrate that our proposal can improve both periodic and aperiodic reactivity compared with existing work, and the implemented DCW can also ensure the system predictability by achieving extremely low overheads.
Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management
On Chip Reconfigurable CMOS Analog Circuit Design and Automation Against Aging Phenomena: Sense and React
As process variations increase and devices get more diverse in their behavior, using the same test list for all devices is increasingly inefficient. Methodologies that adapt the test sequence with respect to lot, wafer, or even device?s own behavior help contain the test cost while maintaining test quality. In adaptive test selection approaches, initial test list, a set of tests that are applied to all devices to learn information, plays a crucial role in the quality outcome. Most adaptive test approaches select this initial list based on fail probability of each test individually. Such a selection approach does not take into account the correlations that exist among various measurements and potentially will lead to the selection of correlated tests. In this work, we propose a new adaptive test algorithm that includes a mathematical model for initial test ordering that takes correlations among measurements into account. The proposed method can be integrated within an existing test flow running in the background to improve not only the test quality but also the test time. Experimental results using four distinct industry circuits and large amounts of measurement data show that the proposed technique outperforms prior approaches considerably.
Packet Classification is the enabling function performed in commodity switches for providing various services like access control, intrusion detection, load balancing and so on. Ternary Content Addressable Memories (TCAMs) are the de-facto standard for performing packet classification at high speeds. However, TCAMs are highly costlier both in terms of cost and power consumption, forcing the switch vendors towards placing lots of effort for power management. Hence, power efficient solutions for TCAM based packet classification are highly relevant even today. In this paper, we propose a novel rule placement algorithm based on the unique field values presence within the rule databases. We evaluate the total search that is needed to be inspected with respect to traditional placement approach and the proposed placement approach based on the information content within the fields. Simulation results showed an average reduction of 30.55% in the search space by the proposed placement approach thereby resulting in an average reduction of 18.85% per search energy over TCAM. With typical TCAM clock -speeds ranging between 200 - 400 MHz, this reduction in the per search energy maps to a huge reduction in the total energy consumed by the TCAM based network switches. The proposed solution is plug and play type requiring only minimal preprocessing within the Network Processing Unit (NPU) of the switches and edge routers.
Emerging computational resistive memory is a promising candidate to overcome DRAM challenges and the memory wall bottleneck. However, its cell-level and array-level nonideal properties significantly degrade the reliability, performance, accuracy, and energy efficiency during memory access and analog computation. Cell-level nonidealities include nonlinearity, asymmetry, variability, etc. Array-level nonidealities include interconnect resistance, parasitic capacitance, sneak path, etc. This review summarizes solutions that can mitigate the impact of these nonideal properties. Firstly, we introduce several typical resistive memory devices with focus on their switching modes and characteristics. Secondly, we review resistive memory cells and memory array structures, including 1T1R, 1R, 1S1R, 1TnR, and CMOL. We also overview 3D cross-point arrays and their structural properties. Thirdly, we analyze the impact of cell-level and array-level nonideal properties during memory access and analog arithmetic operation with focus on dot product operation and matrix-vector multiplication. Fourthly, we discuss how to mitigate these nonideal properties by static physical and geometric parameter optimization and dynamic runtime optimization from the viewpoint of cell-array interaction-and-codesign. Dynamic runtime operation schemes include line connection, voltage bias, logical-to-physical mapping, state partition, read reference setting, and switching mode reconfiguration. We also highlight challenges on multilevel cell cross-point arrays and 3D cross-point arrays during these operations. Finally, we survey peripheral circuits design considerations. We also portray an unified reconfigurable computational memory architecture.
Contemporary integrated circuits (ICs) are increasingly being constructed using intellectual property blocks (IPs) obtained from third parties in a globalized supply chain. The increased vulnerability to adversarial changes during this untrusted supply chain raises concerns about the integrity of the end product. The difference in the levels of abstraction between the initial specification and the final available circuit design poses a challenge for analyzing the final circuit for malicious insertions. Reverse engineering presents one way to help reduce the difficulty of circuit analysis and inspection. In this work, we provide a framework that given (i) a gate-level netlist of a design and (ii) a block diagram for the design with relative sizes of the blocks, outputs a matching between the partitions of the circuit and blocks in the block diagram. We first compute a geometric embedding for each node in the circuit, and then apply a clustering algorithm on the embedding features to obtain circuit partitions. Each partition is then mapped to the high-level blocks in the block diagram. These partitions can then be further analyzed for malicious insertions with much reduced complexity in comparison with the full chip. We tested our algorithm on different designs with varying sizes to evaluate the efficacy of algorithm, including the multi-core processor OpenSparc T1, and showed that we can successfully match over 90% of gates to their corresponding blocks.