Security Analysis of Arbiter PUF and Its Lightweight Compositions Under Predictability Test

Unpredictability is an important security property of Physically Unclonable Function (PUF) in the context of statistical attacks, where the... (more)


With the emergence of many-core multiprocessor system-on-chips (MPSoCs), on-chip networks are facing serious challenges in providing fast communication among various tasks and cores. One promising on-chip network design approach shown in recent studies is to add express channels to traditional mesh network as shortcuts to bypass intermediate... (more)

Scalable SMT-Based Equivalence Checking of Nested Loop Pipelining in Behavioral Synthesis

In this article, we present a novel methodology based on SMT-solvers to verify equality of a high-level described specification and a pipelined RTL... (more)

Optimized Implementation of Multirate Mixed-Criticality Synchronous Reactive Models

Model-based design using Synchronous Reactive (SR) models enables early design and verification of application functionality in a platform-independent... (more)

Reducing the Complexity of Dataflow Graphs Using Slack-Based Merging

There exist many dataflow applications with timing constraints that require real-time guarantees on safe execution without violating their deadlines.... (more)

On the Restore Time Variations of Future DRAM Memory

As the de facto main memory standard, DRAM (Dynamic Random Access Memory) has achieved dramatic density improvement in the past four decades, along... (more)

A Hybrid DRAM/PCM Buffer Cache Architecture for Smartphones with QoS Consideration

Flash memory is widely used in mobile phones to store contact information, application files, and other types of data. In an operating system, the... (more)

An Elastic Mixed-Criticality Task Model and Early-Release EDF Scheduling Algorithms

Many algorithms have recently been studied for scheduling mixed-criticality (MC) tasks. However, most existing MC scheduling algorithms guarantee the... (more)

Computation of Seeds for LFSR-Based n-Detection Test Generation

This article describes a new procedure that generates seeds for LFSR-based test generation when the goal is to produce an n-detection test set. The... (more)

Scale & Cap

As the number of cores per server node increases, designing multi-threaded applications has become essential to efficiently utilize the available hardware parallelism. Many application domains have started to adopt multi-threaded programming; thus, efficient management of multi-threaded applications has become a significant research problem.... (more)

Secure and Flexible Trace-Based Debugging of Systems-on-Chip

This work tackles the conflict between enforcing security of a system-on-chip (SoC) and providing observability during trace-based debugging. On one... (more)

A MATLAB Vectorizing Compiler Targeting Application-Specific Instruction Set Processors

This article discusses a MATLAB-to-C vectorizing compiler that exploits custom instructions, for... (more)


Forthcoming Articles
HoPE: Hot-cacheline Prediction for Dynamic Early Decompression in Compressed LLCs

Data compression plays a pivotal role in improving system performance and reducing energy consumption, because it increases the logical effective capacity of a compressed memory system without physically increasing the memory size. However, data compression techniques incur some cost, such as non-negligible compression and decompression overhead. This overhead becomes more severe if compression is used in the cache. In this paper, we aim to lower the read-hit decompression penalty, which significantly increases the cache memory access latency. We demonstrate that the speculative decompression of frequently used cachelines can significantly reduce the read-hit decompression penalty. We hereby propose a Hot-cacheline Prediction and Early decompression (HoPE) mechanism to determine when a compressed cacheline should be decompressed, in order to minimize the total execution time of the system. Additionally, we show that cachelines of similar compressibility often have a high correlation in their hit rates, as data with similar structures is often used in similar ways. Building on this insight, we also propose a compressed cacheline Hit-history-Based Insertion (HBI) policy to take advantage of this correlation, by predicting how often a cacheline will be hit, based on its compressibility. To evaluate the effectiveness of the proposed HoPE mechanism, we run extensive simulations on memory traces obtained from multi-threaded benchmarks running on a full-system simulation framework. We observe significant performance improvements over compressed cache schemes employing the conventional Least-Recently Used (LRU) replacement policy, the Dynamic Re-Reference Interval Prediction (DRRIP) scheme, and the ECM compressed cache management mechanism. Specifically, HoPE exhibits system performance improvements of approximately 11%, on average, over LRU, 8% over DRRIP, and 7% over ECM, by reducing the read-hit decompression penalty by around 65%, over a wide range of applications.

Approximate Energy-Efficient Encoding for Serial Interfaces

Serial buses are ubiquitous interconnections in embedded computing systems that are used to interface processing elements with peripherals, such as sensors, actuators and I/O controllers. In spite of their limited wiring, as off-chip connections they can account for a significant amount of the total power consumption of a system-on-chip device. Encoding the information sent on these buses is the most intuitive and affordable way to reduce their power contribution; moreover, the encoding can me made even more effective by exploiting the fact that many embedded applications can tolerate intermediate approximations without a significant impact on the final quality of results, thus trading off accuracy for power consumption. We propose a simple yet very effective approximate encoding for reducing dynamic energy in serial buses. Our approach uses differential encoding as a baseline scheme, and extends it with bounded approximations to overcome the intrinsic limitations of differential encoding for data with low temporal correlation. We show that the proposed scheme, besides yielding extremely compact codecs, is superior to all state-of-the-art approximate serial encodings over a wide set of traces representing data received or sent from/to sensor or actuators.

SSAGA: SMs Synthesized for Asymmetric GPGPU Applications

Emergence of GPGPU applications, bolstered by flexible GPU programming platforms, has created a tremendous challenge in maintaining a high energy efficiency in modern GPUs. In this paper, we demonstrate that, customizing a Streaming Multiprocessor (SM) of a GPU, at a lower frequency, is significantly more energy efficient, compared to employing DVFS on an SM, designed for a high frequency operation. Using a system level CAD technique, we propose SSAGAStreaming Multiprocessors Sculpted for Asymmetric GPGPU Applications, an energy efficient GPU design paradigm. SSAGA creates architecturally identical SM cores, customized for different voltage-frequency domains. Our rigorous cross-layer methodology demonstrates an average of 20% improvement in energy efficiency, over a spatially multitasking GPU, across a range of GPGPU applications.

Layer Assignment of Escape buses with Consecutive Constraints in PCB Designs

It is known that it is important for cost and reliability consideration to minimize the number of the used layers and assign the escape buses onto the available layers in a PCB design. In this paper, given a set of n escape buses between two adjacent components and a set of m consecutive constraints on the escape buses, the problem of assigning the given escape buses between two adjacent components onto the available layers is formulated in bus-oriented escape routing. Furthermore, an efficient approach is proposed to minimize the number of the used layers for the given escape buses with the consecutive constraints and assign the escape buses onto the available layers. Compared with Yans approach [J. T. Yan et al. 2012] for the layer assignment of the linear escape buses with no consecutive constraint and Mas approach [Q. Ma, F. Y. Young et al. 2011] for the layer assignment of the circular escape buses with consecutive constraints, the experimental results show that our proposed approach obtains betterl results on the number of the used layers and reduces 43.6% and 90.5% of CPU time for the tested examples on the average, respectively.

Generation of Transparent-Scan Sequences for Diagnosis of Scan Chain Faults

Diagnosis of scan chain faults is important for yield learning and improvement. Procedures that generate tests for diagnosis of scan chain faults produce scan-based tests with one or more functional capture cycles between a scan-in and a scan-out operation. The approach to test generation referred to as transparent-scan has several advantages in this context. (1) It allows functional capture cycles and scan shift cycles to be interleaved arbitrarily. This increases the flexibility to assign to the scan cells values that are needed for diagnosis. (2) Test generation under transparent-scan considers a circuit model where the scan logic is included explicitly. Consequently, the test generation procedure takes into consideration the full effect of a scan chain fault. It thus produces accurate tests. (3) For the same reason it can also target faults inside the scan logic. (4) Transparent-scan results in compact test sequences. Compaction is important because of the large volumes of fail data that scan chain faults create. The cost of transparent-scan is that it requires simulation procedures for sequential circuits, and that arbitrary sequences would be applicable to the scan select input. Motivated by the advantages of transparent-scan, and the importance of diagnosing scan chain faults, this paper describes a procedure for generating transparent-scan sequences for diagnosis of scan chain faults. The procedure is also applied to produce transparent-scan sequences for diagnosis of faults inside the scan logic.

Topological Approach to Automatic Symbolic Macromodel Generation for Analog ICs

In the field of analog integrated circuit design, small-signal macromodels play indispensable roles. However, the subject of automatically generating symbolic low-order macromodels in human readable circuit form has not been well studied. Traditionally, work has been published on reducing full-scale symbolic transfer functions to simpler forms, but without the guarantee of interpretability. In this work a topological reduction method is introduced which is able to automatically generate interpretable macromodel circuits in symbolic form; that is, the circuit elements in the compact model maintain analytical relations of the parameters of the original full circuit. This type of symbolic macromodel has several benefits that other traditional modeling methods do not offer: firstly, reusability, namely, designer need not repeatedly generate macromodels for the same analog integrated circuit when it is either resized or rebiased; secondly, interpretability, namely, designer may identify directly circuit parameters in the original integrated circuit that are closely related to the dominant frequency characteristics, such as dc gain, gain/phase margins, and dominant poles/zeros, etc. The effectiveness and computational efficiency of the proposed method have been validated by several operational amplifier (opamp) circuit examples.

Generating Current Constraints to Guarantee RLC Power Grid Safety

A critical task during early chip design is the efficient verification of the chip power distribution network. Vectorless verification, developed over the last decade as an alternative to traditional simulation-based methods, requires the user to specify current constraints (budgets) for the underlying circuitry and checks if the corresponding voltage variations on all grid nodes are within a user-specified margin. This framework is extremely powerful as it allows for efficient and early verification, but specifying/obtaining current constraints remains a burdensome task for users and a hurdle to adoption of this framework by the industry. Recently, the inverse problem has been introduced: generate circuit current constraints that, if satisfied by the underlying logic circuitry, would guarantee grid safety from excessive voltage variations. This approach has many potential applications, including various grid quality metrics, as well as voltage drop aware placement and floorplanning. So far, this framework has been developed assuming an RC model of the power grid. Inductive effects are becoming a significant component of the power supply noise and can no longer be ignored. In this paper, we extend the constraints generation approach to allow for inductance. We give a rigorous problem definition and develop some key theoretical results related to maximality of the current space defined by the constraints. Based on this, we then develop three constraints generation algorithms that target the peak total chip power that is allowed by the grid, the uniformity of current distribution across the die area, and a combination of both metrics.

Leak Stopper: An Actively Revitalized Snoop Filter Architecture with Effective Generation Control

To alleviate high energy dissipation of unnecessary snooping accesses, snoop filter designs have been proposed to reduce snoop lookups. These filters have the problem of decreasing filtering efficiency, and thus usually rely on partial or whole filter reset by detecting block evictions. Unfortunately, the reset conditions occur infrequently or unevenly (named as passive filter deletion). This work proposes the concept of revitalized snoop filter (RSF) design, which can actively renew the destination filter by employing a generation wrapping around scheme for various reference behaviors. We further utilize a sampling mechanism for RSF to timely trigger precise filter revitalizations, so that unnecessary RSF flushing can be minimized. The proposed RSF can be integrated to various inclusive snoop filters and needs only minor change to their designs. We evaluate our proposed design and demonstrate that RSF eliminates 58.6% of snoop energy compared to JETTY on average while inducing only 6.5% of revitalization energy overhead. In addition, RSF eliminates 45.5% of snoop energy compared to stream registers on average and only induces 2.5% of revitalization energy overhead. Overall, these RSFs reduce the total L2 cache energy consumption by 52.1% (58.6%-6.5%) as compared to JETTY and by 43% (45.5%-2.5%) as compared to stream registers. Furthermore, RSF improves the overall performance by 1% to 1.4% on average compared to JETTY and Stream Registers for various benchmark suites.

A Single-Tier Virtual Queuing Memory Controller Architecture for Heterogeneous MPSoCs

Heterogeneous MPSoCs typically integrate diverse cores, including application CPUs, GPUs, and HD coders. These cores commonly share an off-chip memory to save cost and energy, but their memory accesses often interfere with each other, leading to undesirable consequences like a slowdown of application performance or a failure to sustain real-time performance. The memory controller plays a central role in meeting the QoS needs of real-time cores while maximizing the CPU performance. Previous QoS-aware memory controllers are based on a classic two-tier queuing architecture that buffers memory transactions at the first tier, followed by a second tier that buffers translated DRAM commands. In these designs, QoS-aware policies are used to schedule competing transactions at the first stage, but the translated DRAM commands are serviced in FIFO order at the second stage. Unfortunately, once the scheduled transactions have been forwarded to the command stage, newly arriving transactions that may be more critical cannot be serviced ahead of those translated commands that are already queued at the second stage. To address this, we propose a scalable memory controller architecture based on Single-Tier Virtual Queuing (STVQ) that maintains a single-tier of request queues and employs an efficacious scheduler that considers both QoS requirements and DRAM bank states. In comparison with previous QoS-aware memory controllers, the proposed STVQ memory controller reduces CPU slowdown by up to 13.9% while satisfying all frame rate requirements. We propose further optimizations that can significantly increase row-buffer hits by up to 66.2% and reduce memory latency by up to 19.8%.

A List of Fundamental Challenges Towards Making IoT a Reachable Reality: A Model-centric Investigation

The constantly advancing integration capability is paving the way to the construction of extremely large scale continuum of internet where entities, or things, from vastly varied domains are uniquely address- able and interacting seamlessly to form a giant networked system of systems, known as Internet-of-things (IoT). In contrast to such visionary networked system paradigm, prior research efforts on IoT are still very fragmented and confined to disjoint explorations in different application, architectural, security, services, protocol and economical domains, thus preventing the design exploration and optimization from a unified and global perspective. In this context, this survey article first proposes a mathematical modeling frame- work that is rich in expressivity to capture the IoT characteristics from a global perspective. Then a list of fundamental challenges in i) sensing , ii) decentralized computation, iii) energy-efficiency and iv) hardware security is identified and formulated based on the proposed modeling framework. The solutions are discussed to shed lights on future IoT system paradigm development.

Proof-Carrying Hardware via IC3

Proof-carrying hardware is a principle for achieving safety for dynamically reconfigurable hardware systems. The producer of a hardware module spends the huge effort for creating a proof for a safety policy. The proof is then transferred as a certificate together with the configuration bitstream to the consumer of the hardware module, who can quickly verify the given proof. Previous work utilized SAT solvers and resolution traces to set up a proof-carrying hardware technology and corresponding tool flows. In this paper, we present a novel technology for proof-carrying hardware based on inductive invariants. For sequential circuits, our approach is fundamentally stronger than the previous SAT-based one since we avoid the limitations of bounded unrolling. We contrast our technology with existing ones and show that it fits into previously proposed tool flows. We conduct experiments with three categories of benchmark circuits and report consumer and producer runtime and peak memory consumption, as well as the size of the certificates and the distribution of the workload between producer and consumer. Experiments clearly show that our new induction-based technology is superior for sequential circuits, while the previous SAT-based technology is the better choice for combinational circuits.

Low-Power Clock Tree Synthesis for 3D-ICs

We propose efficient algorithms to construct low-power clock tree for through-silicon-via (TSV) based 3D-ICs. We use shutdown gates to save clock tree's dynamic power, which selectively turn off certain clock tree branches to avoid unnecessary clock activities when the modules in these tree branches are inactive. While this clock gating technique has been extensively studied in 2D circuits, its application in 3D-ICs is unclear. In 3D-ICs, a shutdown gate is connected to control signal unit through control TSVs, which may cause placement conflicts with existing clock TSVs in the layout due to TSV's large physical dimension. We develop a two-phase clock tree synthesis design flow for 3D-ICs: (1) 3D abstract clock tree generation based on K-means clustering. (2) Clock tree embedding with simultaneous shutdown gates insertion based on simulated annealing (SA) and a force-directed TSV placer. Experimental results indicate that: (1) The K-means clustering heuristic significantly reduces the clock power by clustering modules with similar switching behavior and close proximity. (2) The SA algorithm effectively inserts the shutdown gates to a 3D clock tree, while considering control TSV's placement. Compared with previous 3D clock tree synthesis technique, our K-means clustering based approach achieves larger reduction in clock tree power consumption while ensuring zero clock skew.

Optimal Scheduling and Allocation for IC Design Management and Cost Reduction

A large semiconductor product company spends hundreds of millions of dollars each year on design infrastructure to meet tapeout schedules for multiple concurrent projects. Resources (servers, EDA tool licenses, engineers, etc.) are limited and must be shared  and the cost per day of schedule slip can be enormous. Co-constraints between resource types (e.g., one license per every two cores (threads)) and dedicated versus shareable resource pools make scheduling and allocation hard. In this paper, we formulate two mixed integer-linear programs for optimal multi-project, multi-resource allocation with task precedence and resource co-constraints. Application to a real-world three-project scheduling problem extracted from a leading-edge design center of anonymized Company X shows substantial compute and license costs savings. Compared to the product company, our solution shows that the makespan of schedule of all projects can be reduced by seven days, which not only saves ~2.7% of annual labor and infrastructure costs, but also enhances market competitiveness. We also demonstrate the capability of scheduling over two dozen chip development projects at the design center level, subject to resource and datacenter capacity limits as well as per-project penalty functions for schedule slips. The design center ended up purchasing 600 additional servers, whereas our solution demonstrates that the schedule can be met without having to purchase any additional servers. Application to a four-project scheduling problem extracted from a leading-edge design center in a non-U.S. location shows availability of up to ~37% headcount reduction during a half-year schedule for just one type of chip design activity.

A Fast Hierarchical Adaptive Analog Routing Algorithm based on Integer Linear Programming

The shrinking design window and high parasitic sensitivity in the advanced technology have imposed special challenges to the analog and RF integrated circuit designers. The state-of-the-art analog routing research tends to favor linear programming to achieve various analog constraints, which, although effective, fails to offer high routing efficiency on its own. In this paper, we propose a new methodology to address such a deficiency based on integer linear programming (ILP) but without compromising the capability of handling any special constraints for the analog routing problems. Our proposed method supports hierarchical routing, which can divide the entire routing area into multiple small heterogeneous regions where the ILP can efficiently derive routing solutions. Distinct from the conventional methods, our algorithm utilizes adaptive resolutions for various routing regions. For a more congested region, a routing grid with higher resolution is employed, whereas a lower-resolution grid is adopted to a less crowded routing region. For a large empty space, routing efficiency can be even boosted by creating more routing hierarchy levels. This scheme is especially beneficial to the analog and RF layouts, which are far sparser than the digital counterpart. The experimental results show that our proposed adaptive ILP-based router is much faster than the conventional ones since it spends much less time for the areas that need no accurate routing anyway. The high efficiency is demonstrated for large circuits and especially sparse layouts along with promising routing quality in terms of analog constraints.

PeaPaw: Performance and Energy Aware Partitioning of Workload on Heterogeneous Platforms

Performance and energy are two major concerns for application development on heterogeneous platforms. It is challenging for application developers to fully exploit the performance/energy potential of heterogeneous platforms. One reason is the lack of reliable prediction of the system's performance/energy before application implementation. Another reason is that a heterogeneous platform presents a large design space for workload partitioning between different processors. To reduce such development cost, this paper proposes a framework, PeaPaw, to assist application developers to identify a workload partition (WP) that has high potential leading to high performance or energy efficiency before actual implementation. The PeaPaw framework includes both analytical performance/energy models and two sets of workload partitioning guidelines. Based on the design goal, application developers can obtain a workload partitioning guideline from PeaPaw for a given platform and then follow it to design one or multiple WPs for a given workload. Then PeaPaw can be used to estimate the performance or energy of the designed WPs, and the WP with the best estimated performance or energy will be selected for further implementation. To demonstrate the effectiveness of PeaPaw, we have conducted three case studies. Results from these case studies show that PeaPaw can faithfully estimate the performance/energy relationships of WPs and provide effective workload partitioning guidelines.

Application-Specific Residential Microgrid Design Methodology

In power system industry, the traditional, non-interactive, and manually-controlled power grid has been transformed to cyber-dominated smart grid. This cyber-physical integration has provided the smart grid with communication, monitoring, computation, and controlling capabilities to improve its reliability, energy efficiency, and flexibility. A microgrid as a localized and semi-autonomous group of smart energy systems, utilizes the above-mentioned capabilities to drive modern technologies such as electric vehicle charging, home energy management, smart appliances, etc. Designing, upgrading, testing, and verifying these microgrids can get too complicated to handle manually. The complexity is due to the wide range of solutions and components that are intended to address the microgrid problems. This paper presents a novel Model-Based Design (MBD) methodology to model, co-simulate, design, and optimize microgrid and its multi-level controllers. It helps us to design, optimize, and validate a microgrid for a specific application. The application rules, requirements, and design-time constraints are met in the designed/optimized microgrid while the implementation cost is minimized. Based on our novel methodology, a design automation, co-simulation, and analysis tool, called GridMAT, is implemented. Our experiments have illustrated that implementing a hierarchical controller reduces the average power consumption by 8% and shifts the peak load for cost saving. Moreover, using our MBD methodology with smart controllers, the total implementation cost decreases by 14%, when upgrading a microgrid, compared to the conventional methodology and 5%, compared to the case where smart controllers are not considered.

Using CoreSight PTM to integrate CRA monitoring IPs in an ARM-based SoC

ARM CoreSight PTM has been widely deployed in recent ARM processors for real-time debugging and tracing of software. Using PTM, the external debugger can extract execution behaviors of applications running on an ARM processor. Recently, some researchers begin to use this feature for other purposes such as fault tolerant computation and security monitoring. This motivated us to develop an external security monitor that can detect control hijacking attacks, of which goal is to maliciously manipulate the control flow of victim applications at attackers disposal. Especially, this paper focuses on detecting a special type of attacks, called code reuse attacks (CRA), which uses a recently-introduced technique that allows attackers to perform arbitrary computation without injecting their code by reusing only the existing code fragments. Our external monitor is attached to the outside of the host system via the system bus and ARM CorSight PTM, and fed with execution traces of a victim application running on the host. As a majority of CRAs violates the normal execution behaviors of a program, our monitor constantly watches and analyzes the execution traces of the victim application and detect a symptom of attacks when the execution behaviors violate certain rules that normal applications are known to adhere. We present two different implementations for this purpose; a hardware-based solution in which all the CRA detection components are implemented in hardware, and a hardware/software mixed solution that can be employed in more resource constrained environment where the deployment of the full hardware level CRA detection is burdensome.

Accelerated Soft-Error-Rate (SER) Estimation for Combinational and Sequential Circuits

Radiation-induced soft errors have posed an increasing reliability challenge to combinational and sequential circuits in advanced CMOS technologies. Therefore, it is imperative to devise fast, accurate and scalable soft error rate (SER) estimation methods as part of cost-effective robust circuit design. This paper presents an efficient SER estimation framework for combinational and sequential circuits, which considers single-event transients (SETs) in combinational logic and multiple cell upsets (MCUs) in sequential elements. A novel top-down memoization algorithm is proposed to accelerate the propagation of SETs, and a general schematic and layout co-simulation approach is proposed to model the MCUs for redundant sequential storage structures. The feedback in sequential logic is analyzed with an efficient time frame expansion method. Experimental results on various ISCAS85 combinational benchmark circuits demonstrate that the proposed approach achieves up to 560.2X times speedup with less than 3\% difference in terms of SER results compared with the baseline algorithm. The average runtime of the proposed framework on a variety of ISCAS89 sequential benchmark circuits is 2.535s, and the runtime is 30.429s for the largest benchmark circuit with more than 1,000 flip-flops and 20,000 gates.

Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark-Silicon Era

The failure of Dennard scaling has led to the utilization wall that is the source of dark silicon and limits the percentage of a chip that can actively switch within its power budget. To address this issue, a structure is needed to guarantee the limited power budget along with providing sufficient flexibility and performance for different applications with various communication requirements. In this line, we present a general-purpose platform for future many-core Chip-Multiprocessors (CMPs) which benefits from the advantages of clustering, Network-on-Chip (NoC) resource sharing among cores, and power gating the unused components of clusters. We also propose two task mapping methods for the proposed platform in which active and dark cores are dispersed appropriately so that an excess of power budget can be obtained. Our evaluations reveal that the first and second proposed mapping mechanisms respectively reduce the execution time by up to 28.6% and 39.2%, the NoC power consumption by up to 11.1% and 10%, and gain an excess power budget of up to 7.6% and 13.4%, over the baseline architecture.

Content-Aware Bit Shuffling for Maximizing PCM Endurance

Recently, phase change memory (PCM) is emerging as a strong replacement for DRAM owing to its many advantages such as non-volatility, high scalability, and so on. However, PCM is still restricted for use as main memory because of its limited write endurance. There have been many methods introduced to resolve the problem by reducing bit flips. Although they have significantly contributed to bit flip reduction, they still have the drawback that the lower bits are flipped more often than the higher bits. The reason is that these methods do not consider the fact that, in general, the lower bits are updated much more frequently than the higher bits. In this paper, we propose a noble content-aware bit shuffling (CABS) technique that minimizes bit flips and evenly distributes them to maximize the lifetime of PCM at the bit level. We also introduce two additional optimizations, namely, addition of an inversion bit and use of an XOR key, to further reduce bit flips. Moreover, CABS is capable of recovering from stuck-at faults by restricting the change in values of stuck-at cells. Experimental results showed that CABS outperformed the existing state-of-the-art methods in the aspect of PCM lifetime extension with minimal overhead. Specifically, CABS achieved up to 48.5% enhanced lifetime compared to the data comparison write (DCW) method, while consuming a few extra resources for metadata. We have also confirmed that CABS is fully applicable to BCH codes as it was able to reduce the maximum number of bit flips in metadata cells by 32.1%.

CDTA: A Comprehensive Solution for Counterfeit Detection, Traceability and Authentication in IoT Supply Chain

The Internet of Things (IoT) is transforming the way we live and work by increasing the connectedness of people and things on a scale that was once unimaginable. However, the vulnerabilities in IoT supply chain have raised serious concerns about the security and trustworthiness of IoT devices and components on them. Testing for device provenance, detection of counterfeit integrated circuits (ICs) and systems, and traceability of IoT devices are challenging issues to address. In this paper, we develop a novel RFID-based system suitable for Counterfeit Detection, Traceability and Authentication in IoT supply chain called CDTA. CDTA is composed of different types of on-chip sensors and in-system structures that collect necessary information to detect multiple counterfeit IC types (recycled, cloned, etc.), track and trace IoT devices, and verify the overall system authenticity. Central to CDTA is an RFID tag employed as storage and a channel to read the information from different types of chips on the printed circuit board (PCB) in both power-on and power-off scenarios. A novel board ID generator is implemented by combining outputs of physical unclonable functions (PUFs) embedded in RFID tag and different chips on the PCB. A light-weight RFID protocol is proposed to enable mutual authentication between RFID readers and tags. We also implement a secure inter-chip communication on the PCB. Simulations and experimental results using Spartan 3E FPGAs demonstrate the effectiveness of this system. The efficiency of the radio frequency (RF) communication has also been verified via a PCB prototype with a printed slot antenna.

Temperature Effect Inversion Aware Dynamic Thermal Management for FinFET Circuits

Due to the superb characteristics, FinFETs have emerged as a promising replacement for planar CMOS devices in sub-20nm CMOS technology nodes. However, based on extensive simulations, we have observed that the gate delay vs. temperature characteristics of FinFET circuits may be fundamentally different from that of the conventional bulk CMOS circuits, i.e., the delay of a FinFET circuit decreases with increasing temperature even in the super-threshold supply voltage regime. Fur- thermore, with the optimal buffer insertion, we have observed that interconnect delay of the FinFET circuits may follow the same trend for the temperature change. Unfortunately, the leakage power dissipation of the FinFET-based circuits increases exponentially with the temperature. These two trends give rise to a tradeoff between delay and leakage power as a function of the chip temperature, and hence, lead to the definition of an optimum chip temperature operating point (i.e., one that balances concerns about the circuit speed and power efficiency.) This paper presents the results of our investigations into the aforesaid temperature effect inversion (TEI) and proposes a novel dynamic thermal management (DTM) algorithm, which exploits this phenomenon to minimize the energy consumption of FinFET circuits without any appreciable performance penalty. Experi- mental results demonstrate that significant energy saving (as high as 36%, with no performance penalty) can be achieved by the proposed TEI-aware DTM approach compared to the best-in-class DTMs that are unaware of this phenomenon.

Design Methodology of Fault-Tolerant Custom 3D Network-on-Chip

A systematic design methodology is presented for custom Network-on-Chip (NoC) in three-dimensional integrated circuits (3D-ICs). In addition, fault tolerance is supported in the NoC if extra links are included in the NoC topology. In the proposed method, processors and the communication architecture are synthesized simultaneously in the 3D floorplanning process. 3D-IC technology enables ICs to be implemented in smaller size with higher performance; on the flip side, 3D-ICs suffer yield loss due to multiple dies in a 3D stack and lower manufacturing yield of through-silicon vias (TSVs). To alleviate this problem, KGD test can be applied to ensure every die to be packaged into a 3D-IC is fault-free. However, faulty TSVs cannot be tested in the KGD test. In this paper, the proposed method deals with the problem by providing fault-tolerance in the NoC topology. The efficiency of the proposed method is evaluated using several benchmark circuits, and the experimental results show that the proposed method produces 3D NoCs with comparable performance than previous methods when fault tolerant features are not realized. With fault tolerance in NoCs, higher yield can be achieved at the cost of performance penalty and elevated power level.

Parallel High-Level Synthesis Design Space Exploration for Behavioral IPs of Exact Latencie

This works presents a Design Space Exploration (DSE) method for Behavioral IPs (BIPs) given in ANSI-C or SystemC in order to find the smallest micro-architecture for a specific target latency. Previous work on High-Level Synthesis (HLS) DSE mainly focused on finding a trade-off curve with Pareto-optimal designs. HLS is however a single process (component) synthesis method. Very often, the latency of the components requires a specific fixed latency when inserted within a larger system. This work presents a fast multi-threaded method to find the smallest micro-architecture for a given BIP and target latency, by discriminating between all different exploration knobs and exploring these concurrently. Experimental results shows that our propose method is very effective and comprehensive results compare the quality of results vs. the speedup of your proposed explorer.

Automated Integration of Dual-Edge Clocking for Low-Power Operation in Nanometer Nodes

Clocking power, including both clock distribution and registers, has long been one of the primary factors in the total power consumption of many digital systems. One straightforward approach to reduce this power consumption is to apply dual-edge triggered (DET) clocking, since sequential elements operate at half the clock frequency, while maintaining same throughput as with conventional single-edge triggered (SET) clocking. However, the DET approach is rarely taken in modern integrated circuits, primarily due to the perceived complexity of integrating such a clocking scheme. In this paper, we first identify the most promising conditions for achieving low-power operation with DET clocking, and then introduce a fully automated design flow for applying DET to a conventional SET design. The proposed design flow is demonstrated on three benchmark circuits in a 40 nm CMOS technology, providing as much as a 50% reduction in clock distribution and register power consumption.


