Answers to comments from Reviewer 1

Size: px

Start display at page:

Download "Answers to comments from Reviewer 1"

Kelley Malone
5 years ago
Views:

1 Answers to comments from Reviewer 1 Question A-1: Though I have one correction to authors response to my Question A-1,... parity protection needs a correction mechanism (e.g., checkpoints and roll-backward recovery), which can incur significant overheads in terms of runtime and energy consumption due to the recovery mechanism once an error is detected, is not accurate. I suggest the authors make it clear that the reliability mechanism supporting error recovery may incur significant performance and energy overheads, not the error recovery itself since the actual soft error rate is extremely low and the error recovery is merely invoked during the execution. Answer A-1: Yes, reviewer is right. What we liked to point out in the last response was that parity protection needs a correction mechanism such as checkpoints and roll-backward recovery and the mechanism supporting error recovery may incur high overheads in terms of performance and energy consumption as reviewer corrected. Question A-2: One more comment: the comparison results in Section 7.2 and Section 7.3 show that PPC is in difficulty to compete with ECCs in terms of vulnerability reduction (around 50% for PPC vs. 99% for ECC), even though the incurred performance and energy overheads are much less. Will this kind of design trade-offs be acceptable in real system design? Please justify the applicability of your proposed PPCs with potential applications. Answer A-2: One of potential applications for PPCs in real system design is portable video surveillance systems in hazardous areas. Mobile embedded systems such as smart phones, PDAs, and portable systems demand energy efficiency mainly because they are running with limited battery. Also, the reliability is becoming important as these mobile devices are running close to the human since the failure of the functionality in these devices may cause catastrophic results. Thus, designers of mobile embedded systems can offer design space to tradeoff the reliability for performance/energy costs or vice versa. These examples include the portable video surveillance systems installed in hazardous areas. Since it is almost impossible to physically replace these systems while running video surveillance as long as possible, they can tradeoff reliability (in particular, less reliable design for architectures dealing with video data itself) for performance/power overheads.

2 Answers to comments from Reviewer 2 Comment B-1: The authors have addressed most of my prior concerns satisfactorily and thus I would like to recommend acceptance of this manuscript. Answer B-1: Thank you for your comments and concerns.

3 Answers to comments from Reviewer 3 Question C-1: In my opinion, you emphasized not methodological but algorithmic aspect in this paper too much. The algorithmic part, however, is not novel because your page assignment problem is categorized into the classical assignment combinatorial problem, for which many people have studied their algorithms. Please try to emphasize the methodological aspect of your research. Or please compare your algorithm with the other ones comprehensively if you still want to emphasize the algorithmic aspect. More detail of the PPC architecture should be explained in this paper. This is more important than discussing the ad-hoc algorithms. Please show some quantitative evaluations on cycle time of the PPC. More concretely, please show several graphs which show the cache size vs cycle time for both a protected part and an unprotected part of the PPC. I think your PPC architecture causes some performance overhead over the TLB. The quantitative evaluation on TLB should also be shown in this paper. Without the quantitative discussion on the PPC, your paper would not be helpful to designers because they could not decide the optimal sizes of unprotected and protected caches for their design. Figure 53 in the following paper would be helpful in your plotting graphs on the cache size vs cycle time. Steven J.E. Wilton and Norman P. Jouppi, "An Enhanced Access and Cycle Time Model for On-Chip Caches," WRL Research Report 93/5. Answer C-1: Thank you for your suggestion. The main focus of this paper is to propose approaches for interesting partitions of general applications among an unprotected cache and a protected cache in a PPC architecture, which has been already published in our previous work [Lee et al. CASES 06 and TVLSI 09]. Our previous work has been focused on the architectural novelty and tradeoffs among performance, energy consumption, and reliability for multimedia applications where obvious partitioning approach exists for PPC architectures such as multimedia data and non-multimedia data. However, this paper presents that PPC architecture is effective not only for multimedia applications but also for general applications with the proposed algorithms for data and instruction partitions in PPCs. Since we have published comprehensive experimental results in terms of performance (runtime), power consumption overhead, reliability (failure rate), and area penalty in our previous papers and technical report, this article emphasizes the effectiveness of our proposed algorithms to partition data and instruction in our previously proposed PPC architecture for general applications including multimedia applications. Further, this article expands the effectiveness of PPC architectures to the instruction caches as well. In this answering document, we provide all our experimental results and comparisons of ECC (Error Correction Code)-protected cache, unprotected (plain or normal) cache, and our PPC architecture in terms of cache access time, power consumption, and area with different cache sizes as the reviewer suggested. Figure 1 shows the cache area in cm 2 for unprotected caches and ECC-protected caches. All results are from Cacti 3.2 with the following parameters: (i) block size is set to 32 Byte for unprotected cache and 38 Byte for ECC-protected cache, (ii) set-associativity is set to 4, (iii) technology is set to 0.18 micro-

4 meter, and (iv) VDD is set to 1.7 Volt. We considered a Hamming Code (32, 38) as our ECC coding, and it demands extra 6 bits for 32 bit protection. This ratio is why we have set 38 Byte for ECC-protected cache, i.e., 6 Byte is assigned for control bits and 32 Byte for data bits. This 6 Byte overhead causes 22% area increase of ECC-protected caches on average (up to 59%) as compared to the area of unprotected caches as shown Figure 1. Note that these overheads result from storing 6 Byte control codes, not the logic of coder and decoder. Figure 1 Cache areas of unprotected caches and protected caches To estimate the area overhead due to ECC codes, we have implemented a Hamming code (32, 38) in VHDL, synthesized it using the Synopsys design compiler with lsi10k libraries of 0.5 micro-meter technology, and scaled it to 0.55 mm 2 for 0.18 micro-meter technology. For PPC architecture, we considered 16:1 ratio between the unprotected cache and the protected cache to the overhead equal to or less than that of only ECC-protected caches. For example, we selected 32 KB unprotected cache and 2 KB protected cache (with ECC). Figure 2 shows the overall comparison of cache areas for the unprotected cache, the ECC-protected cache, and PPC architecture when PPC and ECC-protected cache include this area overhead. The ECC-protected caches incur high overheads while the area of PPC architecture is located between those of the ECC-protected caches and of the unprotected caches. For example, as compared to 32 KB unprotected cache, 32 KB ECC-protected cache incurs about 22% area overhead while PPC (32 KB unprotected cache and 2 KB ECC-protected cache) incurs just 7%, which is about 12% reduction compared to the area of 32 KB ECC-protected cache.

5 Figure 2 Cache areas among unprotected caches, ECC-protected caches, and PPCs Figure 3 shows the access latencies of the unprotected caches and ECC-protected caches in ns over variable cache sizes. All the parameters and configurations are the same as the cache area evaluation. The increased block size (6 Byte from 32 Byte to 38 Byte) in ECC-protected caches causes about 5% overhead on average in terms of cache access latency as compared to that of the unprotected caches. Note that the cache access latency of ECC-protected caches (larger block size) is smaller than that of unprotected caches in several sizes such as 128 KB, 64 KB, 32 KB, 2 KB, etc. Also, implementing speculative operation of ECC does not increase the overall access latency of caches. Thus, we consider the cache access latency of ECC-protected caches identical to that of unprotected caches, which is 1 cycle. Note that we consider that the Instruction Per Cycle (IPC) of the processor is 1 at 400 MHz. Figure 3 Cache access latencies of the unprotected caches and ECC-protected caches

6 Figure 4 shows the power consumption of the unprotected caches and ECC-protected caches in njoules over variable cache sizes. All the parameters and configurations are the same as the cache area evaluation. As shown in cache access latency evaluation, power consumption comparison with Cacti 3.2 doesn t show high overheads due to 6 Byte storage overhead in ECC-protected caches. However, the ECC-protection incurs high overheads in terms of energy consumption, while access latency of the coding and decoding for ECC protection can be optimized or minimized. The experimental estimation of ECC with the Synopsys design compiler shows 0.39 njoules for decoding and 0.22 njoules for encoding as presented in the 1 st revision answering sheet. The average overhead of energy consumption for page partitions discovered by proposed approaches in the article is about 7% for data PPCs and 13% for instruction PPCs as compared to those of the unprotected caches over benchmarks. Figure 4 Power consumption of unprotected caches and ECCprotected caches Question C-2: The reason why I asked Question C-3 is that cache access time (ps/cycle) affects both entire execution time and vulnerability. I cannot understand execution time exactly if you don't show any concrete values on cache access time (ps/cycle). As an answer to my question (C-3), you have shown Figure 2 in which cache access time should be shown because cache access time affects an entire runtime. It is an interesting result that Figure 2 in the reply letter shows that sizing an unprotected cache reduces vulnerability drastically. As a matter of fact, vulnerability seems to be linear to the size of unprotected cache. Sizing the size of unprotected cache is an effective approach to reduce vulnerability especially in a large-cache configuration. How do you justify your approach, comparing with sizing the size of unprotected cache? Answer C-2: Reviewer is right. Figure 2 in the previous reply letter shows interesting results that sizing of unprotected cache is an effective approach to reduce the vulnerability. Indeed, Kim et al. [KimDATE06]

7 presented the impacts of cache size on performance, energy consumption, and reliability. Sizing of (unprotected) cache is an effective tradeoff technique since the decrease of cache size can reduce the vulnerability while incurring high rate of cache misses, causing high overheads in terms of performance and energy consumption due to frequent accesses to off-chip memory. On the other hand, the increase of cache size raises the vulnerability while improving performance with higher energy consumption of caches and lower energy consumption of memory access. Our approach is orthogonal to sizing of unprotected cache, i.e., we can combine PPC and the sizing approach to further tradeoff among performance, energy consumption, and vulnerability. Question C-3: Several definitions and lemmas are written in Page 50. These lemmas have been introduced and proved by not you but the others, I think. These are actually written in Statistics textbooks. If this is true, please refer to an appropriate reference. Even if these lemmas are proved by you, it is unclear what you want to discuss with the lemmas. You implicitly assumed that a binomial distribution is approximated by a normal distribution. Though it is well known that a normal distribution can be an approximation to a binomial distribution for a large number of samples, you should mention that if you include the first lemma in your paper. In my opinion, the first lemma is unnecessary, though. More importantly, masking probability becomes a normal distribution, which is quite important from the aspect of IC reliability. You must explain more about probability, convexity, and skewness if you discuss statistics on masking effect. It is quite important how a real distribution of masking probability is, and how much masking probability is for typical microprocessors and benchmark programs. If what you wanted to do were to estimate how many times simulations were required, it would be meaningless without a real value of "p" which was a masking probability. Real distribution in masking upsets is much more interesting to the readers than the well-known lemmas. Answer C-3: The reviewer is right. We have just applied statistics lemmas from the textbook. The only reason for the existence of Section 4.1 was to give the readers a sense of how computing-intensive simulation-based techniques for estimating reliability are. This motivates for the use of architecturelevel metrics for reliability estimation even the architecture-level metrics are not ours. Indeed, it was proposed in MICRO 03 by Mukherjee et al. We just use the existing architecture-level reliability metric to evaluate page partitions onto the previously proposed PPC (Partially Protected Caches) architecture. We understand the reviewer s concern that this is easy extension of textbook material, and is also clear intuitively. Thus, we have removed Section 4.1 from this article.

8 Question C-4: From Pages 50 to 52, you have shown a vulnerability estimation algorithm which is basically same as the Asadi's work except that you estimate bytewise vulnerability. If you insist on the novelty in the byte-wise vulnerability estimation, please compare byte-wise and word-wise vulnerability estimations with some quantitative evaluation. Extention of wordwise to bytewise one is not so novel. I wonder whether bytewise estimation is better than wordwise one because an entire cache line, whose size is typically not a byte but four or eight bytes, is written out to the lower hierarchy of memory. Bytewise estimation is effective in the write-through policy. You seem not to assume the write-through policy because you mentioned "dirty". I'm not sure whether or not you should include the text in this paper as if the work was done by you. If I were you, I would just summarize the Asadi's work as their work (NOT MINE) with less space. Answer C-4: We do not claim that byte-level vulnerability is novel. We added Fig. 3. and the comprehensive idea to estimate vulnerability in the paper since we like to make our vulnerability metric self-contained as one of reviewers suggested in the prior revision. However, our comprehensive bytelevel vulnerability metric shows experimental results closer to the failure rate than Asadi s word-level critical time. Figure 5 shows these results. X-axis in Figure 5 represents the cache size and Y-axis represents the critical time of Asadi s work, vulnerability of ours, and the failure rates. In lower size of caches, word-level critical time captures well but it loses the accuracy in higher size of caches. For example, word-level critical time estimates the half of failure rate while our byte-level vulnerability is closer to the failure rate than word-level critical time. These results are because our byte-level vulnerability captures each byte vulnerable time (to be closer to the failure rate) and considers more comprehensive cases. Figure 5 Comparison of Vulnerability, Critical Time, and Failure Rate Question C-5: I don't understand your profiling strategy in Page 53 very much. I think that a combination of pages assigned to a cache affects its behavior as well as critical time. Is the interaction between pages negligible regarding vulnerability? You should show profiling results other than "Blowfish

9 Decryption" in order to justify your experiments. Assuming that the number of pages is N, there are 2^N page assignments for the PPC. There are about 50 pages in the example and 2^50 assignments theoretically. I can understand that you wanted to avoid the impossible number of simulations. However, you should mention the limitation of your experiments if you take approximate vulnerabilities. Answer C-5: Figures 6 to 14 below show profiling results with other benchmarks. As shown in Fig. 7. in the paper, the following figures present two observations: (i) the vulnerability decreases when each page is mapped from the unprotected cache to the protected cache in a PPC in the order of page vulnerability and (ii) page partitions significantly affect the performance. We believe that we mentioned the limitation of simulations for all possible combinations. This limitation applies for our approach as well. We did not claim that ours is the best out of all possible combinations of pages. The goal of our approaches is to efficiently figure out the interesting page assignments to two caches in a PPC in terms of vulnerability with least overheads of performance and energy consumption than those of random simulations and genetic algorithms. Figure 6 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (Blowfish Encryption)

10 Figure 7 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (CRC) Figure 8 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (djpeg)

11 Figure 9 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (susan edges) Figure 10 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (FFT)

12 Figure 11 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (Rijndael Decryption)

13 Figure 12 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (Rijndael Encryption) Figure 13 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (SHA)

14 Figure 14 Tradeoffs among vulnerability and runtime when moving pages from unprotected cache to protected cache in a PPC (Stringsearch) Question C-6: You showed several experimental results in which you compared your algorithms with the Monte Carlo method and Genetic Algorithm. I think that is unfair and intentional because it is obvious that the Monte Carlo method does not converge quickly. Experimental results by the MC is unnecessary. It is common to compare a newly-proposed algorithm with simulated annealing and genetic algorithm. If you fairly compare your algorithms with the others, you should show the results by the simulated annealing, which is a common optimization metaheuristic method. This kind of an assignment problem can also be solved by integer linear programming. You should try to solve your problem as an ILP by using a commercial solver such as ILOG CPLEX, Dash Optimization Xpress-MP, and LINDO Systems LINDO. I think your problem might be solved with the evaluation version of LINDO, which is available for free. GLPK and LPSOLVE are also available for free, which are not so fast that you might not be able to use it as a baseline. Answer C-6: Thank you for your suggestion. However, we think the MC method and its evaluation still should be in the paper since the MC method is a typical random approach and we show this typical random approach does not guarantee the effectiveness of randomly explored partitions for our PPC architecture. We also compare our approach with a genetic algorithm to find the best partitions in terms of vulnerability with minimal overheads of performance and energy consumption, and our approach is more effective and efficient than a genetic algorithm. We are definitely interested in ILP approach to find out interesting partitions in terms of vulnerability, performance, and energy consumption in the next work. We already have some preliminary experimental results about instruction PPC architecture, in particular, and ILP approach will be evaluated as a main comparison.

15 Question C-7: You examined both vulnerability and runtime in a PPC structure. I think that you should show some circuit delay, cache access time in ps, with cache parameters changed. This is essential in justifying the effectiveness of the PPC. I think you should examine vulnerability, runtime, chip area of the other cache structures: (i) SEC-DED-protected and non-hybrid cache for various cache sizes, (ii) parityprotected and non-hybrid cache for various cache sizes, and (iii) plain and non-hybrid cache for various cache sizes. It is quite important to justify the advantages of the PPC by showing quantitative values in the non-hybrid caches and comparing vulnerability, runtime, and chip area of the PPC with those of the other cache structures. Answer C-7: Thank you for your suggestion. The main focus of this paper is to propose approaches for interesting partitions among an unprotected cache and a protected cache in a PPC architecture, which has been already published in our previous work [Lee et al. CASES 06 and TVLSI 09]. We provide all our experimental results and comparisons of ECC-protected cache, unprotected (plain or normal) cache, and our PPC architecture in terms of cache access time, power consumption, and area with different cache sizes at Answer C-1. We excluded parity-protected caches mainly because it only detects an error, not correct it. Question C-8: Is the energy consumption model in Page 59 correct? Let's think about a write miss on which a clean line is overwritten with a datum. Writing a datum onto a clean cache line does not cause any eviction. Is decoding necessary in writing a datum onto a clean cache line? Only checking a dirty bit would be enough. If writing a datum onto a clean cache line is negligible, mention that. Answer C-8: It is necessary to decode the data in writing a datum onto a clean cache line. Checking a dirty bit doesn t tell us whether a soft error occurs or not on this cache line. Decoding a cache line with a Hamming code tells whether a soft error occurs or not whenever a write operation happens. Question C-9: The followings are minor comments. - "projected" in line 2 in page 42 should be followed by "to". - The period after "signal interference" in page 42 is unnecessary. - You should mention the parity coding around the description on SEC-DED. - "Luccetti" should be "Lucchetti" in Page 47. Answer C-9: Thank the reviewer for these corrections. We have updated what the reviewer pointed out.

Data Partitioning Techniques for Partially Protected Caches to Reduce Soft Error Induced Failures

Data Partitioning Techniques for Partially Protected Caches to Reduce Soft Error Induced Failures Kyoungwoo Lee, Aviral Shrivastava, Nikil Dutt, and Nalini Venkatasubramanian Abstract Exponentially increasing