The Gap Between the Virtual Machine and the Real Machine. Charles Forgy Production Systems Tech

Similar documents
Superscalar Processors

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Software Optimization: Fixing Memory Problems

Optimizing Memory Bandwidth

Spectre and Meltdown. Clifford Wolf q/talk

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Program Exploitation Intro

Hardware-Based Speculation

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Ryan Sciampacone IBM Managed Runtime Architect 29 th July Packed Objects IBM Corporation

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Pentium IV-XEON. Computer architectures M

Chapter. Out of order Execution

Using MMX Instructions to implement 2X 8-bit Image Scaling

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Computer System Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

The P6 Architecture: Background Information for Developers

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

An Efficient Vector/Matrix Multiply Routine using MMX Technology

Limitations of Scalar Pipelines

Understand the factors involved in instruction set

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Computer System Architecture Final Examination Spring 2002

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Cache Optimisation. sometime he thought that there must be a better way

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

CISC, RISC and post-risc architectures

Portland State University ECE 587/687. Memory Ordering

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

UNIT- 5. Chapter 12 Processor Structure and Function

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Assembly Language Programming

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs

CSC 8400: Computer Systems. Machine-Level Representation of Programs

Precise Exceptions and Out-of-Order Execution. Samira Khan

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

The New C Standard (Excerpted material)

Intel Architecture for Software Developers

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

How to write powerful parallel Applications

Processing Unit CS206T

Performance Tuning VTune Performance Analyzer

HPC VT Machine-dependent Optimization

ECE 571 Advanced Microprocessor-Based Design Lecture 4

CS7810 Prefetching. Seth Pugsley

Techniques for Efficient Processing in Runahead Execution Engines

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Architecture-Conscious Database Systems

Computer Processors. Part 2. Components of a Processor. Execution Unit The ALU. Execution Unit. The Brains of the Box. Processors. Execution Unit (EU)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Copyright 2012, Elsevier Inc. All rights reserved.

Using MMX Instructions to Perform Simple Vector Operations

Martin Kruliš, v

Instruction Level Parallelism (ILP)

William Stallings Computer Organization and Architecture. Chapter 11 CPU Structure and Function

Next Generation Technology from Intel Intel Pentium 4 Processor

Exercise 1: Cache Architecture I

Assembly Language. Lecture 2 x86 Processor Architecture

EECE416 :Microcomputer Fundamentals and Design. X86 Assembly Programming Part 1. Dr. Charles Kim

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Data Types. Numeric Data Types

Modern Processor Architectures. L25: Modern Compiler Design

Towards the Hardware"

Chapter 5. A Closer Look at Instruction Set Architectures

Processor Architecture V! Wrap-Up!

6.823 Computer System Architecture

Advanced optimizations of cache performance ( 2.2)

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 12: Multithreading. Welcome!

Programmazione Avanzata

Superscalar Organization

ECE 341. Lecture # 15

EECS 213 Introduction to Computer Systems Dinda, Spring Homework 3. Memory and Cache

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Hardware-based Speculation

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Module 4c: Pipelining

Lecture 2 Operating System Structures (chapter 2)

CHAPTER 5 A Closer Look at Instruction Set Architectures

White Paper SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD PROCESSORS

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Load Effective Address Part I Written By: Vandad Nahavandi Pour Web-site:

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Transcription:

The Gap Between the Virtual Machine and the Real Machine Charles Forgy Production Systems Tech

How to Improve Performance Use better algorithms. Use parallelism. Make better use of the hardware.

Argument The Java JVM is too far removed from the actual processor architecture to allow the hardware to be used efficiently.

Outline Modern processor architecture. Prefetching data. Making data structures cache-friendly. Using special instruction set features.

Moving a Block of Integers MOV EAX, DWORD PTR [ESI] MOV DWORD PTR [EDI], EAX LEA ESI, [ESI+4] LEA EDI, [EDI+4] MOV EAX, DWORD PTR [ESI] MOV DWORD PTR [EDI], EAX LEA ESI, [ESI+4] LEA EDI, [EDI+4] MOV EAX, DWORD PTR [ESI]...

Artificial Dependencies Many instructions could proceed in parallel, except that they reuse the same registers.

Moving With Extra Registers MOV TEMP1, DWORD PTR [ESI] MOV DWORD PTR [EDI], TEMP1 LEA TEMP2, [ESI+4] LEA TEMP3, [EDI+4] MOV TEMP4, DWORD PTR [TEMP2] MOV DWORD PTR [TEMP3], TEMP4 LEA TEMP5, [TEMP2+4] LEA TEMP6, [TEMP3+4] MOV TEMP7, DWORD PTR [TEMP5]...

Modern Processors Modern processors perform essentially this kind of renaming internally, allowing multiple instructions to be executed opportunistically (outof-order execution). Instructions are retired (i.e., their results are written back to the visible registers) in order.

Abstract Model Reservation Station Visible Registers Decode and Register Rename Multiple Functional Units Reorder Unit Instruction Fetch Load-Store Unit

Pipelined Memory Access Memory access is pipelined. So memory latency can be overlapped. For instance, the cost of two loads is much less than twice the cost of one load. When a cache miss is processed, a full cache line is fetched. So, access to the rest of the data in that cache line is free.

Example: Nehalem Reservation station buffers: 38 Reorder buffers: 128 Load buffers: 48 Store buffers: 32 Cache line: 64 bytes (If Hyperthreading is enabled, the buffers are shared between two threads.)

Stalls If the input data for an instruction are unavailable, or if there are not sufficient free resources, the instruction stalls.

Stalls and Out-of-Order Execution Out-of-order execution allows the processor to perform useful work on other instructions while the stalled instruction waits. Also, very importantly, it allows stalls to overlap in time.

Cost of Cache Misses As a rough guide, a level 2 cache miss on a Core 2 processor takes around 165 cycles (desktop) 300 cycles (server) Source: Cycle Accounting Analysis on Intel Core 2 Processors, Dr. David Levinthal, Intel Corporation.

Outline Modern processor architecture. Prefetching data. Making data structures cache-friendly. Using special instruction set features.

Pointer-Chasing Problem Many programs (including rule engines) have the characteristics of: Working on large numbers of objects, which are linked by pointers. Performing only a small amount of processing on each object after it is fetched. Problem: when an object is fetched, it is unlikely to be in the cache.

Impact of Cache Misses One simulation study of a hash join operation, showed the processor spending from 73% to 82% of its time stalled on data cache misses. Source: Improving Hash Join Performance through Prefetching by Chen, Ailamaki, Gibbons, and Mowry.

Cache Misses and Rule Engines Measurements using VTUNE indicate that join nodes in Rete II matchers spend more than 50% of their time stalled on data cache misses.

Abstract Example while (needtoprocess(curobj)) { process(curobj); curobj = curobj.successor; }

Reducing the Impact of Cache Misses If data is fetched ahead of time, the cache miss latency can overlap with useful work: while (needtoprocess(nextobj)) { curobj = nextobj; nextobj = curobj.successor; dummy = nextobj.successor; process(curobj); }

Problem with this Solution Because instructions are retired in-order, this creates an artificial dependency in the code. Also, it is something of a hack.

A Better Solution The X86 architecture has a group of instructions that are specifically for this situation: the PREFETCH group. while (needtoprocess(curobj)) { PREFETCH(curobj.successor); process(curobj); curobj = curobj.successor; }

But There is no way to specify in Java that a PREFETCH should be performed.

Outline Modern processor architecture. Prefetching data. Making data structures cache-friendly. Using special instruction set features.

Reducing the Number of Cache Misses When a processor fetches data into the cache, it fetches more than one word. (Typically 64 bytes.) So, if related data can be grouped properly, one fetch can bring in the data for several operations. In addition, if low-level access to memory is used, sometimes explicit pointers can be eliminated, and address arithmetic used instead.

Example: B+ Trees In-memory databases frequently use some variant of B+ Trees to index the data. Many researchers have worked on the trees cache-friendly. Among the techniques used are sizing the trees to the cache, aligning the fields on appropriate byte boundaries, using address arithmetic rather that pointers, etc.

JVM Problem These kinds of techniques cannot be used in Java. Java puts objects where it decides to. Everything is an object, even arrays.

Example: B+ Node class INode extends Node { int[] keys; Node[] pointers; } Problems: The arrays keys and pointers are not in the node; they are separate objects. (More pointer-following means more cache misses.) The arrays may or may not be aligned properly for the cache.

Outline Modern processor architecture. Prefetching data. Making data structures cache-friendly. Using special instruction set features.

Advanced Instruction Example: SIMD Instructions The current generation of processors offer SIMD instructions that can process 128 bits at a time. The next generation will support 256-bit SIMD instructions. A number of researchers have shown that SIMD instructions can significantly speed up join, scan, aggregate, etc. operations.

SIMD and Java SIMD intrinsics cannot be referred to in a Java program.

Questions?