CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Similar documents
Lecture 2 Parallel Programming Platforms

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Parallel Architecture. Sathish Vadhiyar

4. Networks. in parallel computers. Advances in Computer Architecture

Parallel Architectures

Parallel Numerics, WT 2013/ Introduction

Lecture 7: Distributed memory

Introduction to parallel computing

Physical Organization of Parallel Platforms. Alexandre David

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EE/CSCI 451: Parallel and Distributed Computation

Overview. Processor organizations Types of parallel machines. Real machines

Module 5 Introduction to Parallel Processing Systems

Advanced Parallel Architecture. Annalisa Massini /2017

CSC630/CSC730: Parallel Computing

Parallel Architectures

CS575 Parallel Processing

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Copyright 2010, Elsevier Inc. All rights Reserved

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Interconnection Network

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Network-on-chip (NOC) Topologies

PARALLEL COMPUTER ARCHITECTURES

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Parallel Hardware and Interconnects

Interconnect Technology and Computational Speed

BlueGene/L (No. 4 in the Latest Top500 List)

Dr. Joe Zhang PDC-3: Parallel Platforms

INTERCONNECTION NETWORKS LECTURE 4

Multi-Processor / Parallel Processing

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Chapter 9 Multiprocessors

Unit 9 : Fundamentals of Parallel Processing

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Dr e v prasad Dt

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

EE/CSCI 451: Parallel and Distributed Computation

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS/COE1541: Intro. to Computer Architecture

High Performance Computing. University questions with solution

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Lecture: Interconnection Networks

Parallel Architectures

TDT Appendix E Interconnection Networks

Computer parallelism Flynn s categories

Concurrent/Parallel Processing

Communication Cost in Parallel Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection networks

Scalability and Classifications

Multiprocessors & Thread Level Parallelism

COSC 6374 Parallel Computation. Parallel Computer Architectures

Topologies. Maurizio Palesi. Maurizio Palesi 1

CS Parallel Algorithms in Scientific Computing

Parallel Architecture, Software And Performance

Model Questions and Answers on

EE/CSCI 451: Parallel and Distributed Computation

COSC 6374 Parallel Computation. Parallel Computer Architectures

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Parallel Programming Platforms

Parallel Computing: Parallel Architectures Jin, Hai

High Performance Computing in C and C++

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Interconnection Networks. Issues for Networks

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

High performance computing. Message Passing Interface

Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Chapter 2: Parallel Programming Platforms

Overview: Classical Parallel Hardware

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

SHARED MEMORY VS DISTRIBUTED MEMORY

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Structure of Computer Systems

Portland State University ECE 588/688. Cray-1 and Cray T3E

Lecture 7: Parallel Processing

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Topologies. Maurizio Palesi. Maurizio Palesi 1

Interconnection Networks

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

UNIT I (Two Marks Questions & Answers)

Lecture 7: Parallel Processing

Parallel Computing Ideas

Blocking SEND/RECEIVE

Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop

Transcription:

CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport

Chapter 2: Architecture of Parallel Computers Hardware Software

2.1.1 Flynn s taxonomy Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction single-data (MISD) Multiple-instruction multiple-data (MIMD) Michael Flynn classified systems according to the number of instruction streams and the number of data streams.

Instruction streams and data streams Data stream: a sequence of digitally encoded coherent signals of data packets used to transmit or receive information that is in transmission. Instruction stream: a sequence of instructions.

Instruction set architecture Stored program computer: memory stores programs, as well as data. Thus, programs must go from memory to the CPU where it can be executed. Programs consist of many instructions which are the 0's and 1's that tell the CPU what to do. The format and semantics of the instructions are defined by the ISA (instruction set architecture). The reason the instructions reside in memory is because CPUs typically hold very little memory. The more memory the CPU has, the slower it runs. Thus, memory and CPU are separate chips

2.1.2 SISD --- the classic van Neumann machine Load X Instruction pool Load Y Add Z, X, Y Store Z Input Devices Output Devices Control Unit External Storage Memory CP U Arithmetic Logic Unit A single processor executes a single instruction stream, to operate on data stored in a single memory. During any CPU cycle, only one data stream is used. The performance of an van Neumann machine can be improved by caching. P Data pool

Steps to run a single instruction IF (instruction fetch): the instruction is fetched from memory. The address of the instruction is fetched from program counter (PC), the instruction is copied from memory to instruction register (IR). ID (instruction decode): decode the instruction and fetch operands. EX (execute): perform the operation, done by ALU (arithmetic logic unit) MEM (memory access): it happens normally during load and store instructions. WB (write back): write results of the operation in the EX step to a register in the register file. PC (update program counter): update the value in program counter, normally PC PC + 4

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Subscalar CPUs: Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. This process is inherent inefficiency It takes 15 cycles to complete three instructions

2.1.3 Pipeline and vector architecture IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Scalar CPUs: In this 5 stage pipeline, it can barely achieve the performance of one instruction per CPU clock cycle

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Superscalar CPUs: in the simple superscalar pipeline, two instructions are fetched and dispatched at the same time, a performance of a maximum of two instructions per CPU clock cycle can be achieved.

Example float x[100], y[100], z[100]; for (i=0; i<100; i++) z[i] = x[i] + y[i]; Fetch the operands from memory Compare exponent Shift one operand Add Normalized the results Store result in memory The functional units are arranged in a pipeline. The output of one functional unit is the input to the next. Say x[0] and y[0] are being added, one of x[1] and y[1] can be shifted, the exponents in x[2] and y[2] can be compared, and x[3] and y[3] can be fetched. After pipelining, we can produce a result six times faster than without the pipelining.

clock fetch comp shift add norm store 1 X0,y0 2 X1,y1 X0,y0 3 X2,y2 X1,y1 X0/y0 4 5 6

do i=1, 100 z(i) = x(i) + y(i) enddo z(1:100) = x(1:100) + y(1:100) Fortran 77 Fortran 90 By adding vector instructions to the basic machine instruction set, we can further improve the performance. Without vector instructions, each of the basic instructions has to be issued 100 times. With vector instructions, each of the basic instructions has to be issued 1 time. Using multiple memory banks. Operations (fetch and store) that access main memory are several times slower than CPU only operations (add). For example, suppose we can execute a CPU operation once every CPU cycle, but we can only execute a memory access every four cycles. If we used four memory banks, and distribute the data z[i] in memory bank i mod 4, we can execute one store operation per cycle.

2.1.4 SIMD Instruction pool Load X[1] load Y[1] Load X[2] Load Y[2] Load X[n] Load Y[3] P P Data pool Add Z[1], X[1], Y[1] Add Z[2], X[2], Y[2] Add Z[n], X[3], Y[3] P Store Z[1] Store Z[2] Store Z[n] A type of parallel computers. Single instruction: All processor units execute the same instruction at any give clock cycle. Multiple data: Each processing unit can operate on a different data element. It typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units Best suitable for specialized problems characterized by a high degree of regularity, e.g., image processing. infantry

A single CPU to control and a large collection of subordinate ALUs, each with its own memory. During each instruction cycle, the control processor broadcasts an instruction to all of the subordinate processors, and each of the subordinate processors either executes the instruction or is idle.

For (i=0; i<100; i++) if (y[i]!= 0.0) z[i] = x[i]/y[i]; else z[i] = x[i]; Time step 1 Time step 2 Time step 3 Test local_y!= 0 If local_y!= 0, z[i]=x[i]/y[i] If local_y == 0, idle If local_y!= 0, idle If local_y == 0, z[i]=x[i] Disadvantage: in a program with many conditional branches or long segments of code whose execution depends on conditionals, its more likely that many processes will remain idle for a long time.

2.1.5 MISD A single data stream is fed into multiple processing units Each processing unit operates on the data independently via independent instruction streams Very few actual machines: CMU s C.mmp computer (1971) Instruction pool P P Data pool Load X[1] Mul Y[1], A, X[1] Add Z[1], X[1], Y[1] Store Z[1] Load X[1] Mul Y[2], B, X[1] Add Z[2], X[1], Y[2] Store Z[2] Load X[1] Mul Y[3], C, X[1] Add Z[3], X[1], Y[3] Store Z[3]

2.1.6 MIMD Multiple instruction stream: Every processor may execute a different instruction stream Multiple data stream: Every processor may work with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, grids, networked parallel computers, multiprocessor SMP computer Each processor has both a control unit and an ALU, and is capable of executing its own program at its own pace P P P Instruction pool P P P Data pool Load X[1] load Y[1] Add Z[1], X[1], Y[1] Store Z[1] Load A Mul Y, A, 10 Sub B, Y, A Store B Load X[1] load C[2] Add Z[1], X[1], C[2] Sub B, Z[1], X[1]

2.1.7 shared-memory MIMD Bus-based architecture Switch-based architecture Cache coherence

CPU CPU CPU Interconnected network Memory Memory Memory Generic shared-memory architecture Shared-memory systems are sometimes called multiprocessors

Bus-based architecture CPU CPU CPU Cache Cache Cache Bus Memory Memory Memory The interconnect network is bus based. The bus will become saturated if multiple processors are simultaneously attempting to access memory. Each processor has access to a fairly large cache. These architectures do not scale well to large numbers of processors because of the limited bandwidth of a bus.

Switch-based architecture Memory Memory Memory CPU CPU CPU The interconnect network is switch-based. A crossbar can be visualized as a rectangular mesh of wires with switches at the points of intersection, and terminal on its left and top edges. The switches can either allow a signal to pass through in both the vertical and horizontal directions simultaneously, or they can redirect a signal from vertical to horizontal or vice versa. Any processor can access any memory, and any other processor can access any other memory.

The crossbar switch-based architecture is very expensive. A total number of mn hardware switches are need for an m n crossbar. The crossbar system is a NUMA (nonuniform memory access) system, because when a processor access memory attached to another crossbar, the access times will be greater.

Cache coherence The caching or shared variables should ensure cache coherence. Basic idea: each processor has a cache controller, which monitors the bus traffic. When a processor updates a shared variable, it also updates the corresponding main memory location. The cache controllers on the other processors detect the write to main memory and mark their copies of the variable as invalid. Not good for other types of shared-memory machine.

2.1.8 Distributed-memory MIMD CPU Memory CPU Memory CPU Memory Interconnected network In distributed-memory system, each processor has its own private memory.

A static network (mesh) A dynamic network (crossbar) A node is a vertex corresponding to a processor/memory pair. In static network, all vertices are nodes. In dynamic network, some vertices are nodes, other vertices are switches.

Fully connected interconnection network The ideal interconnected network is a fully connected network, in which each node is directly connected to every other node. With a fully connected network, each node can communicate directly with every other node. Communication involves no delay. The cost is too high to be practical. Question: How many connections are needed for a 10 processor machine?

Crossbar interconnection network Question: for a machine with p processors, how many switches do we need?

Multistage switching network For a machine of p nodes, an omega network will use plog 2 (p)/2 switches. An omega network

Static interconnection networks A linear array A ring For a system of p processors, a linear array needs p-1 wires, a ring need p wires. They scale well, but the communication cost is high. In a linear array, two communication processors may have to forward the message along as many as p-1 wires, and in a ring it may be necessary to forward the message along as many as p/2 wires.

Dimension 1 Dimension 2 Dimension 3 For a hypercube network with dimension d, the number of processors is p=2 d. The maximum number of wires a message will need to be forwarded is d = log 2 (p). This is much better than the linear array or ring. It does not scale well. Each time we wish to increase the machine size, we must double the number of nodes and add a new wire to each node.

Two dimensional mesh Three dimensional mesh If a mesh has dimension d 1 d 2 d n, then the maximum number of wires a message will have to travere is n i= 1 ( d i 1) If a mesh is a square, d 1 =d 2 = =d n, the maximum will be n(p 1/n -1). A mesh becomes a torus if wrap around wires are added. For torus, the maximum will be 1/2np 1/n. Mesh and torus scale better than hypercubes. If we increase the size of a q q mesh, we simply add a q 1 mesh and q wires. We need to add p (n-1)/n nodes if we want to increase the size of a square n-dimensional mesh or torus.

Characteristics of static networks Diameter: the diameter of a network is the maximum distance between any two nodes in the network. Arc connectivity: The arc connectivity of a network is the minimum number of arcs that must be removed from the network to break it into two disconnected networks Bisection width: The bisection width is the minimum number of communication links that must be removed to partition the network into two equal halves Number of links: the number of links is the total number of links in the network.

Characteristics of static networks Network Diameter Bisection width Arc connection Number of links Fully connected 1 p 2 /4 p-1 p(p-1)/2 star 2 1 1 p-1 Linear array p-1 1 1 p-1 Ring (p>2) p-2 2 2 p Hypercube logp p/2 logp ( p log p) / 2 2D mesh 2D torus 2( p 1) 2 p / 2 2 p p 2 4 2( p p) 2p

2.1.9 communication and routing If two nodes are not directly connected or if a processor is not directly connected to memory module, how is data transmitted between the two? If there are multiple routes joining the two nodes or processor and memory, how is the route decided on? Is the route chosen always shortest?

0 1 2 3 4 5 6 7 8 Store-and-forward routing Time Node A z y x w z y x z y z Data Node B w x w y x w z y x w z y x z y z Node C w x w y x w z y x w Store-and-forward routing: A sends a message to C through B, B read the entire message, and then send it to node C. It takes more time and memory.

Cut-through routing Time 0 1 2 3 4 5 Node A z y x w z y x z y z Data Node B w x y z Node C w x w y x w z y x w Cut-through routing: A sends an message through B to C, B immediately forward each identifiable piece or packet of the message to C

Communication unit A message is a contiguous group of bits that is transferred from source terminal to destination terminal. A packet is the basic unit of a message. Its size is in the order of hundreds and thousands of bytes. It consists of header flits and data flits.

Flit: flit is the small unit of information at link layer, and its size is of a few words Phit: phit is the smallest physical unit of information at the physical layer, which is transferred across one physical link in one cycle

Communication cost Startup time : startup time is the time required to handle a message at the sending and receiving nodes, that includes, (1) prepare message (adding header, trailer, error correction information), (2) execute the routing algorithm, and (3) establish an interface between the local node and the router Note: This latency is incurred only once for a single message transfer Per-hop Time: per-hop time is the time taken by the header of a message to travel between two directly connected nodes in the network. Note: The per-hop time is also called node latency Per-word transfer time: per-word transfer time is the time taken for one word to traverse one link. Per-word transfer time is the reciprocal of the channel bandwidth

When a message traverses a path with multiple links, Each intermediate node on the path forwards the message to the next node after it has received and stored the message Total communication cost for a message of size m to traverse a path of l links t comm = t s + (mt w t h )l

Example: communication time for linear array o 1 2 3 4 (1) Store and forward routing: t comm = t s + mlt w, since in modern parallel computers, the per-hop time is very small compared to per-word time. (2) cut-through routing: t comm = t s + lt h + mt w, the term of the product of message size and number of links is no longer contained.

2.2 software issues A program is parallel if at any time during its execution, it can comprise more than one process. We see how the processes can be specified, created, and destroyed.

2.2.1 Shared memory programming Private and shared variables int private_x; shared int sum=0; sum = sum + private_x; Time 0 1 Process 0 Fetch sum =0 Fetch private_x=2 Process 1 Finish calculation of private_x Fetch sum = 0 2 Add 2+0 Fetch private_x 3 Fetch sum into register A Fetch private_x into register B Add contents of register B to register A Store content of register A in sum 3 4 Store sum = 2 Add 3 + 0 Store sum =3

Mutual exclusion, Critical section, Binary semaphore, barrier shared int s = 1; // wait until s=1; while (!s); s = 0; sum = sum + private_x; s = 1; Void P(int* s /* in/out */); Void V(int *s /*out */); P(int * s) { While (!s); s=0; } V(int *s) { s=1; } Problem: s is not atomic, one process has a value of 1, another may have a value of 0 Int private_x; Shared int sum=0; Shared int s=1; /* compute priviate_x */ P(&s); Sum=sum+private_x; V(&s); Barrier(); If (I m process 0) printf( sum = %d\n, sum);

2.2.2 Message passing The most commonly used method of programming distributed-memory MIMD system is message passing, or its variant. We focus on the Message-Passing Interface (MPI)

MPI_Send() and MPI_Recv() Int MPI_Send(void* buffer /* in */, int count /* in */, MPI_Datatype datatype /* in */, int destination /* in */, int tag /* in */, MPI_Comm communicator /* in */) Int MPI_Recv(void* buffer /* in */, int count /* in */, MPI_Datatype datatype /* in */, int source /* in */, int tag /* in */, MPI_Comm communicator /* in */), MPI_Status* status /* out */)

Process 0 sends a float x to process 1; MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Process 1 receives the float x from process 0; MPI_Recv(&x, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD); Different programs or a single program? SPMD(Single-Program-Multiple_Data) model If (my_process_rank == 0) MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Else if(my_process_rank == 1) MPI_Recv(&x, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);

Buffering 0 (A) request to send ; 1 (B) ready to receive We can buffer the message: the content of the message can be copied into a system-controlled block of memory (on A or B, or both), and 0 can continue executing. Synchronous communication: process 0 wait until process 1 is ready; Buffered communication: the message is buffered into the appropriate memory location controlled by 1. Advantage: the sending process can continue to do useful work if the receiving process is not ready, the system will not crash even if process 1 doesn t execute a receive Disadvantage: it uses additional memory and if the receiving process is ready, the communication will actually take longer because of copying data between the buffer and the user program memory locations.

Blocking and nonblocking communication Blocking communication: a process remains idle until the message is available, such as MPI-Recv(). In blocking communication, it may not be necessary for process 0 to receive permission to go ahead with the send. Nonblocking receive operation: MPI_Irecv(), with an additional parameter a request. The call would notify the system that process 1 intended to receive a message from 0 with the properties indicated by the argument. The system would initialize the request argument, and process 1 would return. Then process 1 could perform some other useful work and check back later to see if the message has arrived. Nonblocking communication can be used to provide dramatic improvements in the performance of message-passing programs.

2.2.3 Data-parallel languages!hpf$!hpf$!hpf$!hpf$ C Program add_arrays PROCESSORS p(10); real x(1000), y(1000), z(1000) ALIGN y(:) WITH x(:) ALIGN z(:) WITH x(:) DISTRIBUTE x(block) ONTO p initialize x and y. z = x + y end (1) Specify a collection of 10 abstract processors; (2) Define arrays; (3) Specify that y should be mapped to the abstract processors in the same way that x is; (4) Specify that y should be mapped to the abstract processors in the same way that x is; (5) Specify which elements of x will be mapped to which abstract processors; (6) BLOCK specifies that x will be mapped by blocks onto the processors. The first 1000/10=100 elements will be mapped to the first processor.

2.2.4 RPC and Active message RPC (remote procedure call) and active messages are two other methods to parallel systems, but we are not going to discuss them in this course.

2.2.5 Data mapping Optimal data mapping is about how to assign data elements to processors so that communication is minimized. Our array A=(a0, a1, a2,, an-1), our processors P=(q0, q1, q2,, qp-1) If the number of processor is equal to the number of array elements ai = qi Block mapping: partitioning the array elements into blocks of consecutive entries and assigns the blocks to the processors. If p=3 and n=12 a0, a1, a2, a3 q0 a4, a5, a6, a7 q1 a8, a9, a10, a11 q2 Cyclic mapping: it assigns the first element to the first processor, the second element to the second processor, and so on. a0, a3, a6, a9 q0 a1, a4, a7, a10 q1 a2, a5, a8, a11 q2 Block-cyclic mapping: it partitions the array into blocks of consecutive elements as in the block mapping. The blocks are not necessarily of size n/p. The blocks are then mapped to the processors in the same way that the elements are mapped in the cyclic mapping. The block size is 2 in the following example. a0, a1, a6, a7 q0 a2, a3, a8, a9 q1 a4, a5, a10, a11 q2

How about matrices? p0 p0 p0 p1 grid