A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

Similar documents
A Scalable Multiprocessor for Real-time Signal Processing

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

SRAM SRAM SRAM SRAM EPF 10K130V EPF 10K130V. Ethernet DRAM DRAM DRAM EPROM EPF 10K130V EPF 10K130V. Flash DRAM DRAM

High Performance Neural Net Simulation on a Multiprocessor System with "Intelligent" Communication

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

sizes. Section 5 briey introduces some of the possible applications of the algorithm. Finally, we draw some conclusions in Section 6. 2 MasPar Archite

Flexible Hardware Support for Interworking Systems. Till Harbaum Detlef Meier Matthias Prinke. Martina Zitterbart

Readout-Nodes. Master-Node S-LINK. Crate Controller VME ROD. Read out data (PipelineBus) VME. PipelineBus Controller PPM VME. To DAQ (S-Link) PPM

Hardware Implementation of GA.

4. Networks. in parallel computers. Advances in Computer Architecture

The Cambridge Backbone Network. An Overview and Preliminary Performance. David J. Greaves. Olivetti Research Ltd. Krzysztof Zielinski

PARNEU: Scalable Multiprocessor System for Soft Computing Applications

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Embedded Systems: Hardware Components (part II) Todor Stefanov

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. *

Architectures? Vinoo Srinivasan, Shankar Radhakrishnan, Ranga Vemuri, and Je Walrath. fvsriniva, sradhakr, ranga,

The CPU Design Kit: An Instructional Prototyping Platform. for Teaching Processor Design. Anujan Varma, Lampros Kalampoukas

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

farun, University of Washington, Box Seattle, WA Abstract

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Dept. of Computer Science, Keio University. Dept. of Information and Computer Science, Kanagawa Institute of Technology

Network-on-Chip Architecture

Memory Systems IRAM. Principle of IRAM

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

JNTUWORLD. 1. Discuss in detail inter processor arbitration logics and procedures with necessary diagrams? [15]

Studer D21m. I/O System Components. Condensed Information

Memroy MUX. Input. Output (1bit)

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996

The Nios II Family of Configurable Soft-core Processors

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Design of a System-on-Chip Switched Network and its Design Support Λ

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof

Embedded Systems. 7. System Components

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

ECE 697J Advanced Topics in Computer Networks

ATLANTIS - a modular, hybrid FPGA/CPU processor for the ATLAS. University of Mannheim, B6, 26, Mannheim, Germany

Technical Report No On the Power of Arrays with. Recongurable Optical Buses CANADA. Abstract

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

ECE 551 System on Chip Design

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Assignment1 - CSG1102: Virtual Memory. Christoer V. Hallstensen snr: March 28, 2011

Laboratory Pipeline MIPS CPU Design (2): 16-bits version

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Enhancing Integrated Layer Processing using Common Case. Anticipation and Data Dependence Analysis. Extended Abstract

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Patagonia Cluster Project Research Cluster

Techniques. IDSIA, Istituto Dalle Molle di Studi sull'intelligenza Articiale. Phone: Fax:

Dynamic Multi-Path Communication for Video Trac. Hao-hua Chu, Klara Nahrstedt. Department of Computer Science. University of Illinois

Design and Implementation of a. August 4, Christopher Frank Joerg

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

PEPE: A Trace-Driven Simulator to Evaluate. Recongurable Multicomputer Architectures? Campus Universitario, Albacete, Spain

As dierent shading methods and visibility calculations have diversied the. image generation, many dierent alternatives have come into existence for

OMNEO Interface and OMNEO Control Praesideo on IP

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract

A Hierarchical Approach to Workload. M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1

1 master and 8 independent stereo subgroup Flexible architecture including a modular control surface, outputs

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Word Clock Select. 188 Digital I/O, Setup, and Utilities. Wordclock. 1. Use the [DIGITAL I/O] button to locate the DIGITAL I/O 1/5 page.

suitable for real-time applications. In this paper, we add a layer of Real-Time Communication Control (RTCC) protocol on top of Ethernet. The RTCC pro

Dr e v prasad Dt

\Classical" RSVP and IP over ATM. Steven Berson. April 10, Abstract

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

Real Time Spectrogram

FPGAs: Instant Access

What is Parallel Computing?

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Interconnect Technology and Computational Speed

Space Priority Trac. Rajarshi Roy and Shivendra S. Panwar y. for Advanced Technology in Telecommunications, Polytechnic. 6 Metrotech Center

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

A New Orthogonal Multiprocessor and its Application to Image. Processing. L. A. Sousa M. S. Piedade DEEC IST/INESC. R.

DIGIGRID MGR. Table of Contents

Advanced Parallel Architecture. Annalisa Massini /2017

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

Elchin Mammadov. Overview of Communication Systems

DSP Development Environment: Introductory Exercise for TI TMS320C55x

CS 4453 Computer Networks Winter

Zeki Bozkus, Sanjay Ranka and Georey Fox , Center for Science and Technology. Syracuse University

VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU

Design and Implementation of a FPGA-based Pipelined Microcontroller

6.1 Multiprocessor Computing Environment

UNIT I (Two Marks Questions & Answers)

A Framework for Building Parallel ATPs. James Cook University. Automated Theorem Proving (ATP) systems attempt to prove proposed theorems from given

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

Introduction to Parallel Programming

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

Modeling of an MPEG Audio Layer-3 Encoder in Ptolemy

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations

Programming Environments for Developing Real Time. Autonomous Agents based on a Functional Module Network Model

Client Server & Distributed System. A Basic Introduction

NVIDIA nforce IGP TwinBank Memory Architecture

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

The Architecture of a Homogeneous Vector Supercomputer

Transcription:

A Freely Congurable Audio-Mixing Engine with Automatic Loadbalancing M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster Electronics Laboratory, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland March 7, 1995 Abstract The most important design issue for digital audio mixing consoles is the communication concept, that is used to interconnect an array of signal processors. This paper demonstrates the implementation of a digital mixing console that is capable of routing up to 100 audio signalpaths. The audio algorithms are on a modular DSP network. Signalpaths can be edited with a graphical user interface and are automatically mapped on the DSP network with optimal load-balancing. 1 Introduction The typical architecture of a digital audio mixing console is shown in gure 1. The most obvious dierence to an analog mixing console is the clear separation of mixing-desk and mixing-engine. The mixing-engine performs the digital signal processing and is completely controlled by the mixing-desk. A digital mixing console can be described as a network of signal-paths where a large number of digital audio functions are combined. Each function has to process some input signals and produces some output signals. The produced output signals of each function are used as input signals for other functions. This network can be described in a signal ow-graph where every function has at least one input and one output. The sources and drains of the ow-graph are the audio inputs and outputs of the entire mixing console. The mixing desk controls directly the functions by a set of parameters. For example a scale function scales a particular audio signal according to the slider on the mixing-desk. The huge amount of processing power that a digital mixing console demands can only be performed by a certain number of processors. Such an array of processors needs an interprocessor communication with a high bandwidth to provide the data exchange between audio functions. The communication 1

network is often designed directly in hardware because a software controlled communication does not satisfy the strict requirements of speed or uses too much processing power. As a result, the communication network becomes very inexible and changes in the digital data ow of audio signals are dicult. To tackle the lack of routability and communication speed of the interprocessor network this paper shows a new, highly exible communication structure. It has a bandwidth of 800 MBit/sec and allows to communicate up to 512 internal audio channels autonomously without decreasing the processing power. 2 System Requirements As mentioned above, audio functions need to be spreaded among an array of processors. To provide an ecient distribution, a homogeneous architecture of processing elements (PEs) is required. In other words, all PEs have the same capabilities and there is no master processor. Under this condition a load-balancing analysis can be performed in which the needed processing time of each function is measured. The goal is to evaluate the best possible way of spreading the functions among the processing elements and to use only the fewest number of PEs. Such a system works with optimal load-balancing. The next design issue that has to be dealt with arises in nding a way of mapping the signal ow-graph of the functions onto the existing hardware. For a fully routable system, that allows any interconnection between two or more functions running on any processor in the system, the communication structure of the interprocessor network must be orthogonal. This means each processor must be able to access all data produced by all other processors. The last important requirement of the system is that interprocessor communication is independent and not consuming processing power. 3 System Architecture Figure 2 shows the system architecture of the mixing-engine. One major goal of the project was to build a fully scalable system, where the number of PEs can be anything between 1 and 100. This corresponds directly with the fact that the needed processing power varies with the number of functions describing a mixing console. For the reason of scalability the chosen network topology is a ring. Buses and crossbars are other network examples. However a bus may establish only one connection at a time and must be arbitrated. A crossbar of order N may establish N connections at a time. This topology owns the best connection possibilities. But a scalable system with a crossbar network can hardly be realized. Other systems with a ring architecture have been built, like the WARP [1], the iwarp and the RAP [2]. 2

3.1 Communication Concept Every PE has its own communication controller (CC) which is responsible for the data ow on the ring bus. The data that has to be transferred passes every CC in a strictly ordered fashion value by value. Every controller is programmed to insert data from its PE at a certain position of the data stream. It also copies data coming from other PEs out of the data stream and stores it for its PE. This is a special kind of an independent time-division multiplexed (TDM) bus. In order to reduce bandwidth, but still meet the requirement of orthogonality of the interprocessor network, every CC communicates only data that is needed by other processors. Before the communication starts, the CCs have to be congured by the processors. After that the net works completely autonomously. Figure 3 illustrates the overlapping of communication and processing. The CCs synchronizes itself and starts the communication as soon as the rst data values are available. Data transfer and processing is executed simultaneously without slowing down the processors. After the last value has reached the last CC the communication cycle is nished and the processing can start immediately. Therefore processing is synchronized with the end of the communication cycle and not with the master sampling clock. Let S C be the synchronization time of the CC and S P the synchronization time of the PE. Thus the synchronization time over the entire system is maximum of S C and S P. If no CC is involved, like in other implementations, all synchronization is done by the processors. The synchronization time is then the sum of S C and S P. A new cycle begins after the next master sampling clock. Because of the independent communication controller the communication concept is named \Intelligent Communication". 3.2 Global Audio Channels The data that is communicated on the ring bus can be described as a set of global channels. Each channel is a digital audio connection between audio functions on dierent processors. Each processor produces a certain part of these channels depending on which functions are running on this processor. If two functions on the same processor need a connection, there is no need to use a global audio channel. The audio data can be transfered within the memory of the processor. This is equivalent to a local audio connection. The amount of communicated data, limited by the highest possible clock frequency on the ring bus, is at the moment 25 MWord/sec. At a digital audio sampling rate of 48KHz it is possible to communicate up to 512 global channels at 32 bits/word. Supposing an optimal signal ow-graph of a digital mixing console where most of the audio connections can be hold locally and not more than 5 global audio channels are used for a full signal-path, it is possible to route up to 100 audio signal-paths on the system. 3

3.3 Data Input Output Two interface modules provide the data exchange of audio raw-data with external audio resources. One is an AES/EBU interface, the other is a Multi Audio Digital Interface (MADI). With one MADI interface a maximum of 56 digital audio channels can be connected directly to the mixing-engine. To support an ecient data ow it is important to connect the interface modules directly to a CC. This way no processing power is lost at all. Figure 4 shows the topology of the mixing-engine with the I/O features. An interface module can be positioned anywhere between two processors. A system can have several MADI and AES/EBU interfaces. Like processors also the interface modules produce a certain part of the global channels. The CC of each module works autonomously and is liable for the data that the module produces and consumes. 3.4 Parameter Processing The mixing desk needs to control every function that is currently running on the system with a certain set of parameters. Also this information needs to be communicated through the interprocessor network. However parameters are not changing as fast as audio raw-data. Therefore it is not necessary to use one global audio channel for each parameter. In the current implementation 512 parameters are multiplexed on one global audio channel. This corresponds to an update rate of more than 100 times per second per parameter. Only a few global audio channels are applied and no special communication network for parameters has to be implemented. Between two parameter updates, an interpolation of the parameters is done to avoid audible discontinuities. 4 Hardware Implementation The hardware platform of the mixing-engine is the MUSIC Parallel-computer built at the Electronics Lab of the Swiss Federal Institute of Technology [3] [4]. However the communication concept and the operating system was completely redesigned. Figure 5 shows a processing board. Three PEs t on one board (22cm by 23cm) and up to 63 PEs can be connected together in a standard 19 inch rack. A special I/O board gives the possibility to connect a MADI or a AES/EBU module directly to the interprocessor network. The modular design allows to scale the system according to the individual needs. Only the necessary number of PEs and modules are inserted in the system. Therefore hardware overhead can be substantially reduced. One PE consists of a Motorola DSP 96002 oating point digital signal processor, 1 MByte of static RAM and 2 MBytes of dual-ported DRAM (Video RAM) organized in two blocks called \producer" memory and \consumer" memory. Each PE has its own communication controller, which is responsible for the data-ow between the PE and the interprocessor network. The CC is 4

implemented in an FPGA Xilinx XC3190. It fetches data through the serial port of the producer VRAM and writes arriving data into the serial port of the consumer VRAM. The serial buer of the producer and consumer VRAM can store 512 IEEE oating point values of 32 bits. 5 Software The software for the mixing engine is made of three parts: a signal-ow-graph editor, a conguration software and a runtime kernel. Figure 6 shows the three steps for the reconguration of the mixing-engine. Each step corresponds to a separate software module. 5.1 Signal-ow-graph Editor Audio functions are programmed in optimized assembler code. They appear as icons in the signal-ow-graph editor. Figure 7 demonstrates how functions can be placed and connected together. Subgraphs can be dened for later use. For example a complete channel structure can be designed and inserted as a block into the total system. The graphical user interface is running on a UNIX workstation. 5.2 Signal-path Router After placing and connecting the audio functions with the signal-ow-graph editor, the signal-path router congures the mixing-engine according to the designed signal network. In a load-balancing analysis the functions are placed on the processors. For parallel programs with asynchronous data exchange this is a known problem [5] [6]. However a synchronous system like a mixing-engine already has well partitioned functions and the processing time for each processor is x. Important is the optimal load of the processors. In the next step the routing of the signal-ow-graph is performed and mapped on the mixingengine. If a connection between two processors is needed the interprocessor network is applied using one of the global audio channels. If two functions are linked together that run on the same processor a local connection is established. 5.3 Runtime Kernel After booting the system a runtime kernel is working on each PE. It synchronizes all running functions with the communication controller. At a master sampling frequency of 48kHz new audio signals arrive about every 20 s. This time also corresponds to the processing time on each processor when no pipelining of functions is involved. The kernel uses less than 5 % of processing time on each processor. The remaining time is reserved for the audio functions exclusively. 5

5.4 Audio Function Design New audio functions can be included with a minimum of software eort. Using a well dened software interface the user can insert any self-written audio function in the system. The new function can be programmed in C or DSP assembler code. After the integration it is visible in the signal-ow-graph editor and can be placed in any audio signal network. 6 Conclusion This paper describes a communication network which is very exible and still reaches the necessary speed for multi digital audio communication. Recon- guration of signal paths is done easily with automatic load-balancing on all processors, which guarantees an optimal usage of processing resources. Therefore any conguration of a digital mixing console described in a signal-owgraph can be implemented. The presented implementation is the result of a research work and is not a cost eective solution. However the system is fully operational and serves as the platform for an industrial product. References [1] M. Annaratone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzilicioglu, J. A. Webb. The WARP Computer: Architecture, Implementation and Performance. IEEE Trans. on Computer, Vol. C-36, No. 12, December 1987, pp.1523-1538 [2] N. Morgan, J. Beck, P. Kohn, J. Bilmes, E. Allman, and J. Beer. The rap: A Ring Array Processor for Layered Network Calculations. In International Conference On Application Specic Array Processors. IEEE Computer Society Press, 1990. [3] A. Gunzinger, U. A. Muller, W. Scott, B. Baumle, P. Kohler, W. Guggenbuhl. Architecture and Realization of a Multi Signalprocessor System. In International Conference On Application Specic Array Processors. IEEE Computer Society Press, 1992. [4] U. A. Muller, B. Baumle, P. Kohler, A. Gunzinger, W. Guggenbuhl. Achieving Supercomputer Performance for Neural Net Simulation with an Array of Digital Signal Processors. IEEE Micro, October 1992. [5] Ch. W. Kessler (ed). Automatic Parallelization, new Approaches to Code Generation, Data Distribution, and Performance Prediction. Vieweg, Wiesbaden, Germany, 1994. [6] G. Haring (ed), G. Kotsis (ed). Performance Measurement and Visualization of Parallel Systems. North-Holland, Amsterdam, London, New York, Tokyo, 1993. 6

Mixing Desk Parameter Audio raw-data Input MIXING ENGINE Processed audio data Output Figure 1: Architecture of a Digital Mixing Console Ringbus Controller Controller Controller Controller 1 2 3 n PE 1 PE 2 PE 3 PE n Figure 2: Topology of the scalable mixing-engine End of communication S c Communication Processing Clock Period Clock Period End of processing S p Figure 3: Communication and processing on all PEs run in parallel. After the end of a communication cycle, the processing can start immediately. The synchronization of communication (S C ) and processing (S P )is done separately and can be pipelined. 7

Controller Controller 1 2 Controller 3 Controller 3 Controller n AES/EBU MADI PE 1 MADI PE n Figure 4: Input Output features of the mixing-engine Figure 5: A processing board with 3 PEs 8

Signal Flow Graph Routing Runtime Figure 6: Reconguration of the mixing-engine is done in 3 steps. Figure 7: Signal-ow-graph Editor 9