1st HellasGrid User Forum 10-11/1/2008 GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content Ioannis Konstantinou School of ECE Computing Systems Laboratory National Technical University of Athens
Concept Text based search in audiovisual content Search results: Portions of video files containing selected keywords Example User searches for keyword Acropolis Video portions containing the spoken word Acropolis are located and presented in the user YouTube like functionality
Objectives Keyword extraction from video files using automatic speech recognition algorithms (ASR) Efficient and scalable distributed storage of large media content Indexing of extracted metadata for efficient keyword search YouTube like user interface for video searching/downloading Contribution to existing Grid Middleware using GGF standardized components
Addressed Issues Execution Distributed execution of CPU/Data intensive Speech Recognition Algorithms Storage Server load balancing using performance metrics Client transfer time optimization using bittorrent like algorithms Increase data availability Multi-organizational data storage support using Virtual Organizations (VOs)
Video file keyword extraction procedure MetaScheduler Application Software Storage Cluster Cluster Schedule Execution CVSP tool Data Client GET/PUT File Ops SE Export Cluster Import Metadata Parser SE SE Metadata File Update Metadata DB
Distributed Execution Platform Architecture MDS : Managing and Discovery Service PBS : Portable Batch Scheduler WSGRAM : Grid Resource and Allocation Management WN: Worker Node Central MDS Server User Interface GridWay Metascheduler Globus Toolkit 4 MDS Publish aggregate Cluster information SOAP/XML messages Computing Element PBS Scheduler Globus Toolkit 4 MDS + WSGRAM PBS client service Ganglia Tool Local MDS Server Local MDS Server Local MDS Server Ganglia Stats WN WN WN WN WN WN WN WN WN
Distributed Storage Platform Architecture MDS: Managing and Discovery Service GridFTP: Grid File Transfer Protocol DRLS: Distributed Replica Location Service SE: Storage Element PFN: Physical File Name LFN: Logical File Name DRLS Get/Put LFN to PFN mappings SOAP/XML Messages Pastry Peer to Peer Storage Subsystem Client Machine Query Storage Subsystem to obtain candidate storage servers using SOAP/XML Messages GridFTP Get/put SE SE Replication Algorithm using NWS stats SE GridNews DataClient component GridFTP Client Parallel Downloader SE GridFTP Server Network Weather Service (NWS) DRLSCom Service Globus WS Apache Axis Container
Distributed Replica Location Service Contains mappings of LFNs to PFNs DHT used: Pastry P2P Logarithmic routing In a network with n nodes, a query needs only log(n) messages (hops) Plaxton s algorithm minimizes query latencies Redundancy through replication Eliminates single point of failure situations Inherent load balancing capabilities Consistent hashing algorithms
Load Balancing Servers exchange load metrics CPU Bandwidth Free Disk Space Prediction algorithms (e.g. Linear regression) forecast future metrics from history data Weighted Normalized Metric WNM : Wm X(M t / M max ) Servers maintain sorted WNM list : [WNM 1,..WNM n ] WNM list periodically refreshed
Replication Upon a STORE client request: Top k servers are selected from WNM list k: configurable static replication factor Most suitable server is returned to the client Client initiates a single GridFTP file upload Server replicates the new file according to WNM list and factor k DRLS is informed about the new LFN->PFN mappings Client is informed Upon completion
Parallel Downloader Upon a GET client request: Server contacts DRLS and retrieves replica locations Client establishes N GridFTP connections Client initiates N parallel (threaded) small data chunk requests After each successful retrieval, client reinitiates another request Optimum file transfer time: The greater file portion is retrieved from the faster storage nodes
GridTorrent Metadata fields Current Id File size Piece size Hashes Distributed RLS instead of Tracker Partial GridFTP for actual transfer BitTorrent replica selection and tit-for-tat algorithm. Compatible with plain GridFTP servers PFN s prefix determines protocol (gtp://site.fully.qualified.d omain.name/path/to/file) size,piece_size, hashes, currentid Peer that wants to download a file New connection Bittorrent piece Peer1 GridTorrent client Bittorrent Bittorrent New connection piece publish pfns Striped GridFTP Peer2 GridTorrent client New connection Striped GridFTP piece piece piece Peer3 GridFTP Server Striped GridFTP piece Distributed RLS Peer set CurrentID :pfn1 pfn2 pfn3
EXISTING MIDDLEWARE CONTRIBUTION Design and development of a dynamic replica selection/placement algorithm Added support for multiple clusters using Gridway Metascheduler Enhanced bittorrent like file downloading from multiple sources Replace centralized replica location service with a scalable distributed peer to peer solution
Development Testbed Hardware 4X Dual Core AMD Opteron(tm) Processor 875 2.2GHz 8 virtual CPU 16Gb Ram Deployment of 5 Xen virtual machines, 2GB ram Software Globus Toolkit v4.0 Globus WS Core (Apache AXIS WS Container) Rice Pastry P2P (Java) Network Weather Service Torque (OpenPBS) scheduler GridWay Metascheduler
Virtualization Xen Hypervisor Paravirtualization tequnique Guest OS use special xen aware kernel Direct utilization of special CPU instructions Faster than full virtualization (VmWare) Use of Xen Hypervisor Easy prototype management/administration Simple control of the node lifecycle Facilitate prototype deployment in many actual nodes
Currently working Replace ParallelDownloader with GridTorrent Deploy prototype in the PlanetLab testbed Run experiments Fine-tune designed algorithms
Gridnews Portal Users can: Perform keyword search in the autoannotated multimedia content View the video from their browser in a youtube style Download only a fragment of the video where this keyword exists
Screenshots keyword Video URLs time position
Questions