H.264 Streaming Framework for Virtual Colonoscopy

H.264 Streaming Framework for Virtual Colonoscopy CSE 523/524: Advanced Project in Computer Science Project Report Apurva Kumar apkumar@cs.stonybrook.edu August 17 th, 2016

C o n t e n t s ii CONTENTS Acknowledgements... iii Overview... 1 Virtual Colonoscopy... 2 Streaming... 2 JPEG Streaming... 2 H.264 Streaming... 3 Adding additional information... 3 Supplemental enhancement information... 3 Format and Specification... 4 Limitations... 4 Multiple Streams... 4 Omegalib... 5 FFENC... 5 Adding SEI... 6 Code... 6 Conclusion & Future Work... 7 References... 8

A c k n o w l e d g e m e n t s iii ACKNOWLEDGEMENTS I would specifically like to thank Sayedkoosha Mirhosseini for mentoring me through the project and guiding me whenever I hit dead ends. A special thanks to Ping Hu, Alessandro Febretti and the Stack Overflow community for helping out at certain points along the project. I would also like to thank my advisor Prof. Arie E. Kaufman for giving me the opportunity to pursue my interests and support me on working on this project.

OVERVIEW The goal of this project was to implement an architecture for virtual reality (VR) streaming to a mobile device like an IPad. More specifically, a framework to enable Virtual Colonoscopy (VC) on a tablet. The current setup runs on a desktop system. However, to make it more accessible and easy to use for doctors at hospitals, the goal is to make it run on simple to use hand held devices. However due to limited memory and processing power, the application cannot entirely reside within the mobile device. Hence, we set up a server-client architecture: the processing is done on the server and the final view is sent to the client to display. Previously, the codebase only supported JPEG streaming. Through this project we implement a faster H.264 streaming. We developing a framework keeping future enhancements in mind like: low latency streaming & handling real-time interactions on client.

V i r t u a l C o l o n o s c o p y 2 VIRTUAL COLONOSCOPY In recent years, virtual colonoscopy, an alternative to the traditional colonoscopy, has emerged as an option for most patients. Virtual colonoscopy is a safe, highly accurate minimally invasive CT imaging examination of the entire colon and rectum. It is a well-tolerated exam that takes about 10 minutes to complete. Its goal is the same as that of traditional colonoscopy: to identify polyps and cancers in the colon. Polyps have been shown to be the precursor of most colon cancers, and the goal of virtual colonoscopy is to find these potentially dangerous polyps before they become actual cancers. At the visualization lab at Stony Brook University, we employ advanced visualization techniques to achieve virtual imaging and exploration of the human colon. a helical CT scanner is used to obtain a sequence of 2D slices of the human abdomen. These CT slices are then reconstructed into a 3D volume and, subsequently, the human colon is visualized with various visualization techniques implemented in VolVis, which is a comprehensive volume visualization system intended for scientists and engineers as well as visualization developers. This noninvasive procedure is employed as an alternative method to existing procedures of imaging the mucosal surface of the colon. Our current implementation allows the user to achieve both planned and guided navigations inside the colon. This is a joint project between the Departments of Computer Science and Radiology. The research activities have been carried out in the Visualization Lab of the Computer Science Department and the Lab for Imaging Research and Informatics (IRIS) of the Radiology Department. STREAMING Streaming media is video or audio content sent in compressed form over the Internet and played immediately, rather than being saved to the hard drive. With streaming media, a user does not have to wait to download a file to play it. Because the media is sent in a continuous stream of data it can play as it arrives. Users can pause, rewind or fast-forward, just as they could with a downloaded file, unless the content is being streamed live. There are two main protocols used for carrying video and audio data over IP networks: HTTP and RTSP. Using these protocols, it is possible to transmit video and audio in various compression formats (JPEG, MPEG-4, H.264, AAC etc.). JPEG Streaming HTTP has long been established as a method of transmitting JPEG video streams. Motion JPEG (M-JPEG) is a video compression format in which each video frame or interlaced field of a

A d d i n g a d d i t i o n a l i n f o r m a t i o n 3 digital video sequence is compressed separately as a JPEG image and then streamed. It has it s set of advantages and disadvantages: It is simple to implement because it uses a mature compression standard (JPG) with welldeveloped libraries, and it's an intra-frame method of compression. It enjoys broad client support and Minimal hardware is required because it is not computationally intensive. Disadvantages of M-JPEG include lack of support for sound. The lack of inter-frame prediction limits its efficiency to 1:20 or lower; making it slow & causing it to consume much more bandwidth and storage. H.264 Streaming H.264 is a block-oriented motion-compensation-based video compression standard that is currently one of the most commonly used formats for the recording, compression, and distribution of video content. Unlike M-JPEG, H.264 compresses across frames: only some frames are compressed by themselves, while most of them only record changes from the previous frame. This is much faster and can save a significant amount of bandwidth. As a video codec, H.264 can be incorporated into multiple container formats, and is frequently produced in the MPEG-4 container format, which uses the.mp4 extension, as well as QuickTime (.MOV), Flash (.F4V), 3GP for mobile phones (.3GP), and the MPEG transport stream(.ts). Most of the time, but not all the time, H.264 video is encoded with audio compressed with the AAC (Advanced Audio Coding) codec, which is an ISO/IEC standard (MPEG4 Part 3). ADDING ADDITIONAL INFORMATION There is a lot of useful information that can be streamed to the client along with the frames to be rendered. This could be information about the scene in terms of meta data, or depth buffers for predicting frames & faster local rendering on the client. In our framework, we use h.264 encoding of the frames to stream to the client. This gives us two options to package additional information along with the stream: adding supplemental enhancement information & adding an extra stream. Supplemental enhancement information The h.264 compression format supports the addition of user specified meta data called supplemental enhancement information (SEI) along with every frame that is encoded. We make use of this provision to pack in useful meta-data to send to the client.

A d d i n g a d d i t i o n a l i n f o r m a t i o n 4 Format and Specification Network Abstraction Layer (NAL) and Video Coding Layer (VCL) are the two main concepts in H.264. A H.264 file consists of a number of NAL units (NALU) and each NALU can be classified as VCL or non-vcl. Video data is processed by the codec and packed into NAL units. A three-byte or four-byte start code, 0x000001 or 0x00000001, is added at the beginning of each NAL unit. They are called Byte-Stream Format and help the decoder find the boundaries of the NALU easily. In a NALU, the first byte is a header byte indicating the type of data contained in it and other information. The rest of bytes are the payload of a NAL unit. A NALU of type 6 indicates that the following bytes represent a SEI payload. The next byte indicates the type of SEI payload. A SEI of type 5 represents unregistered user data, which is what we will be using. The next byte indicates the size of the SEI data followed by the data itself. We then end the payload by using the end code: 0x80 The following is an example of a NALU for a user SEI message: Can be broken down as: \x00\x00\x01\x06\x05\x05\x68\x65\x6c\x6c\x6f\x80 \x00\x00\x01 - NAL unit identifier \x06 - Indicating NAL unit of type 6 ie. SEI \x05 - Indicating SEI of type 5 ie. User Data Unregistered \x05 - Indicating size of SEI payload (1 byte) \x68\x65\x6c\x6c\x6f - SEI data: hello (in hex) \x80 - Indicates end of SEI NAL unit Limitations According to the h.264 standard, the user specified SEI payload can be only 1 byte in size. Hence, the amount of information that can be packaged is limited to 255 bytes. If only small amount of data is to be supplied, this is the best way to package data with minimal overhead for encoding and streaming. Multiple Streams Another way to package additional data is to mux multiple streams together into a container (like MP4) and then stream to the client. If we want to package more than 255 bytes of data, adding and additional stream containing the desired information on a frame to frame basis can solve the problem. Increasing the number of streams will increase the bandwidth required.

O m e g a l i b 5 There is also an overhead in encoding every extra stream of data and then muxing all the streams into a container. We also encounter an additional overhead of de-muxing the streams on the client before we can decode each stream. OMEGALIB Today s visualization, visual analytics and virtual-reality technologies could significantly facilitate and enhance human insight and knowledge. Multidisciplinary teams rely on a variety of domain-specific and/or special-purpose software libraries that do not interoperate, do not take advantage of the changing landscape of computing platforms, and do not take advantage of new consumer-priced and advanced 2D and 3D display systems. Omegalib is an integrated Hybrid Framework for Scientific Visualization that addresses these challenges. Omegalib lets researchers tightly couple multiple libraries to create combined or linked visualizations; to utilize a variety of display devices, from smartphones and 3D headmounted displays to conference-room monitors to room-sized immersive environments; and, to use cloud computing to render complex graphics and then stream to personal devices using a web browser. Omegalib tightly couples 2D/3D visualizations and virtual environments with computing and display platforms to create an ecosystem that allows scientists to focus more time on analysis and discovery. Omegalib is a joint venture which includes computer science developers at University of Illinois at Chicago and Stony Brook University partnering with domain scientists in astrophysics, engineering, geoscience, and molecular modeling to expand Omegalib s features. Our VC system uses Omegalib as a core component to visualize & render the colon. In this project, we use the porthole module of omegalib to stream the visualized data to the browser. Omegalib currently only supports JPEG streaming, and the main goal of this project is to develop a module (ffenc) which lies between the core omegalib and porthole module and is responsible for h.264 encoding of frames on the fly along with support for further future enhancements. FFENC The major work of this project focusses on developing an h.264 encoding module for Omegalib with support for adding SEI data and multiple streams. This has been done utilizing FFmpeg and hence the module has been named FFENC (FFmpeg ENCoder). FFmpeg is a free software project that produces libraries and programs for handling multimedia data. FFmpeg includes libavcodec, an audio/video codec library used by several other projects, libavformat, an audio/video container mux and demux library, and the ffmpeg command line program for transcoding multimedia files.

F F E N C 6 FFENC is a standalone module by itself; but in the nature of our project it is heavily coupled with the Porthole module in Omegalib. Porthole is a framework that helps Virtual Environment applications developers to generate decoupled HTML5 interfaces. The volume to be rendered is passed onto Porthole on per frame basis. Porthole utilized FFENC to encode every frame as desired and then streams it to the browser. The workflow of FFENC is as follows: first the FFmpeg h.264 encoder is initialized with finetuned parameters. It then has an interface to encode a frame passed on to it. Frames are stored in a specific pixel format within Omegalib. They are first converted into an FFmpeg accessible RGB24 format and then passed on to the encoder. If specified by the application, there is also a provision to add a SEI message or add another stream for metadata prior to encoding. Adding SEI Even though there is a h.264 standard defining SEI NAL syntax structure, there is currently no API in the libavcodec/libavformat libraries to assign this data. However, provision for adding the SEI data per frame is an integral aspect of the framework. After countless hours spent analyzing the h.264 standard and the its implementation in libavcodec/libavformat, I devised a small hack to pack in the SEI data. I was able pinpoint the location in memory that the SEI NALU is supposed to reside. I explicitly create a SEI NAL unit and copy it to the desired offset at the memory location prior to encoding. Once encoded, while analyzing every frame of the video, I was able to recover the SEI message transmitted and thus verified the correctness of the hack. Provision for adding SEI metadata is available only for the FFmpeg CPU based h.264 encoder which uses x.264 encoding libraries. Currently the hardware accelerated ffmpeg h.264 encoders don t have any support for encoding SEI data. Code FFENC module is free and open source; the source code for the same is currently available on my github: https://github.com/cruxeon/ffenc

C o n c l u s i o n & F u t u r e W o r k 7 CONCLUSION & FUTURE WORK With the FFENC module integrated into Omegalib, we now have a framework for efficiently streaming VR content using Omegalib. More specifically, we can now use this module to stream the VC application to mobile devices. The next step involves creating a mobile VC web app for the tablet as an interface for the streamed data. We can devise and implement an intuitive interaction scheme to interact with a volume (colon) on the client. Some inputs like measurements, comments and bookmarking can be handled directly on the client. Navigational interactions will be sent to the server for processing and sending back subsequent frames. Another possible future extension is to enable low latency streaming. Our framework already allows packaging of extra information in the form of SEI messages and additional streams. Extra information, like depth buffers can be sent to the client for fast local rendering of predicted future frames. This will effectively hide the latency to inputs on the client side and we can have scene correction on the fly for the predicted frames.

R e f e r e n c e s 8 REFERENCES 1. https://www.stonybrookmedicine.edu/patientcare/virtual-colonoscopy 2. https://labs.cs.sunysb.edu/labs/vislab/3d-virtual-colonoscopy-home/ 3. http://bensoftware.com/blog/comparison-of-streaming-formats/ 4. https://blog.angelcam.com/what-is-the-difference-between-mjpeg-and-h-264/ 5. https://en.wikipedia.org/wiki/motion_jpeg 6. https://en.wikipedia.org/wiki/h.264/mpeg-4_avc 7. https://www.itu.int/rec/t-rec-h.264-201602-i/en 8. http://uic-evl.github.io/omegalib/ 9. https://github.com/uic-evl/omegalib 10. https://github.com/omega-hub 11. http://ip.hhi.de/imagecom_g1/assets/pdfs/h264_iso-iec_14496-10.pdf 12. http://yumichan.net/video-processing/video-compression/introduction-to-h264-nal-unit/ 13. https://ffmpeg.org/ 14. https://en.wikipedia.org/wiki/ffmpeg