Bringing Multimedia Contents into MP3 Files

ENTERTAINMENT EVERYWHERE Bringing Multimedia Contents into MP3 Files Lavinia Egidi and Marco Furini, Università del Piemonte Orientale ABSTRACT The digital music revolution has improved the availability of music and made it easier to enjoy. The shift from hard support, such as audio CDs, to software leaves some unsatisfied by the loss of additional media contents, such as images, lyrics, and CD cover. We propose to fill the gap by enriching e-music with multimedia contents. Our proposal focuses specifically on MP3 as a widely distributed open format. The challenge we meet in enriching MP3 lies in maintaining audio compatibility and avoiding file size explosion. To this end, we define a description language to express multimedia contents in a textual way, with higher representation efficiency than existing languages. We also address security issues that arise from the use of an open format. Effectiveness is shown with an MP3 player capable of rendering multimedia contents. INTRODUCTION The digital music revolution is changing the music industry with a greater impact than the one that took place when compact discs displaced vinyl records in the 1980s. Driven by the combination of the Internet and improvements in technologies for compressing digital audio, the digital revolution is affecting the whole music industry [1]. Consumers have more convenient access to a wider variety of music and can directly buy, download, and store hours of digital music in a simple audio device without having to carry several audio-cds; musicians have increased access to fans and more options for distributing their music; record labels have a better way to promote and distribute music to a wider audience; traditional music stores can provide customers with facilities that can cater to numerous consumers who have no net access or do not have the right equipment at home; the electronic industry has a new market for devices designed to listen to e-music away from the computer (e.g., ipod, diamond rio) and classic devices (e.g., palmtops, Walkmen, cellular phones, DVDs) redesigned with e-music compatibility. More and more people are attracted by e- music and participate in some form of music downloading by using P2P systems (e.g., Napster, Gnutella, Kazaa) or online music stores that have recently appeared on the Web (e.g., the itunes Apple MusicStore, Tiscali Music Store, MSN, Napster 2.0). This success is highlighted by recent research [2] that also forecasts greater usage of e-music in the near future, when, thanks to the introduction of broadband/wireless technologies (WiFi, UMTS, fiber optics, DSL) and continuous reduction in storage costs, users will be able to have access to digital musical stores at any time and everywhere. With this trend, in the next few years audio CDs will be replaced by e- music [2]. However, some in the music industry claim that e-music can never replace audio CDs, as simple audio tracks cannot substitute for mediarich audio-cds that provide users with a complete multimedia product (audio, CD cover, images, and lyrics). In this article we present a simple and novel approach that fills the gap between simple audio track and media-rich products. In particular, our approach transforms a simple MP3 file into a media-rich product, by introducing multimedia contents (MMCs) in it. The MP3 format has been selected because it is an open standard and has a tremendous amount of grassroots support that helped it to become the de facto standard for e-music, not to mention that it is the most used format for highquality digital music in P2P systems and millions of people are already using it. The main concern about the MP3 format is that it lacks security measures, so it is quite easy for people to illegally reproduce and distribute copyrighted music. This is why online music distributors are hesitant to release music in MP3 format and hence try to impose on the market proprietary formats (e.g., Windows Media Audio) and/or digital rights management (DRM) systems (Apple IMusic Store, the most famous online music distributor, uses the AAC format, which is open but wrapped within the Apple DRM system that allows users to play music on up to a limited number of authorized computers). For this reason, our proposal contains both a mechanism to introduce MMCs into an MP3 file and a security mechanism that allows rendering of the embedded MMCs only to those users who legally obtain the MP3 file. 90 0163-6804/05/$20.00 2005 IEEE IEEE Communications Magazine May 2005

The advantage of our proposal is that an MP3 file can be considered a media-rich product as well as an audio CD and may contain more multimedia contents than audio-cds. In fact, it is possible to introduce contents that cannot be provided by audio CDs: for instance, karaoke and updated information about the song or singer (e.g., live tour dates). Note that our approach does not affect the MP3 file structure or audio quality (an enriched MP3 file has the same file structure as a simple MP3 file) and only slightly increases the MP3 file size. Briefly, to minimize the increment of the MP3 filesize, MMCs are not directly stored into the MP3 file; only a textual description of them is stored. Hence, MMCs are first described with a markup description language (e.g., SMIL, MPEG7- DDL), and the resulting textual description is stored in the MP3 file using one of the ID3 tags [3] (textual tags that can be attached to MP3 files without any problems) in a trojan way. In this way, the MP3 file structure is not modified, the file size is only slightly affected, and the MP3 file, although provided with MMCs, can still be played by any MP3 player. This contrasts with future ID3 versions (e.g., ID3v4 [3]) that will introduce some MMCs by modifying the current MP3 file structure, resulting in the coexistence of two different MP3 file structures. In addition, to further reduce the increment of the MP3 file size, we propose a simple and specific markup description language, MP3 Enhanced Contents Description Language (MECDL), which produces shorter MMCs descriptions than those obtainable with SMIL/ MPEG7-DDL. Our proposal is completed by security techniques (based on cryptography and watermarking) whose aim is to protect e-music vendors from unauthorized playout and distribution of MMCs, and to guarantee integrity of the MMCs provided, to the advantage of both users and vendors. We show the effectiveness of our approach with an MP3 player able to read and understand the MMCs description stored inside the ID3v2 tag. The player is written in Java to enhance software portability (it can be directly used over several devices that support Java applications, like the recent released cellulars from several vendors including Nokia, Motorola, and Sony Ericsson) and can manage MMCs description written with MECDL, SMIL and MPEG7-DDL. We believe that the introduction of new contents inside the MP3 file will increase the success of MP3 files. The entire music industry may benefit from our approach, and e-music can become comparable to audio CDs. The remainder of the article is organized as follows. In the next section we briefly introduce the basic contents provided by current MP3 files and show how markup languages can be used to describe MMCs. Following that we present characteristics of our approach and the design of an MP3 player that exploits our approach. Finally, we draw and present our conclusions. Audio (a) Audio n Figure 1. MP3 evolution: a) the original format; b) ID3v1 introduces a way to store up to 128 characters after the audio data; c) ID3v2 allows storing an unlimited number of characters (organized in tags) before the audio data. PRELIMINARIES (b) Id3 v1 In this section we briefly introduce the MP3 file format, from the initial release with only audio information and nothing else, to the current MP3 file structure with audio and textual information introduced by the ID3 approach [3]. We also present two of the most important markup languages that can be used to describe MMCs in a textual way. THE MP3 FILE FORMAT The MPEG layer III (MP3) format was released in 1992 and provides high-fidelity audio at low bit rates with little or no perceptual difference between the original and reconstructed signals. By using psychoacoustic models, which remove the least perceptually relevant portions of the signal, MP3 achieves a high compression ratio (e.g., 11:1 with a bit rate of 128 kb/s), and music quality does not suffer because human ears cannot discern the difference for bit rates beyond 128 kb/s. Current MP3 files provide users with highquality audio and some textual information (e.g., song title, author, track, comment, and genre). This basic information is the result of the evolution of the MP3 format (Fig. 1). The original format (Fig. 1a) was designed to store only audio information and nothing else. To increase the attractiveness of this format, the MP3 file structure was first modified with the introduction of the so-called ID3 tag [3] (Fig. 1b). This method allows the insertion of a tag, composed of textual information, at the end of the audio data. The inserted tag has a size of 128 characters and is organized to provide users with song title, artist, album, year, comment and genre. In this way, users have some textual information associated with the song. Soon, a second version of ID3 was released under the name ID3v2 (Fig. 1c). This second version removes the length constraint and allows the insertion of some additional tags (e.g., composer, original artist, URL, copyright), each of them with arbitrary length. Furthermore, ID3v2 puts the textual information before the audio data, allowing MP3 players to have immediate access to the information, even with audio streaming. Despite the initial problems caused by the introduction of ID3v2 (some MP3 players crashed), nowadays the ID3 approach is the most accepted way to insert textual information inside MP3 files, and almost all MP3 players are compatible with ID3v2. ID3 is still in expansion, and future versions Id3 v2 (c) Audio IEEE Communications Magazine May 2005 91

SMIL defines an XML-based language that can be used to describe the temporal behavior of a multimedia presentation in terms of synchronization of the different media. <smil> <head> <layout width=400 height=200> <root-layout width= 300 height= 200 background-color= white /> <region id= region_1 left= 75 top= 50 width= 32 height= 32 /> <region id= region1_1 left= 150 top= 50 width= 100 height= 30 /> </layout> </head> <body> <par> <img src= cover.gif region= region_1 /> <audio src= song.wma /> <seq> <text begin= 7s dur= 3s region= region1_1 >You and me /> <text begin= 10s dur= 3s region= region1_1 >We used to be together />... </seq> </par> </body> </smil> n Table 1. An example of an SMIL file. will introduce additional contents such as images [3]. This will cause the MP3 file structure to change. Furthermore, by storing images in MP3 files, the MP3 file size will increase, causing the file download time and file storage disc space to increase too. For this reason, in this article we propose an alternative approach to introduce additional contents inside an MP3 file. MARKUP LANGUAGES A markup language allows the spatial layout of different media elements (video, audio, graphics, text) to be described as well as the temporal order in which these elements will play during presentation. The resulting description is textbased and may also contain some references (e.g., links) to the media elements composing the presentation. In the following, we present the characteristics of two of the most accepted markup languages: SMIL and MPEG7-DDL. SMIL Synchronized Multimedia Integration Language (SMIL) [4] defines an XML-based language that can be used to describe the temporal behavior of a multimedia presentation in terms of synchronization of the different media. Hyperlinks with media objects can also be used. These media objects can be of different media types: audio, video, still pictures, still text, text stream, and animations. Furthermore, SMIL performs synchronization of media (e.g., a video track and an audio track) by referencing external players. An SMIL multimedia description results in a textual file that can be interpreted by media players. The description is composed of tags and attributes (Table 1): tags <layout> and <region> are used to define a window where the MMCs will appear; <img> is used to specify an image, and <text> is used to specify textual information; the attributes <begin> and <dur> define temporal properties of the associated tag. By using these tags it is possible to describe several MMCs, as presented in Table 1. When an SMIL media player executes an SMIL file, it shows an image (cover.gif) and plays out the audio file (song.wma), displaying lyrics according to the specified timing descriptions. MPEG-7 DDL MPEG7 Description Definition Language (MPEG7-DDL) [5] has been designed to provide detailed formatting information and fine-grained descriptions of the structural and low-level audio, visual, and audiovisual features of MMCs. MPEG7-DDL provides a rich set of standardized tools to enable audiovisual descriptions. An MPEG7-DDL multimedia description results in a textual file composed of tags and attributes (Table 2). <image>, <audio>, and <textannotation> are used to link external resources (images, audio, and text, respectively). Each resource has to be located through the <medialocator> tag, and timing properties may be specified through the <Mediatime> tag. In Table 2 we show an example of an MPEG7- DDL description for images and temporal synchronization between audio and text. Note that the audio-text timing synchronization requires several tags as for each word of the song (or group of words) an audio segment has to be specified. Each audio segment contains word(s) and the time relation of such word(s) (beginning and duration time) with the audio. THE ENRICHED MP3 APPROACH In this section we present our approach, aimed at introducing into the MP3 file MMCs protected by a security mechanism to ensure that only legal owners can enjoy MMCs. Without loss of generality, in this article we discuss specific MMCs (CD cover, image, lyrics, karaoke, and updated information) and consider MP3 files with ID3 tags (if not present, ID3 tags can be introduced easily without affecting playout compatibility). Our goal is to add the MMCs while meeting two constraints: 1)Minimize incrementing of MP3 file size. 2)Avoid any possible modification to the MP3 file structure. 92 IEEE Communications Magazine May 2005

Constraint 1 imposes that large files like images not be directly stored. Hence, MMCs are first described through a markup language, and the resulting textual description is stored into the MP3 file. Thus, external media can be linked to the MP3 file, but still live independently, for greater flexibility. Moreover, since a textual description can be stored into the ID3v2 comment-field tag, constraint 2 is easily met. Hence, the presence of additional MMCs will not impair reproduction on currently available MP3 players. The additional contents may simply be displayed (not interpreted), as any other information located in the comment-field tag. MMCs should be inserted into the MP3 file only on user request. For instance, while buying an e-song, the user should specify the type of contents he/she wants, and the enriched MP3 file is customized accordingly. This way, the user need not pay, in terms of downloading time and costs, for MMCs in which he/she is not interested. Driven by the size constraint above, in the next subsection we introduce MECDL, a markup description language designed to produce shorter MMCs description than produced by SMIL and MPEG7-DDL. We also show some data to compare the efficiency of the three languages. In the following subsections we show examples of MECDL usage, and then discuss the security mechanisms. <Mpeg7> <Image> <MediaLocator> <MediaUri>http://wsite/n_doubt.jpg</MediaUri> </MediaLocator> <RelatedMaterial> <MediaLocator> <MediaUri>http://wsite/n_doubt.pdf</MediaUri> </MediaLocator> </RelatedMaterial> </Image> <Audio>... <RelatedMaterial> <MediaLocator> <MediaUri>http://wsite/n_doubt.htm</MediaUri> </MediaLocator> </RelatedMaterial> <MediaTime> <MediaTimePoint>T00:00:00</MediaTimePoint> <MediaDuration>PT4M23S</MediaDuration> </MediaTime> <TemporalDecomposition gap=false overlap=false> <AudioSegment> <TextAnnotation> <FreeTextAnnotation>You and me </FreeTextAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint>00:00:06;9</MediaTimePoint> <MediaDuration>00:00:03</MediaDuration> </MediaTime> </AudioSegment>... </TemporalDecomposition> </Audio> </MultimediaContent> </mpeg7> n Table 2. An example of an MPEG7-DDL description. THE MEC DESCRIPTION LANGUAGE The MEC Description Language (MECDL), like SMIL and MPEG7-DDL, allows multimedia contents to be described through textual information and external media to be linked with the MP3 file so that they can live independent of each other. MECDL is designed to give a very simple and compact textual representation of MMCs. MECDL defines the structure of the inserted contents, and adds MMCs and timing capabilities to the MP3 file using tags with attributes and values. The tags are enclosed between angle brackets and are in the form <tag attribute=value>, with the exception of tags that do not have attributes. The meaning of tags, attributes, and values is as follows. Tag: Defines the purpose of the description. The tag name comes after a left angle bracket. Most of the tags have attributes, although some tags may consist of just the name. If the tag consists of a pair, the second tag (or closing tag) consists of just the tag name preceded by a slash (e.g., </body>). A closing tag never has attributes. In Table 3 we show tags designed to describe the services proposed in this article. Attribute: Each attribute defines one aspect of the tag; in Table 3 we show the possible attributes: href, to locate a resource on the Internet; and unit and time, for the karaoke service. Attributes are followed by an equal sign (=) and by a value. Value: All values may be integers or names, depending on what type of value is appropriate for the attribute. Names must be enclosed in double quotation marks. In Table 4 we show the amount of bytes necessary to describe the MMCs proposed in this article for some of the songs we tested, for all three languages: SMIL, MPEG7-DDL, and MECDL. Note that MECDL is preferable in our setting since it produces shorter descriptions than either SMIL or MPEG7-DDL. Hence, in the next subsection we show how to describe MMCs using MECDL. (The language can be extended to describe other types of contents.) MULTIMEDIA CONTENTS DESCRIPTION In the following, we show how images, CDcover, updated information and karaoke-like service can be described with MECDL. It is worth noting that the same MMCs may be described with SMIL and MPEG7-DDL, as the approach is similar, but for the sake of conciseness we limit our attention to MECDL. When external resources are linked, the MP3 player s capabilities naturally constrain the kind of resource that can be defined. For maximal flexibility, the language imposes no restriction. In the following we assume an MP3 player that can handle pdf, jpg, and gif images. CD Cover The CD cover is mainly composed of images and has a well defined format (front cover, back cover, internal covers). To avoid any excessive file size increment, MECDL links the IEEE Communications Magazine May 2005 93

<body> </body> <www> <img> <cover> <kar> <track> <encrypted> </encrypted> secure href= URL unit=value time=value song= title Defines the document body. Defines a resource where updated information is located. Defines a resource where an image is located. Defines a resource where the CD-cover is located. Defines the audio-lyrics timing relation. Defines a new line in the lyrics/karaoke description. Used to identify a song in an MP3 file that contains multiple songs. Defines the body of the encrypted data. Switches the player to security mode. Gives a fully qualified HTTP URL. Gives the value of the time unit (in ms) used in the timing description. Gives the timing property of a word of the song. In time units. Gives the title of the song if multiple songs are encoded in the same MP3 file. n Table 3. MECDL tags and attributes. MP3 file with an external resource that represents the CD cover. The MECDL syntax is <cover href = url >, where url is the link to access the CD cover. For instance, <cover href = http:// somewebsite/cover.pdf > links the MP3 file with a resource (cover.pdf) located at somewebsite, while <cover href = file://somedirectory/cover.pdf > links the MP3 with an internal resource. Image MECDL links the MP3 file with an external resource that represents an image. The MECDL syntax is <img href = url >, where url is the link to access a particular image. For instance, <img href = http:// somewebsite/image. jpg > links the MP3 file with a jpg file located in somewebsite. If the image is locally available, http should be substituted with file. Updated Information Although updated information (e.g., tour dates and singer information) may be directly stored in the MP3 file, the MECDL links the MP3 file with an external resource that contains updated information. Storing is avoided because this information needs to always be updated; otherwise, it is soon obsolete. The MECDL syntax is <www href= url >, where url is the link to access the updated information. For instance, <www href = http:// somewebsite/somepage.html > links the MP3 file with a webpage somepage.html located on somewebsite, where information is supposed to be updated. Karaoke Karaoke is a multimedia entertainment service that is receiving much interest by providing users with on-screen lyrics information synchronized with audio playout. One of the main concerns in the design of a karaoke system is the adopted synchronization strategy. Some examples of karaoke systems can be found in [6 8]. To provide this service, it is necessary to define timing properties for the song lyrics. To this end, MECDL uses <kar> and <time>. An example of MECDL timing description is given in Table 5: text following the <kar> tag represents a word of the song (or a group of words) that has to be displayed according to the time attribute. For instance, <kar time=69>you and me means that after 69 time units, the words You and me are sung. The tag is used as a new line. With this information, the MP3 player can know exactly which word(s) the singer is singing. It is important to point out that the value of the time attribute is expressed in time units. The time unit is used to define the granularity of the timing description (some songs, like rap songs, may require finer time granularity than other songs, like jazz or opera songs). It is defined through the <body> tag and expressed Song (singer) Simple MP3 audio track (bytes) SMIL description (bytes) MPEG7-DDL description (bytes) MECDL description (bytes) Don t Speak (No Doubt) 4,218,752 20,480 88,326 6144 That Day (Natalie Imbruglia) 4,542,464 14,681 62,656 9224 Ironic (Alanis Morisette) 3,678,851 14,679 65,436 6138 Rock DJ (Robbie Williams) 4,127,347 22,341 100,256 7385 n Table 4. MMCs description: vomparison between different markup languages. Songs are 128 kb/s encoded. 94 IEEE Communications Magazine May 2005

in milliseconds (e.g., <body unit=100> means that a single time unit corresponds to 100 ms). Lyrics No MECDL tags are designed to describe lyrics. Lyrics are the only text not included between angle brackets. Hence, to display lyrics, the MP3 player simply has to skip the MECDL tags; this is why no MECDL tags are specifically designed to describe lyrics. However, since multiple songs may be encoded in the same MP3 file, it may be necessary to identify lyrics of different songs. For this reason, the <track song= title > tag is introduced, which specifies the name of the song through the attribute song. For instance, if the tag <track song= don t speak > is present, it means that lyrics that follows the tag are related to the song Don t Speak. <body unit=100> <img href = http://somewebsite/no_doubt.jpg > <www href = http://somewebsite/no_doubt.htm > <cover href = http://somewebsite/no_doubt.pdf > <kar time=69>you and me <kar time=99>we used to be together <kar time=131>everyday together always <kar time=2423>hush, hush darlin Hush, <kar time=2501>hush, hush don t tell me </body> n Table 5. An example of multimedia contents described through our MEC description language. ADDING MULTIMEDIA CONTENTS TO AN MP3 FILE The description and storage of MMCs in an MP3 file are very simple: once a markup description language (SMIL, MPEG7-DDL, or MECDL) has been chosen, simple text editors and ID3 tools (included in many MP3 players) may be used to complete the description/storing task. On one hand, this is good news, as users can easily add contents of their choice to their MP3 files. We show in the following subsection a simple graphical tool to assist users in MMC addition, helping them to avoid mistyping, syntax errors, or wrong lyrics-audio synchronization. On the other hand, ease of alteration is a threat to music distributors: a malicious user can modify information in order to provide wrong audio-lyrics timing synchronization, wrong lyrics, or fake images or CD cover, information not related to the song, and so on. We address this problem when introducing security mechanisms later. Do-It-Yourself MMC Addition The graphical tool we developed (Fig. 2) uses windows, buttons, and text boxes to help with including services in MP3 files. In this way, inserting services like CD cover, image, and updated information is as easy as writing into text boxes. Also, karaoke description is done through buttons, so the tool generates an MECDL description of this service in an easy way. To produce the karaoke timing description two files are necessary: the MP3 file and a lyrics file (a simple text file with lyrics typed as they are written in any CD booklet). The user selects the desired karaoke service (single word or group of words) and presses the KARAOKE button. At this point, the MP3 audio playout starts, and words are shown in the lyrics box. From now on, each time the user presses the SET-TIME button, an audio-text timing description is generated. In particular, if word-by-word mode has been selected, the tool shows the first word of the lyric and the user must press the SET-TIME button when the singer is singing that word. This way, a timing relation between the lyric and the audio playout is generated. When the singer sings the second word, the user must press SET- Enriched MP3 Commercial Distribution In a commercial setting, MMCs should be protected for two main reasons: They add value; therefore, users should pay for them. The vendor is responsible for the kind of material it is distributing; therefore, there must be a way to prove the integrity of released enriched MP3 files. This can be achieved by using DRM systems [9], which either wrap contents in a protective software layer or ensure that contents can only be examined by specific software. We use basically two classes of techniques: standard cryptographic tools such as encryption, hashing, and digital signatures; and watermarking. The MMC description is encrypted in order to prevent MMCs being illegally shared or disn Figure 2. The visual MECDL interface. TIME again, to generate the timing description for the second word. This process goes on until the end of the song. In line-by-line mode, the process slightly differs, only in that the tool shows one entire line of the lyric at a time, and, pressing the SET-TIME button, a timing relation between lyric lines and audio playout is generated. Through the ADD SERVICES button, the MECDL description is inserted into the ID3v2 comment-field tag of the MP3 file, which is now ready to provide multimedia services through an MECDL-compatible MP3 player. IEEE Communications Magazine May 2005 95

n Figure 3. The graphical interface of our MP3 player. tributed. (Therefore, the MECDL description has the secure attribute inside the <body> tag.) We assume that each copy of the enriched MP3 is released for a specific player. The player is bound to the computer on which it runs (with normal copyright protection techniques), and each player uses a unique watermarking key. Watermarking techniques enable information (the watermark) to be hidden in a data stream. The actual embedding depends on the watermarking key. In audio applications, the watermark must be imperceptible and must depend on the e-music. If, with no knowledge of the key, it is also hard to detect the watermark even comparing different watermarked copies (statistical invisibility), and difficult to remove or tamper with it (robustness and tamper resistance), the technique can be used to store sensitive information. For instance, a secure technique is described in [10]. Thus, the vendor embeds in the audio stream the encryption key with which MMCs are enciphered. Only the intended player can easily retrieve it. Also, the vendor embeds a hash of the MMCs as a watermark in the audiostream; this serves as a lightweight integrity check that provides a minimal guarantee (watermarks are hard to alter), and can be verified at runtime (the check does not require heavy computation). Clearly, the player must retrieve the decryption key before playout, and it is also desirable that MMCs playout depends on the weak integrity check. Since we require that the whole protection mechanism be transparent to users, we allow the first 30 s of reproduction to be unprotected; weak integrity data and the decryption key are embedded in the first portion of the audio stream (i.e., in the first 30 s of audio), and the player does not need to impose any delay on playout during this initial phase. (The functional description is completed by the overview of the player in the next section.) A digital signature of the vendor, publicly available at the vendor s Web site, binds the MMCs to the audio stream to prove integrity of the MMCs in distribution. (Because of the perturbations introduced by watermarks in the audio stream, the signature must be based on a perceptual digest of the audio data, i.e., a digest computed with a hash function that is oblivious to alterations of the stream that are indistinguishable to the human ear.) Moreover, in order to trace illegal distributors, fingerprints are used. These are watermarks that uniquely identify an actor. The choice of a collusion-resistant technique is obviously very important in this setting. In order to make fingerprints useful for forensic analysis, the vendor must never see a fingerprinted copy, and a user must never obtain a copy with no fingerprint on it; therefore, we rely on a trusted third party for the creation and insertion of fingerprints. The system must obviously be defended against reverse engineering of the player. Techniques in this field change quickly and are widely deployed in much commercial software. THE MEC MP3 PLAYER In this section we show that our approach can be used to provide users with MMCs. To this aim, we design an MP3 player that can read and understand the MMCs description located in the ID3v2 comment-field tag. We only give a highlevel description of our player, since implementation details are irrelevant here. Our player is written in Java in order to enhance software portability; hence, it can be directly used over several devices that support Java applications: in particular, portable devices such as recently released cellular phones. Our player can interpret MECDL, SMIL, and MPEG7-DDL. A snapshot of our player is shown in Fig. 3. A region of the screen is used to display some information about the MP3 song being played out, another region is used to display one of the services provided to the user (the image), and the bottom part is used to provide the karaoke service. All other services are available through selections located in the Options menu. Simply put, when the user selects an MP3 file to play out, our player checks the ID3v2 comment-field and retrieves the MMCs description if present. The player also checks whether the MMCs description is encrypted or not (i.e., whether the attribute secure is present or not inside the body tag). The overall architecture of the player is shown in Fig. 4. If the <body> tag does not have the secure attribute, the MMCs can be immediately rendered. Otherwise, as we previously mentioned, we accept that the first 30 s of audio and MMC reproduction are unprotected (a). During that time, the software player extracts embedded watermarked data, WID, for weak integrity verification (b) and transparently checks the integrity of the MMCs provided (c). Weak integrity verification amounts to lightweight cryptographic computation of a hash function. If the integrity check fails, reproduction is interrupted (d). MMCs that are to be reproduced after the first 30 s are encrypted. Therefore, during this initial reproduction phase, the player also 96 IEEE Communications Magazine May 2005

retrieves the cryptographic key for decryption, embedded as a watermark as well (e). Then the remaining MMCs are decrypted (f), and audio MMCs rendering continues (g). The rendering of MMCs can be done in several ways; the player implementation has to take care of the information located in the MMCs description. For instance, our player can establish an http connection in order to get external media that are linked to the MP3 file via the MMCs description. It can also manage pdf files and htm files, so it can handle resources (e.g., the CD cover or updated information) in pdf or htm format. An image can be displayed only if it is in jpg/gif format and is presented on the graphical interface of Fig. 3. The karaoke description is used to highlight a word or group of words the singer in singing. In our player implementation (Fig. 3), lyrics are presented inside the karaoke-like service box. According to the audio playout, lyrics lines scroll and word(s) currently sung by the singer are presented in red. CONCLUSIONS In this article we propose a novel approach to enrich MP3 files with multimedia entertainment contents. Our approach does not require additional resources or equipment, does not affect the MP3 file structure, and only slightly increases MP3 file size. It requires multimedia contents to be described by a markup description language, and the resulting description has to be stored in one of the ID3 tags commonly present in MP3 files. MECDL, a simple markup description language, was also introduced in order to have simpler MMCs description than that obtained with SMIL and MPEG7-DDL. In addition, MECDL is capable of dealing with security features. A Java MP3 player, able to interpret MMCs description (produced by SMIL, MPEG7-DDL, or MECDL), was developed to show the effectiveness of our approach. Since it is expected that in the near future e- music may be considered an alternative to audio CDs, we think that our approach will contribute to increasing the attractiveness of MP3 files. In fact, by using our approach, the MP3 file is no longer a simple audio track, but a media-rich product that can provide even more multimedia services than audio CDs (e.g., karaoke service and updated information). We think that once people have gotten used to enhanced MP3 files, they will not want to go back to simple MP3 audio track files. We are currently working on introducing MMCs into other e-music open formats and are focusing our attention on the AAC format, with the goal of transforming it into a media-rich product. 0 Time (s) 30 Reproduction of audio and MMCs Unconditional reproduction (a) Reproduction of protected material (g) n Figure 4. The software player architecture. REFERENCES [1] C. K. M. Lam and B. C. Y. Tan, The Internet is Changing the Music Industry, Commun. ACM, vol. 44, no. 8, Aug. 2001, pp. 62 68. [2] Forrest Research, From Discs to Downloads, http:// www.forrest.com. 2003 [3] IDv3 http://www.id3.org/ [4] W3 Rec., Synchronized Multimedia Integration Language (SMIL) 2.0 Specification, http://www.w3. org/tr/smil20/ [5] MPEG7-DDL http://archive.dstc.edu.au/mpeg7-ddl/ [6] M. Furini and L. Alboresi, Audio-Text Synchronization inside MRS Files: A New Approach and its Implementation, Proc. IEEE Consumer Commun. and Net. Conf. 2004, Las Vegas, NV, Jan. 2004. [7] W. H. Tseng and J. H. Huang, A high Performance Video Server for Karaoke System, IEEE Trans. Consumer Elec., vol. 40, no. 3, Aug. 1994, pp. 609 18. [8] M. Roccetti et al., Delivering Music over the Wireless Internet: From Song Distribution to Interactive Karaoke on UMTS Devices, Ch. 24, Wireless Internet Handbook, B. Furth and M. Ilyas, Eds., CRC Press, 2003. [9] M. Stamp, Digital Rights Management: The Technology behind the Hype, J. Elect. Commerce Res., vol. 4, no. 3, 2003. [10] I. J. Cox et al., Secure Spread Spectrum Watermarking for Multimedia, IEEE Trans. Image Proc., vol. 6, Dec. 1997, pp. 1673 78. BIOGRAPHIES WM retrieval Weak integrity data retrieval (b) Decryption key retrieval (e) Crypto operations Weak integrity verification (c) LAVINIA EGIDI (lavinia.egidi@mfn.urnpmn.it) graduated in mathematics from the Università di Roma La Sapienza, Italy, and has a Ph.D. in computer science from the Universities of Torino and Milano, Italy. She is currently an assistant professor at the Università del Piemonte Orientale. She mainly works on issues of security and privacy, but also has ongoing research on symbolic languages for temporal representation. MARCO FURINI (marco.furini@mfn.unipmn.it) received a degree and a Ph.D. in computer science from the University of Bologna, Italy, in 1995 and 2001, respectively. He is currently a faculty member of the Computer Science Department of Piemonte Orientale University. From August 1998 to May 1999 he visited the Department of Computer Science, University of Massachusetts, Amherst. His scientific interests include multimedia communication systems, QoS issues over IP networks, and e-music distribution. Success MMC decryption (f) Failure Terminate (d) IEEE Communications Magazine May 2005 97