Analysis and Implementation of Automatic Reassembly of File Fragmented Images Using Greedy Algorithms By Lucas Shinkovich and Nate Jones
Outline Introduction & Problem Method Algorithms Problems & Limitations Experimental implementation Future work
Introduction & Problem Files tend to get fragmented (stored in discontinuous blocks) on a regular basis. This is usually caused by files that are too large for one block or a block is deleted and reused by a larger/smaller file. Problem: Reassemble the fragmented blocks (image data only) without any prior knowledge of their original ordering. Assume: No missing or corrupted blocks.
Method Put fragments together based on a candidate weight value. Generalize to a graph problem of finding k vertex disjoint paths in a complete graph (NP-complete). Problem is a lot simpler because we know the first fragment (header) and the first fragment contains valuable information such as image Resolution, which we can use to obtain how many fragments it contains. Assume: Each fragment is at least width size, so we only need to compare from top and bottom. Find concatenation of fragments that minimizes or maximizes the candidate weight values to obtain the original image. Three types of candidate weight values: Pixel Matching (PM), Sum of Differences (SoD), and Median Edge Detection (MED).
Algorithms Eight different reconstruction algorithms based on vertex/non-vertex disjoint paths, done in parallel or serial and according to the heuristic (greedy or enhanced greedy). Greedy Heuristic: Start with header fragment hi, reconstruct the image by choosing the best available fragment t (based on candidate weight value). Add t to the reconstructed image path and repeat the process until the image is reconstructed. Time: O(nlogn)
Algorithms (cont.) 1) Greedy sequential unique path (SUP): Uses the greedy heuristic to assemble fragments and creates vertex disjoint paths (i.e. a fragment can only be used once). Problems: Dependent on the order of images being processed. 2) Greedy non-unique path (NUP): Uses the greedy heuristic to assemble fragments and creates non-vertex paths (i.e. a fragment may be used an infinite number of times).
Algorithms (cont.)
Algorithms (cont.) 3) Greedy parallel unique path (PUP): Start with all of the image headers and choose the best match for each header. We then pick the best header-fragment pair out of all the header-fragment best matches. 4) Greedy shortest path first (SPF UP): Attempts to gain the benefits of the NUP algorithms while being an UP algorithm. Assumes that the best reconstructed image is the one with the lowest average path cost. Runs the greedy NUP across all of the images and computes their total cost (sum of candidate weight values). The total cost divided by the number of fragments is the average path cost. We remove the image with the lowest average path cost and then run the algorithm again until we have obtained all of the images.
Algorithms (cont.) All of the algorithms mentioned (1-4) have a total running time of O(n2logn). Enhanced Greedy Heuristic: Same as greedy heuristic except before we choose the best fragment match hi and t we first check to see if t is a better match for another fragment. Algorithms 5-8 are the same as the previous ones discussed except they used the enhanced greedy algorithm.
Problems & Limitations Assumptions are not realistic. Assumption: Each fragment is at least width size This is not always the case. Example: Take a 512x512 image, this is 512X512X3 + 54 bytes = 786,486 bytes. We now divide that into 4096 bytes (cluster size) and get 192.0132. We have a fragment left over that is of size 54 bytes and the width of the image is 512X3 = 1536 bytes. How do we calculate candidate weights with different width sizes? A similar type of problem arises when the image width is larger then the cluster size. In this case, we can no longer compare fragments by just the top and bottom.
Problems & Limitations (cont.) Assumption: No fragments are corrupt or missing. Not realistic, if you have lost the file allocation table then you have most likely lost some data along with it. Missing regular fragments will only cause minor problems in some algorithms, but missing a header fragment will make that image s reconstruction impossible. This is because we will be missing vital header information such as the width of the image (we need this to calculate candidate weight values). Partial Solution: Major problem is finding out what fragment(s) are missing for what image(s). If we can solve this problem, we might be able to do some type of interpolation for the missing fragment especially if its at the end of the image.
Problems & Limitations (cont.) Assumption: Access to raw pixel data. (not clearly stated in paper) Need access to raw pixel data to get correct candidate weight values. Compressed image formats such as JPEG need to be decoded before we have access to the pixel data. Due to the fact that we might not have access to all of the 8x8 blocks in one fragment we will not be able to successfully do inverse DCT to access the pixel data. New ideas in research, such as image encryption will also have the same problem.
Problems & Limitations (cont.) Problems with the algorithm itself: Greedy algorithm without proof of greedy property or optimal substructure. Greedy choice does not always find shortest path (Dijkstra s algorithm). Not very efficient, lot of sorting and tree allocation (linear sort?). What do we do with two fragments with the same candidate weight values?
E x p er i m en tal I m p l em en tati o n
E x p er i m en tal I m p l em en tati o n (C o n t'd ) W ri tten i n C # D el egat es al l o w ed u s t o easi l y p l u g i n d i f f er en t w ei gh t f u n c t i o n s, c o m p ar at o r s, et c. W e w er e b o t h f am i l i ar w i t h C #. T h e G U I w as i m p o rtan t b ecau se i t al l o w s u s to q u i ck l y an d easi l y p u t to geth er n ew ex p eri m en ts w i th n ew i m age sets an d n ew p aram eters. W e set u p o u r E x p eri m en tal i m p l em en tati o n o f th ei r ex p eri m en t to b e a p l ace to easi l y test f u tu re w o rk o n th e su b j ect.
F r ag m en t.c s
R eassem b l ee x p er i m en t.c s
R ep o r ti n g R esu l ts G rap h o f A v erage C o l o r I n ten si ty S i n ce C o l o r I n ten si ty i s a M aj o r F acto r i n F ragm en t W ei gh ts S h o w h o w si m i l ar an I m age i s D o esn 't w o rk as w el l f o r sm al l i m ages R ed O ri gi n al I m age B l u e R eco n stru cti o n G reen D i f f eren ce
R esu l ts 51 2x 266
R esu l ts R ec o n str u c ti o n S O D
R esu l ts R ec o n str u c ti o n P M A
R esu l ts R ec o n str u c ti o n M ED
R esu l ts B i g I m ag e
R esu l ts B i g I m ag e S O D
Jer u sal em T o w er C o l o r S h i f ti n g
Jer u sal em T o w er P M A
Jer u sal em T o w er - M E D
F u tu r e W o r k G rap h th e can d i d ate w ei gh t v al u es S ee i f sp i k es i n t h e gr ap h c o r r esp o n d t o p r o b l em s i n t h e r esu l t an t i m age. C an t h i s h el p u s t o f i n d w h er e t h e al go r i t h m f al l s ap ar t? A ttem p t a m i x ed ap p ro ach to E n h an ced G reed y an d G reed y H eu ri sti c. M i gh t i m p r o v e t i m e b o u n d f o r av er age c ase E n h an c ed G r eed y H eu r i st i c. D o u b t f u l t h at i t w i l l h av e a p o si t i v e af f ec t o n t h e R ec o n st r u c t ed I m age. A ttem p t R eco n stru cti o n o f I m ages w i th m i ssi n g d ata. H o w m u c h m i ssi n g d at a m ak es t h e i m age i r r et r i ev ab l e?
Questions? Any questions?
References N.Memon and A.Pal, Automated reassembly of file fragmented images using greedy algorithms, IEEE Trans. Image Processing, vol.15, no.2, pp.385-393, 2006.