A Security Model for Multi-User File System Search. in Multi-User Environments

Similar documents
A Security Model for Full-Text File System Search in Multi-User Environments

Efficiency vs. Effectiveness in Terabyte-Scale IR

Chapter 2. Architecture of a Search Engine

Information Retrieval. (M&S Ch 15)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Efficient query processing

Instructor: Stefan Savev

COMP6237 Data Mining Searching and Ranking

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

From Passages into Elements in XML Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval

Information Retrieval

vector space retrieval many slides courtesy James Amherst

Information Retrieval

Document Representation : Quiz

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

MG4J: Managing Gigabytes for Java. MG4J - intro 1

Full-Text Indexing For Heritrix

Lecture 5: Information Retrieval using the Vector Space Model

68A8 Multimedia DataBases Information Retrieval - Exercises

Outlook. File-System Interface Allocation-Methods Free Space Management

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

ResPubliQA 2010

Q: Given a set of keywords how can we return relevant documents quickly?

Natural Language Processing

RSDC 09: Tag Recommendation Using Keywords and Association Rules

Basic filesystem concepts. Tuesday, November 22, 2011

2. Give an example of algorithm instructions that would violate the following criteria: (a) precision: a =

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

A Simple Search Engine

Exploiting Index Pruning Methods for Clustering XML Collections

Information Retrieval

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Word Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval

A System for Query-Specific Document Summarization

The Topic Specific Search Engine

Chapter 6: Information Retrieval and Web Search. An introduction

A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Indexing and Query Processing

Static Pruning of Terms In Inverted Files

Models for Document & Query Representation. Ziawasch Abedjan

Midterm Exam Search Engines ( / ) October 20, 2015

Search Engine Architecture II

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Outline of the course

Information Retrieval

THE WEB SEARCH ENGINE

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

CS/INFO 1305 Information Retrieval

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Query Processing and Alternative Search Structures. Indexing common words

CS54701: Information Retrieval

VK Multimedia Information Systems

ECE232: Hardware Organization and Design

Data-Intensive Distributed Computing

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Directory Implementation. A High-end MP

A Hybrid Approach to Index Maintenance in Dynamic Text Retrieval Systems

Notes: Notes: Primo Ranking Customization

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

modern database systems lecture 4 : information retrieval

Query Evaluation Strategies

CS 6320 Natural Language Processing

Chapter Two. Lesson A. Objectives. Exploring the UNIX File System and File Security. Understanding Files and Directories

Last time. User Authentication. Security Policies and Models. Beyond passwords Biometrics

FileSearchEX 1.1 Series

Introduction to Information Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Weighting Document Genre in Enterprise Search

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Document indexing, similarities and retrieval in large scale text collections

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

CS/INFO 1305 Summer 2009

Query Evaluation Strategies

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Semantic text features from small world graphs

8.0 Help for Community Managers About Jive for Google Docs...4. System Requirements & Best Practices... 5

LIST OF ACRONYMS & ABBREVIATIONS

Using Graphics Processors for High Performance IR Query Processing

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Volume 6, Issue 1, January 2018 International Journal of Advance Research in Computer Science and Management Studies

COMP90015: Distributed Systems Assignment 1 Multi-threaded Dictionary Server (15 marks)

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Transcription:

A Security Model for Full-Text File System Search in Multi-User Environments Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada December 15, 2005

1 Introduction and Motivation 2 3 4 5 6

Introduction and Motivation Desktop search Virtually all desktop search systems (Copernic, Google, MSN, Yahoo,...) maintain one index per user. Security perspective: Per-user indexing processes run with user privileges. Very secure. Efficiency perspective: Files that can be read by n users are indexed n times. Waste of space and time. Efficiency considerations force us to use one single index that is accessed by all users in the system. But then make sure that no file permissions are violated!

The Wumpus Search Engine What is Wumpus? A multi-user file system search engine. A multi-purpose information retrieval system. What are the main features? Multi-user support based on the UNIX security model: Each user can only search files for which she has permission. Fully dynamic: File changes, creations, and deletions are immediately reflected by the index and the search results. Wumpus is Free Software according to the GPL. http://www.wumpus-search.org/

Index Structure: Inverted Files Index Structure: Inverted Files The Vector Space Model and TF/IDF Structural Queries: The GCL Query Language Combining GCL and TF/IDF Ranking An inverted file is a dictionary data structure that tells us for every term in the text collection (here: file system) the exact location of all its occurrences. Term 1 Term 2 17, 28, 42, 177, 219 3, 38, 59, 131 Term N 79, 81, 157, 224, 289, 324 Building an inverted file (mathematically): Transposing the term/document matrix. Can be done very efficiently. All information necessary to process a search query can be found inside the inverted file. The actual files being searched are only accessed at the very end, to produce snippet summaries etc.

Index Structure: Inverted Files The Vector Space Model and TF/IDF Structural Queries: The GCL Query Language Combining GCL and TF/IDF Ranking The Vector Space Model and TF/IDF (1) Boolean searches (AND, OR) are easy to implement and very helpful. But only if the number of matching documents is small. For large text collections (e.g., file systems): Relevance ranking! Three main assumptions behind relevance ranking: All query terms are independent (vector space assumption). A file is more likely to be relevant if has more occurrences of query terms (term frequency assumption). A query term that appears in many documents is less important than a query term that appears in few documents (inverse document frequency assumption).

Index Structure: Inverted Files The Vector Space Model and TF/IDF Structural Queries: The GCL Query Language Combining GCL and TF/IDF Ranking The Vector Space Model and TF/IDF (2) State-of-the-art relevance ranking: Okapi BM25 Given a document collection D, a query Q := {T 1,..., T n }, and a document D. Then ( ) D f T,D (k 1 + 1) log D T f T,D + k 1 (1 b + b S BM25 (D) = T Q dl avgdl ), where D T is the set of documents containing T, and f T,D is the number of occurrences of T within D. BM25 can be thought of as a Boolean OR. It finds all documents containing at least one query term and ranks all matching documents by their BM25 score.

Index Structure: Inverted Files The Vector Space Model and TF/IDF Structural Queries: The GCL Query Language Combining GCL and TF/IDF Ranking Structural Queries: The GCL Query Language Textual query: Find all documents in which mad and cow occur within a distance of 3 words from each other. GCL query: GCL operator tree: (<doc> </doc>) ((mad cow) [3])

Combining GCL and TF/IDF Ranking Index Structure: Inverted Files The Vector Space Model and TF/IDF Structural Queries: The GCL Query Language Combining GCL and TF/IDF Ranking Using GCL to compute BM25 relevance scores: Assume the search query contains the three terms T 1, T 2, T 3. Find all documents matching the expression: ( doc /doc ) (T 1 T 2 T 3 ) Rank all matching documents using IDF weights: ( ) #( doc /doc ) w Ti = log #(( doc /doc ) T i ) Other components of BM25 can be expressed in a similar way.

The UNIX Security Model The UNIX Security Model The find/grep/slocate Security Model The Wumpus Security Model The UNIX security model knows 3 different types of file access privileges: owner group others Read x x Write x execute x x R, W, X,... But where is the search bit??? Need to work with implicit search permissions. Define the search privilege as a combination of read and execute.

The find/grep/slocate Security Model The UNIX Security Model The find/grep/slocate Security Model The Wumpus Security Model In the traditional find/grep UNIX search environment, a user can search a file if (and only if) she has R and X permission for a path leading to the parent directory (in order to find the file) and R permission for the actual file (in order to grep it). The same rules are used by slocate when it tries to find out what search results it may return to the user. Good: It is possible to grant file access to a specific group of users by making the dir X only (not R) and telling them the file name. Bad: To make a file find/grep-searchable, the entire directory has to be made readable, revealing information about other files.

The Wumpus Security Model The UNIX Security Model The find/grep/slocate Security Model The Wumpus Security Model In the security model used by Wumpus, a user can search a file if (and only if) she has X permission for a path leading to the parent directory and R permission for the actual file. Good: Individual files can be made searchable without revealing any information about other files in the same directory. Bad: Inconsistent with the existing search infrastructure and possible counter-intuitive. Neither definition of searchable is completely satisfactory. Maybe we really need an explicit search bit.

Introduction and Motivation The UNIX Security Model The find/grep/slocate Security Model The Wumpus Security Model Where are we? There are different ways to define what it means for a file to be searchable by a user. Regardless of the actual definition, the result is a user-specific partition of the set of all files in the file system into 2 subsets: F U (files that may be searched by user U) and F U (files that may not be searched by user U). We need to make sure that the response to a search query by user U does not reveal any information about files in F U. In particular: Search results must not contain files from F U.

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks Implementation I: The Postprocessing Approach The most obvious way to address the security requirements is based on a postprocessing approach: Find all files matching the query and rank them using global, user-independent term weights (IDF values). Before the search results are returned to the user, remove all files for which the user does not have search permission. Rationale: In a system with a large number of users, it is infeasible to precompute per-user term weights. This implementation is actually part of several multi-user search systems. Its problems are real problems.

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (1) Target: Analyze the results to a search query in order to determine the number of files F T containing a given term T. Is this a real problem? Yes. And it does not stop with terms. Boolean queries, structural queries, phrases,... How to proceed? If the search engine returns relevance scores along with the matching files, it is straightforward to compute F T (see paper for details). Make it more challenging: The search engine does not return relevance scores. Use relative file ranks to approximate F T.

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (2) Assume we want to approximate F T. Create single file F 0 containing only the term T. Generate unique, random term T 2. Create 1000 files F 1... F 1000, each containing the term T 2. Submit search query {T, T 2 }. Since BM25 performs OR, all files F 0... F 1000 match the query and are ranked according to their BM25 score. If F 0 appears before any of the other files (F 1... F 1000 ), we know that F T 1000 = F T2. Exact value: Binary search!... but what if not???

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (3) If F 0 appears after the F 1... F 1000, then: Delete all files created so far. Generate two more random terms T 3 and T 4. Create 1000 files F 1... F 1000 containing T 2 and T 3. Create 999 files F 1... F 999, each containing only T 4. Create one last file F 1000, containing T and T 4. Send query {T, T 2, T 3, T 4 } to the search systems. All 2,000 files match the query. But what is the ranking? What are the scores?

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (4) Remember the BM25 score of a file F : S BM25 (F ) = ( ) F log F T T Q f T,F (k 1 + 1) f T,F + k 1 (1 b + b dl avgdl ). Thus, we have: ( S BM25 ( F 1000 ) = C log F F T S BM25 ( F i ) = C ) F + log F T4 ) ( log F F + log F T2 F T3 and for 1 i 999, where C = (k 1 +1) 1+k 1 (1 b+ 2b avgdl ).

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (5) BM25 score components: T 2 T 3 T T 4 Document frequencies: F T2 = F T3 = 1000 F T4 = 1000 F T 1000

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (5) BM25 score components: T 2 T 3 T T 4 Document frequencies: F T2 = F T3 = 1000 F T4 = 1000... decrease! F T 1000

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (5) BM25 score components: T 2 T 3 T T 4 Document frequencies: F T2 = F T3 = 1000 F T4 = 1000... decrease!... decrease! F T 1000

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (5) BM25 score components: T 2 T 3 T T 4 Document frequencies: F T2 = F T3 = 1000 F T4 = 1000... decrease!... decrease!... decrease! F T 1000... log F F F F F T + log F T4 log F T2 + log F T3

Implementation Exploit I: Relevance Scores Exploit II: Relative File Ranks How to Exploit the Postprocessing Approach (6) Example Assume T appears in F T = 11,000 files. Using the technique just described, we obtain log F F F F F + log 2 log log + log F T 90 1000 F T 91, which yields 10990 F T 11111. The relative error is ε = 0.5%.

Query Integration Experimental Results (1) Query Optimization Experimental Results (2) (1) We have seen how to use GCL to perform relevance ranking. Now: Integrate security restrictions into GCL. At query time, construct a GC-list F U that represents all files searchable by user U. Then apply restrictions:

Query Integration Experimental Results (1) Query Optimization Experimental Results (2) (2) Other components of BM25 ranking function (term weights, average document length, etc.) are modified in a similar way. Unsecure term weights: ( ) #( doc /doc ) w Ti = log #(( doc /doc ) T i ) Secure term weights: w Ti ( ) #(( doc F U ) ( /doc F U )) = log #((( doc F U ) ( /doc F U )) (T i F U ))

Experimental Results (1) Query Integration Experimental Results (1) Query Optimization Experimental Results (2) (a) Without cache effects: All posting lists have to be fetched from disk. (b) With cache effects: All posting lists are fetched from the disk cache. Compared to postprocessing, the unoptimized query integration is between 54-74% slower and 18-36% faster (depending on cache effects and relative number of visible files).

Query Optimization Query Integration Experimental Results (1) Query Optimization Experimental Results (2) Consider the GCL expression: ((<doc> F U ) (</doc> F U )) F U. This is equivalent to: (<doc> </doc>) F U. Analogous equivalences exist for the other GCL operators:,,...

Experimental Results (2) Query Integration Experimental Results (1) Query Optimization Experimental Results (2) (a) Without cache effects: All posting lists have to be fetched from disk. (b) With cache effects: All posting lists are fetched from the disk cache. Compared to postprocessing, the optimized query integration is between 12-17% slower and 20-39% faster (depending on cache effects and relative number of visible files).

We have proposed a file system search security model based on the UNIX security model. For one possible implementation (postprocessing), we have shown that an arbitrary user may obtain information about files for which she does not have read permission. We have presented a second, secure implementation and pointed out query optimization opportunities. Using the optimized implementation of the security model, average time per query is between 61% and 117% of the original, security-unaware implementation.

The End Introduction and Motivation Thank You!