Crawling Rich Internet Applications

Size: px

Start display at page:

Download "Crawling Rich Internet Applications"

Dylan French
6 years ago
Views:

1 Crawling Rich Internet Applications Gregor v. Bochmann (in collaboration with the SSRG group) University of Ottawa Canada Oldenburg, den 16 Dezember 2013 Qiao

2 Overview Background The evolving web Why crawling Our research project Web Crawling Traditional web crawling RIA crawling Performance objectives, assumptions Crawling strategies Breadth-first - Depth-first - Greedy Model-based strategies (Hypercube - Menu) Probabilistic strategy Component-based crawling Distributed crawling Different architectures Experimental results Conclusions 2

3 Web Crawling is Exploring Web Applications automatically Discovering the pages of a Web application Emulating the user behaviour to retrieve states of a web application. Web Crawling is as old as the web itself! From the early times of the web, matching the expansion of the web has been a challenge 3

4 Traditional Web The evolving Web static HTML pages stored as separate files, identified by a URL Deep Web Server application accesses a database, user fills request forms HTML pages dynamically created by server, identified by URL including request parameters Rich Internet Applications (RIA Web-2 ) pages contain executable code (e.g. JavaScript, Silverlight, Adobe Flex...); executed in response to user interactions, or timeouts (so-called events); script may change displayed page (the state of the application changes) with same URL. AJAX: script may interact asynchronously with the server to update the page 4

5 Example of a traditional web application Show my web site ( ) Simplified model of the web site Bochmann publications Pub publications hobbies research group DSRG Hobbies Gregor von Bochmann Painter B page with URL link (event) 5

6 6 RIA examples TestRIA, AltroMutual

7 7 RIA example - Clipmarks

8 8 RIA example Google Mail

9 The Graph Model a web application Graph model: Web page (client state of the application) node is encoded in HTML called DOM Event (click, mouse-over, etc.) edge An event triggers a transition between states Bochmann publications Pub publications hobbies research group DSRG Hobbies Gregor von Bochmann Painter B 9 page with URL link (event)

10 RIA vs. Traditional Web (Web-1) Graph model: Web-1 RIA Web page (state) : has URL few pages have a URL Event includes next URL code execution Bochmann Bochmann publications hobbies Hobbies publications hobbies Hobbies Pub publications research group DSRG Gregor von Bochmann Painter B Pub publications research group DSRG Gregor von Bochmann Painter B 10 page with URL link (event) State (no URL)

11 Why crawling Objective A: find all (or all important ) pages for content indexing for search engines for security testing and vulnerability assessment for accessibility testing Objective B: find all links between pages for ranking pages, e.g. Google ranking in search queries for automated testing and model checking of the web application for assuring that all pages have been found 11

Software Security Research Group (SSRG), University of Ottawa in collaboration with IBM Software Security Research Group (SSRG), University of Ottawa In collaboration with IBM University of Ottawa

12 Software Security Research Group (SSRG), University of Ottawa in collaboration with IBM Software Security Research Group (SSRG), University of Ottawa In collaboration with IBM University of Ottawa IBM R&D (Ottawa) Prof. Guy-Vincent Jourdan Prof. Gregor v. Bochmann -- Iosif Viorel Onut (PhD) Suryakant Choudhary (Master student) -- AppScan product team Emre Dincturk (PhD student) Khaled Ben Hafaiedh (PhD student) Seyed M. Mir Taheri (PhD student) Ali Moosavi (Master student) 12

13 View detailed security issues reports Security Issues Identified with Static Analysis (white-box view) Security Issues Identified with Dynamic Analysis (black-box view) Aggregated and correlated results Remediation Tasks Security Risk Assessment 13

14 Overview Background The evolving web Why crawling Our research project Web Crawling Traditional web crawling RIA crawling Performance objectives and assumptions Crawling strategies Breadth-first, Depth-first, Greedy Model-based strategies (Hypercube - Menu) Probabilistic strategy Component-based crawling Distributed crawling Different architectures Experimental results Conclusions 14

15 Traditional Web Crawling HTML page is a tree data structure, called DOM. It includes information about display by the browser events that can be activated by the user (for instance, clicking on certain displayed fields); for each event URL to be requested from the server through an HTTP Request (link to next page) The page returned by the server for a given URL, in general, depends on the server state and the values of cookies The displayed page is identified by its URL if we ignore server state and cookies 15

16 Traditional web crawling algorithm Given: an initial seed URL a domain (or list of domains) defining the limit of the web space to be explored Crawler variables (of type set of URLs ): exploredurls = empty unexploredurls = {seedurl} Algorithm While unexploredurls is not empty do Take a URL from unexploredurls, add it to exploredurls, request it from the server, analyse the returned page (according to the purpose of the crawl), extract the links in the page and add the corresponding URLs (if they are new, and if they are in the domain) to unexploredurls 16

17 RIA Crawling Difference from traditional web Most pages have no URL and therefore are not directly accessible When an event triggers the execution of a script, the script may change the DOM structure which may lead to a new display and a new set of enabled events that is a new state of the application. Crawling means: finding all URLs that are part of the application, plus for each URL, find all states reached (from the seed URL) by executing any sequence of events Important note: only the seed states are directly accessible by a URL publications Pub publications Bochmann hobbies research group DSRG Hobbies Gregor von Bochmann Painter B 17 State (no URL)

18 18 Difficulties for crawling RIA State identification A state can not be identified by a URL. Instead, we consider that the state is identified by the current DOM in the browser. Most links (events) do not contain a URL An event included in the DOM may not explicitly identify the next state reached when this event is executed. To determine the state reached by such an event, we have to execute that event. In traditional crawling, the event (link) contains the URL - identification of the next state reached Accessibility of states Most states are not directly accessible (no URL) only through seed URL and a sequence of events (and intermediate states)

19 Important consequence For a complete crawl (a crawl that ensures that all states of the application are found), the crawler has to execute all events in all states of the application since for any of these events, we do not know, a priory, whether its execution in the current state will lead to a new state or not. Note: In the case of traditional web crawling, it is not necessary to execute all events on all pages; it is sufficient to extract the URLs from these events, and get the page for each URL only once. 19

20 The links publication in the pages Bochmann and DSRG have the same URL The page Pub will be retrieved only once. Example The events publication in the pages Bochmann and DSRG have no URL Both events publication must be executed, and the crawler finds out that they both lead to the same client state. Bochmann Bochmann publications hobbies Hobbies publications hobbies Hobbies Pub publications research group DSRG Gregor von Bochmann Painter B Pub publications research group DSRG Gregor von Bochmann Painter B 20

21 AJAX: asynchronous interactions with the server We ignore the intermediate states in our current work, by simply waiting that a new stable state is reach after each user input 21

22 RIA: Need for DOM equivalence A given page often contains information that changes frequently, e.g. advertizing, time of the day information. This information is usually of no importance for the purpose of crawling. In the traditional web, the page identification (i.e. the URL) does not change when this information changes. In RIA, states are identified by their DOM. Therefore similar states with different advertizing would be identified as different states (which leads to a too large state space). We would like to have a state identifier that is independent of the unimportant changing information. We introduce a DOM equivalence, and all states with equivalent DOMs have the same identifier. 22

23 DOM equivalence The DOM equivalence depends on the purpose of the crawl. In the case of security testing, we are not interested in the textual content of the DOM, however, this is important for content indexing. The DOM equivalence relation is realized by a DOM reduction algorithm which produces (from a given DOM) a reduced canonical representation of the information that is considered relevant for the crawl. If the reduced DOMs obtained from two given DOMs are the same, then the given DOMs are considered equivalent, that is, they represent the same application state (for this purpose of crawl). 23

24 Form of the state identifiers The reduced DOM could be used as state identifier. however, it is quite voluminous we have to store the application model in memory during its exploration, each edge in the graph contains the identifiers of the current and next states. This is necessary to check whether a state obtained after the execution of some event is a new state or a known one Condensed state identifier: A hash of the reduced DOM The crawler also stores for each state the list of events included in the DOM, and whether they are executed or not used to select the next event to be executed during the crawl 24

25 Performance objectives Execution speed: How many events (state transitions) can be executed per hour? Complete crawl: Given enough time, the strategy terminates the crawl when all states of the application have been found. Efficiency of finding states - finding states fast : If the crawl is terminated by the user before a complete crawl is attained, the number of discovered state should be as large as possible. For many applications, a complete crawl cannot be obtained within a reasonable length of time. Therefore the third objective is very important. 25

26 Our working assumptions Deterministic RIA : the crawled RIA is deterministic from the point of view of the client (e.g. no dependence on updated database content) Given user input : we are provided a set of user inputs for text fields and build the model that corresponds to these inputs Reliable reset : we can reliably reset the system by reloading the seed URL (thus the graph is strongly connected) 26

27 Overview Background The evolving web Why crawling Our research project Web Crawling Traditional web crawling RIA crawling Performance objectives Crawling strategies Breadth-first, Depth-first, Greedy Model-based strategies (Hypercube - Menu) Probabilistic strategy Component-based crawling Distributed crawling Different architectures Experimental results Conclusions 27

28 Crawling Strategies Most work on crawling RIA do not intend to build a complete model of the application. Some consider standard strategies for the exploration of the graph model, such as Depth- First and Breadth-First. We have developed more efficient strategies based on the assumed structure of the application ( model-based strategies, see below) 28

29 Example of crawling sequence Depth-first strategy geturl(bochmann); analysedom; execute(publications) and find new state Pub; analysedom; - go back (reset) - geturl(bochmann); execute(research group) and find new state DSRG; analysedom; execute(publications) and find known state Pub; - go back (reset) - geturl(bochmann); execute(hobbies) and find new state Hobbies; analysedom and find new URL PainterB; geturl(painterb); analysedom; etc. Bochmann publications Pub publications hobbies research group DSRG Hobbies Gregor von Bochmann Painter B Such a systematic approach will execute all events and eventually find all states. 29

30 30 Resets Each time there is a go back in the crawling sequence, the crawler has to go back to a seed-url (which takes more time than executing an event) and possibly execute several events in order to reach the desired state. For instance, in the Breadth-First strategy, the crawler has to later go back to the state DSRG in order to execute the event publications Resets are much more expensive (in terms of execution times) than event executions The number of resets should be minimized. publications Pub publications Bochmann hobbies research group DSRG Hobbies Gregor von Bochmann Painter B

31 Disadvantages of standard strategies Breadth-First: No long sequences of event executions Very many Resets Depth-First: Advantage: has long sequences of event executions Disadvantage: when reaching a known state, the strategy takes a path back to a specific previous state for further event exploration. This path through known edges is often long and may involve a Reset (overhead) going back to another state with nonexecuted events may be much more efficient. 31

32 Greedy and model-based crawling The Greedy strategy Forward exploration until a state with no unexecuted events is encountered then find closest state with an unexecuted event, and continue Model-based crawling Meta-model: assumed structure of the application Crawling strategy is optimized for the case that the application follows these assumptions Crawling strategy must be able to adapt to applications that do not satisfy the meta-model 32

33 Model-based crawling: Two phases 33 State exploration phase finding all states assuming that the application follows the assumptions of the meta-model Transition exploration phase executing all remaining events in all known states (that have not been executed during the state exploration phase) Order of execution First state exploration; then transition exploration Adaptation: If new states are discovered during transition exploration phase, go back to state exploration phase, etc.

34 Comparing efficiency of finding states ```` Cost (number of event executions + reset cost) Note: log scale Number of states discovered This is for a specific application Total: 129 such comparisons should be done for many different types of applications Note: Hypercube gives similar results to Greedy

35 Comparing efficiency of exploring all edges Cost (number of event executions + reset cost) Number of edges explored Total: 10364

36 Model-based crawling: Hypercube Hypercube The state reached by a sequence of events from the initial state is independent of the order of the events. The enabled events at a state are those at the initial state minus those executed to reach that state. ++ : One can find optimal paths for state and transition exploration phases -- : very few applications follow the hypercube model 36 Example: 4-dim. Hypercube

37 Model-based crawling: Menu model Example web site: Ikebana-Ottawa ( ikebanaottawa.ca ) Hypothesis: There are three types of events: Menu events: The next state obtained is independent of the state where the event is executed Normal events: Next state depends on current page Self-loop events: Next state is equal to current state Crawling strategy Explore Normal events before Menu events, because menu events do not find any new states To classify the events, they must be executed from two different states

38 Menu strategy: state exploration From the current state, choose the next event according to the following event priority 1. Unclassified events not yet executed 2. Unclassified events once executed from a different state 3. Normal events 4. Menu events (we do not expect to find a new state) 5. Self-loop events (we do not expect to find a new state) If all events have already be executed on the current page: find a short path to a page with an event of high priority 38

39 Menu model: finding a path to next event Find path on current application model, based on executed edges predicted edges: Locally nonexecuted, but globally executed once are predicted to be of type menu Predicted edges Executed edges 39

40 Probability strategy This is a variation of the Greedy strategy. Inspired by the Menu strategy, we introduce event priorities. The priority of an event is based on statistical observations (during the crawl of the application) about the number of new states discovered when executing the given event. The strategy is based on the belief that an event which was often observed to lead to new states in the past will be more likely to lead to new states in the future. 40

41 Probability strategy: event priorities 41 Priority of events from current state: Probability of a given event e finding a new state from the current state is P(e) = ( S(e) + p S ) / ( N(e) + p N ) S : number of states found by e - N : number of times executed This is a Bayesian formula, with p S = 1 and p N = 2 gives initial probability = 0.5 If current state s has no non-executed event: Find a locally non-executed event e from some nearby state s such that P(e) is high and the path from s to s is short Note: the path from s to s is through events already executed How to find an optimal combination of high-priority event on a nearby state is described in our paper at ICWE 2012

42 Experiments We did experiments with the different crawling strategies using the following web sites: Periodic table (Local version: Clipmarks (Local version: TestRIA ( ) Altoro Mutual ( ) 42

43 43 Results: State exploration

44 Results: Transition exploration Cost for a complete crawl Cost = number of event executions + R * number of resets R = 18 for the Clipmarks web site 44

45 Component-based crawling In many web sites, the number of pages is immense because of different ordering of elements or combinations of several components : a complete crawl is not feasible Revised coverage criteria: Cover all components of pages in the application (but not all combinations or ordering of these components) Assumption: Components are independent of one another. 45

46 46 Examples of components

47 47 Assumed structure of a page

48 48 Example: The Bebop application

49 49 Performance

50 Scalability Execution time of crawl as a function of items stored in the application As expected: normal crawling has exponential complexity Component-based crawl appears to have quadratic complexity 50

51 Overview Background The evolving web Why crawling Our research project Web Crawling Traditional web crawling RIA crawling Performance objectives Crawling strategies Breadth-first, Depth-first, Greedy Model-based strategies (Hypercube - Menu) Probabilistic strategy Component-based crawling Distributed crawling Different architectures Experimental results Conclusions 51

52 Distributed crawling Observation: On average, event execution and analysis of the next state discovered takes about 20 times more time than deciding on the next event to be executed. Question: Can the crawling of a complex application be accelerated by distributing the crawling over several computers / cores? 52

53 53 Different distributed architectures 1. Central coordinator keeps information about the discovered application model 1. Each crawler contacts the Coordinator after each execution of an event and obtains the next event to be executed (coordinator performs the crawling strategy) dynamic event allocation to crawlers 2. Static event allocation to crawlers (crawlers obtain application model from coordinator and perform crawling strategy locally, only for allocated events) 2. Several coordinators share the information about the application model A distributed hash table is used to allocate the states of the model to the different coordinators Each coordinator is associated with approximately 20 crawlers Coordinators perform the crawling strategy, but using partial model information different sharing schemes can be envisioned for exchanging information between the coordinators

Notes: The BF strategy has bad performance, but has the advantage that only the states of the model must be shared with the crawlers (not the transitions).

54 Notes: The BF strategy has bad performance, but has the advantage that only the states of the model must be shared with the crawlers (not the transitions). One sees the expected decrease in crawling time The delay due to the coordinator is negligible, even for 15 crawlers The static allocation of events leads to unequal loads dynamic load sharing among crawlers may be useful 54 Experimental results (architecture 1.2 BF strategy)

55 Experimental results (architecture 1.1 Greedy strategy) Notes: The greedy strategy has good performance. In this architecture, the model information is not shared with the crawlers. Again, one sees the expected decrease in crawling time 55

56 Performance depends on sharing scheme In case that there is no unexecuted event from the current state, the coordinator has to find another state with an unexecuted event Reset-only: Use reset to reach a different state Local Knowledge: Find shortest path (SP) to new state based on local knowledge of the application model Shared Knowledge: Use SP based on knowledge sharing, piggy-backed on other messages Forward Exploration: A distributed algorithm for finding SP 56 Simulation results (architecture 2 Greedy strategy) Notes: Fixed number of crawlers, varying number of Coordinators (overload ignored)

57 Overview Background The evolving web Why crawling Our research project Web Crawling Traditional web crawling RIA crawling Performance objectives Crawling strategies Breadth-first, Depth-first, Greedy Model-based strategies (Hypercube - Menu) Probabilistic strategy Component-based crawling Distributed crawling Different architectures Experimental results Conclusions 57

58 58 Conclusions RIA crawling is quite different from traditional web crawling Different crawling strategies can improve the efficiency of crawling The crawling of a RIA can be effectively distributed over several crawling engines We have developed prototypes of our crawling strategies, integrated with the IBM AppScan product

59 References Background; MESBAH, A., DEURSEN, A.V. AND LENSELINK, S., Crawling Ajax-based Web Applications through Dynamic Analysis of User Interface State Changes. ACM Transactions on the Web (TWEB), 6(1), a23. Our Papers: Seyed M. Mirtaheri, Mustafa Emre Dincturk, Salman Hooshmand, Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., A Brief History of Web Crawlers, in Proceedings of the CASCON 2013, November pages Seyed M. Mirtaheri, Zou, D., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V. Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications, in Proceedings of the 8TH International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC 2013), Compiegne, France, October pages Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G.v. and Onut, I.V. Building Rich Internet Applications Models: Example of a Better Strategy, in Proceedings of the 13th International Conference on Web engineering (ICWE 2013), Aalborg, North Denmark, July pages Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Moosavi, A., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., Crawling Rich Internet Applications: The State of the Art, in Proceedings of the CASCON 2012, November pages Dincturk, M.E., Choudhary, S., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., A Statistical Approach for Efficient Crawling of Rich Internet Applications, in Proceedings of the 12th International Conference on Web engineering (ICWE 2012), Berlin, Germany, July pages Choudhary, S., Dincturk, M.E., Bochmann, G.v., Jourdan, G.-V., Onut, I.V. and Ionescu, P., Solving Some Modeling Challenges when Testing Rich Internet Aplications for Security, in The Third International Workshop on Security Testing (SECTEST 2012), Montreal, Canada, April pages Benjamin, K., Bochmann, G.v., Dincturk, M.E., Jourdan, G.-V. and Onut, I.V., A Strategy for Efficient Crawling of Rich Internet Applications, in Proceedings of the 11th International Conference on Web engineering (ICWE 2011), Paphos, Cyprus, July pages Benjamin, K., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., Some Modeling Challenges when Testing Rich Internet Applications for Security, in First International Workshop on Modeling and Detection of Vulnerabilities (MDV 2010), Paris, France, April pages Dincturk, M.E., Jourdan, G.-V., Bochmann, G.v. and Onut, I.V., A Model-Based Approach for Crawling Rich Internet Applications, submitted to a journal. 59

60 Questions?? Comments?? These slides can be downloaded from

MODEL-BASED RICH INTERNET APPLICATIONS CRAWLING: MENU AND PROBABILITY MODELS

Journal of Web Engineering, Vol. 0, No. 0 (2003) 000 000 c Rinton Press MODEL-BASED RICH INTERNET APPLICATIONS CRAWLING: MENU AND PROBABILITY MODELS SURYAKANT CHOUDHARY, EMRE DINCTURK, SEYED MIRTAHERI