Financial Market Web Mining with the Software Agent PISA

Financial Market Web Mining with the Software Agent PISA Patrick Bartels and Michael H. Breitner Institut für Wirtschaftsinformatik, Universität Hannover Königsworther Platz 1, D-30167 Hannover, Germany Email: bartels@iwi.uni-hannover.de and breitner@iwi.uni-hannover.de Abstract: The World Wide Web (WWW) contains millions of hypertext pages that present nearly all kinds of information. Specific effort is required to use this information as input to subsequent computations or to aggregate values in web pages. To do so, data from webpages are extracted and stored on computers either in text files or databases. These manipulations can be better attained by using an autonomous software program instead of human personal. Here, we present a platform independent software agent called PISA (Partially Intelligent Software Agent) that extracts financial data autonomously from webpages and stores them on a local computer. Data quality has highest significance. PISA generates time series with user defined denseness. These time series have adequate quality for financial market analyses, e. g. forecasts with neural networks. 1 Introduction Financial market websites contain financial market information from banks, brokers, exchanges and financial service providers, e. g. stock quotes, option prices, exchange rates in particular and other information concerning exchange markets. The internet offers these data in many cases near time. Comparable up-to-dateness usually is only offered by commercial finance databases that must be paid. Fees are usually high. The internet often offers the same information cost free. Here, up-to-dateness and cost-freeness are exploited to build a financial database for free. The database's quality is as good as of commercial financial databases. Financial market websites usually do not contain static but dynamic content. The presented data change frequently. To minimize work the presented data are usually stored in databases. When a webpage is requested by a browser the needed information is queried from the database and the results are put into a specific template. Templates are a skeleton of the resulting pages which contain specific tags. These tags define where specified pieces of information are filled in. This results in highly structured webpages. Since usually one single template is used for many kinds of webpages, once a scheme is recognized, this scheme can be used to identify the demanded information on other pages of the same website.

2 Patrick Bartels and Michael H. Breitner Here, schemes are used to extract specified information to generate not just a database but dense time series. A time series is defined as a series or function of a variable over time. This means that a particular variable takes a particular discrete value at a sequence of points in time. Here, quotes are used as variables. Time series can be used to train artificial neural networks. E. g. the FAUN 1 neurosimulator project uses neural networks to predict real market option prices, see [2], and to make short term forecasts, see [5]. Those neural networks need input data with highest data quality. Here, we present the platform independent software agent PISA to generate time series. The resulting time series are capable of being used for neural network training. Major problems of web-mining that usually effect output data quality are prevented. Primary problem is that most webpage's structure changes frequently and the position of information might change. In this case either no information or wrong information might be extracted. No information is as well extracted if a website is not available for a certain time. In cases information is distributed to different webpages, data from different webpages must be aggregated to realize added value, e. g. a price difference for arbitrage trades. To enable time series with flexible denseness the agent has to extract wanted information in adequate small intervals. As the Internet is an international medium websites are formatted with several international formats like for date, time and numbers. Most existing web-mining programs are designed to extract the text representing these data. To assure highest data quality date, time and numbers need to be consistently formatted. The agent's data output limits further processing possibilities. Therefore, both text files and databases have to be supported to store the extracted data. 2 Software-Agents 1.1 Agent Paradigm Primary advantage of using an agent is that agents can work very efficiently 24 hours a day and 7 days a week (except breakdowns and maintenance work). Agents can handle dozens of extraction operations per minute automatically. Able to work long periods of time with low costs software agents are well qualified for web-mining tasks. There is no generally accepted definition of the term software agent although the difference between normal programs and software agents has been discussed intensely for many years now [cf. 4, 3]. A common characterization was disposed by Jennings and Woodridge [cf. 7]. Their definition is the origin for almost all current research on agent technology. Accordantly an agent is a representative who works on behalf of a person and has the following attributes: Autonomy The 1 FAUN = Fast Approximation with Universal Neural networks

Financial Market Web Mining with the Software Agent PISA 3 agent should be able to execute the assigned tasks on its own without any callback. Social behavior Agents interact with other agents and at least with the user. Ability to react Perception of the system environment the agent is "living" in and the ability to react on basis of more or less precisely defined decision patterns. Consciousness An agent does not only react on events but can anticipate future incidents. The proficiency of the four characteristics depends on the agent's aims. Here, the agent has to receive and process webpages automatically. Autonomy is required to work over a long period of time without necessary user interaction. User interaction is however necessary for configuration. This requires only little social behavior. The agent has to react on the perceived situation. Consciousness is not necessary since all necessary decisions can be made using hard coded rules. Case differentiations within the program code are sufficient because all possible cases are a priori known. As the structure of the processed webpages is known in advance the agent just has to filter user specified patterns without "thinking". This does not suffice to call the agent intelligent. The presented agent is called partially intelligent and its intelligence will be further developed. 1.2 Agent Requirements Usually financial market webpages contain several pieces of independent information on a single website, e. g. stock quotes of a specific index. These information chunks have to be correctly and reliably identified, extracted and saved. Each step requires specific abilities of a web-mining agent. Receiving webpages: To receive a webpage a request is send to a webserver using TCP/IP-protocols which have to be supported. If the information containing webpage's URL (Uniform Resource Locator) is not a priori known crawling methods are mandatory. The agent must recognize if a webpage contains relevant information or not. Here, for neural network training a continuous data flow is mandatory. The agent has to be permanently available during trading hours. Referring to the need of efficiency the agent should only work if it is reasonable. Financial websites are only updated when a change of the underlying data occurs. In times with only few changes processing frequency can be decreased. Timer functions enable reasonable system utilization. Extraction intervals must be as small as possible to enable time series with optimal denseness. Minimal system utilization has high priority. Extraction: Dealing with HTML (Hypertext Markup Language) documents the structure is not always completely defined and can be irregular. HTML documents can contain errors. Missing tags are a common example for these errors. Since browser programs handle these problems they are often not noticed by users and/or webmasters. The agent has to recognize and handle such problems to assure error tolerant HTML parsing. Once a webpage's source code is received and parsed regular patterns are advisable to identify and extract complex patterns from HTML source code. String Tokenizer methods are less powerful than regular expressions but they are usually faster. Both should be supported.

4 Patrick Bartels and Michael H. Breitner The internet offers financial data from all over the world. Dependent on the target group different international number, time and date formats are used. These have to be recognized and reformatted in user defined formats. This increases flexibility for further processing programs. Some important information, e. g. date and time of a quote, is often split into multiple pieces. The agent has to be able to merge them. Therefore, as well as for adjusting different time zones date, time and number information has to be transformed into numbers. All extracted values have to be accessible as variables to calculate new values and to merge distributed items. Data storage: File handling methods are mandatory to save extracted patterns. The user should be able to choose an output file format. Both plain text files and XML-files (XML = Extensible Markup Language) should be supported. Large amounts of data can be handled easier stored in a database instead of text files. The most common protocol for accessing databases is the ODBC-Protocol (Open database connectivity) which should be supported by the programming language as well. Beside the mentioned requirements there are some general ones. The agent should be as platform independent as possible to assure flexible application. The hazard of breakdowns has to be minimized to assure a continuous data flow. Input data has to be correct. Accurateness is a very important aim. The agent has to provide rules with which extracted data can be checked on plausibility. 1.3 Existing Agents Web-mining agents are developed for several years now. Usually web-mining agents and web-crawlers are developed for a specific task. This explains several drawbacks which are summarized here. All considered programs are able to request and receive webpages. Missing timer functionality prevents that extraction tasks can be executed in specified intervals. Performance is further affected by missing multithreading support. Most agents support adequate crawling methods. Usually regular patterns are used to identify and extract pieces of information. The relative position of a pattern to another pattern is usually not supported. The position within the source code can only be defined by generating a complex pattern. The HTML structure is hardly taken into account. The resulting extraction rules are vulnerable for errors. Even if the right value is correctly extracted often no formatting methods are provided. Also most wrapper tools are only capable of saving the extracted results in plain text files. Only few of them support XML. ODBC-support is very seldom. The mentioned drawbacks do not appear for all considered agents. No program is fully capable for the given task. Upgrading existing programs fails since the considered programs are usually not well documented and the source code is not accessible. Inadequate extensibility and missing functions are sufficient reasons to develop an own web-mining agent. An overview for public domain and commercial web-mining agents is accessible in [6].

Financial Market Web Mining with the Software Agent PISA 5 3 Software Agent PISA 1.4 Design and Implementation The mentioned requirements lead to special demands for the language PISA is programmed in. The major considered languages and their attributes are shown in Table 1 that summarizes the results of a detailed consideration [cf. 1]. The languages are differently applicable. Most missing functions can be extended using public domain modules. Since these programs are usually not well documented the language should support the features inherently. Here, Java is the most suitable programming language to achieve the mentioned goals. Table 1. Overview of considered programming languages. PHP Java- Perl C++ C# Java Functions Languages Script Internet functions Regular expressions String tokenizer functions Multithreading Error handling Timer functions ( ) File handling ODBC database support Remote method invocation Platform independency PISA is completely realized in Java and consists of five components, shown in Figure 1. The component PisaMain initiates and starts all user defined extraction tasks. Specifications are taken from a configuration file. The initiated tasks run as independent threads and are executed simultaneously. Each task is represented by a single PisaCrawler object. These objects request the defined webpages and approve that during the crawling process no webpage is requested multiple times. Each received webpage is process by the HtmlDocument component that parses the passed website's source code. The source code is further processed by the PisaGrabber component which identifies and returns the user wanted patterns. The extracted data are saved either in a plain text file, XML-file or a database. 1.5 Functionality Due to the dynamic nature of the web, most information extraction systems focus on specific extraction tasks. Here, we concentrate on agent based generation of dense time series. The specific problems are focused here. PisaMain: PISA starts with executing the main module PisaMain that initiates the extraction tasks. They are defined by the user in a file with a specific syntax. The PisaMain-object starts one PisaCrawler-object per URL. These Pisa-

6 Patrick Bartels and Michael H. Breitner Crawler-objects are started with a one second time delay, each. For dense time series the request interval is very small. If too many pages are requested from a webserver too frequently, the webserver might crash or it does not answer some requests. Latter results from a standard mechanism to avoid webserver overload by ignoring requests when a specified number of requests in a period is exceeded. This results in information gaps in the time series. The delay time is adjustable. Fig. 1. Major modules of PISA. PisaCrawler: Each PisaCrawler-object is executed in an adjustable interval. The interval is defined in seconds to enable dense time series. Shorter intervals are possible but not necessary since financial quotes are updated at most every second. Without given interval the PisaCrawler-object terminates itself after the first evaluation. The PisaCrawler component crawls websites either at every execution or just once at the first run. In this case PISA recognizes interesting webpages by user defined requirements and memorizes the URL. Future accesses use this address. Each requested webpage is represented by an HtmlDocument-object and is further processed by the PisaGrabber-object. HtmlDocument: The evoking class passes an URL. The HtmlDocument component requests and analyzes the according webpages. PISA handles common syntactical errors reliably. Therefore, an error tolerant HTML parsing process is realized. The source code is converted into XHTML compliant text. Afterwards, all needed tags are identified using regular expressions for start- and end-position of a tag. Only needed tags are processed to decrease processing time. Once a tag is identified its attributes like size, color and content are analyzed and stored in a tag-object. This tag-object represents the HTML tag. For each kind of tag an array is created that contains the objects of a kind in order of appearance. This enables a successive comparison of the array fields and the search for a special pattern. PisaGrabber: The PisaGrabber module extracts the demanded patterns from a passed HtmlDocument-object. The easiest way to extract a pattern is successive comparison of each array field with the wanted pattern. This approach is neither comfortable nor capable because the number of appearance might change. If a stock's bid-price is appearance number two of a 2-digit number today, it can be the third one tomorrow. Here, the number of a requested pattern is defined relative to an anchor. This anchor is also defined by a pattern. An example clarifies this procedure: A typical table with stock prices is shown in Figure 2. The last given price can be found by using a pattern for a 2-digit decimal number. The current date can be used as the anchor. The current price's position is the forth table cell after the current date. Both patterns are specified by regular expressions.

Financial Market Web Mining with the Software Agent PISA 7 Fig. 2. An example of a webpage presenting the quotes of the Siemens AG at the NYSE. Using this approach the user has to know four things: 1. The pattern of the requested information; 2. The anchor pattern; 3. The number of tags between anchor-pattern and wanted information; 4. The kind of tag the patterns are formatted with. Optionally names for each extracted bit of information can be defined. This enables storage in platform independent XML-files. The names can be used like variables either to calculate new values or to merge several bits of information, e. g. a complete date that is split into data and time. If the user specifies a time zone PISA is working in, extracted worldwide times are converted in local time. To assure high data quality, accurateness of the extracted data is very important. Information items can be defined as mandatory. If an object is declared as mandatory and not available on a webpage the whole dataset is abolished. In some cases quotes are published with a certain time delay. If the quote date is not available the current date is extracted. Time of quote and date of visit might not match. E. g. if the time delay is 10 minutes the current time is 08/17/03 12:05 am when the quote time is 11:55 pm the day before. Merging current date and time of quote results in a future time, i. e. 08/17/03 11:55 pm. PISA accomplishes this problem. Additionally rules check the extracted data if they fit certain criteria, e. g. it has to be within a specified fluctuation margin. Extracting data from several webpages leads to the problem that the display format of numbers and dates do most likely differ from each other. PISA formats text, numbers and dates in adjustable formats to facilitate import in retailing programs. 1.6 Examples PISA was tested in detail in [1]. Here, we summarize the results. We tested the agent in different environments and for different tasks. The underlying website specific schemes have been generated manually. To test the crawling function we decided not only to use financial websites because the URL of specific information is usually a priori known. Therefore ebaycustomer profiles were generated. Starting from an ebay-user's feedback page the agent followed only the links to auction pages. These links were recognized by a specified pattern given by the user. PISA extracted auction details. Such user profiles are interesting for cross-selling activities. To test PISA's performance we extracted prices for 60 German options from three different websites, see [1]. All options were extracted from different pages. The 180 pages were processed with an interval of 2 minutes each. The system

8 Patrick Bartels and Michael H. Breitner worked reliably and stable. Because of the resulting system utilization a highcapacity computer is recommended to keep processing times adequate. The resulting time series are used for generating real market option prices, see [2]. PISA extracted currency exchange rates for US Dollar and Euro from 06/03/03 to 07/16/03 from Finanztreff (www.finanztreff.de) with an interval of 10 seconds. Beside the current bid- and ask-price also date, time and German security identification number were reliably extracted. The results are used for short term forecasts. As the results in [5] show the data quality is very high. At most 2 or 3 values are missing in a row in the time series. Theses gaps are filled by using linear approximations. Reasons for missing values are most likely network failures. 4 Conclusion Tests showed that PISA generates high quality data sets, e. g. for neural network training. Even a high number of webpages can be processed in short intervals. Intervals can be defined by the user approximately. The process is only limited by the computer's performance the agent is running on. Even if the extraction process works well some limitations are left that affect data quality. Network failures, server sided blackouts and maintenance work make page requests impossible. Another source of errors are significant layout changes of a webpage. Today, only little changes can be anticipated by PISA. 5 References 1. Bartels P., Breitner M. (2003): Automatic Extraction of Derivative Market Prices from Webpages using a Software Agent. IWI Discussion Paper Series No. 4, Institut für Wirtschaftsinformatik, Universität Hannover, p. 1-32. 2. Breitner M. (1999): Heuristic Option Pricing with Neural Networks and the Neurocomputer Synapse 3, Optimization 81, 319-333. 3. Brenner W., Zarnekow R. and Wittig H. (1998): Intelligent Software Agents: Foundations and Applications. Springer, Berlin. 4. Caglayan A. (1997): Agent Sourcebook: A Complete Guide to Desktop, Internet, and Intranet Agents. John Wiley & Sons, Wien. 5. Mettenheim H.-J. (2003): Entwicklung der grob granularen Parallelisierung für den Neurosimulator FAUN 1.0 und Anwendungen in der Wechselkursprognose. Dissertation, Hannover. 6. Tredwell R., Kuhlins S. (2003): Wrapper Development Tools. http://www.wifo.unimannheim.de/~kuhlins/wrappertools/. 7. Woodridge M. and Jennings N. (1995): Intelligent Agents: Theory and Practice, Knowledge Engineering Review 10, 115-152.