Toward the integration of informatic tools and GRID infrastructure for Assyriology text analysis

Size: px

Start display at page:

Download "Toward the integration of informatic tools and GRID infrastructure for Assyriology text analysis"

Alyson Chandler
5 years ago
Views:

58 Rencontre Assyriologique Internationale (RAI) Private and State 16-20 July 2012 - Leiden Toward the integration of informatic tools and GRID infrastructure for

1 58 Rencontre Assyriologique Internationale (RAI) Private and State July Leiden Toward the integration of informatic tools and GRID infrastructure for Assyriology text analysis Giovanni Ponti, Ph.D. ENEA UTICT-HPC giovanni.ponti@enea.it joint work with D. Alderuccio, G. Mencuccini, A. Rocchi, S. Migliori, G. Bracco, P. Negri Scafa

2 Outline Introduction (data analysis problem) Knowledge Discovery in Databases (KDD) Data Mining Clustering Proposal Experimental Analysis Eshnunna corpus Clustering algorithm and settings ENEA-GRID/CRESCO Infrastructure Results Conclusion and Future Works

3 Data Analysis Problem Handling too many data is often a hard task, which may frequently lead to errors and to wrong interpretation Importance of Data Analysis process: Better underline main data features Helps in understanding the data Reduce the data dimensionality (aggregate results) Need for automatic/semi-automatic tools based on Computer Science to analyse the data

4 Knowledge Discovery in Databases (KDD) Knowledge Discovery in Databases is the non-trivial process of identifying novel, valid, potentially useful, and ultimately understandable patterns in the data Fayyad et al., 1996 Focus on Clustering task

5 Data Mining Data Mining task consists in various techniques: Decision Trees Neural Networks Association Rules Clustering Data Mining objective: Analyzing huge amount of data and rearrange them in homogeneous schemas and structures which emphasize hidden data features

6 Clustering Organizing data in homogeneous groups (i.e., clusters) in such a way that objects within the same group are highly similar, whereas objects in different groups are dissimilar Objects in the same group share common hidden features

7 Clustering Key Aspects (1) Cruciality in Data Clustering: Data Representation (how data are structured and how data features are represented) Relational Model (Database Systems)

8 Clustering Key Aspects (2) Similarity/Distance measures (measure employed to discover data groups) Domain-Specific solutions Euclidean Distance (numerical data) Jaccard Similarity (categorical data) Cosine Similarity (text data)

9 Clustering Algorithms Three main algorithm families: Partitional (separate data space in regions) Hierarchical (build a data hierarchy according to an agglomerative or divisive strategy) Density-based (discover highly-dense regions of different shapes)

K-Means In our study, we resorted to Partitional Algorithms and, in particular, to the well-known K-means algorithm Partition a dataset into k groups (i.e., clusters), in which

10 K-Means In our study, we resorted to Partitional Algorithms and, in particular, to the well-known K-means algorithm Partition a dataset into k groups (i.e., clusters), in which each object is assigned to the cluster with the nearest mean Iterative process Convergence is demonstrated and occurs when non objects have been relocated (i.e., non group changes)

11 Our proposal and Aim of the work Proposal: Defining a methodology to analyze transliterated corpora from cuneiform tablets from Eshnunna exploiting informatics Settings: Tool: text mining algorithm (data mining on texts) Dataset: corpus of 50 transliterated letters from Eshnunna kingdom Goal: identify groups of texts that are similar each other and discover non trivial relations and patterns (i.e., information not clearly expressed in the corpus) to ease and guide assyriologist and linguistic work

12 Experimental Analysis Steps: A (short) description of the Eshnunna corpus Computer Science and Eshnunna ENEA-GRID/CRESCO environment Results

13 Eshnunna texts Corpus of 50 letters from Eshnunna old-babylonian Kingdom Prose texts well-articulated, homogeneous, suitable for text analysis Texts not used: Administrative (too many names) Contracts (too many formulas)

14 Computer Science and Eshnunna Computer Science tools are helpful for assyriologists to provide a better representation of the data to be analyzed Advantages: Exploiting DBMS (DataBase Management Systems) to store texts Texts are structured, as they can be represented according to their features (i.e., terms) Texts can be analyzed by means of statistical tools, that discover data correlations Data can be reorganized, filtered, and manipulated exploiting query languages Data can be easily shared among the assyriology community (e.g., web-based access)

15 Processing of Eshnunna corpus Row-cuneiform texts have been preprocessed to be represented and analyzed by our tool Steps: Cuneiform texts transliterated in a UNICODEbased font Graphical forms in transliterated texts have been lemmatized (i.e., nouns, adjectives, and verbs to base standard form)

each term has a relevance in dependence on the statistical/correlation

16 Clustering on Eshnunna corpus Choices: Algorithm: K-means # of groups (cluster): from 2 (low specific) to 20 (high specific) Data Representation: each term has a relevance in dependence on the statistical/correlation measure employed Data Modeling: Vector Space Model A document is seen as a vector of its term

17 ENEA-GRID/CRESCO Infrastructure (1) GRID and High Performance Computing (HPC) infrastructures provide a powerful integrated environment for Data Storage Data Visualization Data Analysis Suitable environment for managing and analyzing complex and large textual corpora

ENEA-GRID/CRESCO Infrastructure (2) ENEA-GRID provides a unified and homogenous environment for ENEA computational resources located in 6 calculus centers connected via GARR network.

18 ENEA-GRID/CRESCO Infrastructure (2) ENEA-GRID provides a unified and homogenous environment for ENEA computational resources located in 6 calculus centers connected via GARR network. It offers: More than 40Tflops of integrated computational power Multiplatform systems, i.e., Linux x86_64 (5000 cores for CRESCO systems) Unified access to remote resources via SSH, NX, and FARO web portal A distributed le system (AFS) and a parallel highperformance one (GPFS) Cloud services, Virtual Labs, and resource monitoring systems

19 Results We performed a two-stage analysis: Quantitative: exploiting quality-based indexes for clustering evaluation Qualitative: describing data relations and affinities discovered by the clustering algorithm (performed by the domain-expert)

20 Quantitative Analysis We resorted to a well-known quality index for clustering evaluation, called Q It is based on cluster inter-similarity and intra-similarity Q ranges within [-1, 1], as -1 is for lowest clustering quality and 1 for highest one Results: Clustering on Eshnunna texts achieves quality results from 0.2 to 0.6 (by varying the cluster number), which indicate high quality clustering solutions

21 Qualitative Analysis Qualitative analysis has the objective of exploring clusters and try to describe common features and relations among the data A domain-expert (assyriologists) is necessary to discover data affinities in clusters Some of the most interesting relations: Cluster 1: same main character (tutub-magir), same administrative context (šatammu), same theme (fields) Cluster 3: themes related (fields, barley, water) Cluster 2: same main character (tutub-magir), same administrative context (palace, different from cluster 1), same theme (beefs) Cluster 4: Same main theme (religious)

22 TIGRIS Web Portal We propose a Web-based Virtual Lab for assyriologists TIGRIS - Toward Integration of e-tools in GRId infrastructure for e-assyriology Access to: Documentation Texts Software Text Mining tools ENEA-GRID high performance computation...

Conclusion We proposed a methodology for analyzing transliterated old-babylonian Eshnunna texts We exploited text mining tools, in particular clustering, to discover

23 Conclusion We proposed a methodology for analyzing transliterated old-babylonian Eshnunna texts We exploited text mining tools, in particular clustering, to discover homogeneous groups and hidden relations in the data Clustering results are highly-effective in discovering high quality groups and in highlighting interesting relations among data

24 Thanks! Giovanni Ponti, Ph.D. ENEA UTICT-HPC ENEA-GRID/CRESCO TIGRIS Web Portal

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering