Medical Records Clustering Based on the Text Fetched from Records

Size: px

Start display at page:

Download "Medical Records Clustering Based on the Text Fetched from Records"

Harry Atkinson
6 years ago
Views:

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.

1 Available Online at International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN X IMPACT FACTOR: IJCSMC, Vol. 5, Issue. 5, May 2016, pg Medical Records Clustering Based on the Text Fetched from Records Ms. Ramya.S.Bhat 1, Mrs. A.Rafega Beham 2 ¹M.Tech Scholor, Department of Information Science and Engineering New Horizon College Of Engineering, Bangalore, India ²Asst.Professor, Department of Information Science and Engineering New Horizon College Of Engineering, Bangalore, India 1 bhatramyas@gmail.com, 2 rafeeka.rifa@gmail.com Abstract- This paper describes how the rich available data from patient s medical records can be clustered and hidden information can be retrieved out of it. We first collect the 49 patient s medical records, use annotators to extract the text based on symptom occurred and medical drug name. The fetched text are clustered and stored in a file. When a combination of medical terms taken from medical documents is given as a query through the search engine shows the clustered documents. We use MetaMap and Medex as annotators for extracting the symptom names and the pharmaceutical names. For clustering the fetched data we are using the multi view NMF, which is a clustering technique. Keywords:- multi view NMF, Metamap, Medex, document clustering. I. INTRODUCTION Data mining is a process of extracting the hidden data from the large set of data to extract the feasible pattern. There are many types of data mining. One of the techniques among them is the clustering technique. Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, document clustering and computer graphics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore 2016, IJCSMC All Rights Reserved 771

2 be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties. One application of clustering technique, ie the document clustering is explained in this paper. Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users. The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. II. RELATED WORK First, we pre-process clinical notes to identify words and sentences from clinical notes. During the preprocessing, we use section annotator to identify different sections for each clinical note. The section annotator depends on the section header information from clinical notes. We also use negation annotator to remove negation symptom and medication names. Pre-negation is negation words like avoid, deny, cannot, without, and so on. Post-negation is negation words like free, was ruled out, and so on. After pre-process, we use symptom annotator based on the Meta Map to extract symptom names from clinical notes. Meanwhile, we use medication annotator based on MedEx System to extract medication names from clinical notes. We use MetaMap to extract symptom names from clinical notes. MetaMap is a program that maps biomedical texts to concepts in the UMLS Meta-thesaurus. Since Meta map returns all types of concepts, we only keep these concepts related to symptom names, such as concept labeled as sosy, which represents sign and symptom. We use MedEx system to extract medication names from clinical notes. Advantages 1. By using symptom names and medication names, the clustering performance can be improved. 2. It also indicates that multi-view NMF can achieve better results than NMF. Unsupervised learning algorithms such as principal components analysis and vector quantization can be understood as factorizing a data matrix subject to different constraints. Depending upon the constraints utilized, the resulting factors can be shown to have very different representational properties. Principal components analysis enforces only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability. On the other hand, vector quantization uses a hard winner- takeall constraint that results in clustering the data into mutually exclusive prototypes. It has been shown that non negativity is a useful constraint for matrix factorization that can learn a parts representation of the data.the nonnegative basis vectors that are learned are used in distributed, yet still 2016, IJCSMC All Rights Reserved 772

3 sparse combinations to generate expressiveness in the reconstructions.in this submission, we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data. We formally consider algorithms for solving the following problem: Non-negative matrix factorization (NMF) Given a non-negative matrix V, find non-negative matrix factors Wand H such that: V~WH (1) NMF can be applied to the statistical analysis of multivariate data in the following manner. Given a set of of multivariate n-dimensional data vectors, the vectors are placed in the columns of an n x m matrix V where m is the number of examples in the data set. This matrix is then approximately factorized into an n x r matrix Wand an r x m matrix H. Usually r is chosen to be smaller than nor m, so that Wand H are smaller than the original matrix V. This results in a compressed version of the original data matrix. What is the significance of the approximation in Eq. (1)? It can be rewritten column by column as v ~ Wh, where v and h are the corresponding columns of V and H. In other words, each data vector v is approximated by a linear combination of the columns of W, weighted by the components of h. Therefore W can be regarded as containing a basis that is optimized for the linear approximation of the data in V. Since relatively few basis vectors are used to represent many data vectors, good approximation can only be achieved if the basis vectors discover structure that is latent in the data. Some of the applications of Non matrix factorization are:- Text Mining for Surveillance Electronic Mail Sub collections Term Weighting Document clustering. Our contributions in this paper are: 1) We present a system for extracting symptom/medication names from clinical notes. 2) We apply multi-view NMF to evaluate the effects of using medication/symptom names to improve the clinical documents clustering results. 3) We fetch the data from search engine by sending the query. Fig 1.Overview of Designed System 2016, IJCSMC All Rights Reserved 773

The extraction of the data is shown below: Fig 2.

4 The execution of the methodology is shown below: In this paper 49 set of patient s records were taken to extract the symptoms occurred and medical drug name. The extraction of the data is shown below: Fig 2. Extraction of medical drug name and symptom(record 0-5) Fig 3.Extraction of medical drug name and symptom(record 6-11) 2016, IJCSMC All Rights Reserved 774

Fig 6. Extraction of medical drug name and symptom(record 24-29) Fig 7.

Fig 8. Extraction of medical drug name and symptom(record 36-41) Fig 9.

8 Clustering of the documents Fig 10.Clustering of documents(cluster 0-31) It shows that the first record is in the first cluster, second record in the third cluster and so on. Fig 11. Clustering of documents(cluster 32-48) After the documents are clustered the query is sent through a search engine so as to retrieve the clustered data on the basis of symptom occurred and medical drug name from the patient s medical record. 2016, IJCSMC All Rights Reserved 778

10 Fig 14. Search Terms sent as query A search terms are the medical terms observed from the medical records which are sent as an query.in the above the search terms given are diabetes mellitus and glaucoma. Fig 15.Clusterd data on the basis of medication and symptom for a given query. 2016, IJCSMC All Rights Reserved 780

Fig 16. Clustered data on the basis of medication and symptom for a given query obesity and insomnia. The above screen shot depicts the clustered data for the search terms obesity and insomnia. III.

11 Fig 16. Clustered data on the basis of medication and symptom for a given query obesity and insomnia. The above screen shot depicts the clustered data for the search terms obesity and insomnia. III. CONCLUSION This paper explains about the document clustering done by the technique Multi-view NMF based on the symptom occurred and medical drug name taken from the patient s medical records. The medical records which are of rich text are taken into account in order to find the hidden information from them. In this approach we first extract the nouns by removing the negation annotators and fetch the nouns. The text fetched is further for clustering using the NMF clustering technique. The clustered data is stored which can be used further to find the symptoms occurred and the medical drug name with similarity when a query is passed through the search engine which is useful in the healthcare sector especially at medical transcription sector. REFERENCES [1] K. Roberts and S. M. Harabagiu, A flexible framework for deriving assertions from electronic medical records, J. Amer. Med. Informat. Assoc., vol. 18, no. 5, pp , [2] M.-Y. Kim et al., Patient information extraction in noisy tele health texts, in Proc. IEEE Int. Conf. Bioinformat. Biomed. (BIBM 13). [3] F.S.Roqueetal., Using electronic patient records to discover disease correlations and stratify patient cohorts, PLoS Comput. Biol., vol. 7, no. 8, p. E , [4] G. Hripcsak et al., Mining complex clinical data for patient safety research: a framework for event discovery, J. Biomed. Informat., vol. 36, no. 1, pp , [5] S. V. Pakhomov, A. Ruggieri, and C. G. Chute, Maximum entropy modelling for mining patient medication status from free text, inproc. AMIA Symp. Amer. Med. Informat. Assoc., [6] A. Henriksson, Semantic spaces of clinical text: Leveraging distributional semantics for natural language processing of electronic health records, Lic. degree, Dept. Comput. Syst. Sci., Stockholm Univ., Stockholm, Sweden, [7] S. Kushinka, Clinical documentation: EHR deployment techniques, in California HealthCare Found., 2010 [Online]. Available: C/PDF%20ClinicalDocumentationEHRDeploymentTechniques.pdf 2016, IJCSMC All Rights Reserved 781

12 [8] K. Chaudhuri, S. Kakade, K. Livescu, and K. Srid- haran. Multi-view clustering via canonical correlation analysis. In ICML, pages , [9] C. Ding, T. Li, and W. Peng. On the equivalence be- tween non-negative matrix factorization and probabilis- tic latent semantic indexing. Computational Statistics Data Analysis, 52: , [10] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. In KDD, pages , [11] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In SIGIR, pages , [12] G. Greco, A. Guzzo, and L. Pontieri. Coclustering multiple heterogeneous domains: Linear combinations and agreements. TKDE, 22: , [13] D. Greene and P. Cunningham. A matrix factorization approach for integrating multiple data views. In ECML PKDD, pages , [14] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50 57, [15] A. Kumar and H. Daum e III. A co-training approach for multi-view spectral clustering. In ICML, pages , , IJCSMC All Rights Reserved 782

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,