Text and Data Mnng In Innovaton Joseph Engler Innovaton Typology Generatonal Models 1. Lnear or Push (Baroque) 2. Pull (Romantc) 3. Cyclc (Classcal) 4. Strategc (New Age) 5. Collaboratve (Polyphonc) Collaboratve Authortatveness Focused Web Mnng for Papers/Artcles Creaton of Authortatveness Matrx Apply Authortatveness Metrc Document Clusterng Focused Web Mnng Standard Web Crawlng Methodologes Download of Query Specfc Fles Forms the repostory from whch Text Mnng takes place 1
Authortatveness Matrx Authors names are parsed from the documents Cted references are parsed from the documents The publcaton date s parsed from the document Authortatveness Matrx Howt, P. Benkler, Y. Stokc, D. Von Hppel, E. Nolan, R.L. Koza, J.R. Document Year Kusak, A 1 1 0 1 1 1 2006 Ln, G. 0 1 0 0 1 1 2004 Stokc, D. 1 0 1 0 0 1 1999 Parsng of Names Heurstcs Name should be frst non-empty lne after ttle Regular Expressons ^\w+\s\w[.]\s*\w+\s \w+\s\w*[.]*\s*\w+\s[a][n][d] Names Database wth Dce Coeffcent Dce Coeffcent Create Bgrams of the two words beng compared. Nght g = {n, g, gh, ht} = X Nacht = {na, ac, ch, ht} = Y Calculate Smlarty ( X Y ) Dce = 2 Coef X + Y 2
Authortatveness Metrc Scan Authortatveness Matrx Create Hash of Authors (row) Create Hash of Referenced Authors (column) Create Hash of Average Age of Document for each author Authors Hash (rows) measures Out-Lnks Reference Hash (columns) measures In-Lnks Authortatveness Metrc Cont. Calculate the ntal authortatveness for each author and referenced author. t' A = l( ln( λ ( out ) + n ) out s the number of out-lnks n s the number of n-lnks λ s a user defned weght parameter of document age n [0,1] t s the average age of the document for author Authortatveness Boostng Smlar to PageRank Algorthm Iteratve Approach If an authortatve t author references a paper, the n-lnk to that reference s ncreased In-Lnks of less authortatve authors pose no detrment In-Lnk Boostng Calculate the mean of the n-lnks N Aj e n = N j= 1 Update the Authortatveness Metrc N A j ' e n = = 1 1 j n A j f e > n A j f e n 3
Determnng Authortatveness Order the authors by Authortatveness Select Top K authors Cluster the documents Fnd cluster closest to current ssue Fnd most authortatve k authors for that ssue Authors that are authortatve overall may not be authortatve on specfc topc Possble applcaton of Apror Prncpal Expermental Results 945 Artcles on Genetc Algorthms Unclustered to determne overall authorty 30 teratons of In-Lnk Boostng λ set to 1 to not dscount older authortatveness Expermental Results Cont. Author Orgnal Boosted Authortatveness Authortatveness J. H. Holland 5.278 6.2122 J. R. Koza 4.8828 5.8105 F. H. Bennett 4.5849 5.1350 Cyclc Innovaton & Data Mnng Mnng of Requrements Creaton of Requrements Database Constructon of Requrements Tree L. Altenberg 4.3438 4.4306 D. Andre 3.9512 4.0105 Note the mnmal boost of the last two authors. 4
Web Mnng For Requrements Source of Requrements Blogs User Revews Expert Revews Patent Databases Trade Journals Stock Market Analyss (trcky at best) Flterng Requrements Moaners and Prasers I hate ths MP3 player and would never buy a product from ths company agan. I just love Mcrosoft and every thng they produce. It s all bug free Attempt to assgn a measure of success to the requrement Identfy hstorc ssues versus new requrements Requrements Database Transactonal Database of Sorts Product Workstaton Desk Smaller Increased Interface Increased Nonnterferng Footprnt Bandwdth wth Stylus RPMs legs 1 0 0 0 1 Turret Lathe 1 0 0 1 1 Abstracton of Database Utlze Multdmensonal Cubes Ablty to Roll-up or Drll Down smlar to OLAP Increased choces n levels of abstracton Smartphone 1 1 1 0 0 5
Mnng Frequent Requrements Select Product/Servce type to mne the requrements for IPod (ncludes all MP3 Players) On-lne Tax Servce (ncludes all on-lne) Utlze a Market Basket type of Analyss Apror Algorthm FP-Growth Mnng Frequent Requrements Dscover Frequent Itemsets Itemset can be consdered as a conjuncton of tems A ^ B Itemset can be consdered as a predcate A => B Frequent Itemset Metrcs Support Confdence sup( A B) = conf ( A B) = Number of tupels contanng both A and B Total number of tupels Number of tuples contanng both A and B Number of tupels contanng A Frequent Itemset Generaton 1. Scan Database for frequent 1 tems 2. Remove those tems that have a support value of less than a gven threshold 3. Jon the remanng frequent tems to form 2 tem temsets 4. Repeat steps 2 and 3 ncrementng the temset sze each tme untl there are no temsets left to jon. 6
Frequent Itemset Example Workbench Close to lathe and Desk Items wth legs and a top wth stablty Mne the frequent requrements from our Requrements database prevously shown Frequent Itemset Example Product Workstaton Desk Smaller Footprnt Increased Bandwdth Interface wth Stylus Increased RPMs Nonnterferng legs 1 0 0 0 1 Turret Lathe 1 0 0 1 1 Smartphone 1 1 1 0 0 Choose tems of smlar abstracton to mne from. Frequent Itemset Example Frequent Itemset Example Itemset Support Smaller Footprnt 2 Increased Bandwdth 0 Itemset Smaller Footprnt and Non-Interferng Legs Support 2 Interface wth Stylus 0 There remans no further temsets to jon. Increased RPMs 1 Non-nterferng Legs 2 Set support = 2 7
Buld Requrements Tree Bult from frequent temsets 8