RECORD LINKAGE, A REAL USE CASE WITH SPARK ML. Alexis Seigneurin

Size: px

Start display at page:

Download "RECORD LINKAGE, A REAL USE CASE WITH SPARK ML. Alexis Seigneurin"

Malcolm Mills
6 years ago
Views:

1 RECORD LINKAGE, A REAL USE CASE WITH SPARK ML Alexis Seigneurin

2 Who I am Software engineer for 15 years Consultant at Ippon USA, previously at Ippon France Favorite subjects: Spark, Machine Learning, Cassandra Spark

3 200 software engineers in France and the US In the US: offices in DC, NYC and Richmond, Virginia Digital, Big Data and Cloud applications Java & Agile expertise Open-source projects: JHipster, Tatami,

4 The project Record Linkage Record Linkage with Machine learning Use cases: Entity resolution Deduplication Entity disambiguation Find new clients who come from insurance comparison services Commission Find duplicates in existing files (acquisitions)

5 Overview

6 Purpose Find duplicates! ID veh codptgar_veh dt_nais_cp dt_permis_cp redmaj_cp formule PE TIERS... FO VOL_INCENDIE... FI TOUS_RISQUES... RE TIERS... FO TOUS_RISQUES

7 Steps 1. Preprocessing 1. Find potential duplicates 2. Feature engineering 2. Manual labeling of a sample 3. Machine Learning to make predictions on the rest of the records

8 Prototype Crafted by a Data Scientist Not architectured, not versioned, not unit tested Not ready for production Spark, but a lot of Spark SQL (data processing) Machine Learning in Python (Scikit Learn) Objective: industrialization of the code

9 Preprocessing

10 Inputs Data (CSV) + Schema (JSON) ;Jose;Lester;10/10/ ;José;Lester;10/10/ ;Tyler;Hunt;12/12/ ;Tiler;Hunt;25/12/ ;Patrick;Andrews; { "tableschemabeforeselection": [ { "name": "ID", "typefield": "StringType", "hardjoin": false }, { "name": "name", "typefield": "StringType", "hardjoin": true, "cleaning": "digitletter", "listfeature": [ "scarcity" ], "listdistance": [ "equality", "soundlike" ] },...

11 Data loading Spark CSV module DataFrame Don t use type inference ID name surname birthdt Jose Lester 10/10/ José Lester 10/10/ Tyler Hunt 12/12/ Tiler Hunt 25/12/ Patrick Andrews

12 Data cleansing Parsing of dates, numbers Cleaning of strings ID name surname birthdt jose lester jose lester tyler hunt tiler hunt patrick andrews null

13 Feature calculation Convert strings to phonetics (Beider-Morse) ID name surname birthdt BMencoded_name jose lester ios iosi ioz iozi jose lester ios iosi ioz iozi tyler hunt tilir tiler hunt tqlir tili tilir patrick andrews null pytrqk pytrik pat

14 Find potential duplicates Auto-join (more on that later ) ID_1 name_1 surname_1... ID_2 name_2 surname_ jose lester jose lester tyler hunt tiler hunt

15 Distance calculation Several distance algorithms: Levenshtein distance, date difference ID_1... ID_2... equality_name soundlike_name... datediff_birthdt

16 Standardization / vectorization Standardization of distances only Vectorization (2 vectors) ID_1 name_1 surname_1 birthdt_1 ID_2 name_2 surname_2 birthdt_2 distances other_features jose lester jose lester [0.0,0.0,... [2.0,2.0, tyler hunt tiler hunt [1.0,1.0,... [1.0,2.0,

17 Spark SQL DataFrames

18 From SQL Generated SQL requests Hard to maintain (especially as regards to UDFs) val cleaningrequest = tableschema.map(x => { x.cleaningfuction match { case (Some(a), _) => a + "(" + x.name + ") as " + x.name case _ => x.name } }).mkstring(", ") val cleanedtable = sqlcontext.sql("select " + cleaningrequest + " from " + tablename) cleanedtable.registertemptable(schema.tablename + "_cleaned")

19 to DataFrames DataFrame primitives More work done by the Scala compiler val cleaneddf = tableschema.filter(_.cleaning.isdefined).foldleft(df) { case (df, field) => val udf: UserDefinedFunction =... // get the cleaning UDF } df.withcolumn(field.name + "_cleaned", udf.apply(df(field.name))).drop(field.name).withcolumnrenamed(field.name + "_cleaned", field.name)

20 Unit testing

21 Unit testing Scalatest + Scoverage Coverage of all the data processing operations

22 Comparison of Row objects ;Jose;Lester;10/10/ ;Jose =-+;Lester éà;10/10/ ;Jose;Lester;invalid date val resdf = schema.cleantable(rows) "The cleaning process" should "clean text fields" in { val res = resdf.select("id", "name", "surname").collect() val expected = Array( Row("000010", "jose", "lester"), Row("000011", "jose", "lester ea"), Row("000012", "jose", "lester") ) res should contain thesameelementsas expected } "The cleaning process" should "parse dates" in {...

23 Shared SparkContext Don ts: Use one SparkContext per class of tests multiple contexts Setup / tear down the SparkContext for each test slow tests Do s: Use a shared SparkContext object SparkTestContext { val conf = new SparkConf().setAppName("deduplication-tests").setMaster("local[*]") val sc = new SparkContext(conf) val sqlcontext = new SQLContext(sc) }

24 spark-testing-base Holden Karau s spark-testing-base library Provides: Shared SparkContext and SQLContext Comparison of RDDs, DataFrames, DataSets Mock data generators

25 Matching potential duplicates

26 Join strategy Prospects New clients For record linkage, first merge the two sources Then auto-join Duplicate

27 Join - Volume of data 100 Input: 1M records Cartesian product: 1000 B records Find an appropriate join condition 25 0

28 Join condition Multiples join on 2 fields Equality of values or custom condition (UDF) Union between all the intermediate results E.g. with fields name, surname, birth_date: df1.join(df2, (df1("id_1") < df2("id_2")) && (df1("name_1") === df2("name_2")) && (soundlike(df1("surname_1"), df2("surname_2"))) df1.join(df2, (df1("id_1") < df2("id_2")) && (df1("name_1") === df2("name_2")) && (df1("birth_date_1") === df2("birth_date_2"))) UNION df1.join(df2, (df1("id_1") < df2("id_2")) && (soundlike(df1("surname_1"), df2("surname_2"))) && (df1("birth_date_1") === df2("birth_date_2")))

29 DataFrames extension

30 DataFrames extension 3 types of columns Data Non-distance features Distances ID_1... ID_2... equality_name soundlike_name... datediff_birthdt

31 DataFrames extension DataFrame columns have a name and a data type DataFrameExt = DataFrame + metadata over columns case class OutputColumn(name: String, columntype: ColumnType) class DataFrameExt(val df: DataFrame, val outputcolumns: Seq[OutputColumn]) { def show() = df.show() def drop(colname: String): DataFrameExt =... def withcolumn(colname: String, col: Column, columntype: ColumnType): DataFrameExt =......

32 Labeling

33 Labeling Manual operation Is this a duplicate? Yes / No Performed on a sample of the potential duplicates Between 1000 and records

34 Labeling

35 Predictions

36 Predictions Machine Learning Random Forests (Gradient Boosting Trees also give good results) Training on the potential duplicates labeled by hand Predictions on the potential duplicates not labeled by hand

37 Predictions Sample: 1000 records Training set: 800 records Test set: 200 records Results Found 53 duplicates on the 58 True positives: 53 expected (53+5) and only 2 errors False positives: 2 True negatives: 126 False negatives: 5 Precision 93% Recall 91%

38 Summary & Conclusion

39 Summary Single engine for Record Linkage and Deduplication Machine Learning Specific rules for each dataset Higher identification of matches Previously ~50% Now ~90%

40 Thank

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces