Big Data Appliance in Risk Management Erste Group Bank Jozef Zubricky Group Credit Risk Models and Methods
Digital data have predictive power...
Web Scenarios with highest predictive power Currency Conversion Information (1.3 % Defaults) Loan Consolidation Information (4.6% Defaults)
Simplest method is Naïve Bayes Text classification: SPAM filter Email messages Term-frequency matrix Class probabilities Message 3 Msg1 Msg2 Msg3 Msg1 Msg2 Msg3 Message 2 Client 3 Message 1 Your email has won 2.5 million. Client 2 Client 1 sit Mozilla Consolida Macintosh 193.1.2.0 your 1 2 1 email 1 0 1 has 1 2 1 won 1 0 0 million 1 0 0 Cli1 Cli2 Cli3 sit 1 1 0 Mozilla 1 0 3 Consolida 2 0 0 Macintosh 1 2 0 193.1.2.0 1 2 0 SPAM 90% 5% 10% HAM 10% 95% 90% The messages with high SPAM probability classified as SPAM. Text classification: Digital scoring Lists of strings Term-frequency matrix Class probabilities Cli1 Cli2 Cli3 High risk 10% 40% 90% Low risk 90% 60% 10% Probability of high risk used as an additional variable in the scorecard.
Not able to implement In our traditional system, due to computational speed and ever changing underlying websites
Big Data appliance is build for such tasks Input Column 1: IP Address Column 2: Timestamp of click Column 3: URL of Page Visited Column 4: Webpage Text Map() Key (IP Address) Value (Timestamp) Value (URL or Web Page visited) Value (Probability of Web Page being good or bad based on Webpage text and Naïve Bayes) Reduce() Shuffle and Sort Least Risky Pages Most Risky Pages Least Risky IP Addresses Most Risky IP Addresses Key(IP Address) Value (List of probabilities of all the websites visited by IP Address per user session defined by timestamps)
No business case... Nobody wants to finance this just for one problem like this
Problem: Come up data driven model. Natural experiment data
Why not champion challenger? It is costly Reputation Risk
We found 2 natural experiments In 2007 we granted for almost one month loans without considering current instalments Until 2010 we were granting foreign exchange loans in some countries Well we though we did found natural experiments...
We deleted data and scripts were not working For first experiment data for 2007 were no longer available For second experiment, scripts were not working across a corridor
We managed to... Retrieve data from old backups. Modelling itself was quite a success. We have found our relationships
So we set up a project called CRANE Central Place for Model Development, Monitoring and Validation Unlimited Data History, to Utilise Past Crisis Data for Model Development Automated Data Load and Post Rollout Check to Reduce Operational Problems
We needed cheap storage 4 Cost of Data Warehouse Appliance 3 2 Cost of Hadoop Appliance 1 0 10TB 20TB 30TB 40TB 50TB
Big Data Technology without Big Data
But this is how it got in Business case based on cheap storage
But... How to connect it to production legacy systems? What is regulatory environment?
Our Environment is diverse We expanded by buying banks Different legacy IT systems Central modelling team distributed across locations Not enough storage in enterprise data warehouse
EBA and ESMA published Regulation draft on Big Data Type: EBA ESMA big data into Google They seek comments until 17.3.2017 People who wrote it have good insight into industry
Main Takeaways of Risks Transparent Security Reputation Conformist Referring to other Cybersecurity and Wrong decisions, Conformist directives aimed at data protection of difficult to control, behaviour of people client ownership of Big Data solutions. exclusion of groups when they know how the data and Risk of outsourcing of clients form their data influence transparency of use services decision making etc.
Thank you for your attention What is your experience with Data-mart projects? Jozef Zubricky jozef.zubricky@erstegroup.com What is your experience with Big Data usage? Jozef Zubricky +43 664 818 2976