USA + 1 650 469 3205 | IND + 91 120 423 9596

Data De-Duplication and Cleaning


  • Our clients, one of India’s leading life Insurance companies with a growing portfolio were finding it difficult to maintain good data warehouse practices, with all the focus on new acquisitions. Every new policy was enrolled as a new record and irrespective of whether or not the person/ company/ family had other policies; there was no connection with the existing accounts.
  • Business wanted to be able to de-dup the database and develop a unique identification number at account level linked with a family (or entity) identification number. This would allow them to identify potential cross-sell opportunities, in addition to giving a realistic idea about the portfolio dynamics and enable better business decisions.


  • As a first step, we compiled a laundry list of all ‘personal identification’ related fields across customer database platforms. All fields like Name, Address, Employer Details, Data of Birth, PIN Code, Phone Numbers, Father’s Name, PAN number, Passport Details, and other identification related data fields were identified.
  • Developed a data table with all these fields along with the current Customer ID for the complete portfolio.
  • As a first step, conducted duplicity analyses at various levels. Over 10% of accounts had the same Name and PAN number combinations. There was over 30% duplicity in phone number information. On deeper investigation, it turned out the same number had been given for distinct customers with the same agent, affirming the business knowledge that agents mask their customers’ details from the company and share their own personal details instead
  • Similar analyses were conducted for all logical combinations. Conclusion- there was 16-20% duplicity in the portfolio. In other words, there were almost 20% customers with more than one accounts and the company was not even aware of them.
  • As a result of the investigations described above, we were able to identify best step wise combinations of fields to eliminate duplicity. Each step involved a progressively tighter set of rules to identify duplicate records.
  • Applied Fuzzy logic algorithms with defined confidence intervals to remove duplicates at each step of the process. Confidence intervals were decided in consultation with the client to make sure there was no over-fitting (treating different accounts as one).
  • Ran the process iteratively with decreasing incremental results with each subsequent iteration. It was decided to terminate the iterative process when there was less than 0.05% incremental lift with one iteration. After 15 iterations, the cut-off was achieved. There was approximately 0.3% duplicity in the resultant dataset.
  • Basis the de-duplication process, we developed a unique ID for all the accounts in portfolio. Shared the old ‘customer ID’ to new ‘unique ID’ mapping with the client along with the duplicacy removal algorithm (for further use). The unique ID was implemented in the data warehouse also.


  • With duplicity removed from the database, the client has a better grip on the portfolio dynamics.
  • Business has a deeper understanding of the customers with multiple accounts. With these insights, they can formulate better cross-sell plans to leverage the existing customer base to its full potential.