Just over a year ago we launched the Duplicate Detector to help you find and merge those pesky duplicate profiles cluttering your database, and it’s already been used to merge over 500,000 profiles!
This week we launched a new detection algorithm to improve the results of the Duplicate Detector. Our tests have shown the new algorithm to be significantly better at finding duplicates and ignoring similar, but not actually duplicate, profiles. It will save you time by providing better suggestions and keeping your database tidier!
This new feature serves as an example of how we love using new technologies to equip your church, and if you’re like me and you’d like the technical details of how complex problems were solved, read on!
The Nitty Gritty
The Duplicate Detector compares two profiles by looking at several pieces of information from each profile (such as names, birthdays, contact info, et cetera) and uses the information to decide if those two profiles should be flagged as potential duplicates.
But deciding which comparisons are the most reliable predictors of duplicates isn’t easy. For example, if two profiles have matching names, phone numbers, and birthdays, they’re probably duplicates, but what if the names and addresses match but the birthdays are different? Should they be flagged?
When the Duplicate Detector was first built, we predicted what attributes would most reliably predict duplicate profiles, and then tested those predictions against the limited data available. The results have been pretty good so far, but have generated more false positives than we’d like to see, particularly for larger churches.
After over a year of watching churches interact with the Duplicate Detector, we now have a huge amount of data about what duplicate profiles actually look like. We used this data, combined with some really interesting tech called a “genetic algorithm,” to completely revamp how the tool works.
From Wikipedia: “In a genetic algorithm, a population of candidate solutions […] to an optimization problem is evolved toward better solutions.”
In non-nerd terms: instead of us guessing which attributes are the most reliable predictors of duplicates, the computer did it for us thousands of times. Then the qualities from accurate guesses were combined with the qualities of other effective guesses to make even more effective “guess babies.” This process went on and on, testing tens of thousands of weightings, until we got a solution that fit just right.
The New Duplicate Detector
These new rules have been tested against a dataset consisting of hundreds of thousands of duplicate profiles and false positives, and it is now detecting 36% more duplicates than before, while creating 20% less false positives.
As time goes on we will continue improving our algorithms with new data to keep making this tool better and better.