I think, if I was doing a huge clean up I would pick one of those algorithms like simil.
Then loop all your records and return and save all like values that meet a certain score. Maybe anything above a .65 if using Simil or 70 using Leven... These are then "pretty similar" records. Or maybe...