Sometimes we have duplicate rows in database due to some bad design or schema change. Lets see how we can fix them.
- The solution is to “group by” rows based on the required columns.
- Add a column for counts
- Filter rows where count is more than 1
Lets say we have a model
broadcast_date and other fields. We want to delete duplicate announcements for the company on same date.
from django.db.models import Min, Max, Count from documents.models import Announcement # Groupby by putting group identifiers in values # then annotate a count record dupes = Announcement.objects.values('company_id', 'title', 'broadcast_date') \ .order_by() \ .annotate( first_update_id=Min('id'), records=Count('id'), ).filter( records__gt=1 ) # Check query print dupes.query # Handle duplicate for result in dupes: Announcement.objects.filter( company_id=result['company_id'], title=report['title'], broadcast_date=report['broadcast_date'], ).exclude(id=report['first_update_id']).delete()
unique_together database constraints
Best way to handle duplicates is to use
unique_together constraints at database level. A common error is sometimes to use a nullable field in
unique_together. In SQL,
null is not equal to anything. In-fact,
null is not equal to even
null. So a
unique constraint on a
null value won’t work.