Pelles C forum

C language => Beginner questions => Topic started by: Grincheux on February 12, 2016, 09:56:31 PM

Title: SQLite and duplicates
Post by: Grincheux on February 12, 2016, 09:56:31 PM
In a folder I have many thousands images.
I made a SQLite database to store their size and the MD5 of the file and the MD5 of the image bits.
In C I could find them walking through the db record by record sorting on one of the md5, but it is slow.
I can have JPEG file size with a MD5 file size different than a MD5 file size but both of them can have the same MD5 for the image bits.
Is it possible to have a sqlite query to find that?

Here is an exaple of the db.

The files Folder 00000005_00018.jpg and Folder 00000005_00019.jpg are the same.
Idem for Folder 00000005_00109 and Folder 00000005_00110

One solution is
Quote
SELECT FileSha2, ImgSha2, FileSize, ImgWidth, ImgHeight, ImgSize, ImgRatio, COUNT(*)
FROM Images GROUP BY FileSha2,ImgSha2,FileSize
HAVING COUNT(*) > 1
ORDER BY FileSha2

But that needs an other query for all the selected items.
If two files have the same MD5 file size that does not mean the imageS are the same.
To be sure I must compare the ImgSha2 field wich contains a MD5 of the image bits.
Files with MD5 file size differents can have a MD5 image bits equals!

An other problem is two file with a differrent compression level for the same image will not give the same MD5 for the file size and for the image bits. But the image you can see seems the same.

There is an other problem with image of different sizes but with the same ratio, but this is an other story.

Thanks for your help