mysql - Rails: A way to check for duplicate item in DB? Affiliate data feeds -
I have a problem with related data feeds
For example, Amazon or other e-shop From Partners
For example Amazon : Product Title: iPhone 5: I'm trying to import my product data, but want to avoid duplicate. 16 GB black
and another store uses the product title: iPhone 5 16 GB
.
They should be listed as a product, now imagine that I have 10 stores in the sale of iPhone 5.
Of course, they have many more parameters, but I still need algorithms to prevent this from happening. A similarity of product parameters like matching algorithm.
Does anyone have experience with it and can tell me what kind of algorithm can be advised for this scenario?
A detailed list of parameters
Thanks a lot!
This can be done by EAN number, but if this number is not provided.
Before developing an alogorithm, you need to define business rules. If you have a situation, where all the features are left without the title, then you can try sub-string (partial of another) or fuzzy match on the title.
We are using fuzzy-string-match gem to find duplicate companies.
Assuming that the discrepancy is on the title only, you can put more intelligence into an algorithm by analyzing the title parts. In your example, the title part can be model, version, capacity and color. For this example:
required_dates = [model, version, capacity] optional_data = [color]
and define attributes for each product category . It is related to a fuzzy match and you should be able to get a good match on spelling errors and match the following:
iPhone 5 16GB Black iPhone 5 16GB IPhone 5 16 GB White
Comments
Post a Comment