Author : Dipalee A. More 1
Date of Publication :17th August 2017
Abstract: Database contains very large data sets, where various duplicate records are present. The duplicate records occur when data entries are stored in a uniform manner in the database, resolving the structural heterogeneity problem. Maximum the gain of the overall process within time availability by reporting most results much earlier than traditional approaches. Detection of duplicate records is difficult to find and it takes more execution time. The authors described various techniques used to find duplicate records in the database but there are some issues in these techniques. To address this, Progressive Algorithms have been said, for that, which significantly increases the efficiency of finding duplicates, if the execution time is limited and improves the quality of records. The authors will combine base paper progressive approaches with scalable approaches for duplicate detection to deliver results even faster
Reference :
-
- Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti, “Eliminating fuzzy duplicates in data warehouses,” In Proceedings of the International Conference on Very Large Databases (VLDB), 2002
- Rohan Baxter, Peter Christen, and Tim Churches. “A comparison of fast blocking methods for record linkage,” In SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003.
- Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney, “Adaptive blocking: Learning to scale up record linkage,” In Industrial Conference on Data Mining (ICDM), 2006.
- Peter Christen, “Towards parameter-free blocking for scalable record linkage,” Technical Report TR-CS-07-03, The Australian National University, August 2007.
- S. E. Whang, D. Marmaros, and H. GarciaMolina, “Pay-as-you-go entity resolution,”IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1111–1124, May 2012.
- Ashwini V. Lake, Lithin K, “A study and survey on various progressive duplicate detection mechanisms,” in IJRET: International Journal of Research in Engineering and Technology, vol. 05 pp. 2319-1163, Mar. 2016.
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios, “Duplicate record detection: A survey,” IEEE Transactions on Knowledge and Data Engineering (TKDE), 19, 2007.
- Mauricio A. Hernandez and Salvatore J. Stolfo, “The merge/purge problem for large databases,” In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 1995
- Mauricio A. Hernandez and Salvatore J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining and Knowledge Discovery, 2(1), 1998.
- Alvaro E. Monge and Charles Elkan, “An efficient domain-independent algorithm for detecting approximately duplicate database records, ” In Proceedings of the Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
- Sven Puhlmann, Melanie Weis, and Felix Naumann, “XML duplicate detection using sorted neighborhoods,” In Proceedings of the International Conference on Extending Database Technology (EDBT), 2006.