«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

HTML)

分享到：

《武汉工程大学学报》[ISSN:1674-2869/CN:42-1779/TQ]

卷:
期数:: 2014年09期

页码:: 66-69

栏目:: 机电与信息工程

出版日期:: 2014-09-30

文章信息/Info

Title:: Method for detecting approximately duplicate database records in big data environment

文章编号:: 16742869(2014)09006604

作者:: 殷秀叶; 周口师范学院计算机科学与技术学院，河南周口 466001

Author(s):: YIN Xiuye; School of Computer Science and Technology, Zhoukou Normal University, Zhoukou 466001，China

关键词:: 相似重复记录; 大数据; 同义属性

Keywords:: approximately duplicated records; big data; MapReduce; synonymous property

分类号:: TP393

DOI:: 103969/jissn16742869201409013

文献标志码:: A

摘要:: 大数据环境下的相似重复记录影响数据统计分析结果的准确性，需要过滤相似重复记录.对相似重复记录检测的研究现状做了介绍，在此基础上提出了属性加权的思想，对属性进行加权，并根据属性权值进行排序分组;在对属性加权时，考虑到一些字段的取值是一一对应的关系，权值相同，提出了同义属性的概念，在原数据集的基础上排除部分同义属性来缩减数据集，提高重复数据检测的效率，最后给出了相似重复记录判定的方法.考虑到大数据集给重复记录检测带来的挑战，将大数据集拆分成若干小数据集，充分利用MapReduce机制进行处理，将大数据集按照权重较大的属性取值进行分组，分割成若干个map任务，分别进行处理.实验结果表明,该方法能够有效地提高相似重复记录检测的效率.

Abstract:: The accuracy of the data statistical analysis is affected by approximately duplicated records in big data environments, so the approximately duplicated records need to be filtered. We introduced the current research of approximately duplicated records and proposed the weighted attribute idea, weighting the attributes and grouping them according to the weights. Considering that some field’s relationship is one to one, we proposed synonymous property. We excluded some synonymous property on the basis of the original dataset to reduce the dataset and improve the efficiency of detection of approximately duplicated records .Finally synonymous property was proposed. Big datasets were split into a number of small datasets considering the challenge of approximately duplicated records in big dataset. Taking full advantage of MapReduce processing mechanism, big datasets were grouped according to the weight of the larger attribute values, and then divided into a number of map tasks to process. Experiment shows that this method can improve detection efficiency of approximately duplicated records effectively.

参考文献/References:

［1］李建中,刘显敏.大数据的一个重要方面:数据可用性［J］ .计算机研究与发展, 2013,50(6) :11471162.LI Jianzhong,LIU Xianmin. An important aspect of big data：data usability［J］. Journal of Computer Research and Development, 2013,50(6) :11471162.（in Chinese）［2］李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法［J］. 电子科技大学学报,2007,36(6):12731277.LI Xingyi, BAO Congjian, SHI Huaji. A method for detecting approximately duplicate database records in data warehouse［J］. Journal of University of Electronic Science and Technology of China, 2007,36(6):12731277. （in Chinese）［3］庞雄文,姚占林,李拥军.大数据量的高效重复记录检测方法［J］.华中科技大学学报,2010,38(2):811.PANG Xiongwen, YAO Zhanlin, LI Yongjun. Efficient duplicate records detection method for massive data［J］. Journal of Huazhong University of Science and Technology,2010,38(2):811. （in Chinese）［4］周典瑞,周莲英.海量数据的相似记录检测算法［J］.计算机应用,2013,33(8):22082211. ZHOU Dianrui,ZHOU Lianying. Algorithm for detecting approximate duplicate records in massive data［J］. Journal of Computer Application,2013,33(8):22082211. （in Chinese）［5］敖莉,舒继武,李明强.重复数据删除技术［J］.软件学报,2010, 21(5):916929.AO Li, SHU Jiwu, LI Mingqiang. Data deduplication techniques［J］. Journal of Software, 2010,21(5):916929. （in Chinese）［6］韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法［J］.计算机研究与发展, 2005,42(12) :22062212. HAN Jingyu,XU Lizhen,DONG Yisheng. An approach for detecting similar duplicate records of massive data［J］. Journal of Computer Research and Development, 2005,42(12):22062212. （in Chinese）［7］邱越峰.一种高效的检测相似重复记录的方法［J］.计算机学报,2001,24(1):6977.QIU Yuefen. An efficient approach for detecting approximately duplicate database records［J］. CHINESE J.COMPUTERS, 2001,24(1):6977. （in Chinese）［8］DEAN J，GHEMAWAT S. MapReduce: simplified data processing on large clusters［C］// In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, New York：NY，2004.

相似文献/References:

[1]郭文龙,董建怀.基于模糊综合评判和长度过滤的SNM改进算法[J].武汉工程大学学报,2017,39(04):403.[doi:10. 3969/j. issn. 1674?2869. 2017. 04. 015]
　GUO Wenlong,DONG Jianhuai.Improved SNM Algorithm Based on Fuzzy Comprehensive Evaluation and Length Filtering[J].Journal of Wuhan Institute of Technology,2017,39(09):403.[doi:10. 3969/j. issn. 1674?2869. 2017. 04. 015]
[2]刘黎志,何经纬.空气质量监测大数据区间的统计问题[J].武汉工程大学学报,2019,(02):179.[doi:10. 3969/j. issn. 1674?2869. 2019. 02. 015]
　LIU Lizhi,HE Jingwei.Big Data Interval Statistics for Air Quality Monitoring[J].Journal of Wuhan Institute of Technology,2019,(09):179.[doi:10. 3969/j. issn. 1674?2869. 2019. 02. 015]
[3]刘黎志,彭　贝.Spark集群中还贷问题的逻辑回归模型研究[J].武汉工程大学学报,2020,42(01):113.[doi:10.19843/j.cnki.CN42-1779/TQ.201907020]
　LIU Lizhi,PENG Bei.Logistic Regression Model for Loan Repayment in Spark Cluster[J].Journal of Wuhan Institute of Technology,2020,42(09):113.[doi:10.19843/j.cnki.CN42-1779/TQ.201907020]
[4]彭　贝,刘黎志*,杨　敏,等.基于Hive的空气质量大数据查询优化方法[J].武汉工程大学学报,2020,42(04):467.[doi:10.19843/j.cnki.CN42-1779/TQ.202003009]
　PENG Bei,LIU Lizhi*,YANG Min,et al.Hive-Based Query Optimization for Air Quality Big Data[J].Journal of Wuhan Institute of Technology,2020,42(09):467.[doi:10.19843/j.cnki.CN42-1779/TQ.202003009]

备注/Memo

备注/Memo:: 收稿日期：20140612基金项目：国家自然科学基金青年项目(61103143)；周口师范学院青年科研基金项目(zknuc0215)作者简介：殷秀叶（1984），女，河南信阳人，助教,硕士.研究方向:大数据的检测效率.

更新日期/Last Update: 2014-10-10

《武汉工程大学学报》[ISSN:1674-2869/CN:42-1779/TQ]

文章信息/Info

参考文献/References:

相似文献/References:

备注/Memo

常用功能

导航/Navigate

工具/Tools

统计/Statistics