«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

HTML)

分享到：

《武汉工程大学学报》[ISSN:1674-2869/CN:42-1779/TQ]

卷:: 39
期数:: 2017年04期

页码:: 403-408

栏目:: 机电与信息工程

出版日期:: 2017-10-14

文章信息/Info

Title:: Improved SNM Algorithm Based on Fuzzy Comprehensive Evaluation and Length Filtering

文章编号:: 20170415

作者:: 郭文龙; 董建怀; 福建江夏学院电子信息科学学院，福建福州 350108

Author(s):: GUO Wenlong; DONG Jianhuai; College of Electronics and Information Science， Fujian Jiangxia University， Fuzhou 350108，China

关键词:: 相似重复记录; 模糊综合评判; 属性; 长度过滤; SNM; 算法

Keywords:: approximately duplicated records; fuzzy comprehensive evaluation; attribute; length filtering; SNM; algorithm

分类号:: TP311

DOI:: 10. 3969/j. issn. 1674?2869. 2017. 04. 015

文献标志码:: A

摘要:: 为了提高数据库的数据质量，需要对相似重复记录进行清洗，基本邻近排序算法是目前常用的清洗算法之一. 针对判重过程中属性权值计算主观性过强的问题，提出通过多用户综合评判确定属性权值的方法，该方法能更客观地评判属性的重要性程度. 在此基础上，结合属性权值计算两条记录的长度比例，排除不可能构成相似重复的记录，减少了比较次数，提高了检测效率. 实验结果表明改进算法在查全率、查准率及时间效率等方面均有所提高

Abstract:: To improve the quality of data， the approximately duplicated records need to be cleaned. The basic sorted-neighborhood method（SNM） is one of the commonly used cleaning algorithms. Aimed at the problem of excessive subjectivity of attribute weight calculation in detection algorithm， the article proposes a method based on the fuzzy comprehensive evaluation of multiuser to determine the attribute weight， which can be more objective to judge the importance level of the attribute. The proposed algorithm calculates the length ratio of the two records with attribute weight， then uses the length ratio to exclude records that are impossible to be approximately duplicated， reduces comparison times， and improves the detection efficiency. The experiment results show that the recall， precision and time efficiency are enhanced.

参考文献/References:

［1］　HERNANDEZ M， STOLFO S. The merge/purge problem for large databases［C］//Proceedings of the ACM SIGMOD international conference on management of data. California：San Jose， 1995： 127-138. ［2］　HERNANDEZ M， STOLFO S. Real-world data is dirty： data cleansing and the merge/purge problem［J］. Data Mining and Knowledge Discovery， 1998，2（1）： 9-37. ［3］　叶焕倬，吴迪. 相似重复记录清理方法研究综述［J］. 现代图书情报技术，2010，26（9）：56-66. YE H Z， WU D. A survey of approximately duplicate data cleaning method［J］. New Technology of Library and Information Service， 2010，26（9）：56-66. ［4］　陈爽，宋金玉，刁兴春，等. 基于伸缩窗口和等级调整的SNM改进方法［J］. 计算机应用研究，2013，30（9）：2736-2739. CHEN S，SONG J Y，DIAO X C， et al. Amelioration method of SNM based on flexible window and ranking adjusting［J］. Application Research of Computers， 2013，30（9）：2736-2739. ［5］　殷秀叶. 大数据环境下的相似重复记录检测方法［J］. 武汉工程大学学报，2014，36（9）：66-69. YIN X Y. Method for detecting approximately duplicate database records in big data environment［J］. Journal of Wuhan Institute of Technology，2014，36（9）：66-69. ［6］　陈芬. 改进量子粒子群算法优化神经网络的数据库重复记录检测［J］. 计算机应用与软件，2014，31（3）：20-21，115. CHEN F. Database duplicate records detection using neural network optimized by iqpso［J］. Computer Applications and Software， 2014，31（3）：20-21，115. ［7］　李鑫，李军，丰继林，等. 面向相似重复记录检测的特征优选方法［J］. 传感器与微系统，2011，30（2）：37-40. LI X， LI J， FENG J L， et al. An optimal feature selection method for approximately duplicate records detecting［J］. Transducer and Microsystem Technologies， 2011，30（2）：37-40. ［8］　周典瑞，周莲英. 海量数据的相似重复记录检测算法［J］. 计算机应用，2013，33（8）：2208-2211. ZHOU D R，ZHOU L Y. Algorithm for detecting approximate duplicate records in massive data［J］. Journal of Computer Applications， 2013，33（8）：2208- 2211. ［9］　周丽娟，肖满生. 基于数据分组匹配的相似重复记录检测［J］. 计算机工程，2010，36（12）：104-106. ZHOU L J，XIAO M S. Detection of approximately duplicated records based on data grouping matching［J］. Computer Engineering， 2010，36（12）：104-106. ［10］　肖满生，周浩慧，王宏. 基于模糊综合评判的相似重复记录识别方法［J］. 计算机工程，2010，36（13）：51-53. XIAO M S，ZHOU H H，WANG H. Identification method of approximately duplicate records based on fuzzy integrated estimation［J］. Computer Engineering，2010，36（13）：51-53. ［11］　郭文龙. 基于长度过滤和有效权值的SNM改进算法［J］. 计算机工程与应用，2014，50（19）：123-127. GUO W L. Improved SNM algorithm based on length filtering and effective weights［J］. Computer Engineering and Applications，2014，50（19）：123- 127. ［12］　刘雅思，程力，李晓. 基于长度过滤和动态容错的SNM改进算法［J］. 计算机应用研究，2017，34（1）：147-150. LIU Y S， CHENG L， LI X. Improved SNM algorithm based on length filtering and dynamic fault-tolerance［J］. Application Research of Computers， 2017，34（1）：147-150. ［13］　刘河香. 模糊数学理论及其应用［M］. 北京：科学出版社，2012. ［14］　张胜礼，李永明. 广义模糊集GFScom在模糊综合评判中的应用［J］. 计算机科学，2015，42（7）：125-128，161. ZHANG S L，LI Y M. Application of generalized fuzzy sets GFScom to fuzzy comprehensive evaluation［J］. Computer Science， 2015，42（7）：125-128，161. ［15］　余肖生，胡孙枝. 基于SNM改进算法的相似重复记录消除［J］. 重庆理工大学学报（自然科学版），2016，30（4）：91-96. YU X S， HU S Z. Research on eliminating duplicate records based on SNM improved algorithm［J］. Journal of Chongqing University of Technology（Natural Science）， 2016，30（4）：91-96.

相似文献/References:

[1]殷秀叶.大数据环境下的相似重复记录检测方法[J].武汉工程大学学报,2014,(09):66.[doi:103969/jissn16742869201409013]
　YIN Xiu ye.Method for detecting approximately duplicate database records in big data environment[J].Journal of Wuhan Institute of Technology,2014,(04):66.[doi:103969/jissn16742869201409013]

备注/Memo

备注/Memo:: 收稿日期：2017-04-08 基金项目：福建省自然科学基金（2015J01653）；福建江夏学院青年科研人才培育基金（JXZ2014011）作者简介：郭文龙，硕士，副教授. E-mail：[email protected]

更新日期/Last Update: 2017-08-04

《武汉工程大学学报》[ISSN:1674-2869/CN:42-1779/TQ]

文章信息/Info

参考文献/References:

相似文献/References:

备注/Memo

常用功能

导航/Navigate

工具/Tools

统计/Statistics