本文首发地址:

https://yishuihancheng.blog.csdn.net/article/details/98854135

欢迎大家去我的博客去看,感觉CSDN博客的阅读效果比这里的要好,包括:整体的布局、代码格式化等方面,阅读体验会好点。

python计算对比度(Python基于wordnet实现词语相似度计算分析)(1)

这里从百度百科里面拿来关于“wordnet”的定义和介绍:

WordNet是由Princeton 大学的心理学家,语言学家和计算机工程师联合设计的一种基于认知语言学的英语词典。它不是光把单词以字母顺序排列,而且按照单词的意义组成一个“单词的网络”。 它是一个覆盖范围宽广的英语词汇语义网。名词,动词,形容词和副词各自被组织成一个同义词的网络,每个同义词集合都代表一个基本的语义概念,并且这些集合之间也由各种关系连接。 WordNet包含描述概念含义,一义多词,一词多义,类别归属,近义,反义等问题,访问以下网页,可使用wordnet的基本功能 http://wordnetweb.princeton.edu/perl/webwn wwordnet官网在这里: https://wordnet.princeton.edu/

鉴于wordnet本身的性质,我们想到了可以借助于同一词典网络的形式来计算词语之间的相似度,具体的实现很简单,核心的思想就是:同义词汇数据。具体的代码实现如下:

#!usr/bin/env python #encoding:utf-8 from __future__ import division ''' __Author__:沂水寒城 功能: 基于 WordNet 的词语相似度计算分析 ''' import sys import numpy as np import pandas as pd from scipy import stats from nltk.corpus import wordnet as wn from sklearn.preprocessing import MinMaxScaler, Imputer reload(sys) sys.setdefaultencoding('utf-8') def loadData(data='data.csv'): ''' 加载数据集 ''' data=pd.read_csv(data) word_list=np.array(data.iloc[1:,[0,1]]) self_sim_res=np.array(data.iloc[1:,[2]]) return word_list,self_sim_res def calWordSimilarity(word_list,self_sim_res,res_path='wordnetResult.csv'): ''' 计算词语相似度 ''' self_sim_matrix=np.zeros( (len(self_sim_res),1)) for i,word_pair in enumerate(word_list): word1,word2=word_pair count=0 synsets1=wn.synsets(word1) synsets2=wn.synsets(word2) print 'synsets1: ',synsets1 print 'synsets2: ',synsets2 for synset1 in synsets1: for synset2 in synsets2: score=synset1.path_similarity(synset2) if score is not None: self_sim_matrix[i,0] =score count =1 else: pass self_sim_matrix[i,0]=self_sim_matrix[i,0]*1.0/count imputer=Imputer(missing_values='NaN', strategy='mean', axis=0) imputer_list=imputer.fit_transform(self_sim_matrix) scaler=MinMaxScaler(feature_range=(0.0,10.0)) imputer_list_scale=scaler.fit_transform(imputer_list) (coefidence,p_value)=stats.spearmanr(self_sim_res,imputer_list_scale) print 'coefidence: ',coefidence print 'p_value: ',p_value submitData=np.hstack((word_list,self_sim_res,imputer_list_scale)) (pd.DataFrame(submitData)).to_csv(res_path,index=False, header=["Word1","Word2","originalSim","wordnetSim"]) if __name__=='__main__': word_list,self_sim_res=loadData(data='data.csv') calWordSimilarity(word_list,self_sim_res,res_path='wordnetResult.csv')

下面我们来看具体的计算应用实例。

首先是原始数据文件的格式如下:

python计算对比度(Python基于wordnet实现词语相似度计算分析)(2)

其中,前两列分别为需要计算的词汇,最后一列是人为给定的初始相似度数据,这个完全凭感觉给就好了。

简单的测试结果输出如下:

synsets1: [] synsets2: [Synset('telephone.n.01'), Synset('phone.n.02'), Synset('earphone.n.01'), Synset('call.v.03')] synsets1: [Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12'), Synset('house.v.01'), Synset('house.v.02')] synsets2: [Synset('horse.n.01'), Synset('horse.n.02'), Synset('cavalry.n.01'), Synset('sawhorse.n.01'), Synset('knight.n.02'), Synset('horse.v.01')] synsets1: [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] synsets2: [Synset('bicycle.n.01'), Synset('bicycle.v.01')] synsets1: [Synset('homo.n.02'), Synset('human.a.01'), Synset('human.a.02'), Synset('human.a.03')] synsets2: [Synset('woman.n.01'), Synset('woman.n.02'), Synset('charwoman.n.01'), Synset('womanhood.n.02')] synsets1: [Synset('large.a.01'), Synset('big.s.02'), Synset('bad.s.02'), Synset('big.s.04'), Synset('big.s.05'), Synset('big.s.06'), Synset('boastful.s.01'), Synset('big.s.08'), Synset('adult.s.01'), Synset('big.s.10'), Synset('big.s.11'), Synset('big.s.12'), Synset('big.s.13'), Synset('big.r.01'), Synset('boastfully.r.01'), Synset('big.r.03'), Synset('big.r.04')] synsets2: [Synset('huge.s.01')] synsets1: [Synset('rain.n.01'), Synset('rain.n.02'), Synset('rain.n.03'), Synset('rain.v.01')] synsets2: [Synset('wind.n.01'), Synset('wind.n.02'), Synset('wind.n.03'), Synset('wind.n.04'), Synset('tip.n.03'), Synset('wind_instrument.n.01'), Synset('fart.n.01'), Synset('wind.n.08'), Synset('weave.v.04'), Synset('wind.v.02'), Synset('wind.v.03'), Synset('scent.v.02'), Synset('wind.v.05'), Synset('wreathe.v.03'), Synset('hoist.v.01')] synsets1: [Synset('spider.n.01'), Synset('spider.n.02'), Synset('spider.n.03')] synsets2: [Synset('crawl.n.01'), Synset('crawl.n.02'), Synset('crawl.n.03'), Synset('crawl.v.01'), Synset('crawl.v.02'), Synset('crawl.v.03'), Synset('fawn.v.01'), Synset('crawl.v.05')] synsets1: [Synset('fire.n.01'), Synset('fire.n.02'), Synset('fire.n.03'), Synset('fire.n.04'), Synset('fire.n.05'), Synset('ardor.n.03'), Synset('fire.n.07'), Synset('fire.n.08'), Synset('fire.n.09'), Synset('open_fire.v.01'), Synset('fire.v.02'), Synset('fire.v.03'), Synset('displace.v.03'), Synset('fire.v.05'), Synset('fire.v.06'), Synset('arouse.v.01'), Synset('burn.v.01'), Synset('fuel.v.02')] synsets2: [Synset('fireman.n.01'), Synset('stoker.n.02'), Synset('reliever.n.03'), Synset('fireman.n.04')] synsets1: [Synset('flood.n.01'), Synset('flood.n.02'), Synset('flood.n.03'), Synset('flood.n.04'), Synset('flood.n.05'), Synset('flood_tide.n.02'), Synset('deluge.v.01'), Synset('flood.v.02'), Synset('flood.v.03'), Synset('flood.v.04')] synsets2: [Synset('blood.n.01'), Synset('blood.n.02'), Synset('rake.n.01'), Synset('lineage.n.01'), Synset('blood.n.05'), Synset('blood.v.01')] coefidence: 0.0 p_value: 1.0

计算结果数据如下:

python计算对比度(Python基于wordnet实现词语相似度计算分析)(3)

相比于原始的数据文件这里多了最后一列,代表的就是基于wordnet计算的结果。

后面计划将该计算模块做成GUI可视化操作工具,相关代码数据文件会上传到我的GitHub中,欢迎交流!

,