pandas:统计某一列字符串中各个word出现的频率
背景
某一列是字符串,想要统计该列字符串分词结果后各词出现的词频。
示例代码
# -*- coding: utf-8 -*- # @Time : 2022/2/13 4:18 下午 # @Author : JasonLiu # @FileName: test.py import pdb import pandas as pd import numpy as np df = pd.DataFrame( [[104472, "R.X. Yah & Co"], [104873, "Big Building Society"], [109986, "St Jamess Society"], [114058, "The Kensington Society Ltd"], [113438, "MMV Oil Associates Ltd"]], columns=["URN", "Firm_Name"]) # 方法1: result1 = df.Firm_Name.str.split(expand=True).stack().value_counts() print("方法1:") print(result1) # PS: str.split(expand=True).stack() is a really clever option on small data, but it quickly runs out of memory # on data of any size. Since it expands out a matrix for every unique word in Firm_Name, # data sparsity explodes matrix columns without many observations print("方法2:") result2 = pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts() print(result2) print("方法3:") result3 = pd.Series( .join(df.Firm_Name).split()).value_counts() print(result3) print("方法4:") temp = df[Firm_Name].str.cat(sep= ) # pdb.set_trace() from collections import Counter word_count = Counter(temp.split( )) print(word_count) print("方法5:") results = Counter() df[Firm_Name].str.split().apply(results.update) print(results)
运行结果如下:
方法4: Counter({Society: 3, Ltd: 2, R.X.: 1, Yah: 1, &: 1, Co: 1, Big: 1, Building: 1, St: 1, "Jamess": 1, The: 1, Kensington: 1, MMV: 1, Oil: 1, Associates: 1}) 方法5: Counter({Society: 3, Ltd: 2, R.X.: 1, Yah: 1, &: 1, Co: 1, Big: 1, Building: 1, St: 1, "Jamess": 1, The: 1, Kensington: 1, MMV: 1, Oil: 1, Associates: 1})
上一篇:
IDEA上Java项目控制台中文乱码