Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

yipeiwu_com5年前Python基础

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 \t \n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'\w+')
  #word_reg = re.compile(r'\w+\b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k, v in words_dict.items():
    print(k, v)

运行结果:

childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'\w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k, v in words_dict.items():
    print(k, v)

运行结果:

We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.

PS:这里再为大家推荐2款相关统计工具供大家参考:

在线字数统计工具:
http://tools.jb51.net/code/zishutongji

在线字符统计与编辑工具:
http://tools.jb51.net/code/char_tongji

更多关于Python相关内容感兴趣的读者可查看本站专题:《Python文件与目录操作技巧汇总》、《Python文本文件操作技巧汇总》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程

希望本文所述对大家Python程序设计有所帮助。

相关文章

用Python的Tornado框架结合memcached页面改善博客性能

原因 Blog是一个更新并不很频繁的一套系统,但是每次刷新页面都要更新数据库反而很浪费资源,添加静态页面生成是一个解决办法,同时缓存是一个更好的主意,可以结合Memcached添加少量的...

python开发之list操作实例分析

本文实例分析了python开发之list操作。分享给大家供大家参考,具体如下: 对python中list的操作,大家可以参考《Python list操作用法总结》 以下是我个人的笔记:...

python进程管理工具supervisor使用实例

python进程管理工具supervisor使用实例

平时我们写个脚本,要放到后台执行去,我们怎么做呢? 复制代码 代码如下: nohup python example.py 2>&1 /dev/null & 用tumx或者scre...

Python微信公众号开发平台

Python微信公众号开发平台

  上大学的时候,对微信公众号开发浅尝辄止的玩了一下,感觉还是挺有意思的。 /post/133677.htm后来服务器到期了,也就搁置了。由于发布web程序,使用PHP很顺手,...

python基于隐马尔可夫模型实现中文拼音输入

python基于隐马尔可夫模型实现中文拼音输入

在网上看到一篇关于隐马尔科夫模型的介绍,觉得简直不能再神奇,又在网上找到大神的一篇关于如何用隐马尔可夫模型实现中文拼音输入的博客,无奈大神没给可以运行的代码,只能纯手动网上找到了结巴分词...