I would like to find the most frequently occurring approximate match in a long string WITH A CONDITION that the word is also from a provided list.
我想在一个长字符串中找到最常出现的近似匹配。这个单词也来自提供的列表。
Example:
例:
# provided list
>> jobskill = ["scrum", "customer experience improvement", "python"]
# long string
>> jobtext = ["We are looking for Graduates in our Customer Experience department in Swindon, you will be responsible for improving customer experience and will also be working with the digital team. Send in your application by 31st December 2018",
"If you are ScrumMaster at the top of your game with ability to communicate inspire and take people with you then there could not be a better time, we are the pioneer in digital relationship banking, and we are currently lacking talent in our Scrum team, if you are passionate about Scrum, apply to our Scrum team, knowledge with python is a plus!"]
# write a function that returns most frequent approximate match
>> mostfrequent(input = jobtext, lookup = jobskill)
# desired_output: {"customer experience improvement, "scrum"}
Appreciate any form of help, thank you!
感谢任何形式的帮助,谢谢!
2 个解决方案
#1
0
I am not familiar with fuzzywuzzy that you mentioned, but you can properly improvise.
我不熟悉你提到的fuzzywuzzy,但你可以适当地即兴发挥。
import re
result = {}
for text in jobtext:
for s in jobskill:
check = re.findall(s, text, re.IGNORECASE)
if check:
result[s] = len(check)
print (result)
#2
0
Using fuzzywuzzy
from collections import defaultdict
from fuzzywuzzy import fuzz
# provided list
jobskill = ["scrum", "customer experience improvement", "python"]
# long string
jobtext = [
"We are looking for Graduates in our Customer Experience department in Swindon, you will be responsible for improving customer experience and will also be working with the digital team. Send in your application by 31st December 2018",
"If you are ScrumMaster at the top of your game with ability to communicate inspire and take people with you then there could not be a better time, we are the pioneer in digital relationship banking, and we are currently lacking talent in our Scrum team, if you are passionate about Scrum, apply to our Scrum team, knowledge with python is a plus!",
]
def k_most_frequent(k, text, queries, threshold=70):
"""Return k most frequent queries using fuzzywuzzy to match."""
frequency = defaultdict(int)
text = " ".join(text).split()
for query in queries:
for window in range(len(query.split()) + 1):
frequency[query] += sum(
[
fuzz.ratio(query, " ".join(text[i : i + window])) > threshold
for i in range(len(text))
]
)
return sorted(frequency.keys(), key=frequency.get, reverse=True)[:k]
print(k_most_frequent(2, jobtext, jobskill))
# output: ["customer experience improvement, "scrum"]
#1
0
I am not familiar with fuzzywuzzy that you mentioned, but you can properly improvise.
我不熟悉你提到的fuzzywuzzy,但你可以适当地即兴发挥。
import re
result = {}
for text in jobtext:
for s in jobskill:
check = re.findall(s, text, re.IGNORECASE)
if check:
result[s] = len(check)
print (result)
#2
0
Using fuzzywuzzy
from collections import defaultdict
from fuzzywuzzy import fuzz
# provided list
jobskill = ["scrum", "customer experience improvement", "python"]
# long string
jobtext = [
"We are looking for Graduates in our Customer Experience department in Swindon, you will be responsible for improving customer experience and will also be working with the digital team. Send in your application by 31st December 2018",
"If you are ScrumMaster at the top of your game with ability to communicate inspire and take people with you then there could not be a better time, we are the pioneer in digital relationship banking, and we are currently lacking talent in our Scrum team, if you are passionate about Scrum, apply to our Scrum team, knowledge with python is a plus!",
]
def k_most_frequent(k, text, queries, threshold=70):
"""Return k most frequent queries using fuzzywuzzy to match."""
frequency = defaultdict(int)
text = " ".join(text).split()
for query in queries:
for window in range(len(query.split()) + 1):
frequency[query] += sum(
[
fuzz.ratio(query, " ".join(text[i : i + window])) > threshold
for i in range(len(text))
]
)
return sorted(frequency.keys(), key=frequency.get, reverse=True)[:k]
print(k_most_frequent(2, jobtext, jobskill))
# output: ["customer experience improvement, "scrum"]