I was looking for a module, regex, or anything else that might apply to this problem.
我正在寻找可能适用于此问题的模块,正则表达式或其他任何内容。
How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match?
我怎样才能以编程方式解析字符串并创建已知的英语和|西班牙语单词,我有一个字典表,我可以检查算法的随机化匹配的每个排列?
Given a group of characters: EBLAIDL KDIOIDSI ADHFWB
给出一组字符:EBLAIDL KDIOIDSI ADHFWB
The program should return: BLADE
AID
KID
KIDS
FIDDLE
HOLA
etc....
该计划应该返回:BLADE AID KID KIDS FIDDLE HOLA等....
I also want to be able to define the minimum & maximum word length as well as the number of syllables
我还希望能够定义最小和最大字长以及音节数
The input length doesn't matter, it must be only letters, and punctuation doesn't matter.
输入长度无关紧要,它必须只是字母,标点符号无关紧要。
Thanks for any help
谢谢你的帮助
EDIT
Letters in the input string can be reused.
输入字符串中的EDIT字母可以重复使用。
For example, if the input is: ABLED
then the output may contain: BALL
or BLEED
例如,如果输入为:ABLED,则输出可能包含:BALL或BLEED
4 个解决方案
#1
4
You haven't specified, so I'm assuming each letter in the input can only be used once.
你没有指定,所以我假设输入中的每个字母只能使用一次。
[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]
[你已经在输入中指定了字母可以多次使用,但是我将在这里留下这篇文章以防有人发现它有用。]
The key to doing this efficiently is to sort the letters in the words.
有效地做到这一点的关键是对单词中的字母进行排序。
abracadabra => AAAAABBCDRR
abroad => AABDOR
drab => ABDR
Then it becomes clear that "drab" is in "abracadabra".
然后很明显“drab”在“abracadabra”中。
abracadabra => AAAAABBCDRR
drab => A B DR
And that "abroad" isn't.
那“国外”不是。
abracadabra => AAAAABBCD RR
abroad => AA B DOR
Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.
让我们将排序后的字母称为“签名”。如果您可以从“A”的签名中删除字母以获得“B”的签名,则单词“B”in在单词“A”中。使用正则表达式模式很容易检查。
sig('drab') =~ /^A?A?A?A?A?B?B?C?D?R?R?\z/
Or if if we eliminate needless backtracking for efficiency, we get
或者,如果我们为了提高效率而消除不必要的回溯,我们就会得到
sig('drab') =~ /^A?+A?+A?+A?+A?+B?+B?+C?+D?+R?+R?+\z/
Now that we know what pattern we want, it's just a matter of building it.
既然我们知道我们想要什么样的模式,那只需要构建它。
use strict;
use warnings;
use feature qw( say );
sub sig { join '', sort grep /^\pL\z/, split //, uc $_[0] }
my $key = shift(@ARGV);
my $pat = sig($key);
$pat =~ s/.\K/?+/sg;
my $re = qr/^(?:$pat)\z/s;
my $shortest = 9**9**9;
my $longest = 0;
my $count = 0;
while (my $word = <>) {
chomp($word);
next if !length($word); # My dictionary starts with a blank line!!
next if sig($word) !~ /$re/;
say $word;
++$count;
$shortest = length($word) if length($word) < $shortest;
$longest = length($word) if length($word) > $longest;
}
say "Words: $count";
if ($count) {
say "Shortest: $shortest";
say "Longest: $longest";
}
Example:
$ perl script.pl EBLAIDL /usr/share/dict/words
A
Abe
Abel
Al
...
libel
lid
lie
lied
Words: 117
Shortest: 1
Longest: 6
#2
3
Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:
好吧,正则表达式相当容易......然后你只需要遍历字典中的单词。 EG,假设一个标准的linux:
# perl -n -e 'print if (/^[EBLAIDL]+$/);' /usr/share/dict/words
Will quickly return all the words in that file containing those and only those letters.
将快速返回包含那些字母的文件中的所有单词。
A
AA
AAA
AAAA
AAAAAA
AAAL
AAE
AAEE
AAII
AB
...
As you can see, though, you need a dictionary file that is worth having. In particular, /usr/share/dict/words on my Fedora system contains a bunch of words with all As which may or may not be something you want. So pick your dictionary file carefully.
但是,正如您所看到的,您需要一个值得拥有的字典文件。特别是,我的Fedora系统上的/ usr / share / dict / words包含一堆带有所有As的单词,可能是也可能不是你想要的东西。所以仔细选择你的字典文件。
For min a max length, you can quickly get that as well:
对于最小长度,你也可以很快得到它:
$min = 9999;
$max = -1;
while(<>) {
if (/[EBLAIDL]+$/) {
print;
chomp;
if (length($_) > $max) {
$max = length($_);
$maxword = $_;
}
if (length($_) < $min) {
$min = length($_);
$minword = $_;
}
}
}
print "longest: $maxword\n";
print "shortest: $minword\n";
Will produce:
ZI
ZMRI
ZWEI
longest: TANSTAAFL
shortest: A
For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.
将单词分成几部分并计算音节是非常特定于语言的,正如上面的评论中所提到的那样。
#3
1
The only way I can imagine this would work would be to parse through all possible combinations of letters, and compare them against the dictionary. The fastest way to compare them against a dictionary is to turn that dictionary into a hash. That way, you can quickly look up whether the word was a valid word.
我能想象的唯一方法就是解析所有可能的字母组合,并将它们与字典进行比较。将它们与字典进行比较的最快方法是将该字典转换为哈希。这样,您可以快速查找该单词是否是有效单词。
I key my dictionary by lower casing all letters in the dictionary word and then removing any non-alpha characters just to be on the safe side. For the value, I'll store the actual dictionary word. For example:
我通过下限字母词中的所有字母来键入我的字典,然后删除任何非字母字符只是为了安全起见。对于该值,我将存储实际的字典单词。例如:
cant => "can't",
google => "Google",
That way, I can display the correctly spelled word.
这样,我可以显示拼写正确的单词。
I found Math::Combinatorics which looked pretty good, but wasn't quite working the way I hoped. You give it a list of letters, and it will return all combinations of those letters in the number of letters you specify. Thus, I thought all I had to do was convert the letters into a list of individual letters, and simply loop through all possible combinations!
我发现Math :: Combinatorics看起来很不错,但并不像我希望的那样工作。你给它一个字母列表,它将返回你指定的字母数的所有字母组合。因此,我认为我所要做的就是将字母转换成单个字母列表,然后简单地遍历所有可能的组合!
No... That gives me all unordered combinations. What I then had to do was with each combination, list all possible permutations of those letters. Blah! Ptooy! Yech!
不......这给了我所有无序的组合。然后我必须做的是每个组合,列出这些字母的所有可能的排列。胡说! Ptooy! Yech!
So, the infamous looping in a loop. Actually, three loops. * The outer loop simply count down all numbers of combinations from 1 to the number of letters in the word. * The next finds all unordered combinations of each of those letter groups. * Finally, the last one takes all unordered combinations and returns a list of permutations from those combinations.
所以,循环中臭名昭着的循环。实际上,三个循环。 *外部循环简单地倒计数从1到字中字母数的所有组合数。 *下一个查找每个字母组的所有无序组合。 *最后,最后一个采用所有无序组合并返回这些组合的排列列表。
Now, I can finally take those permutations of letters and compare it against my dictionary of words. Surprisingly, the program ran much faster than I expected considering it had to turn a 235,886 word dictionary into a hash, then loop through a triple decker loop to find all permutations of all combinations of all possible number of letters. The whole program ran in less than two seconds.
现在,我终于可以采用那些字母排列,并将其与我的词典进行比较。令人惊讶的是,程序运行速度比我预期的要快得多,因为它必须将235,886字的字典转换为哈希,然后循环通过三层环路来查找所有可能数量字母的所有组合的所有排列。整个程序在不到两秒的时间内完成。
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use autodie;
use Data::Dumper;
use Math::Combinatorics;
use constant {
LETTERS => "EBLAIDL",
DICTIONARY => "/usr/share/dict/words",
};
#
# Create Dictionary Hash
#
open my $dict_fh, "<", DICTIONARY;
my %dictionary;
foreach my $word (<$dict_fh>) {
chomp $word;
(my $key = $word) =~ s/[^[:alpha:]]//;
$dictionary{lc $key} = $word;
}
#
# Now take the letters and create a Perl list of them.
#
my @letter_list = split // => LETTERS;
my %valid_word_hash;
#
# Outer Loop: This is a range from one letter combinations to the
# maximum letters combination
#
foreach my $num_of_letters (1..scalar @letter_list) {
#
# Now we generate a reference to a list of lists of all letter
# combinations of $num_of_letters long. From there, we need to
# take the Permutations of all those letters.
#
foreach my $letter_list_ref (combine($num_of_letters, @letter_list)) {
my @letter_list = @{$letter_list_ref};
# For each combination of letters $num_of_letters long,
# we now generate a permeation of all of those letter
# combinations.
#
foreach my $word_letters_ref (permute(@letter_list)) {
my $word = join "" => @{$word_letters_ref};
#
# This $word is just a possible candidate for a word.
# We now have to compare it to the words in the dictionary
# to verify it's a word
#
$word = lc $word;
if (exists $dictionary{$word}) {
my $dictionary_word = $dictionary{$word};
$valid_word_hash{$word} = $dictionary_word;
}
}
}
}
#
# I got lazy here... Just dumping out the list of actual words.
# You need to go through this list to find your longest and
# shortest words. Number of syllables? That's trickier, you could
# see if you can divide on CVC and CVVC divides where C = consonant
# and V = vowel.
#
say join "\n", sort keys %valid_word_hash;
Running this program produced:
运行此程序产生:
$ ./test.pl | column
a al balei bile del i lai
ab alb bali bill delia iba laid
abdiel albe ball billa dell ibad lea
abe albi balled billed della id lead
abed ale balli blad di ida leal
abel alible be blade dial ide led
abide all bea blae dib idea leda
abie alle bead d die ideal lei
able allie beal da dieb idle leila
ad allied bed dab dill ie lelia
ade b beid dae e ila li
adib ba bel dail ea ill liable
adiel bad bela dal ed l libel
ae bade beld dale el la lid
ai bae belial dali elb lab lida
aid bail bell dalle eld label lide
aide bal bella de eli labile lie
aiel bald bid deal elia lad lied
ail baldie bide deb ell lade lila
aile bale bield debi ella ladle lile
#4
1
Maybe it would help if you create a separate table with the 26 letters of the alphabet. Than, you would build a query that will search on the second database for any letter you defined. It is important that the query assures that each result is unique.
如果您使用字母表中的26个字母创建一个单独的表,这可能会有所帮助。然后,您将构建一个查询,在第二个数据库中搜索您定义的任何字母。重要的是查询确保每个结果都是唯一的。
So, you have a table that contains your words, and you have a relation of many to many to another table that contains all the letters of the alphabets. And you would query on this second table and make the results unique. You could have a similar approach to the number of the letters.
因此,您有一个包含您的单词的表,并且您与包含字母表的所有字母的另一个表具有多对多的关系。您将在第二个表上查询并使结果唯一。您可以对字母数量采用类似的方法。
You could use the same approach for the number of letters and syllables. So you would make one query that would be joining all the information you want. Put the right indexes on the database to help performance, make use of appropriate caching and, if it comes to that, you can parallelize searches.
您可以对字母和音节的数量使用相同的方法。因此,您将创建一个可以加入所需信息的查询。在数据库上放置正确的索引以帮助提高性能,使用适当的缓存,如果是这样,您可以并行化搜索。
#1
4
You haven't specified, so I'm assuming each letter in the input can only be used once.
你没有指定,所以我假设输入中的每个字母只能使用一次。
[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]
[你已经在输入中指定了字母可以多次使用,但是我将在这里留下这篇文章以防有人发现它有用。]
The key to doing this efficiently is to sort the letters in the words.
有效地做到这一点的关键是对单词中的字母进行排序。
abracadabra => AAAAABBCDRR
abroad => AABDOR
drab => ABDR
Then it becomes clear that "drab" is in "abracadabra".
然后很明显“drab”在“abracadabra”中。
abracadabra => AAAAABBCDRR
drab => A B DR
And that "abroad" isn't.
那“国外”不是。
abracadabra => AAAAABBCD RR
abroad => AA B DOR
Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.
让我们将排序后的字母称为“签名”。如果您可以从“A”的签名中删除字母以获得“B”的签名,则单词“B”in在单词“A”中。使用正则表达式模式很容易检查。
sig('drab') =~ /^A?A?A?A?A?B?B?C?D?R?R?\z/
Or if if we eliminate needless backtracking for efficiency, we get
或者,如果我们为了提高效率而消除不必要的回溯,我们就会得到
sig('drab') =~ /^A?+A?+A?+A?+A?+B?+B?+C?+D?+R?+R?+\z/
Now that we know what pattern we want, it's just a matter of building it.
既然我们知道我们想要什么样的模式,那只需要构建它。
use strict;
use warnings;
use feature qw( say );
sub sig { join '', sort grep /^\pL\z/, split //, uc $_[0] }
my $key = shift(@ARGV);
my $pat = sig($key);
$pat =~ s/.\K/?+/sg;
my $re = qr/^(?:$pat)\z/s;
my $shortest = 9**9**9;
my $longest = 0;
my $count = 0;
while (my $word = <>) {
chomp($word);
next if !length($word); # My dictionary starts with a blank line!!
next if sig($word) !~ /$re/;
say $word;
++$count;
$shortest = length($word) if length($word) < $shortest;
$longest = length($word) if length($word) > $longest;
}
say "Words: $count";
if ($count) {
say "Shortest: $shortest";
say "Longest: $longest";
}
Example:
$ perl script.pl EBLAIDL /usr/share/dict/words
A
Abe
Abel
Al
...
libel
lid
lie
lied
Words: 117
Shortest: 1
Longest: 6
#2
3
Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:
好吧,正则表达式相当容易......然后你只需要遍历字典中的单词。 EG,假设一个标准的linux:
# perl -n -e 'print if (/^[EBLAIDL]+$/);' /usr/share/dict/words
Will quickly return all the words in that file containing those and only those letters.
将快速返回包含那些字母的文件中的所有单词。
A
AA
AAA
AAAA
AAAAAA
AAAL
AAE
AAEE
AAII
AB
...
As you can see, though, you need a dictionary file that is worth having. In particular, /usr/share/dict/words on my Fedora system contains a bunch of words with all As which may or may not be something you want. So pick your dictionary file carefully.
但是,正如您所看到的,您需要一个值得拥有的字典文件。特别是,我的Fedora系统上的/ usr / share / dict / words包含一堆带有所有As的单词,可能是也可能不是你想要的东西。所以仔细选择你的字典文件。
For min a max length, you can quickly get that as well:
对于最小长度,你也可以很快得到它:
$min = 9999;
$max = -1;
while(<>) {
if (/[EBLAIDL]+$/) {
print;
chomp;
if (length($_) > $max) {
$max = length($_);
$maxword = $_;
}
if (length($_) < $min) {
$min = length($_);
$minword = $_;
}
}
}
print "longest: $maxword\n";
print "shortest: $minword\n";
Will produce:
ZI
ZMRI
ZWEI
longest: TANSTAAFL
shortest: A
For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.
将单词分成几部分并计算音节是非常特定于语言的,正如上面的评论中所提到的那样。
#3
1
The only way I can imagine this would work would be to parse through all possible combinations of letters, and compare them against the dictionary. The fastest way to compare them against a dictionary is to turn that dictionary into a hash. That way, you can quickly look up whether the word was a valid word.
我能想象的唯一方法就是解析所有可能的字母组合,并将它们与字典进行比较。将它们与字典进行比较的最快方法是将该字典转换为哈希。这样,您可以快速查找该单词是否是有效单词。
I key my dictionary by lower casing all letters in the dictionary word and then removing any non-alpha characters just to be on the safe side. For the value, I'll store the actual dictionary word. For example:
我通过下限字母词中的所有字母来键入我的字典,然后删除任何非字母字符只是为了安全起见。对于该值,我将存储实际的字典单词。例如:
cant => "can't",
google => "Google",
That way, I can display the correctly spelled word.
这样,我可以显示拼写正确的单词。
I found Math::Combinatorics which looked pretty good, but wasn't quite working the way I hoped. You give it a list of letters, and it will return all combinations of those letters in the number of letters you specify. Thus, I thought all I had to do was convert the letters into a list of individual letters, and simply loop through all possible combinations!
我发现Math :: Combinatorics看起来很不错,但并不像我希望的那样工作。你给它一个字母列表,它将返回你指定的字母数的所有字母组合。因此,我认为我所要做的就是将字母转换成单个字母列表,然后简单地遍历所有可能的组合!
No... That gives me all unordered combinations. What I then had to do was with each combination, list all possible permutations of those letters. Blah! Ptooy! Yech!
不......这给了我所有无序的组合。然后我必须做的是每个组合,列出这些字母的所有可能的排列。胡说! Ptooy! Yech!
So, the infamous looping in a loop. Actually, three loops. * The outer loop simply count down all numbers of combinations from 1 to the number of letters in the word. * The next finds all unordered combinations of each of those letter groups. * Finally, the last one takes all unordered combinations and returns a list of permutations from those combinations.
所以,循环中臭名昭着的循环。实际上,三个循环。 *外部循环简单地倒计数从1到字中字母数的所有组合数。 *下一个查找每个字母组的所有无序组合。 *最后,最后一个采用所有无序组合并返回这些组合的排列列表。
Now, I can finally take those permutations of letters and compare it against my dictionary of words. Surprisingly, the program ran much faster than I expected considering it had to turn a 235,886 word dictionary into a hash, then loop through a triple decker loop to find all permutations of all combinations of all possible number of letters. The whole program ran in less than two seconds.
现在,我终于可以采用那些字母排列,并将其与我的词典进行比较。令人惊讶的是,程序运行速度比我预期的要快得多,因为它必须将235,886字的字典转换为哈希,然后循环通过三层环路来查找所有可能数量字母的所有组合的所有排列。整个程序在不到两秒的时间内完成。
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use autodie;
use Data::Dumper;
use Math::Combinatorics;
use constant {
LETTERS => "EBLAIDL",
DICTIONARY => "/usr/share/dict/words",
};
#
# Create Dictionary Hash
#
open my $dict_fh, "<", DICTIONARY;
my %dictionary;
foreach my $word (<$dict_fh>) {
chomp $word;
(my $key = $word) =~ s/[^[:alpha:]]//;
$dictionary{lc $key} = $word;
}
#
# Now take the letters and create a Perl list of them.
#
my @letter_list = split // => LETTERS;
my %valid_word_hash;
#
# Outer Loop: This is a range from one letter combinations to the
# maximum letters combination
#
foreach my $num_of_letters (1..scalar @letter_list) {
#
# Now we generate a reference to a list of lists of all letter
# combinations of $num_of_letters long. From there, we need to
# take the Permutations of all those letters.
#
foreach my $letter_list_ref (combine($num_of_letters, @letter_list)) {
my @letter_list = @{$letter_list_ref};
# For each combination of letters $num_of_letters long,
# we now generate a permeation of all of those letter
# combinations.
#
foreach my $word_letters_ref (permute(@letter_list)) {
my $word = join "" => @{$word_letters_ref};
#
# This $word is just a possible candidate for a word.
# We now have to compare it to the words in the dictionary
# to verify it's a word
#
$word = lc $word;
if (exists $dictionary{$word}) {
my $dictionary_word = $dictionary{$word};
$valid_word_hash{$word} = $dictionary_word;
}
}
}
}
#
# I got lazy here... Just dumping out the list of actual words.
# You need to go through this list to find your longest and
# shortest words. Number of syllables? That's trickier, you could
# see if you can divide on CVC and CVVC divides where C = consonant
# and V = vowel.
#
say join "\n", sort keys %valid_word_hash;
Running this program produced:
运行此程序产生:
$ ./test.pl | column
a al balei bile del i lai
ab alb bali bill delia iba laid
abdiel albe ball billa dell ibad lea
abe albi balled billed della id lead
abed ale balli blad di ida leal
abel alible be blade dial ide led
abide all bea blae dib idea leda
abie alle bead d die ideal lei
able allie beal da dieb idle leila
ad allied bed dab dill ie lelia
ade b beid dae e ila li
adib ba bel dail ea ill liable
adiel bad bela dal ed l libel
ae bade beld dale el la lid
ai bae belial dali elb lab lida
aid bail bell dalle eld label lide
aide bal bella de eli labile lie
aiel bald bid deal elia lad lied
ail baldie bide deb ell lade lila
aile bale bield debi ella ladle lile
#4
1
Maybe it would help if you create a separate table with the 26 letters of the alphabet. Than, you would build a query that will search on the second database for any letter you defined. It is important that the query assures that each result is unique.
如果您使用字母表中的26个字母创建一个单独的表,这可能会有所帮助。然后,您将构建一个查询,在第二个数据库中搜索您定义的任何字母。重要的是查询确保每个结果都是唯一的。
So, you have a table that contains your words, and you have a relation of many to many to another table that contains all the letters of the alphabets. And you would query on this second table and make the results unique. You could have a similar approach to the number of the letters.
因此,您有一个包含您的单词的表,并且您与包含字母表的所有字母的另一个表具有多对多的关系。您将在第二个表上查询并使结果唯一。您可以对字母数量采用类似的方法。
You could use the same approach for the number of letters and syllables. So you would make one query that would be joining all the information you want. Put the right indexes on the database to help performance, make use of appropriate caching and, if it comes to that, you can parallelize searches.
您可以对字母和音节的数量使用相同的方法。因此,您将创建一个可以加入所需信息的查询。在数据库上放置正确的索引以帮助提高性能,使用适当的缓存,如果是这样,您可以并行化搜索。