Map string positions to list positions

I begin with a list of words like ["ONE","TWO","THREE","FOUR"].

我从一个单词列表开始,如[“ONE”,“TWO”,“THREE”,“FOUR”]。

Later, I join the list to make a string: "ONETWOTHREEFOUR". I do some stuff while looking at this string and get a list of indices, say [6,7,8,0,4] (which maps onto that string to give me the word "THROW", though as pointed out in comments that's irrelevant to my question).

后来,我加入列表来创建一个字符串:“ONETWOTHREEFOUR”。我在查看这个字符串时会做一些事情并得到一个索引列表,比如[6,7,8,0,4](它映射到那个字符串上给我“THROW”这个词,尽管正如评论中指出的那样)与我的问题无关)。

Now I want to know which items from the original list gave me the letters I am using to make my word. I know I used letters [6,7,8,0,4] from the joined string. Based on that list of string indices, I want the output {0,1,2}, because I used letters from every word in the original list except "FOUR".

现在我想知道原始列表中的哪些项目给了我用来表达我的信件。我知道我使用了连接字符串中的字母[6,7,8,0,4]。根据字符串索引列表,我想要输出{0,1,2},因为我使用了原始列表中除“FOUR”之外的每个单词的字母。

What I've tried so far:

到目前为止我尝试了什么:

wordlist = ["ONE","TWO","THREE","FOUR"]
stringpositions = [6,7,8,0,4]
wordlengths = tuple(len(w) for w in wordlist) #->(3, 3, 5, 4)
wordstarts = tuple(sum(wordlengths[:i]) for i in range(len(wordlengths))) #->(0, 3, 6, 11)

words_used = set()
for pos in stringpositions:
    prev = 0
    for wordnumber,wordstart in enumerate(wordstarts):            
        if pos < wordstart:
            words_used.add(prev)
            break
        prev = wordnumber

It seems awfully long-winded. What's the best (and/or most Pythonic) way for me to do this?

看起来非常啰嗦。对我来说,最好的(和/或大多数Pythonic)方法是什么?

2 个解决方案

#1

As clarified in the comments, the OP's goal is to figure out which words were used based on which string positions were used, rather than which letters were used -- so the word/substring THROW is basically irrelevant.

正如评论中所阐明的那样,OP的目标是根据使用的字符串位置找出使用哪些单词,而不是使用哪些字母 - 所以字/子串THROW基本上是无关紧要的。

Here's a very short version:

这是一个很短的版本:

from itertools import chain

wordlist = ["ONE","TWO","THREE","FOUR"]
string = ''.join(wordlist) # "ONETWOTHREEFOUR"
stringpositions = [6,7,8,0,4]

# construct a list that maps every position in string to a single source word    
which_word = list(chain( [ii]*len(w) for ii, w in enumerate(wordlist) ))

# it's now trivial to use which_word to construct the set of words 
# represented in the list stringpositions
words_used = set( which_word[pos] for pos in stringpositions )

print "which_word=", which_word
print "words_used=", words_used

==>

which_word= [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3]
words_used= set([0, 1, 2])

EDIT: Updated to use list(itertools.chain(generator)) rather than sum(generator, []) as suggested by @inspectorG4dget in the comments.

编辑:更新为使用列表(itertools.chain(生成器))而不是sum(generator,[]),如@ inspectorG4dget在评论中所建议的那样。

#2

Here's the easiest way. If you want to be more space-efficient, you might want to use some sort of binary search tree

这是最简单的方法。如果您想要更节省空间,可能需要使用某种二叉搜索树

wordlist = ["ONE","TWO","THREE","FOUR"]
top = 0
inds = {}
for i,word in enumerate(wordlist):
    for k in range(top, top+len(word)):
        inds[k] = i
    top += len(word)

#do some magic
L = [6,7,8,0,4]
for i in L: print(inds[i])

Output:

You could of course call set() on the output if you wanted to

如果你愿意,你当然可以在输出上调用set()

#1