如何针对大文本文件运行字典搜索？

We're in the final stages of shipping our console game. On the Wii we're having the most problems with memory of course, so we're busy hunting down sloppy coding, packing bits, and so on.

我们正处于运送控制台游戏的最后阶段。在Wii上我们当然遇到了最多的内存问题,所以我们正在忙着寻找邋code的编码,打包位等等。

I've done a dump of memory and used strings.exe (from sysinternals) to analyze it, but it's coming up with a lot of gunk like this:

我已经完成了内存转储,并使用strings.exe(来自sysinternals)来分析它,但它会产生很多像这样的垃圾:

''''$$$$    %%%%
''''$$$$%%%%####&&&&
''''$$$$((((!!!!$$$$''''((((####%%%%$$$$####((((
''))++.-$$%&''))
'')*>BZf8<S]^kgu[faniwkzgukzkzkz
'',,..EDCCEEONNL

I'm more interested in strings like this:

我对这样的字符串更感兴趣:

wood_wide_end.bmp
restroom_stonewall.bmp

...which mean we're still embedding some kinds of strings that need to be converted to ID's.

...这意味着我们仍然要嵌入某些需要转换为ID的字符串。

So my question is: what are some good ways of finding the stuff that's likely our debug data that we can eliminate?

所以我的问题是:有什么好方法可以找到我们可以消除的调试数据?

I can do some rx's to hack off symbols or just search for certain kinds of strings. But what I'd really like to do is get a hold of a standard dictionary file and search my strings file against that. Seems slow if I were to build a big rx with aardvaark|alimony|archetype etc. Or will that work well enough if I do a .NET compiled rx assembly for it?

我可以做一些rx来破解符号或只搜索某些类型的字符串。但我真正想做的是获取一个标准的字典文件并搜索我的字符串文件。如果我用aardvaark | alimony | archetype等构建一个大的rx,似乎很慢。或者如果我为它编写一个.NET编译的rx程序集,那还能运行得好吗?

Looking for other ideas about how to find stuff we want to eliminate as well. Quick and dirty solutions, don't need elegant. Thanks!

寻找关于如何找到我们想要消除的东西的其他想法。快速而肮脏的解决方案,不需要优雅。谢谢!

2 个解决方案

#1

First, I'd get a good word list. This NPL page has a good list of word lists of varying sizes and sources. What I would do is build a hash table of all the words in the word list, and then test each word that is output by strings against the word list. This is pretty easy to do in Python:

首先,我会得到一个好的单词列表。这个NPL页面有很多不同大小和来源的单词列表。我要做的是构建一个单词列表中所有单词的哈希表,然后测试字符串对单词列表输出的每个单词。这在Python中很容易做到:

import sys

dictfile = open('your-word-list')
wordlist = frozenset(word.strip() for word in dictfile)
dictfile.close()

for line in sys.stdin:
    # if any word in the line is in our list, print out the whole line
    for word in line.split():
        if word in wordlist:
            print line
            break

Then use it like this:

然后像这样使用它:

strings myexecutable.elf | python myscript.py

However, I think you're focusing your attention in the wrong place. Eliminating debug strings has very diminishing returns. Although eliminating debugging data is a Technical Certification Requirement that Nintendo requires you to do, I don't think they'll bounce you for having a couple of extra strings in your ELF.

但是,我认为你把注意力集中在错误的地方。消除调试字符串的回报非常低。虽然消除调试数据是任天堂要求你做的技术认证要求,但我认为他们不会因为你的ELF中有一些额外的字符串而反弹。

Use a profiler and try to identify where you're using the most memory. Chances are, there will be a way to save huge amounts of memory with little effort if you focus your energy in the right place.

使用分析器并尝试识别您使用最多内存的位置。如果你把精力集中在正确的地方,可能会有很少的努力来节省大量的内存。

#2

This sounds like an ideal task for a quick-and-dirty script in something supporting regex's. I'd probably do something in python real quick if it was me.

对于支持正则表达式的东西来说,这听起来像是一个快速而肮脏的脚本的理想任务。如果是我的话,我可能会在python中快速做点什么。

Here's how I would proceed: Every time you encounter a string (from the strings.exe output), prompt the user as to whether they'd like to remember it in the dictionary or permanently ignore it. If the user chooses to permanently ignore the string, in the future when its encountered, don't prompt the user about it and throw it away. You can optionally keep an anti-dictionary file around to remember this for future runs of your script. Build up the dictionary file and for each string keep a count or any other info about it you'd like about it. Optionally sort by the number of times the string occurs, so you can focus on the most egregious offenders.

以下是我将如何处理:每次遇到字符串(来自strings.exe输出)时,都会提示用户是否要在字典中记住它或永久忽略它。如果用户选择永久忽略该字符串,将来遇到该字符串时,不要提示用户将其丢弃并丢弃。您可以选择保留一个反字典文件,以便在将来运行脚本时记住这一点。构建字典文件,并为每个字符串保留计数或任何其他关于它的信息。可选择按字符串出现的次数排序,因此您可以专注于最恶劣的违规者。

This sounds like an ideal task for learning a scripting language. I wouldn't bother messing with C#/C++ or anything real fancy to implement this.

这听起来像学习脚本语言的理想任务。我不打扰乱用C#/ C ++或任何真正想要实现它的东西。

#1

import sys

dictfile = open('your-word-list')
wordlist = frozenset(word.strip() for word in dictfile)
dictfile.close()

for line in sys.stdin:
    # if any word in the line is in our list, print out the whole line
    for word in line.split():
        if word in wordlist:
            print line
            break