C ++ - std :: string是否适合存储大型文本文件,如果不是,那么这样做的最佳数据类型是什么?

时间:2020-12-25 16:56:38

I was just wondering, what is the best data type for storing the contents of a text-based file? Is std::string suitable for keeping the contents of a larger file in memory?

我只是想知道,存储基于文本的文件内容的最佳数据类型是什么? std :: string是否适合将较大文件的内容保存在内存中?

I'm making an editor of sorts right now so I'd like to know, I can't seem to find a good answer.

我现在正在做一个编辑,所以我想知道,我似乎无法找到一个好的答案。

Edit: Yeah, this was a very vague question and I didn't expect it to get quite as much attention. Saying it's an editor is kinda a bad description, and the question is quite vague, I was just wondering how to store read-only text in memory, if std::stringis a bad way to do so; if it is inefficient or not.

编辑:是的,这是一个非常模糊的问题,我没想到它得到了相当多的关注。说它是一个编辑器有点糟糕的描述,问题很模糊,我只是想知道如何在内存中存储只读文本,如果std :: string是一个坏方法;如果它效率低下。

3 个解决方案

#1


3  

The "editor of sorts" is probably the important thing: if the text were read-only, you could consider using mmap. I don't have enough experience with memory mapped files to know if they're appropriate for text editors, however.

“编辑”可能是重要的事情:如果文本是只读的,你可以考虑使用mmap。但是,我没有足够的内存映射文件经验来了解它们是否适合文本编辑器。

There are data structures more suited to modifying large chunks of text. A rope is a binary tree with short text strings at the leaf nodes... operations on a string such as appending some text might cause the leaf node to be split and the appended text added into the righthand new node. This has the advantage that existing strings don't always need to be repeated moved or grown as the text document is modified.

有更适合修改大块文本的数据结构。绳索是叶子节点上具有短文本字符串的二叉树...对字符串的操作(例如附加一些文本)可能导致叶子节点被拆分并且附加的文本被添加到右侧新节点中。这具有以下优点:在修改文本文档时,不总是需要重复移动或增长现有字符串。

Another alternative is a simpler structure called a gap buffer. This effectively uses three strings to hold your text, a prefix, a postfix and a pre-sized gap. When the user starts work on a section of text, the document is split into the prefix and postfix strings, and a new gap buffer is allocated. The text the user adds is pushed into the gap buffer which may be expanded as needed. When they move on to a different point in the document, the gap buffer is merged with the other strings and a new gap is created. The assumption here is that most of the document will be static, with most edits occurring around a specific location in the document at any given time, minimising string copies, moves and reallocations.

另一种替代方案是称为间隙缓冲区的更简单的结构。这有效地使用三个字符串来保存您的文本,前缀,后缀和预先大小的间隙。当用户开始处理一段文本时,文档被分成前缀和后缀字符串,并分配新的间隙缓冲区。用户添加的文本被推入间隙缓冲区,可以根据需要进行扩展。当它们移动到文档中的不同点时,间隙缓冲区将与其他字符串合并,并创建新的间隙。这里的假设是大多数文档都是静态的,大多数编辑在任何给定时间发生在文档中的特定位置,最小化字符串副本,移动和重新分配。

Emacs uses gap buffers, which suggests they're not a bad place to start. There's plenty of discussion (and comparison) of the two datastructures out there, and you may even be able to find perfectly useable implementations already available. Implementing your own gap buffer should be dead easy.

Emacs使用间隙缓冲区,这表明它们不是一个糟糕的起点。这两个数据结构有很多讨论(和比较),你甚至可以找到已经可用的完全可用的实现。实现自己的差距缓冲区应该很容易。

Possibly useful reading: Gap Buffers, or, Don’t Get Tied Up With Ropes? (which includes some profiling information), original SGI C++ library Rope docs

可能有用的阅读:差距缓冲,或者,不要用绳索捆绑? (包括一些分析信息),原始的SGI C ++库Rope docs

#2


2  

Well, for a vague question, my answer is that probably std::string will suite you well. But.. there many ways to store this, it depends on how are you development requisites.

好吧,对于一个模糊的问题,我的答案是,std :: string可能会很好地适应你。但是......有很多方法来存储它,这取决于你的开发必需品。

Edit: Complementary Answer (edited question) No, it's not inefficient at all. It's quite suitable for generic use and excelent for readlonly access.

编辑:补充答案(编辑问题)不,它根本不是低效的。它非常适合通用,并且非常适合读取访问。

#3


1  

This is a vague question that is why you can't find a good answer. It is more about what you do with this text file. If the text file is small enough to be stored in memory then sure you can store it in a string. But then how are you going to use it? What does this do for you? Are you going to use regex for find certain words? Then sure you can do that but it may be slow.

这是一个模糊的问题,这就是为什么你找不到一个好的答案。它更多地是关于您对此文本文件的操作。如果文本文件足够小,可以存储在内存中,那么请确保将其存储在字符串中。但那你打算如何使用呢?这对你有什么用?你打算用正则表达式找到某些单词吗?然后确定你可以这样做,但它可能会很慢。

Is the the text file a webpage(source)? Then sure you can do that and search for the tags you are looking for. There might be better ways like putting it into an xml tree and searching for the tags but the ONE string should still work.

文本文件是网页(来源)吗?然后确定您可以这样做并搜索您要查找的标签。可能有更好的方法,例如将它放入xml树并搜索标签,但ONE字符串应该仍然有效。

Anyway this is a tough question to answer because we don't know what you are using the string for in the first place.

无论如何,这是一个难以回答的问题,因为我们首先不知道你在使用字符串是什么。

If you just need it whole and intact then if you have enough memory to store it in a string then sure.

如果你只需要它完整而完好无损,那么如果你有足够的内存来存储它,那么肯定。

#1


3  

The "editor of sorts" is probably the important thing: if the text were read-only, you could consider using mmap. I don't have enough experience with memory mapped files to know if they're appropriate for text editors, however.

“编辑”可能是重要的事情:如果文本是只读的,你可以考虑使用mmap。但是,我没有足够的内存映射文件经验来了解它们是否适合文本编辑器。

There are data structures more suited to modifying large chunks of text. A rope is a binary tree with short text strings at the leaf nodes... operations on a string such as appending some text might cause the leaf node to be split and the appended text added into the righthand new node. This has the advantage that existing strings don't always need to be repeated moved or grown as the text document is modified.

有更适合修改大块文本的数据结构。绳索是叶子节点上具有短文本字符串的二叉树...对字符串的操作(例如附加一些文本)可能导致叶子节点被拆分并且附加的文本被添加到右侧新节点中。这具有以下优点:在修改文本文档时,不总是需要重复移动或增长现有字符串。

Another alternative is a simpler structure called a gap buffer. This effectively uses three strings to hold your text, a prefix, a postfix and a pre-sized gap. When the user starts work on a section of text, the document is split into the prefix and postfix strings, and a new gap buffer is allocated. The text the user adds is pushed into the gap buffer which may be expanded as needed. When they move on to a different point in the document, the gap buffer is merged with the other strings and a new gap is created. The assumption here is that most of the document will be static, with most edits occurring around a specific location in the document at any given time, minimising string copies, moves and reallocations.

另一种替代方案是称为间隙缓冲区的更简单的结构。这有效地使用三个字符串来保存您的文本,前缀,后缀和预先大小的间隙。当用户开始处理一段文本时,文档被分成前缀和后缀字符串,并分配新的间隙缓冲区。用户添加的文本被推入间隙缓冲区,可以根据需要进行扩展。当它们移动到文档中的不同点时,间隙缓冲区将与其他字符串合并,并创建新的间隙。这里的假设是大多数文档都是静态的,大多数编辑在任何给定时间发生在文档中的特定位置,最小化字符串副本,移动和重新分配。

Emacs uses gap buffers, which suggests they're not a bad place to start. There's plenty of discussion (and comparison) of the two datastructures out there, and you may even be able to find perfectly useable implementations already available. Implementing your own gap buffer should be dead easy.

Emacs使用间隙缓冲区,这表明它们不是一个糟糕的起点。这两个数据结构有很多讨论(和比较),你甚至可以找到已经可用的完全可用的实现。实现自己的差距缓冲区应该很容易。

Possibly useful reading: Gap Buffers, or, Don’t Get Tied Up With Ropes? (which includes some profiling information), original SGI C++ library Rope docs

可能有用的阅读:差距缓冲,或者,不要用绳索捆绑? (包括一些分析信息),原始的SGI C ++库Rope docs

#2


2  

Well, for a vague question, my answer is that probably std::string will suite you well. But.. there many ways to store this, it depends on how are you development requisites.

好吧,对于一个模糊的问题,我的答案是,std :: string可能会很好地适应你。但是......有很多方法来存储它,这取决于你的开发必需品。

Edit: Complementary Answer (edited question) No, it's not inefficient at all. It's quite suitable for generic use and excelent for readlonly access.

编辑:补充答案(编辑问题)不,它根本不是低效的。它非常适合通用,并且非常适合读取访问。

#3


1  

This is a vague question that is why you can't find a good answer. It is more about what you do with this text file. If the text file is small enough to be stored in memory then sure you can store it in a string. But then how are you going to use it? What does this do for you? Are you going to use regex for find certain words? Then sure you can do that but it may be slow.

这是一个模糊的问题,这就是为什么你找不到一个好的答案。它更多地是关于您对此文本文件的操作。如果文本文件足够小,可以存储在内存中,那么请确保将其存储在字符串中。但那你打算如何使用呢?这对你有什么用?你打算用正则表达式找到某些单词吗?然后确定你可以这样做,但它可能会很慢。

Is the the text file a webpage(source)? Then sure you can do that and search for the tags you are looking for. There might be better ways like putting it into an xml tree and searching for the tags but the ONE string should still work.

文本文件是网页(来源)吗?然后确定您可以这样做并搜索您要查找的标签。可能有更好的方法,例如将它放入xml树并搜索标签,但ONE字符串应该仍然有效。

Anyway this is a tough question to answer because we don't know what you are using the string for in the first place.

无论如何,这是一个难以回答的问题,因为我们首先不知道你在使用字符串是什么。

If you just need it whole and intact then if you have enough memory to store it in a string then sure.

如果你只需要它完整而完好无损,那么如果你有足够的内存来存储它,那么肯定。