如何安全地从std :: istream中读取一行？

I want to safely read a line from an std::istream. The stream could be anything, e.g., a connection on a Web server or something processing files submitted by unknown sources. There are many answers starting to do the moral equivalent of this code:

我想安全地从std :: istream中读取一行。流可以是任何东西，例如，Web服务器上的连接或处理由未知来源提交的文件的东西。有很多答案开始做这个代码的道德等价物：

void read(std::istream& in) {
    std::string line;
    if (std::getline(in, line)) {
        // process the line
    }
}

Given the possibly dubious source of in, using the above code would lead to a vulnerability: a malicious agent may mount a denial of service attack against this code using a huge line. Thus, I would like to limit the line length to some rather high value, say 4 millions chars. While a few large lines may be encountered, it isn't viable to allocate a buffer for each file and use std::istream::getline().

鉴于可能存在可疑的来源，使用上述代码会导致漏洞：恶意代理可能会使用巨大的线路对此代码发起拒绝服务攻击。因此，我想将线路长度限制在一些相当高的值，比如4百万个字符。虽然可能会遇到一些大行，但为每个文件分配缓冲区并使用std :: istream :: getline（）是不可行的。

How can the maximum size of the line be limited, ideally without distorting the code too badly and without allocating large chunks of memory up front?

如何限制线路的最大尺寸，理想情况下不会过于严重地扭曲代码并且不预先分配大块内存？

4 个解决方案

#1

You could write your own version of std::getline with a maximum number of characters read parameter, something called getline_n or something.

您可以编写自己的std :: getline版本，其中包含最大字符数读取参数，称为getline_n或其他东西。

#include <string>
#include <iostream>

template<typename CharT, typename Traits, typename Alloc>
auto getline_n(std::basic_istream<CharT, Traits>& in, std::basic_string<CharT, Traits, Alloc>& str, std::streamsize n) -> decltype(in) {
    std::ios_base::iostate state = std::ios_base::goodbit;
    bool extracted = false;
    const typename std::basic_istream<CharT, Traits>::sentry s(in, true);
    if(s) {
        try {
            str.erase();
            typename Traits::int_type ch = in.rdbuf()->sgetc();
            for(; ; ch = in.rdbuf()->snextc()) {
                if(Traits::eq_int_type(ch, Traits::eof())) {
                    // eof spotted, quit
                    state |= std::ios_base::eofbit;
                    break;
                }
                else if(str.size() == n) {
                    // maximum number of characters met, quit
                    extracted = true;
                    in.rdbuf()->sbumpc();
                    break;
                }
                else if(str.max_size() <= str.size()) {
                    // string too big
                    state |= std::ios_base::failbit;
                    break;
                }
                else {
                    // character valid
                    str += Traits::to_char_type(ch);
                    extracted = true;
                }
            }
        }
        catch(...) {
            in.setstate(std::ios_base::badbit);
        }
    }

    if(!extracted) {
        state |= std::ios_base::failbit;
    }

    in.setstate(state);
    return in;
}

int main() {
    std::string s;
    getline_n(std::cin, s, 10); // maximum of 10 characters
    std::cout << s << '\n';
}

Might be overkill though.

可能有点矫枉过正了。

#2

There is already such a getline function as a member function of istream, you just need to wrap it for buffer management.

已经有这样的getline函数作为istream的成员函数，你只需要将它包装起来进行缓冲区管理。

#include <assert.h>
#include <istream>
#include <stddef.h>         // ptrdiff_t
#include <string>           // std::string, std::char_traits

typedef ptrdiff_t Size;

namespace my {
    using std::istream;
    using std::string;
    using std::char_traits;

    istream& getline(
        istream& stream, string& s, Size const buf_size, char const delimiter = '\n'
        )
    {
        s.resize( buf_size );  assert( s.size() > 1 );
        stream.getline( &s[0], buf_size, delimiter );
        if( !stream.fail() )
        {
            Size const n = char_traits<char>::length( &s[0] );
            s.resize( n );      // Downsizing.
        }
        return stream;
    }
}  // namespace my

#3

Replace std::getline by creating a wrapper around std::istream::getline:

通过在std :: istream :: getline周围创建一个包装来替换std :: getline：

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    try
       {
       str.resize(n);
       is.getline(&str[0],n,delim);
       str.resize(is.gcount());
       return is;
       }
    catch(...) { str.resize(0); throw; }
    }

If you want to avoid excessive temporary memory allocations, you could use a loop which grows the allocation as needed (probably doubling in size on each pass). Don't forget that exceptions might or might not be enabled on the istream object.

如果你想避免过多的临时内存分配，你可以使用一个循环，根据需要增加分配（每次传递可能会增加一倍）。不要忘记在istream对象上可能启用或不启用异常。

Here's a version with the more efficient allocation strategy:

这是一个具有更高效分配策略的版本：

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    std::streamsize base=0;
    do {
       try
          {
          is.clear();
          std::streamsize chunk=std::min(n-base,std::max(static_cast<std::streamsize>(2),base));
          if ( chunk == 0 ) break;
          str.resize(base+chunk);
          is.getline(&str[base],chunk,delim);
          }
       catch( std::ios_base::failure ) { if ( !is.gcount () ) str.resize(0), throw; }
       base += is.gcount();
       } while ( is.fail() && is.gcount() );
    str.resize(base);
    return is;
    }

#4

Based on the comments and answers, there seem to be three approaches:

根据评论和答案，似乎有三种方法：

Write a custom version of getline() possibly using the std::istream::getline() member internally to get the actual characters.
编写getline（）的自定义版本可能在内部使用std :: istream :: getline（）成员来获取实际字符。
Use a filtering stream buffer to limit the amount of data potentially received.
使用过滤流缓冲区来限制可能收到的数据量。
Instead of reading a std::string, use a string instantiation with a custom allocator limiting the amount of memory stored in the string.
而不是读取std :: string，使用带有自定义分配器的字符串实例化来限制存储在字符串中的内存量。

Not all of the suggestions came with code. This answer provides code for all approaches and a bit of discussion of all three approaches. Before going into implementation details it is first worth pointing out that there are multiple choices of what should happen if an excessively long input is received:

并非所有建议都带有代码。这个答案为所有方法提供了代码，并对所有三种方法进行了一些讨论。在进入实现细节之前，首先要指出的是，如果收到过长的输入，会有多种选择：

Reading an overlong line could result in a successful read of a partial line, i.e., the resulting string contains the read content and the stream doesn't have any error flags set. Doing so means, however, that it isn't possible to distinguish between a line hitting exactly the limit or being too long. Since the limit is somewhat arbitrary anyway it probably doesn't really matter, though.
读取超长行可能导致成功读取部分行，即，结果字符串包含读取内容，并且流没有设置任何错误标志。但是，这样做意味着无法区分确切地达到极限或过长的线。但是，由于限制在某种程度上是任意的，所以它可能并不重要。
Reading an overlong line could be considered a failure (i.e., setting std::ios_base::failbit and/or std::ios_base::bad_bit) and, since reading failed, yield an empty string. Yielding an empty string, obviously, prevents potentially looking at the string read so far to possibly see what's going on.
读取超长行可能被视为失败（即，设置std :: ios_base :: failbit和/或std :: ios_base :: bad_bit），并且由于读取失败，因此产生空字符串。显然，产生一个空字符串可以防止潜在地查看到目前为止读取的字符串，以便可能看到正在发生的事情。
Reading an overlong line could provide the partial line read and also set error flags on the stream. This seems reasonable behavior both detecting that there is something up and also providing the input for potential inspection.
读取超长行可以提供部分行读取并在流上设置错误标志。这似乎是合理的行为，既检测到有什么东西，也提供潜在检查的输入。

Although there are multiple code examples implementing a limited version of getline() already, here is another one! I think it is simpler (albeit possibly slower; performance can be dealt with when necessary) which also retains's std::getline()s interface: it use the stream's width() to communicate a limit (maybe taking width() into account is a reasonable extension to std::getline()):

虽然有多个代码示例已经实现了限制版本的getline（），但这是另一个！我认为它更简单（尽管可能更慢;性能可以在必要时处理），它也保留了std :: getline（）的接口：它使用流的宽度（）来传达限制（可能考虑宽度（）对std :: getline（）的合理扩展：

template <typename cT, typename Traits, typename Alloc>
std::basic_istream<cT, Traits>&
safe_getline(std::basic_istream<cT, Traits>& in,
             std::basic_string<cT, Traits, Alloc>& value,
             cT delim)
{
    typedef std::basic_string<cT, Traits, Alloc> string_type;
    typedef typename string_type::size_type size_type;

    typename std::basic_istream<cT, Traits>::sentry cerberos(in);
    if (cerberos) {
        value.clear();
        size_type width(in.width(0));
        if (width == 0) {
            width = std::numeric_limits<size_type>::max();
        }
        std::istreambuf_iterator<char> it(in), end;
        for (; value.size() != width && it != end; ++it) {
            if (!Traits::eq(delim, *it)) {
                value.push_back(*it);
            }
            else {
                ++it;
                break;
            }
        }
        if (value.size() == width) {
            in.setstate(std::ios_base::failbit);
        }
    }
    return in;
}

This version of getline() is used just like std::getline() but when it seems reasonable to limit the amount of data read, the width() is set, e.g.:

这个版本的getline（）与std :: getline（）一样使用，但是当限制读取的数据量似乎是合理的时，设置width（），例如：

std::string line;
if (safe_getline(in >> std::setw(max_characters), line)) {
    // do something with the input
}

Another approach is to just use a filtering stream buffer to limit the amount of input: the filter would just count the number of characters processed and limit the amount to a suitable number of characters. This approach is actually easier applied to an entire stream than an individual line: when processing just one line, the filter can't just obtain buffers full of characters from the underlying stream because there is no reliable way to put the characters back. Implementing an unbuffered version is still simple but probably not particularly efficient:

另一种方法是仅使用过滤流缓冲区来限制输入量：过滤器只计算处理的字符数，并将数量限制为合适的字符数。这种方法实际上比单个行更容易应用于整个流：当只处理一行时，过滤器不能只从底层流中获取充满字符的缓冲区，因为没有可靠的方法来放回字符。实现无缓冲版本仍然很简单，但可能效率不高：

template <typename cT, typename Traits = std::char_traits<char> >
class basic_limitbuf
    : std::basic_streambuf <cT, Traits> {
public:
    typedef Traits                    traits_type;
    typedef typename Traits::int_type int_type;

private:
    std::streamsize                   size;
    std::streamsize                   max;
    std::basic_istream<cT, Traits>*   stream;
    std::basic_streambuf<cT, Traits>* sbuf;

    int_type underflow() {
        if (this->size < this->max) {
            return this->sbuf->sgetc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
    int_type uflow()     {
        if (this->size < this->max) {
            ++this->size;
            return this->sbuf->sbumpc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
public:
    basic_limitbuf(std::streamsize max,
                   std::basic_istream<cT, Traits>& stream)
        : size()
        , max(max)
        , stream(&stream)
        , sbuf(this->stream->rdbuf(this)) {
    }
    ~basic_limitbuf() {
        std::ios_base::iostate state = this->stream->rdstate();
        this->stream->rdbuf(this->sbuf);
        this->stream->setstate(state);
    }
};

This stream buffer is already set up to insert itself upon construction and remove itself upon destruction. That is, it can be used simply like this:

此流缓冲区已经设置为在构造时插入自身并在销毁时自行移除。也就是说，它可以像这样使用：

std::string line;
basic_limitbuf<char> sbuf(max_characters, in);
if (std::getline(in, line)) {
    // do something with the input
}

It would be easy to add a manipulator setting up the limit, too. One advantage of this approach is that none of the reading code needs to be touched if the total size of the stream could be limited: the filter could be set up right after creating the stream. When there is no need to back out the filter, the filter could also use a buffer which would greatly improve the performance.

添加操纵器设置限制也很容易。这种方法的一个优点是，如果流的总大小可能受到限制，则不需要触摸任何读取代码：可以在创建流之后立即设置过滤器。当不需要退出过滤器时，过滤器也可以使用缓冲器，这将大大提高性能。

The third approach suggested is to use a std::basic_string with a custom allocator. There are two aspects which are a bit awkward about the allocator approach:

建议的第三种方法是将std :: basic_string与自定义分配器一起使用。分配器方法有两个方面有点尴尬：

The string being read actually has a type which isn't immediately convertible to std::string (although it also isn't hard to do the conversion).
正在读取的字符串实际上有一个不能立即转换为std :: string的类型（尽管转换也不难）。
The maximum array size can be easily limited but the string will have some more or less random size smaller than that: when the stream fails allocating an exception is thrown and there is no attempt to grow the string by a smaller size.
可以很容易地限制最大数组大小，但字符串将具有或多或少的随机大小：当流失败时，抛出异常并且不会尝试以较小的大小增长字符串。

Here is the necessary code for an allocator limiting the allocated size:

以下是限制分配大小的分配器的必要代码：

template <typename T>
struct limit_alloc
{
private:
    std::size_t max_;
public:
    typedef T value_type;
    limit_alloc(std::size_t max): max_(max) {}
    template <typename S>
    limit_alloc(limit_alloc<S> const& other): max_(other.max()) {}
    std::size_t max() const { return this->max_; }
    T* allocate(std::size_t size) {
        return size <= max_
            ? static_cast<T*>(operator new[](size))
            : throw std::bad_alloc();
    }
    void  deallocate(void* ptr, std::size_t) {
        return operator delete[](ptr);
    }
};

template <typename T0, typename T1>
bool operator== (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return a0.max() == a1.max();
}
template <typename T0, typename T1>
bool operator!= (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return !(a0 == a1);
}

The allocator would be used something like this (the code compiles OK with a recent version of clang but not with gcc):

分配器将使用类似这样的东西（代码使用最新版本的clang但不使用gcc编译好）：

std::basic_string<char, std::char_traits<char>, limit_alloc<char> >
    tmp(limit_alloc<char>(max_chars));
if (std::getline(in, tmp)) {
    std::string(tmp.begin(), tmp.end());
    // do something with the input
}

In summary, there are multiple approach each with its own small drawback but each reasonably viable for the stated goal of limiting denial of service attacks based on overlong lines:

总之，有多种方法，每种方法都有其自身的小缺点，但每种方法都适用于基于超长线限制拒绝服务攻击的既定目标：

Using a custom version of getline() means the reading code needs to be changed.
使用自定义版本的getline（）意味着需要更改读取代码。
Using a custom stream buffer is slow unless the entire stream's size can be limited.
除非可以限制整个流的大小，否则使用自定义流缓冲区很慢。
Using a custom allocator gives less control and requires some changes to reading code.
使用自定义分配器可以减少控制，并且需要对读取代码进行一些更改。

#1