Adam Pierce’s answer provides an hand-spun tokenizer taking in a const char*
. It’s a bit more problematic to do with iterators because incrementing a string
‘s end iterator is undefined. That said, given string str{ "The quick brown fox" }
we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you’re looking to abstract complexity by using standard functionality, as On Freund suggests strtok
is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don’t have access to C++17 you’ll need to substitute data(str)
as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok
need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok
cannot be used on multiplestrings
at the same time: Either anullptr
must be passed to continue tokenizing the currentstring
or a newchar*
to tokenize must be passed (there are some non-standard implementations which do support this however, such as:strtok_s
)- For the same reason
strtok
cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio’s implementation is thread safe) - Calling
strtok
modifies thestring
it is operating on, so it cannot be used onconst string
s,const char*
s, or literal strings, to tokenize any of these withstrtok
or to operate on astring
who’s contents need to be preserved,str
would have to be copied, then the copy could be operated on
c++20 provides us with split_view
to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector
in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens
. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator
. For example given: const string str{ "The quick \tbrown \nfox" }
we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream
for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string
allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator
of course with this flexibility comes greater expense, but again this is likely hidden in the string
allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" }
we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example