Last Updated, 5 min read

The 1๐Ÿ๐ŸŽ๏ธ Challenge


The Challenge ๐Ÿ‘‰ 1๐Ÿ๐ŸŽ๏ธ

Why?

All my life (read: since I was {current_age_dec_2026 - 2} years old), Iโ€™ve been diving deep into CPU architectures, data oriented design, cache friendly coding practices and general low-level optimizations.

The 1 Billion Row Challenge (1BRC) was the perfect excuse to commit unholy levels of optimization without crashing production.

Attempt 1: The โ€œIt Worksโ€ Approach

Iโ€™ve learned the hard way โ€ฆ really hard way to start simple.

For Attempt 1, I kept it dead simple. Zero optimization. Just me, raw C++, and a desire to understand the problem components. The best way for me to do that is just code out a working solution.

Result: It runs in ~513.62.

main.cpp
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <limits>
#include <map>
#include <sstream>
#include <string>
struct LocationStats {
int32_t min = std::numeric_limits<int32_t>::max();
int32_t max = std::numeric_limits<int32_t>::min();
int64_t sum = 0;
int32_t freq = 0;
};
double round1(double value) { return std::round(value * 10.0) / 10.0; }
void printResults(const std::map<std::string, LocationStats>& m) {
std::string outBuffer;
outBuffer.reserve(8 * 1024 * 1024);
bool first = true;
outBuffer += "{";
for (const auto& [loc, stat] : m) {
char buf[32];
if (!first) outBuffer += ", ";
outBuffer += loc;
outBuffer += "=";
double min_val = stat.min / 10.0;
double max_val = stat.max / 10.0;
double avg = round1(stat.sum / (double)(stat.freq * 10));
std::snprintf(buf, sizeof(buf), "%.1f/%.1f/%.1f", min_val, avg, max_val);
outBuffer += buf;
first = false;
}
outBuffer += "}\n";
std::cout << outBuffer;
}
std::pair<std::string, int32_t> parseLine(const std::string& line) {
std::istringstream ss(line);
std::string location, tempString;
std::getline(ss, location, ';');
std::getline(ss, tempString, ';');
int32_t temperature = static_cast<int32_t>(std::stod(tempString) * 10);
return {location, temperature};
}
void updateStats(const std::string& location, int32_t temperature,
std::map<std::string, LocationStats>& m) {
auto& stat = m[location];
stat.min = std::min(stat.min, temperature);
stat.max = std::max(stat.max, temperature);
stat.freq++;
stat.sum += temperature;
}
std::map<std::string, LocationStats> accumulate(std::ifstream& f) {
std::map<std::string, LocationStats> m;
std::string tempBuffer;
while (std::getline(f, tempBuffer)) {
auto [location, temperature] = parseLine(tempBuffer);
updateStats(location, temperature, m);
}
return m;
}
void oneBrc(const char* filename) {
std::ifstream f(filename);
if (!f.is_open()) {
std::cerr << "Error: Could not open file " << filename << std::endl;
std::exit(EXIT_FAILURE);
}
std::map<std::string, LocationStats> m = accumulate(f);
printResults(m);
}
int main(int argc, char* argv[]) {
const char* filename = argc > 1 ? argv[1] : "../data/measurements.txt";
oneBrc(filename);
return EXIT_SUCCESS;
}

The 2024 version of me would probably just throw standard optimizations at this and hope for the best. But weโ€™re doing science here. I want to know exactly why itโ€™s taking so long and where the bottleneck lies as 2026 version of me prefers targeted efforts.

Time for a Flamegraph.

If you donโ€™t know Brendan Gregg, stop reading this and go read his blog. He is practically the patron saint of systems performance.

I used his legendary flamegraph.pl toolset (from his repo) to visualize the stack traces.

flamegraph_linux.sh
rm -f main out.perf perf.data perf.data.old flamegraph.svg
clang++ -std=c++23 -O2 -g -fno-omit-frame-pointer -Werror -Wall -o main main.cpp
# Record stack samples (requires sudo or perf_event_paranoid relaxed)
sudo perf record -F 997 -g -- ./main
# Convert perf data to folded stacks
perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
# Generate the flamegraph
./flamegraph.pl out.folded > flamegraph.svg

the graph is interactive, feel free to poke around.

The biggest compoenets of runtime are parseLine(), libc.so and std::map<โ€ฆ>::operator[] which is what i expected as the code does a lot of parsing, memory allocation/freeing and map operations (a Billion times).

A simple observation is that the process is easily parallelisable map reduce style. this should naturally reduce the runtime and better utilize the resources.

The challenge however would be to minimize any unnecessay intermediate memory allocations as copy is the default semantics of cpp and lot of copying occurs under the hood as cpp does an incredible job at hiding underlying behaviour without any warning.

TODO:

  1. process the file in batches in parallel using threads.
  2. Minimize memory allocations.
main.cpp
#include <algorithm>
#include <cassert>
#include <charconv>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <limits>
#include <map>
#include <optional>
#include <span>
#include <string>
#include <string_view>
#include <thread>
#include <vector>
constexpr char NUM_THREADS = 8;
constexpr int64_t MAX_FILE_READ_BYTES = 1 * 1024 * 1024 * 512; // 512MB
struct LocationStats {
int32_t min = std::numeric_limits<int32_t>::max();
int32_t max = std::numeric_limits<int32_t>::min();
int64_t sum = 0;
int32_t freq = 0;
};
int64_t getFileSize(const std::string& fileName) {
return static_cast<int64_t>(std::filesystem::file_size(fileName));
}
int64_t getBatchSize(const std::string& fileName) {
int64_t fileSize = getFileSize(fileName);
return (fileSize + (NUM_THREADS - 1)) / NUM_THREADS;
}
double round1(double value) { return std::round(value * 10.0) / 10.0; }
void printResults(const std::map<std::string, LocationStats>& m) {
std::string outBuffer;
outBuffer.reserve(8 * 1024 * 1024);
outBuffer += "{";
for (const auto& [location, stat] : m) {
if (outBuffer.size() > 1) outBuffer += ", ";
outBuffer += location;
outBuffer += "=";
double min_val = stat.min / 10.0;
double max_val = stat.max / 10.0;
double avg = round1(stat.sum / static_cast<double>(stat.freq * 10));
size_t pos = outBuffer.size();
outBuffer.resize(pos + 32);
int writtenBytes =
std::snprintf(&outBuffer[pos], 32, "%.1f/%.1f/%.1f", min_val, avg, max_val);
outBuffer.resize(pos + writtenBytes);
}
outBuffer += "}\n";
std::cout << outBuffer;
}
std::optional<std::pair<std::string_view, int32_t>> parseLine(std::string_view line) {
size_t semicolonPos = line.find(';');
if (semicolonPos == std::string_view::npos) {
return std::nullopt;
}
std::string_view locationView = line.substr(0, semicolonPos);
// Find start and end of temperature
size_t tempStart = semicolonPos + 1;
if (tempStart >= line.size()) {
return std::nullopt;
}
// Trim trailing whitespace
size_t tempEnd = line.size();
while (tempEnd > tempStart && (line[tempEnd - 1] == ' ' || line[tempEnd - 1] == '\t' ||
line[tempEnd - 1] == '\n' || line[tempEnd - 1] == '\r')) {
--tempEnd;
}
if (tempEnd <= tempStart) {
return std::nullopt;
}
double temperature;
const char* start = line.data() + tempStart;
const char* end = line.data() + tempEnd;
auto [ptr, ec] = std::from_chars(start, end, temperature);
if (ec != std::errc{}) {
return std::nullopt;
}
int32_t temp_int = static_cast<int32_t>(std::round(temperature * 10));
return std::make_pair(locationView, temp_int);
}
void updateStats(std::string_view location, int32_t temperature,
std::map<std::string, LocationStats>& m) {
auto& stats = m[std::string(location)];
stats.min = std::min(stats.min, temperature);
stats.max = std::max(stats.max, temperature);
stats.freq++;
stats.sum += temperature;
}
int64_t skipTillNextLine(int threadIndex, int64_t startPos, std::ifstream& f) {
if (startPos == 0) return 0;
// No need to skip if alredy at start of a new line
f.seekg(startPos - 1);
char prevChar;
if (f.get(prevChar) && prevChar == '\n') {
return 0;
}
f.seekg(startPos);
int64_t bytesSkipped = 0;
char c;
while (f.get(c) && c != '\n') {
bytesSkipped++;
}
if (f.good()) bytesSkipped++; // skip '\n'
return bytesSkipped;
}
int64_t readTillEndOfLine(int64_t bytesRead, std::string& fileReadBuffer, std::ifstream& f) {
if (bytesRead > 0 && fileReadBuffer[bytesRead - 1] != '\n') {
while (f.get(fileReadBuffer[bytesRead]) && fileReadBuffer[bytesRead] != '\n') {
bytesRead++;
}
}
return bytesRead;
}
void processLines(std::span<const char> buffer, std::map<std::string, LocationStats>& m) {
size_t pos = 0;
size_t validBytes = buffer.size();
while (pos < validBytes) {
size_t newlinePos = pos;
while (newlinePos < validBytes && buffer[newlinePos] != '\n') {
++newlinePos;
}
if (newlinePos > pos) {
std::string_view line(buffer.data() + pos, newlinePos - pos);
if (!line.empty()) {
auto result = parseLine(line);
if (result) {
auto [location, temperature] = *result;
updateStats(location, temperature, m);
}
}
}
pos = newlinePos + 1;
}
}
void accumulateBatch(int threadIndex, int64_t startPos, int64_t batchBytes,
std::map<std::string, LocationStats>& m, const std::string& fileName) {
std::ifstream f(fileName, std::ios::binary);
if (!f.is_open()) {
std::cerr << "Failed to open file: " << fileName << std::endl;
return;
}
// Skip to the next line boundary at the start for non-zero threads
int64_t processedBytes = 0;
if (threadIndex) {
processedBytes = skipTillNextLine(threadIndex, startPos, f);
}
// Read mini-batches of size MAX_FILE_READ_BYTES to avoid Out of Memory Error
while (processedBytes < batchBytes) {
std::string miniBatchBuffer;
// extra 128 bytes to read till next delimiter char even if not part of the batch.
miniBatchBuffer.resize(MAX_FILE_READ_BYTES + 128);
f.read(miniBatchBuffer.data(), std::min(MAX_FILE_READ_BYTES, batchBytes - processedBytes));
std::streamsize bytesRead = f.gcount();
bytesRead = readTillEndOfLine(bytesRead, miniBatchBuffer, f);
if (bytesRead <= 0) break;
processLines(std::span<const char>(miniBatchBuffer.data(), bytesRead), m);
processedBytes += bytesRead;
}
}
void accumulateThreadResults(const std::vector<std::map<std::string, LocationStats>>& maps,
std::map<std::string, LocationStats>& finalMap) {
for (const auto& m : maps) {
for (const auto& [location, stats] : m) {
auto& finalStats = finalMap[location];
finalStats.freq += stats.freq;
finalStats.sum += stats.sum;
finalStats.max = std::max(finalStats.max, stats.max);
finalStats.min = std::min(finalStats.min, stats.min);
}
}
}
void processInBatches(int64_t batchSize, const std::string& fileName,
std::map<std::string, LocationStats>& finalMap) {
std::vector<std::map<std::string, LocationStats>> maps(NUM_THREADS);
std::vector<std::thread> threads;
threads.reserve(NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
int64_t startPos = i * batchSize;
threads.emplace_back(accumulateBatch, i, startPos, batchSize, std::ref(maps[i]),
std::cref(fileName));
}
for (auto& t : threads) {
t.join();
}
accumulateThreadResults(maps, finalMap);
}
void processInSingleBatch(const std::string& fileName,
std::map<std::string, LocationStats>& finalMap) {
int64_t fileSize = getFileSize(fileName);
accumulateBatch(0, 0, fileSize, finalMap, fileName);
}
void accumulate(const std::string& fileName, std::map<std::string, LocationStats>& finalMap) {
int64_t batchSize = getBatchSize(fileName);
assert(batchSize > 0);
if (batchSize > 4 * 1024) {
processInBatches(batchSize, fileName, finalMap);
} else {
processInSingleBatch(fileName, finalMap);
}
}
void oneBrc(const char* filename) {
std::map<std::string, LocationStats> finalMap;
accumulate(filename, finalMap);
printResults(finalMap);
}
int main(int argc, char* argv[]) {
const char* filename = argc > 1 ? argv[1] : "../data/measurements.txt";
oneBrc(filename);
return EXIT_SUCCESS;
}

Phewโ€ฆ

For this version I went above and beyond to minimize unnecessary memory allocations especially the ones that are by-product of string operations.

Result ~103s.

Itโ€™s sooo nice to not have to wait 10 minutes.

Letโ€™s look at the ๐Ÿ”ฅ๐Ÿ“Š to get a better idea.

Okay the program spends most of the time in processLines() function and libc.so and std::map<โ€ฆ>::operator[] takes up the majority and almost equal time share.

std::map<โ€ฆ>() is a red-black tree and so we cannot reserve memory before hand and adds memory as elements are inserted which is relatively slower.

And the real culprit O(log n) complexity inserts and lookups donโ€™t help either especially when there are at least a billion of these operations.

Even with a very high constant I think itโ€™ll be far more efficient to use O(1) complexity operations provided by std::unordered_map<โ€ฆ>(). The only problem is that we need to print the output in lexicographical order of location names.

Thats no biggie as weโ€™re only dealing with at most 10,000 unique locations. We can just collect them in a std::vector<std::string>() and sort it to then use it to access elements from the std::unordered_map<โ€ฆ>().

The best part is we can reserve memory for both of those data-structures, so not dynamic allocations and reallocations! we reserve sufficient memory beforehand.

#include <algorithm>
#include <cassert>
#include <charconv>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <limits>
#include <span>
#include <string>
#include <string_view>
#include <thread>
#include <unordered_map>
#include <vector>
constexpr char NUM_THREADS = 4;
constexpr int64_t MAX_FILE_READ_BYTES = 1 * 1024 * 1024 * 512; // 512MB
struct LocationStats {
int32_t min = std::numeric_limits<int32_t>::max();
int32_t max = std::numeric_limits<int32_t>::min();
int64_t sum = 0;
int32_t freq = 0;
};
int64_t getFileSize(const std::string& fileName) {
return static_cast<int64_t>(std::filesystem::file_size(fileName));
}
int64_t getBatchSize(const std::string& fileName) {
int64_t fileSize = getFileSize(fileName);
return (fileSize + (NUM_THREADS - 1)) / NUM_THREADS;
}
double round1(double value) { return std::round(value * 10.0) / 10.0; }
void printResults(const std::unordered_map<std::string, LocationStats>& m) {
std::vector<std::string> keys;
keys.reserve(1 * 1024 * 1024);
for (const auto& it : m) {
keys.emplace_back(it.first);
}
sort(keys.begin(), keys.end());
std::string outBuffer;
outBuffer.reserve(2 * 1024 * 1024);
outBuffer += "{";
for (const auto& location : keys) {
auto& stat = m.at(location);
if (outBuffer.size() > 1) outBuffer += ", ";
outBuffer += location;
outBuffer += "=";
double min_val = stat.min / 10.0;
double max_val = stat.max / 10.0;
double avg = round1(stat.sum / static_cast<double>(stat.freq * 10));
size_t pos = outBuffer.size();
outBuffer.resize(pos + 32);
int writtenBytes =
std::snprintf(&outBuffer[pos], 32, "%.1f/%.1f/%.1f", min_val, avg, max_val);
outBuffer.resize(pos + writtenBytes);
}
outBuffer += "}\n";
std::cout << outBuffer;
}
std::pair<std::string_view, int32_t> parseLine(std::string_view line) {
size_t semicolonPos = line.find(';');
assert(semicolonPos != std::string_view::npos);
std::string_view locationView = line.substr(0, semicolonPos);
// Find start and end of temperature
size_t tempStart = semicolonPos + 1;
assert(tempStart < line.size());
// Trim trailing whitespace
size_t tempEnd = line.size();
while (tempEnd > tempStart && (line[tempEnd - 1] == ' ' || line[tempEnd - 1] == '\t' ||
line[tempEnd - 1] == '\n' || line[tempEnd - 1] == '\r')) {
--tempEnd;
}
assert(tempEnd > tempStart);
double temperature;
const char* start = line.data() + tempStart;
const char* end = line.data() + tempEnd;
auto [ptr, ec] = std::from_chars(start, end, temperature);
assert(ptr == end || ec != std::errc{});
int32_t temp_int = static_cast<int32_t>(std::round(temperature * 10));
return std::make_pair(locationView, temp_int);
}
void updateStats(std::string_view location, int32_t temperature,
std::unordered_map<std::string, LocationStats>& m) {
auto& stats = m[std::string(location)];
stats.min = std::min(stats.min, temperature);
stats.max = std::max(stats.max, temperature);
stats.freq++;
stats.sum += temperature;
}
int64_t skipTillNextLine(int threadIndex, int64_t startPos, std::ifstream& f) {
assert(startPos > 0);
assert(threadIndex > 0);
// No need to skip if alredy at start of a new line
f.seekg(startPos - 1);
char prevChar;
if (f.get(prevChar) && prevChar == '\n') {
return 0;
}
f.seekg(startPos);
int64_t bytesSkipped = 0;
char c;
while (f.get(c) && c != '\n') {
bytesSkipped++;
}
if (f.good()) bytesSkipped++; // skip '\n'
assert(bytesSkipped > 0);
return bytesSkipped;
}
int64_t readTillEndOfLine(std::string& fileReadBuffer, std::ifstream& f) {
int64_t bytesRead = f.gcount();
if (bytesRead > 0 && fileReadBuffer[bytesRead - 1] != '\n') {
while (f.get(fileReadBuffer[bytesRead]) && fileReadBuffer[bytesRead] != '\n') {
bytesRead++;
}
}
return bytesRead;
}
void processLinesInCurrBatch(std::span<const char> buffer,
std::unordered_map<std::string, LocationStats>& m) {
size_t pos = 0;
size_t validBytes = buffer.size();
assert(validBytes > 0);
while (pos < validBytes) {
size_t newlinePos = pos;
while (newlinePos < validBytes && buffer[newlinePos] != '\n') {
++newlinePos;
}
std::string_view line(buffer.data() + pos, newlinePos - pos);
assert(!line.empty());
auto result = parseLine(line);
auto [location, temperature] = result;
updateStats(location, temperature, m);
assert(m.size() > 0);
pos = newlinePos + 1;
}
}
int64_t readBatch(int64_t batchSizeBytes, int64_t processedBytes, std::string& miniBatchBuffer,
std::ifstream& f) {
assert(batchSizeBytes > 0);
assert(f.is_open());
f.read(miniBatchBuffer.data(), std::min(MAX_FILE_READ_BYTES, batchSizeBytes - processedBytes));
int64_t bytesRead = readTillEndOfLine(miniBatchBuffer, f);
return bytesRead;
}
void accumulateBatch(int threadIndex, int64_t startPos, int64_t batchSizeBytes,
std::unordered_map<std::string, LocationStats>& m,
const std::string& fileName) {
assert(fileName.size() > 0);
assert(startPos >= 0);
assert(batchSizeBytes > 0);
std::ifstream f(fileName, std::ios::binary);
if (!f.is_open()) {
std::cerr << "Failed to open file: " << fileName << std::endl;
return;
}
// Skip to the next line boundary at the start for non-zero threads
int64_t processedBytes = 0;
if (threadIndex > 0) {
processedBytes += skipTillNextLine(threadIndex, startPos, f);
}
assert(processedBytes < batchSizeBytes);
// Read mini-batches of size MAX_FILE_READ_BYTES to avoid Out of Memory Error
while (processedBytes < batchSizeBytes) {
std::string miniBatchBuffer;
// extra 128 bytes to read till next delimiter char even if not part of the batch.
miniBatchBuffer.resize(MAX_FILE_READ_BYTES + 128);
int64_t bytesRead = readBatch(batchSizeBytes, processedBytes, miniBatchBuffer, f);
// last batch may have content less than batch size
if (bytesRead <= 0) break;
processedBytes += bytesRead;
processLinesInCurrBatch(std::span<const char>(miniBatchBuffer.data(), bytesRead), m);
}
}
void accumulateThreadResults(
const std::vector<std::unordered_map<std::string, LocationStats>>& maps,
std::unordered_map<std::string, LocationStats>& finalMap) {
for (const auto& m : maps) {
for (const auto& [location, stats] : m) {
auto& finalStats = finalMap[location];
finalStats.freq += stats.freq;
finalStats.sum += stats.sum;
finalStats.max = std::max(finalStats.max, stats.max);
finalStats.min = std::min(finalStats.min, stats.min);
}
}
}
void processInBatches(int64_t batchSize, const std::string& fileName,
std::unordered_map<std::string, LocationStats>& finalMap) {
std::vector<std::unordered_map<std::string, LocationStats>> maps(NUM_THREADS);
for (auto& m : maps) {
m.reserve(2 * 1024 * 1024);
}
std::vector<std::thread> threads;
threads.reserve(NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
int64_t startPos = i * batchSize;
threads.emplace_back(accumulateBatch, i, startPos, batchSize, std::ref(maps[i]),
std::cref(fileName));
}
for (auto& t : threads) {
t.join();
}
assert(finalMap.empty());
accumulateThreadResults(maps, finalMap);
}
void processInOneBatch(const std::string& fileName,
std::unordered_map<std::string, LocationStats>& finalMap) {
int64_t fileSize = getFileSize(fileName);
assert(fileSize > 0);
accumulateBatch(0, 0, fileSize, finalMap, fileName);
}
void accumulate(const std::string& fileName,
std::unordered_map<std::string, LocationStats>& finalMap) {
assert(fileName.size() > 0);
int64_t batchSize = getBatchSize(fileName);
assert(batchSize > 0);
if (batchSize > 4 * 1024) {
processInBatches(batchSize, fileName, finalMap);
} else {
processInOneBatch(fileName, finalMap);
}
}
void oneBrc(const char* filename) {
std::unordered_map<std::string, LocationStats> finalMap;
finalMap.reserve(2 * 1024 * 1024);
accumulate(filename, finalMap);
printResults(finalMap);
}
int main(int argc, char* argv[]) {
const char* filename = argc > 1 ? argv[1] : "../data/measurements.txt";
oneBrc(filename);
return EXIT_SUCCESS;
}

Result: ~62s.

It was actually kind stupid to not do this straight away ๐Ÿ˜….

But ๐Ÿ”ฅ๐Ÿ“Š >> ๐Ÿง ๐Ÿค“. letโ€™s look whatโ€™s going on.

The libc.so part has gone down significantly.

the current runtime is dominated by std::unordered_map<โ€ฆ>::operator[] which is fair, no complaints.

But, parsing numbers using std::from_chars takes a significant chunk as well, I think we can parse the numbers faster manually.

/*
Rest of the code ...
*/
// Assumes format: [-]D[D].D where D is digit
int32_t parseInt32(const char* start, const char* end) {
bool neg = (*start == '-');
if (neg) start++;
size_t len = end - start;
int32_t num = 0;
if (len == 3) {
num += (*(start + 0) & 0xF) * 10;
num += (*(start + 2) & 0xF) * 1;
} else if (len == 4) {
num += (*(start + 0) & 0xF) * 100;
num += (*(start + 1) & 0xF) * 10;
num += (*(start + 3) & 0xF) * 1;
}
return neg ? -num : num;
}
std::pair<std::string_view, int32_t> parseLine(std::string_view line) {
size_t semicolonPos = line.find(';');
assert(semicolonPos != std::string_view::npos);
std::string_view locationView = line.substr(0, semicolonPos);
// Find start and end of temperature
size_t tempStart = semicolonPos + 1;
assert(tempStart < line.size());
// Trim trailing whitespace
size_t tempEnd = line.size();
while (tempEnd > tempStart && (line[tempEnd - 1] == ' ' || line[tempEnd - 1] == '\t' ||
line[tempEnd - 1] == '\n' || line[tempEnd - 1] == '\r')) {
--tempEnd;
}
assert(tempEnd > tempStart);
const char* start = line.data() + tempStart;
const char* end = line.data() + tempEnd;
int32_t temp_int = parseInt32(start, end);
return std::make_pair(locationView, temp_int);
}
/*
Rest of the code ...
*/

Result: ~56s

Nice. the parsing is significantly faster now, but weโ€™re hitting bottlenecks the only obvious improvements seems like are going to using custom hash function.

But before that letโ€™s use mmap syscall for significantly faster IO that i learnt while implementing Zue

#include <algorithm>
#include <cassert>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <filesystem>
#include <iostream>
#include <limits>
#include <string>
#include <string_view>
#include <sys/fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <thread>
#include <unistd.h>
#include <unordered_map>
#include <vector>
constexpr size_t NUM_THREADS = 16;
constexpr size_t MAX_CITIES = 10000;
struct LocationStats {
int32_t min = std::numeric_limits<int32_t>::max();
int32_t max = std::numeric_limits<int32_t>::min();
int64_t sum = 0;
int32_t freq = 0;
};
struct MMAPFile {
const char* filePtr;
const size_t fileSize;
};
size_t getFileSize(const std::string& fileName) { return std::filesystem::file_size(fileName); }
MMAPFile getMmappedFile(const char* fileName) {
int fd = open(fileName, O_RDONLY);
if (fd == -1) {
std::cerr << "Could not read input file" << std::endl;
std::exit(EXIT_FAILURE);
}
size_t fileSize = getFileSize(fileName);
char* map = static_cast<char*>(mmap(NULL, fileSize, PROT_READ, MAP_PRIVATE, fd, 0));
close(fd);
if (map == MAP_FAILED) {
std::cerr << "Cound not use mmap on the file" << std::endl;
std::exit(EXIT_FAILURE);
}
return {map, fileSize};
}
void unMapFile(MMAPFile f) { munmap(const_cast<char*>(f.filePtr), f.fileSize); }
size_t getBatchSize(size_t fileSize) { return (fileSize + (NUM_THREADS - 1)) / NUM_THREADS; }
double round1(double value) { return std::round(value * 10.0) / 10.0; }
void printResults(const std::unordered_map<std::string, LocationStats>& m) {
std::vector<std::string> keys;
keys.reserve(MAX_CITIES);
for (const auto& it : m) {
keys.emplace_back(it.first);
}
sort(keys.begin(), keys.end());
std::string outBuffer;
outBuffer.reserve(2 * 1024 * 1024);
outBuffer += "{";
for (const auto& location : keys) {
auto& stat = m.at(location);
if (outBuffer.size() > 1) outBuffer += ", ";
outBuffer += location;
outBuffer += "=";
double min_val = stat.min / 10.0;
double max_val = stat.max / 10.0;
double avg = round1(stat.sum / static_cast<double>(stat.freq * 10));
size_t pos = outBuffer.size();
outBuffer.resize(pos + 32);
int writtenBytes =
std::snprintf(&outBuffer[pos], 32, "%.1f/%.1f/%.1f", min_val, avg, max_val);
outBuffer.resize(pos + writtenBytes);
}
outBuffer += "}\n";
std::cout << outBuffer;
}
// Assumes format: [-]D[D].D where D is digit
int32_t parseInt32(const char* start, const char* end) {
bool neg = (*start == '-');
if (neg) start++;
size_t len = end - start;
int32_t num = 0;
if (len == 3) {
num += (*(start + 0) & 0xF) * 10;
num += (*(start + 2) & 0xF) * 1;
} else if (len == 4) {
num += (*(start + 0) & 0xF) * 100;
num += (*(start + 1) & 0xF) * 10;
num += (*(start + 3) & 0xF) * 1;
}
return neg ? -num : num;
}
std::pair<std::string_view, int32_t> parseLine(std::string_view line) {
size_t semicolonPos = line.find(';');
assert(semicolonPos != std::string_view::npos);
std::string_view locationView = line.substr(0, semicolonPos);
// Find start and end of temperature
size_t tempStart = semicolonPos + 1;
assert(tempStart < line.size());
// Trim trailing whitespace
size_t tempEnd = line.size();
while (tempEnd > tempStart && (line[tempEnd - 1] == ' ' || line[tempEnd - 1] == '\t' ||
line[tempEnd - 1] == '\n' || line[tempEnd - 1] == '\r')) {
--tempEnd;
}
assert(tempEnd > tempStart);
const char* start = line.data() + tempStart;
const char* end = line.data() + tempEnd;
int32_t temp_int = parseInt32(start, end);
return std::make_pair(locationView, temp_int);
}
void updateStats(std::string_view location, int32_t temperature,
std::unordered_map<std::string, LocationStats>& m) {
auto& stats = m[std::string(location)];
stats.min = std::min(stats.min, temperature);
stats.max = std::max(stats.max, temperature);
stats.freq++;
stats.sum += temperature;
}
size_t skipTillNextLine(size_t threadIndex, size_t startPos, MMAPFile f) {
assert(startPos > 0);
assert(threadIndex > 0);
const char* prevChar = f.filePtr + startPos - 1;
if (*prevChar == '\n') {
return 0;
}
size_t bytesSkipped = 0;
while (*(f.filePtr + startPos + bytesSkipped) != '\n') {
bytesSkipped++;
}
bytesSkipped++; // skip '\n'
assert(bytesSkipped > 0);
return bytesSkipped;
}
void processLinesInCurrBatch(size_t startPos, size_t batchEnd, MMAPFile f,
std::unordered_map<std::string, LocationStats>& m) {
size_t pos = startPos;
while (pos < batchEnd && pos < f.fileSize) {
size_t newlinePos = pos;
while (newlinePos < f.fileSize && *(f.filePtr + newlinePos) != '\n') {
++newlinePos;
}
std::string_view line(f.filePtr + pos, newlinePos - pos);
assert(!line.empty());
auto result = parseLine(line);
auto [location, temperature] = result;
updateStats(location, temperature, m);
assert(m.size() > 0);
pos = newlinePos + 1;
}
}
void accumulateBatch(size_t threadIndex, size_t startPos, size_t batchSizeBytes,
std::unordered_map<std::string, LocationStats>& m, MMAPFile f) {
assert(batchSizeBytes > 0);
size_t batchEnd = startPos + batchSizeBytes;
// Skip to the next line boundary at the start for non-zero threads
if (threadIndex > 0) {
size_t bytesSkipped = skipTillNextLine(threadIndex, startPos, f);
startPos += bytesSkipped;
}
processLinesInCurrBatch(startPos, batchEnd, f, m);
}
void accumulateThreadResults(
const std::vector<std::unordered_map<std::string, LocationStats>>& maps,
std::unordered_map<std::string, LocationStats>& finalMap) {
for (const auto& m : maps) {
for (const auto& [location, stats] : m) {
auto& finalStats = finalMap[location];
finalStats.freq += stats.freq;
finalStats.sum += stats.sum;
finalStats.max = std::max(finalStats.max, stats.max);
finalStats.min = std::min(finalStats.min, stats.min);
}
}
}
void processInBatches(MMAPFile f, size_t batchSize,
std::unordered_map<std::string, LocationStats>& finalMap) {
std::vector<std::unordered_map<std::string, LocationStats>> maps(NUM_THREADS);
for (auto& m : maps) {
m.reserve(MAX_CITIES);
}
std::vector<std::thread> threads;
threads.reserve(NUM_THREADS);
for (size_t i = 0; i < NUM_THREADS; ++i) {
size_t startPos = i * batchSize;
threads.emplace_back(accumulateBatch, i, startPos, batchSize, std::ref(maps[i]), f);
}
for (auto& t : threads) {
t.join();
}
assert(finalMap.empty());
accumulateThreadResults(maps, finalMap);
}
void processInOneBatch(MMAPFile f, std::unordered_map<std::string, LocationStats>& finalMap) {
accumulateBatch(0, 0, f.fileSize, finalMap, f);
}
void accumulate(MMAPFile f, std::unordered_map<std::string, LocationStats>& finalMap) {
size_t batchSize = getBatchSize(f.fileSize);
assert(batchSize > 0);
if (batchSize > 4 * 1024) {
processInBatches(f, batchSize, finalMap);
} else {
processInOneBatch(f, finalMap);
}
}
void oneBrc(const char* filename) {
MMAPFile f = getMmappedFile(filename);
std::unordered_map<std::string, LocationStats> finalMap;
finalMap.reserve(MAX_CITIES);
accumulate(f, finalMap);
printResults(finalMap);
unMapFile(f);
}
int main(int argc, char* argv[]) {
const char* filename = argc > 1 ? argv[1] : "../data/measurements.txt";
oneBrc(filename);
return EXIT_SUCCESS;
}
Terminal window
Time (mean ยฑ ฯƒ): 42.432 s ยฑ 0.521 s [User: 140.723 s, System: 14.162 s]
Range (min โ€ฆ max): 41.611 s โ€ฆ 43.481 s 10 runs

mmap is a game changer without any exaggeration. Not only we get rid of the overhead because of reading from streams. We no longer have to allocate 512MB buffers to read from file. We directly use the mapped file and the code is sooo much simpler to read and understand!

I think itโ€™s finally time to do something about std::unordered_map<โ€ฆ>(). std::unordered_map is one big linked-list with an array of pointers to the buckets that enables efficient access to an element without iterating from the beginning. This type of architecture has benefits like it never has to move the elements from one memory address to another as the number of elements increase.

But this leads to pointer chasing which is very bad for the cache

Since we already know the max number of elements we need to support we can simple use a std::vector which are best friends with cache and pre-allocate memory that we need.

For this iteration letโ€™s focus on data-oriented design principles and make the data-structures as cache friendly as possible.

#include <algorithm>
#include <cassert>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <filesystem>
#include <iostream>
#include <limits>
#include <string>
#include <string_view>
#include <sys/cdefs.h>
#include <sys/fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <thread>
#include <unistd.h>
#include <vector>
constexpr size_t NUM_THREADS = 16;
class FastMap {
private:
struct LocationEntry_ {
int32_t min;
int32_t max;
size_t freq;
int64_t sum;
std::string location;
LocationEntry_() {
min = std::numeric_limits<int32_t>::max();
max = std::numeric_limits<int32_t>::min();
freq = 0;
sum = 0;
}
};
std::vector<LocationEntry_> data;
size_t capacity_;
size_t mask_;
size_t fast_hash(std::string_view sv) {
assert(!sv.empty());
uint64_t hash = 0;
for (size_t index = 0; index < sv.size(); ++index)
hash = hash * 1315423911u + static_cast<unsigned char>(sv[index]);
return hash;
}
size_t getIdx(std::string_view queryLocation) {
assert(!queryLocation.empty());
size_t hash = fast_hash(queryLocation);
size_t idx = hash & this->mask_;
while (true) {
const auto& location = this->data[idx].location;
if (location.size() <= 0 || location == queryLocation) {
return idx;
}
idx = (idx + 1) & mask_;
}
}
double roundTowardsINF(double value) const { return std::round(value * 10.0) / 10.0; }
bool sortInplace() {
assert(this->data.size() > 0);
std::sort(this->data.begin(), this->data.end(),
[](const LocationEntry_& a, const LocationEntry_& b) {
return a.location < b.location;
});
return true;
}
public:
FastMap(size_t initCap = 1 << 14) : capacity_(initCap), mask_(initCap - 1) {
this->data.resize(initCap, LocationEntry_());
}
size_t size() const { return this->capacity_; }
__attribute__((hot)) bool update(std::string_view location, int32_t temperature) {
assert(!location.empty());
size_t idx = this->getIdx(location);
assert(idx >= 0 && idx < this->capacity_);
if (this->data[idx].location.size() <= 0) {
this->data[idx].location = std::string(location);
}
this->data[idx].freq += 1;
this->data[idx].sum += temperature;
this->data[idx].min = std::min(this->data[idx].min, temperature);
this->data[idx].max = std::max(this->data[idx].max, temperature);
return true;
}
bool update(const FastMap& other) {
for (size_t i = 0; i < other.size(); ++i) {
const auto& otherLocation = other.data[i].location;
if (otherLocation.size() <= 0) continue;
size_t thisIdx = this->getIdx(otherLocation);
if (this->data[thisIdx].location.size() <= 0) {
this->data[thisIdx].location = otherLocation;
}
this->data[thisIdx].freq += other.data[i].freq;
this->data[thisIdx].sum += other.data[i].sum;
this->data[thisIdx].min = std::min(this->data[thisIdx].min, other.data[i].min);
this->data[thisIdx].max = std::max(this->data[thisIdx].max, other.data[i].max);
}
return true;
}
void printSorted() {
this->sortInplace();
std::string outBuffer;
outBuffer.reserve(2 * 1024 * 1024);
for (const auto& locEntry : this->data) {
const auto& location = locEntry.location;
if (location.size() <= 0) continue;
if (outBuffer.size() > 1) outBuffer += ", ";
outBuffer += location;
outBuffer += "=";
double min_val = locEntry.min / 10.0;
double max_val = locEntry.max / 10.0;
double avg = roundTowardsINF(locEntry.sum / static_cast<double>(locEntry.freq * 10));
size_t pos = outBuffer.size();
outBuffer.resize(pos + 32);
int writtenBytes =
std::snprintf(&outBuffer[pos], 32, "%.1f/%.1f/%.1f", min_val, avg, max_val);
outBuffer.resize(pos + writtenBytes);
}
std::cout << outBuffer;
}
};
struct MMAPFile {
const char* filePtr;
const size_t fileSize;
};
size_t getFileSize(const std::string& fileName) { return std::filesystem::file_size(fileName); }
MMAPFile getMmappedFile(const char* fileName) {
int fd = open(fileName, O_RDONLY);
if (fd == -1) {
std::cerr << "Could not read input file" << std::endl;
std::exit(EXIT_FAILURE);
}
size_t fileSize = getFileSize(fileName);
char* map = static_cast<char*>(mmap(NULL, fileSize, PROT_READ, MAP_PRIVATE, fd, 0));
close(fd);
if (map == MAP_FAILED) {
std::cerr << "Cound not use mmap on the file" << std::endl;
std::exit(EXIT_FAILURE);
}
return {map, fileSize};
}
void unMapFile(MMAPFile f) { munmap(const_cast<char*>(f.filePtr), f.fileSize); }
size_t getBatchSize(size_t fileSize) { return (fileSize + (NUM_THREADS - 1)) / NUM_THREADS; }
// Assumes format: [-]D[D].D where D is digit
int32_t parseInt32(const char* start) {
int32_t value = 0;
bool isNegative = false;
if (start[0] == '-') {
isNegative = true;
start++;
}
value = (start[0] - '0') * 10 + start[2] - '0';
if (start[2] == '.') {
value = (start[0] - '0') * 100 + (start[1] - '0') * 10 + start[3] - '0';
}
return isNegative ? -value : value;
}
std::pair<std::string_view, int32_t> parseLine(std::string_view line) {
assert(!line.empty());
size_t semicolonPos = line.find(';');
assert(semicolonPos != std::string_view::npos);
std::string_view locationView = line.substr(0, semicolonPos);
// Find start and end of temperature
size_t tempStart = semicolonPos + 1;
assert(tempStart < line.size());
// Trim trailing whitespace
size_t tempEnd = line.size();
while (tempEnd > tempStart && (line[tempEnd - 1] == ' ' || line[tempEnd - 1] == '\t' ||
line[tempEnd - 1] == '\n' || line[tempEnd - 1] == '\r')) {
--tempEnd;
}
assert(tempEnd > tempStart);
const char* start = line.data() + tempStart;
int32_t temp_int = parseInt32(start);
return std::make_pair(locationView, temp_int);
}
size_t skipTillNextLine(size_t startPos, MMAPFile f) {
assert(startPos > 0);
const char* prevChar = f.filePtr + startPos - 1;
if (*prevChar == '\n') {
return 0;
}
size_t bytesSkipped = 0;
while (*(f.filePtr + startPos + bytesSkipped) != '\n') {
bytesSkipped++;
}
bytesSkipped++; // skip '\n'
assert(bytesSkipped > 0);
return bytesSkipped;
}
void processLinesInCurrBatch(size_t startPos, size_t batchEndPos, MMAPFile f, FastMap& m) {
size_t pos = startPos;
while (pos < batchEndPos && pos < f.fileSize) {
size_t newlinePos = pos;
while (newlinePos < f.fileSize && *(f.filePtr + newlinePos) != '\n') {
++newlinePos;
}
std::string_view line(f.filePtr + pos, newlinePos - pos);
assert(!line.empty());
auto result = parseLine(line);
auto [location, temperature] = result;
m.update(location, temperature);
pos = newlinePos + 1;
}
}
void accumulateBatch(size_t threadIndex, size_t startPos, size_t batchSizeBytes, FastMap& m,
MMAPFile f) {
assert(batchSizeBytes > 0);
size_t batchEnd = startPos + batchSizeBytes;
// Skip to the next line boundary at the start for non-zero threads
if (threadIndex > 0) {
size_t bytesSkipped = skipTillNextLine(startPos, f);
startPos += bytesSkipped;
}
processLinesInCurrBatch(startPos, batchEnd, f, m);
}
void accumulateThreadResults(const std::vector<FastMap>& maps, FastMap& finalMap) {
for (const auto& m : maps) {
finalMap.update(m);
}
}
void processInBatches(MMAPFile f, size_t batchSize, FastMap& finalMap) {
std::vector<FastMap> maps(NUM_THREADS, FastMap());
std::vector<std::thread> threads;
threads.reserve(NUM_THREADS);
for (size_t i = 0; i < NUM_THREADS; ++i) {
size_t startPos = i * batchSize;
threads.emplace_back(accumulateBatch, i, startPos, batchSize, std::ref(maps[i]), f);
}
for (auto& t : threads) {
t.join();
}
accumulateThreadResults(maps, finalMap);
}
void processLinesInSingleBatch(MMAPFile f, FastMap& finalMap) {
accumulateBatch(0, 0, f.fileSize, finalMap, f);
}
void accumulate(MMAPFile f, FastMap& finalMap) {
size_t batchSize = getBatchSize(f.fileSize);
assert(batchSize > 0);
if (batchSize > 4 * 1024) {
processInBatches(f, batchSize, finalMap);
} else {
processLinesInSingleBatch(f, finalMap);
}
}
void oneBrc(const char* filename) {
MMAPFile f = getMmappedFile(filename);
FastMap finalMap;
accumulate(f, finalMap);
std::cout << '{';
finalMap.printSorted();
std::cout << "}\n";
unMapFile(f);
}
int main(int argc, char* argv[]) {
const char* filename = argc > 1 ? argv[1] : "../data/measurements.txt";
oneBrc(filename);
return EXIT_SUCCESS;
}
Terminal window
Time (mean ยฑ ฯƒ): 32.188 s ยฑ 0.101 s [User: 71.249 s, System: 11.052 s]
Range (min โ€ฆ max): 32.047 s โ€ฆ 32.406 s 10 runs

Nice.

I finally resort to open-addressed hashmap with very simple hash function and weโ€™re now doing slightly better than the most optimal solution.

TODO: Iโ€™d like to try some ideas to make FastMap a bit more cache friendly by using just one 32 bit integer to store both min and max values but i doubt itโ€™ll make much of a difference, still will probably definitely try.

TODO: Summarize current results nicely in a table with benchmarks.

But honestly iโ€™m a little bored and would rather just jump to some rust, zig (my personal favorite) blog.