Problem

If you have a 2GB file with one string per line, which sorting algorithm would you use to sort the file and why?

OR

Sort the given 8 GB log file on time stamp when only 2 GB ram is provided

Solution

We have seen this problem for integers as well - External Sorting of integers. Now, we do the same for the strings.

When an interviewer gives a size limit of 2GB, it should tell you something – in this case, it suggests that they don’t want you to bring all the data into memory. So what do we do? We only bring part of the data into memory..

Short answer (one line): use external (k-way) merge sort — read/sort manageable chunks, write sorted runs to disk, then k-way merge the runs with a min-heap.

Because 2GB of strings is too large to fit entirely in memory, we stream and process the file in pieces. Two practical approaches are:

  1. K-way external merge sort — split the input into chunks that fit in memory, sort each chunk and write it as a sorted run, then merge the runs using a k-way (min-heap) merger.
  2. Partition / bucket approach — partition input into on-disk buckets by a prefix/hash/range (for example by timestamp prefix or hash of the key), then sort each bucket in memory and concatenate the results (useful when you can partition by key cheaply).

Method 1 - K-way External Merge Sort

We can use the internal vs external sorting idea. Use these symbols to reason about costs:

  • S = total number of records (lines) in the file
  • M = number of records we can hold in memory (chunk capacity)
  • K = number of runs/chunks = ceil(S / M)

Algorithm:

  1. Read up to M records (or up to ~memory_limit bytes) from the input, sort them in-memory (e.g., quicksort or std::sort), and write the sorted run to a temporary file. Repeat until the entire input has been processed — this produces K sorted runs.
  2. Open each run for sequential reading and perform a K-way merge using a min-heap of size at most K: push the first record of each run onto the heap, repeatedly pop the smallest record and write it to the output, then read the next record from the run that provided the popped element and push it onto the heap.

Cost analysis:

  • Sorting runs: each run is size ≤ M, so total cost is K * (M log M) = S * log M.
  • K-way merge: each of the S records is pushed and popped from a heap of size K, cost S * log K.
  • Overall: O(S log M + S log K) — typically M is chosen large (close to available memory) so the dominant cost is S log M.

Space: you use O(M) memory for in-memory sorting and O(K) additional small buffers for merging (heap + read buffers). Use disk for temporary runs.

The above algorithm is the standard external sort. Tuning points: choose M based on bytes not records (for variable-length strings), use buffered I/O, and pick a comfortable run size (e.g., several MB) to reduce I/O overhead.

Also take a look at: Internal vs External Sorting

Practical tips

  • Read chunks by bytes (e.g., accumulate lines until ~memory_limit bytes) because strings are variable-length.
  • Use buffered readers/writers (e.g., 64KB–1MB buffers) to reduce syscalls and improve throughput.
  • Parse fixed keys (timestamps) into binary keys if parsing cost is significant; compare keys instead of full strings when possible.
  • Clean up temporary files on normal exit and on errors; write into a temp directory and atomically rename the final output.
  • Tune K and run size to balance number of open files vs heap cost; if K is huge, perform multi-pass merging (merge into intermediate runs first).