Problem
Implement a file syncing algorithm for two computers over a low-bandwidth network. What if we know the files in the two computers are mostly the same?
Solution
Method 1 - Fingerprint comparison
1. Initial Handshake:
- Each computer calculates a fingerprint (hash) for each file it possesses. This could be a checksum (MD5, SHA-1) or a stronger cryptographic hash (SHA-256).
- One computer (initiator) sends the list of file fingerprints to the other computer (responder).
2. Fingerprint Comparison:
- The responder compares its own file fingerprints with the received list.
- For each matching fingerprint:
- The files are considered identical. No transfer is needed.
- For each non-matching fingerprint:
- The responder calculates the difference between the files using a diff algorithm (e.g., rsync). This identifies the changed data blocks.
3. Efficient Transfer:
- The initiator only transmits the actual changed data blocks for the non-matching files, not the entire files.
- The responder uses the received data blocks and its existing file to reconstruct the updated version.
4. Verification and Completion:
- After receiving the data blocks, the responder recalculates the fingerprint for the updated file.
- If the recalculated fingerprint matches the initiator’s fingerprint, the update is successful.
- The responder sends an acknowledgment to the initiator.
Code
Java
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class LowBandwidthSync {
public static void synchronize(FileSyncClient client, FileSyncServer server, DiffFunction diffFunction) throws Exception {
""
"
Synchronizes files between two computers over a low - bandwidth network.
Args:
client: The client computer object(can be extended
for network communication).
server: The server computer object(can be extended
for network communication).
diffFunction: A
function that calculates the difference between files.
""
"
// Calculate fingerprints for client files
Map<String, String> clientFingerprints = new HashMap<>();
for (File file: client.listFiles()) {
clientFingerprints.put(file.getName(), hashFile(file));
}
// Send fingerprints to server
server.receiveFingerprints(clientFingerprints);
// Server compares fingerprints with its own
Map<String, String> serverFingerprints = server.getFingerprints();
List<String> changedFiles = compareFingerprints(clientFingerprints, serverFingerprints);
// Server sends diffs for changed files
for (String filename: changedFiles) {
byte[] diffData = diffFunction.calculateDiff(server.getFile(file.getName()), client.getFile(filename));
client.receiveDiff(filename, diffData);
}
// Client applies diffs to update files
for (Map.Entry < String, byte[] > entry: client.getDiffs().entrySet()) {
client.applyDiff(entry.getKey(), entry.getValue());
}
}
private static String hashFile(File file) throws Exception {
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] buffer = new byte[1024];
int bytesRead;
try (FileInputStream inputStream = new FileInputStream(file)) {
while ((bytesRead = inputStream.read(buffer)) != -1) {
digest.update(buffer, 0, bytesRead);
}
}
byte[] hash = digest.digest();
return convertByteArrayToHex(hash); // Convert byte array to hex string
}
private static String convertByteArrayToHex(byte[] a) {
StringBuilder sb = new StringBuilder(a.length * 2);
for (byte b: a)
sb.append(String.format("%02x", b & 0xff));
return sb.toString();
}
// Interface definitions for network communication and file operations (replace with actual implementations)
public interface FileSyncClient {
List<File> listFiles();
String getFile(String filename) throws IOException;
void receiveDiff(String filename, byte[] diffData) throws IOException;
Map < String, byte[] > getDiffs();
void applyDiff(String filename, byte[] diffData) throws IOException;
}
public interface FileSyncServer {
void receiveFingerprints(Map<String, String> fingerprints);
Map<String, String> getFingerprints();
byte[] getFile(String filename) throws IOException;
}
public interface DiffFunction {
byte[] calculateDiff(File file1, File file2) throws IOException;
}
private static List<String> compareFingerprints(Map<String, String> clientFingerprints, Map<String, String> serverFingerprints) {
List<String> changedFiles = new ArrayList<>();
for (Map.Entry<String, String> entry: clientFingerprints.entrySet()) {
if (!serverFingerprints.containsKey(entry.getKey()) || !serverFingerprints.get(entry.getKey()).equals(entry.getValue())) {
changedFiles.add(entry.getKey());
}
}
return changedFiles;
}
}
Python
import hashlib
def low_bandwidth_sync(client, server, file_diff_func):
# Calculate fingerprints for client files
client_fingerprints = {f: hashlib.sha256(open(f, 'rb').read()).hexdigest() for f in client.list_files()}
# Send fingerprints to server
server.receive_fingerprints(client_fingerprints)
# Server compares fingerprints with its own
server_fingerprints = server.get_fingerprints()
changed_files = []
for filename, fingerprint in client_fingerprints.items():
if filename not in server_fingerprints or server_fingerprints[filename] != fingerprint:
changed_files.append(filename)
# Server sends diffs for changed files
for filename in changed_files:
diff_data = file_diff_func(server.get_file(filename), client.get_file(filename))
client.receive_diff(filename, diff_data)
# Client applies diffs to update files
for filename, diff_data in client.get_diffs():
client.apply_diff(filename, diff_data)
# Example usage (replace with actual file system interaction and network communication)
client_files = ["file1.txt", "file2.txt", "file3.txt"]
server_files = ["file1.txt", "file2.txt", "file4.txt"]
def get_file_diff(file1, file2):
# Replace with actual diff generation logic
return "some diff data"
low_bandwidth_sync(client_files, server_files, get_file_diff)
Complexity
- ⏰ Time complexity:
O(n*(F + D)
wheren
is number of files, andF
is average file size, andD
is average diff calculation time.- Calculating the fingerprint (hash) for each file takes time proportional to the file size. In the worst case (assuming all files are processed), this can be O(n * F), where F is the average file size.
- Comparing fingerprints on both client and server involves iterating through the list of files (O(n)) and performing string comparisons for each fingerprint. This is generally faster than calculating fingerprints and is considered O(n) in most cases.
- Fingerprint Comparison: Comparing fingerprints on both client and server involves iterating through the list of files (O(n)) and performing string comparisons for each fingerprint. This is generally faster than calculating fingerprints and is considered O(n) in most cases.
- Diff Generation (if applicable): If a diff function is used to calculate the difference between changed files, its time complexity depends on the specific diff algorithm used. Some diff algorithms have linear complexity (O(n)) in the worst case, while others might be more complex.
- Network Communication: The time for sending and receiving data over the network depends on factors like network bandwidth and latency. It’s difficult to quantify precisely but can be significant for large files on slow networks.
- 🧺 Space complexity:
O(n*S)
whereS
is average fingerprint size