Problem
Implement a file syncing algorithm for two computers over a low-bandwidth network. What if we know the files in the two computers are mostly the same?
Solution
Method 1 - Fingerprint comparison
1. Initial Handshake:
- Each computer calculates a fingerprint (hash) for each file it possesses. This could be a checksum (MD5, SHA-1) or a stronger cryptographic hash (SHA-256).
- One computer (initiator) sends the list of file fingerprints to the other computer (responder).
2. Fingerprint Comparison:
- The responder compares its own file fingerprints with the received list.
- For each matching fingerprint:
- The files are considered identical. No transfer is needed.
- For each non-matching fingerprint:
- The responder calculates the difference between the files using a diff algorithm (e.g., rsync). This identifies the changed data blocks.
3. Efficient Transfer:
- The initiator only transmits the actual changed data blocks for the non-matching files, not the entire files.
- The responder uses the received data blocks and its existing file to reconstruct the updated version.
4. Verification and Completion:
- After receiving the data blocks, the responder recalculates the fingerprint for the updated file.
- If the recalculated fingerprint matches the initiator’s fingerprint, the update is successful.
- The responder sends an acknowledgment to the initiator.
Code
|
|
|
|
Complexity
- ⏰ Time complexity:
O(n*(F + D)
wheren
is number of files, andF
is average file size, andD
is average diff calculation time.- Calculating the fingerprint (hash) for each file takes time proportional to the file size. In the worst case (assuming all files are processed), this can be O(n * F), where F is the average file size.
- Comparing fingerprints on both client and server involves iterating through the list of files (O(n)) and performing string comparisons for each fingerprint. This is generally faster than calculating fingerprints and is considered O(n) in most cases.
-
- Fingerprint Comparison: Comparing fingerprints on both client and server involves iterating through the list of files (O(n)) and performing string comparisons for each fingerprint. This is generally faster than calculating fingerprints and is considered O(n) in most cases.
- Diff Generation (if applicable): If a diff function is used to calculate the difference between changed files, its time complexity depends on the specific diff algorithm used. Some diff algorithms have linear complexity (O(n)) in the worst case, while others might be more complex.
- Network Communication: The time for sending and receiving data over the network depends on factors like network bandwidth and latency. It’s difficult to quantify precisely but can be significant for large files on slow networks.
- 🧺 Space complexity:
O(n*S)
whereS
is average fingerprint size