Problem
Given a list paths
of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.
A group of duplicate files consists of at least two files that have the same content.
A single directory info string in the input list has the following format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
It means there are n
files (f1.txt, f2.txt ... fn.txt)
with content (f1_content, f2_content ... fn_content)
respectively in the directory “root/d1/d2/.../dm"
. Note that n >= 1
and m >= 0
. If m = 0
, it means the directory is just the root directory.
The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
"directory_path/file_name.txt"
Examples
Example 1:
Input:
paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output:
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Example 2:
Input:
paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output:
[["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Solution
Method 1 - Using Hashmap
To solve this problem, we will use a hash map (dictionary in Python) to store the content of files as keys and their paths as values. This way, all files with the same content will be grouped together.
Here are the steps:
- Create a dictionary to map contents to a list of paths.
- Iterate over each string in the input list
paths
. - Split each string to get the directory path and the list of files with their contents.
- For each file, extract the filename and its content, and store the path in the dictionary under the corresponding content key.
- Finally, extract all the groups from the dictionary and return those that contain more than one path.
Code
Java
public List<List<String>> findDuplicate(String[] paths) {
Map<String, List<String>> map = new HashMap<>();
for (String path : paths) {
String[] parts = path.split(" ");
String dir = parts[0];
for (int i = 1; i < parts.length; i++) {
String part = parts[i];
String fileName = part.substring(0, part.indexOf('('));
String content = part.substring(part.indexOf('(') + 1, part.indexOf(')'));
map.putIfAbsent(content, new ArrayList<>());
String filePath = dir + "/" + fileName;
map.get(content).add(filePath);
}
}
List<List<String>> ans = new ArrayList<>();
map.entrySet().stream().filter(e-> e.getValue().size() > 1).forEach(e -> ans.add(e.getValue()));
return ans;
}
Python
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
content_map: Dict[str, List[str]] = defaultdict(list)
for path in paths:
parts = path.split(" ")
dir_path = parts[0]
for file in parts[1:]:
file_name, file_content = file.split("(")
file_content = file_content[:-1] # remove the closing ')'
file_path = f"{dir_path}/{file_name}"
content_map[file_content].append(file_path)
ans: List[List[str]] = [group for group in content_map.values() if len(group) > 1]
return ans
Complexity
- Time:
O(n * m)
, wheren
is the number of directories, andm
is the average length of each directory string. This accounts for iterating over the input and processing each file. - Space:
O(n * m)
, due to the space required by the dictionary to store file paths and their contents.