Find Duplicate File in System Problem

Problem

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory “root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Examples

Example 1:

Input:
paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output:
 [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input:
paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output:
 [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Solution

Method 1 - Using Hashmap

To solve this problem, we will use a hash map (dictionary in Python) to store the content of files as keys and their paths as values. This way, all files with the same content will be grouped together.

Here are the steps:

Create a dictionary to map contents to a list of paths.
Iterate over each string in the input list paths.
Split each string to get the directory path and the list of files with their contents.
For each file, extract the filename and its content, and store the path in the dictionary under the corresponding content key.
Finally, extract all the groups from the dictionary and return those that contain more than one path.

Code

Java

public List<List<String>> findDuplicate(String[] paths) {
	Map<String, List<String>> map = new HashMap<>();
	for (String path : paths) {
		String[] parts = path.split(" ");
		String dir = parts[0];
		for (int i = 1; i < parts.length; i++) {
			String part = parts[i];
			String fileName = part.substring(0, part.indexOf('('));
			String content = part.substring(part.indexOf('(') + 1, part.indexOf(')'));
			map.putIfAbsent(content, new ArrayList<>());
			
			String filePath = dir + "/" + fileName;
			map.get(content).add(filePath);
		}
	}


	List<List<String>> ans = new ArrayList<>();
	map.entrySet().stream().filter(e-> e.getValue().size() > 1).forEach(e -> ans.add(e.getValue()));

	return ans;

}

Python

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_map: Dict[str, List[str]] = defaultdict(list)
        
        for path in paths:
            parts = path.split(" ")
            dir_path = parts[0]
            
            for file in parts[1:]:
                file_name, file_content = file.split("(")
                file_content = file_content[:-1]  # remove the closing ')'
                
                file_path = f"{dir_path}/{file_name}"
                content_map[file_content].append(file_path)
        
        ans: List[List[str]] = [group for group in content_map.values() if len(group) > 1]
        
        return ans

Complexity

Time: O(n * m), where n is the number of directories, and m is the average length of each directory string. This accounts for iterating over the input and processing each file.
Space: O(n * m), due to the space required by the dictionary to store file paths and their contents.

Problem#

Examples#

Solution#

Method 1 - Using Hashmap#

Code#

Java#

Python#

Complexity#