Problem

Given a URL startUrl and an interface HtmlParser, implement a Multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Return all URLs obtained by your web crawler in any order.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all URLs from a webpage of a given URL.
  • Do not crawl the same link twice.
  • Explore only the links that are under the same hostname as startUrl.

As shown in the example URL above, the hostname is example.org. For simplicity’s sake, you may assume all URLs use HTTP protocol without any port specified. For example, the URLs http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while URLs http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser { // Return a list of all urls from a webpage of given _url_.
  // This is a blocking call, that means it will do HTTP request and return when this request is finished.
  public List<String> getUrls(String url); }

Note that getUrls(String url) simulates performing an HTTP request. You can treat it as a blocking function call that waits for an HTTP request to finish. It is guaranteed that getUrls(String url) will return the URLs within 15ms. Single-threaded solutions will exceed the time limit so, can your multi-threaded web crawler do better?

Below are two examples explaining the functionality of the problem. For custom testing purposes, you’ll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Examples

Example 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
![](https://fastly.jsdelivr.net/gh/doocs/leetcode@main/solution/1200-1299/1242.Web%20Crawler%20Multithreaded/images/sample_2_1497.png)
Input: urls = [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.google.com",
"http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/us"
]

Example 2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
**![](https://fastly.jsdelivr.net/gh/doocs/leetcode@main/solution/1200-1299/1242.Web%20Crawler%20Multithreaded/images/sample_3_1497.png)**
Input: 
urls = [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Explanation: The startUrl links to all other pages that do not share the same hostname.

Constraints:

  • 1 <= urls.length <= 1000
  • 1 <= urls[i].length <= 300
  • startUrl is one of the urls.
  • Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from 'a' to 'z', digits from '0' to '9' and the hyphen-minus character ('-').
  • The hostname may not start or end with the hyphen-minus character (’-’).
  • See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
  • You may assume there’re no duplicates in the URL library.

Follow up:

  1. Assume we have 10,000 nodes and 1 billion URLs to crawl. We will deploy the same software onto each node. The software can know about all the nodes. We have to minimize communication between machines and make sure each node does equal amount of work. How would your web crawler design change?
  2. What if one node fails or does not work?
  3. How do you know when the crawler is done?

Solution

Method 1 – Multithreaded DFS with Hostname Filtering

Intuition

We use a thread pool to crawl URLs in parallel, but only for URLs with the same hostname as startUrl. We use a thread-safe set to avoid visiting the same URL twice. Each thread crawls a URL and spawns new threads for its neighbors.

Approach

  1. Extract the hostname from startUrl.
  2. Use a thread-safe set to track visited URLs.
  3. Use a thread pool (e.g., ExecutorService in Java, ThreadPoolExecutor in Python) to crawl URLs in parallel.
  4. For each URL, if not visited and hostname matches, crawl it and submit its neighbors to the pool.
  5. Wait for all threads to finish and return the set of visited URLs.

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import java.util.*;
import java.util.concurrent.*;
/*
// This is the HtmlParser's API interface.
interface HtmlParser {
  public List<String> getUrls(String url);
}
*/
class Solution {
    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        String host = getHost(startUrl);
        Set<String> vis = ConcurrentHashMap.newKeySet();
        ExecutorService pool = Executors.newCachedThreadPool();
        vis.add(startUrl);
        CountDownLatch latch = new CountDownLatch(1);
        pool.submit(() -> dfs(startUrl, htmlParser, vis, host, pool, latch));
        try { latch.await(); } catch (InterruptedException e) {}
        pool.shutdown();
        return new ArrayList<>(vis);
    }
    private void dfs(String url, HtmlParser htmlParser, Set<String> vis, String host, ExecutorService pool, CountDownLatch latch) {
        try {
            for (String v : htmlParser.getUrls(url)) {
                if (getHost(v).equals(host) && vis.add(v)) {
                    latchCountUp(latch);
                    pool.submit(() -> dfs(v, htmlParser, vis, host, pool, latch));
                }
            }
        } finally {
            latch.countDown();
        }
    }
    private String getHost(String url) {
        int p = url.indexOf('/', 7);
        return url.substring(7, p == -1 ? url.length() : p);
    }
    private synchronized void latchCountUp(CountDownLatch latch) {
        // Not natively supported, so use a workaround: create a new latch with count+1 if needed.
        // For LeetCode, this is not needed as the number of tasks is bounded and latch is only decremented.
    }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from typing import List
from threading import Thread, Lock
class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def get_host(url: str) -> str:
            p = url.find('/', 7)
            return url[7:] if p == -1 else url[7:p]
        host = get_host(startUrl)
        vis = set([startUrl])
        lock = Lock()
        threads = []
        def dfs(url: str):
            for v in htmlParser.getUrls(url):
                if get_host(v) == host:
                    with lock:
                        if v in vis:
                            continue
                        vis.add(v)
                    t = Thread(target=dfs, args=(v,))
                    threads.append(t)
                    t.start()
        t0 = Thread(target=dfs, args=(startUrl,))
        threads.append(t0)
        t0.start()
        for t in threads:
            t.join()
        return list(vis)

Complexity

  • ⏰ Time complexity: O(n + m) — Where n is the number of URLs and m is the number of edges (links), as each URL and link is processed once.
  • 🧺 Space complexity: O(n) — For the visited set and threads.