Web Crawler

Problem

Given a url startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Return all urls obtained by your web crawler in any order.

Your crawler should:

Start from the page: startUrl
Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.
Do not crawl the same link twice.
Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser { // Return a list of all urls from a webpage of given _url_.
  public List<String> getUrls(String url); }

Below are two examples explaining the functionality of the problem, for custom testing purposes you’ll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Note: Consider the same URL with the trailing slash “/” as a different URL. For example, “http://news.yahoo.com”, and “http://news.yahoo.com/" are different urls.

Examples

Example 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Input: urls = [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.google.com",
"http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/us"
]

Example 2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
**![](https://fastly.jsdelivr.net/gh/doocs/leetcode@main/solution/1200-1299/1236.Web%20Crawler/images/sample_3_1497.png)**
Input: 
urls = [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Explanation: The startUrl links to all other pages that do not share the same hostname.

Constraints:

1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl is one of the urls.
Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from ‘a’ to ‘z’, digits from ‘0’ to ‘9’ and the hyphen-minus character (’-’).
The hostname may not start or end with the hyphen-minus character (’-’).
See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
You may assume there’re no duplicates in url library.

Solution

Method 1 – BFS/DFS with Hostname Filtering

Intuition

We need to crawl all URLs reachable from startUrl, but only those with the same hostname. We use BFS or DFS, keeping a set of visited URLs, and only add URLs with the same hostname as startUrl.

Approach

Extract the hostname from startUrl.
Use a queue (BFS) or stack (DFS) to traverse URLs, starting from startUrl.
For each URL, if not visited and hostname matches, add to result and mark as visited.
For each URL, get its neighbors using HtmlParser.getUrls and repeat.
Return the set of visited URLs as the result.

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/**
 * // This is the HtmlParser's API interface.
 * class HtmlParser {
 *   public:
 *     vector<string> getUrls(string url);
 * };
 */
class Solution {
public:
    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
        auto getHost = [](const string& url) {
            int p = url.find('/', 7);
            return url.substr(7, p == string::npos ? string::npos : p - 7);
        };
        string host = getHost(startUrl);
        unordered_set<string> vis{startUrl};
        queue<string> q;
        q.push(startUrl);
        while (!q.empty()) {
            string u = q.front(); q.pop();
            for (const string& v : htmlParser.getUrls(u)) {
                if (!vis.count(v) && getHost(v) == host) {
                    vis.insert(v);
                    q.push(v);
                }
            }
        }
        return vector<string>(vis.begin(), vis.end());
    }
};

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// type HtmlParser interface { getUrls(url string) []string }
func crawl(startUrl string, htmlParser HtmlParser) []string {
    getHost := func(url string) string {
        p := strings.Index(url[7:], "/")
        if p == -1 {
            return url[7:]
        }
        return url[7 : 7+p]
    }
    host := getHost(startUrl)
    vis := map[string]bool{startUrl: true}
    q := []string{startUrl}
    for len(q) > 0 {
        u := q[0]
        q = q[1:]
        for _, v := range htmlParser.getUrls(u) {
            if !vis[v] && getHost(v) == host {
                vis[v] = true
                q = append(q, v)
            }
        }
    }
    ans := []string{}
    for k := range vis {
        ans = append(ans, k)
    }
    return ans
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/*
// This is the HtmlParser's API interface.
interface HtmlParser {
  public List<String> getUrls(String url);
}
*/
import java.util.*;
class Solution {
    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        String host = getHost(startUrl);
        Set<String> vis = new HashSet<>();
        Queue<String> q = new LinkedList<>();
        vis.add(startUrl);
        q.add(startUrl);
        while (!q.isEmpty()) {
            String u = q.poll();
            for (String v : htmlParser.getUrls(u)) {
                if (!vis.contains(v) && getHost(v).equals(host)) {
                    vis.add(v);
                    q.add(v);
                }
            }
        }
        return new ArrayList<>(vis);
    }
    private String getHost(String url) {
        int p = url.indexOf('/', 7);
        return url.substring(7, p == -1 ? url.length() : p);
    }
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// interface HtmlParser { fun getUrls(url: String): List<String> }
class Solution {
    fun crawl(startUrl: String, htmlParser: HtmlParser): List<String> {
        fun getHost(url: String): String {
            val p = url.indexOf('/', 7)
            return url.substring(7, if (p == -1) url.length else p)
        }
        val host = getHost(startUrl)
        val vis = mutableSetOf(startUrl)
        val q = ArrayDeque<String>()
        q.add(startUrl)
        while (q.isNotEmpty()) {
            val u = q.removeFirst()
            for (v in htmlParser.getUrls(u)) {
                if (v !in vis && getHost(v) == host) {
                    vis.add(v)
                    q.add(v)
                }
            }
        }
        return vis.toList()
    }
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from typing import List
# HtmlParser is provided by the system
class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def get_host(url: str) -> str:
            p = url.find('/', 7)
            return url[7:] if p == -1 else url[7:p]
        host = get_host(startUrl)
        vis = set([startUrl])
        q = [startUrl]
        while q:
            u = q.pop(0)
            for v in htmlParser.getUrls(u):
                if v not in vis and get_host(v) == host:
                    vis.add(v)
                    q.append(v)
        return list(vis)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// trait HtmlParser { fn get_urls(&self, url: String) -> Vec<String>; }
use std::collections::{HashSet, VecDeque};
impl Solution {
    pub fn crawl(start_url: String, html_parser: &impl HtmlParser) -> Vec<String> {
        fn get_host(url: &str) -> &str {
            let p = url[7..].find('/').map(|x| x + 7);
            &url[7..p.unwrap_or(url.len())]
        }
        let host = get_host(&start_url);
        let mut vis = HashSet::new();
        let mut q = VecDeque::new();
        vis.insert(start_url.clone());
        q.push_back(start_url);
        while let Some(u) = q.pop_front() {
            for v in html_parser.get_urls(u.clone()) {
                if !vis.contains(&v) && get_host(&v) == host {
                    vis.insert(v.clone());
                    q.push_back(v);
                }
            }
        }
        vis.into_iter().collect()
    }
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
// interface HtmlParser { getUrls(url: string): string[] }
class Solution {
    crawl(startUrl: string, htmlParser: HtmlParser): string[] {
        function getHost(url: string): string {
            const p = url.indexOf('/', 7);
            return url.substring(7, p === -1 ? url.length : p);
        }
        const host = getHost(startUrl);
        const vis = new Set([startUrl]);
        const q = [startUrl];
        while (q.length) {
            const u = q.shift()!;
            for (const v of htmlParser.getUrls(u)) {
                if (!vis.has(v) && getHost(v) === host) {
                    vis.add(v);
                    q.push(v);
                }
            }
        }
        return Array.from(vis);
    }
}

Complexity

⏰ Time complexity: O(n + m) — Where n is the number of URLs and m is the number of edges (links), as each URL and link is processed once.
🧺 Space complexity: O(n) — For the visited set and queue.

Problem#

Examples#

Solution#

Method 1 – BFS/DFS with Hostname Filtering#

Intuition#

Approach#

Code#

Complexity#