题目大意:
多线程网页爬虫
Given a url startUrl
and an interface HtmlParser
, implement a Multi-threaded web crawler to crawl all links that are under the same hostname as startUrl
.
Return all urls obtained by your web crawler in any order.
Your crawler should:
- Start from the page:
startUrl
- Call
HtmlParser.getUrls(url)
to get all urls from a webpage of given url. - Do not crawl the same link twice.
- Explore only the links that are under the same hostname as
startUrl
.
As shown in the example url above, the hostname is example.org
. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems
and http://leetcode.com/contest
are under the same hostname, while urls http://example.org/test
and http://example.com/abc
are not under the same hostname.
The HtmlParser
interface is defined as such:
interface HtmlParser { // Return a list of all urls from a webpage of given url. // This is a blocking call, that means it will do HTTP request and return when this request is finished. public List<String> getUrls(String url); }
Note that getUrls(String url)
simulates performing a HTTP request. You can treat it as a blocking function call which waits for a HTTP request to finish. It is guaranteed that getUrls(String url)
will return the urls within 15ms. Single-threaded solutions will exceed the time limit so, can your multi-threaded web crawler do better?
Below are two examples explaining the functionality of the problem, for custom testing purposes you’ll have three variables urls
, edges
and startUrl
. Notice that you will only have access to startUrl
in your code, while urls
and edges
are not directly accessible to you in code.
Follow up:
- Assume we have 10,000 nodes and 1 billion URLs to crawl. We will deploy the same software onto each node. The software can know about all the nodes. We have to minimize communication between machines and make sure each node does equal amount of work. How would your web crawler design change?
- What if one node fails or does not work?
- How do you know when the crawler is done?
Example 1:
Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" Output: [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/us" ]
Example 2:
Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com" ] edges = [[0,2],[2,1],[3,2],[3,1],[3,0]] startUrl = "http://news.google.com" Output: ["http://news.google.com"] Explanation: The startUrl links to all other pages that do not share the same hostname.
Constraints:
1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl
is one of theurls
.- Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from ‘a’ to ‘z’, digits from ‘0’ to ‘9’ and the hyphen-minus character (‘-‘).
- The hostname may not start or end with the hyphen-minus character (‘-‘).
- See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
- You may assume there’re no duplicates in url library.
如果想查看本题目是哪家公司的面试题,请参考以下免费链接: https://leetcode.jp/problemdetail.php?id=1242
解题思路分析:
这本是一道dfs题目,但要使用多线程来解决。可以理解为dfs的多线程升级版。标准的dfs解法可以参照 LEETCODE 1236. Web Crawler 解题思路分析 这篇文章。普通做dfs深度优先搜索时,我们需要通过一个起点,不停的dfs递归调用搜索到所有节点,本题也不例外,dfs是核心思路,不同的是,我们需要将dfs方法内的代码放到线程中,在做dfs递归操作时,不是递归调用本身,而是需要再开启一个新的线程而已。
需要注意的一点是,多线程不同于单线程,在我们开启一个线程后,主线程并不会等待子线程执行结束再返回结果,因此我们应该想到,在返回结果之前,必须等待所有子线程执行结束,否则我们无法得到所有结果。等待子线程结束的方法很多,在java中最为简单的便是使用Thread.join()方法,这样主线程会被挂起,当join的子线程执行完毕之后,主线程才会继续向下执行。
看下实现代码:
import java.net.URI; class Solution { public List<String> crawl(String startUrl, HtmlParser htmlParser) { // 取得startUrl的域名 String host = URI.create(startUrl).getHost(); // 新建一个线程,爬取startUrl中的所有链接 Crawler crawler = new Crawler(startUrl, host, htmlParser); // 初始化线程的返回结果 crawler.res = new ArrayList<>(); // 开启线程(相当于从起点开始dfs) crawler.start(); // 等待线程执行结束 Crawler.joinThread(crawler); // 返回线程的执行结果 return crawler.res; } } // 爬虫线程(相当于原始的dfs方法) class Crawler extends Thread { String startUrl; // 当前url String hostname; // 域名 HtmlParser htmlParser; // 爬虫接口 // 返回结果 public static volatile List<String> res = new ArrayList<>(); // 初始化线程 public Crawler(String startUrl, String hostname, HtmlParser htmlParser){ this.startUrl = startUrl; this.hostname = hostname; this.htmlParser = htmlParser; } @Override public void run(){ // 获得当前url的域名 String host=URI.create(startUrl).getHost(); // 如果当前域名不属于目标网站,或者当前域名已经爬过,略过 if(!host.equals(hostname) || res.contains(startUrl)){ return; } // 将当前url加入结果集 res.add(startUrl); // 记录当前url页面包含的链接 // 每个链接启动一个新的线程继续dfs List<Thread> threads = new ArrayList<>(); for(String s: htmlParser.getUrls(startUrl)){ Crawler crawler = new Crawler(s, hostname, htmlParser); crawler.start(); threads.add(crawler); } // 等待每个子线程执行结束后,再结束当前线程 for(Thread t: threads){ joinThread(t); } } public static void joinThread(Thread thread){ try{ thread.join(); } catch(InterruptedException e){ } } }
本题解法执行时间为2ms。
Runtime: 2 ms, faster than 97.83% of Java online submissions for Web Crawler Multithreaded.
Memory Usage: 89.5 MB, less than 100.00% of Java online submissions for Web Crawler Multithreaded.
本网站文章均为原创内容,并可随意转载,但请标明本文链接如有任何疑问可在文章底部留言。为了防止恶意评论,本博客现已开启留言审核功能。但是博主会在后台第一时间看到您的留言,并会在第一时间对您的留言进行回复!欢迎交流!
本文链接: http://leetcode.jp/leetcode-1242-web-crawler-multithreaded-解题思路分析/
View Comments (4)
Thanks for the solution. What is the time and space complexity for this?
Basically, the time and space complexity are same with common dfs, but with using the mutil-thread, it is difficult to calculate the complexity.
Okay. What about the follow-up questions? How to answer those?
Well, I try to answer the follow-up questions, but I dont know whether my answers are correct or not.
1. Using a count array to count the running task of each node. Assign the new task to the node who has the minimum task count
2. Create a blacklist to record the bad nodes. Donot Assign new tasks to their. And dont add the url to the result list when failed.
3. task count of each node is zero -> crawler is done(success)
blacklist size == all nodes count -> crawler is done(failed)