Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
232 views
in Technique[技术] by (71.8m points)

Why does scrapy crawler only work once in flask app?

I am currently working on a Flask app. The app takes a url from the user and then crawls that website and returns the links found in that website. This is what my code looks like:

from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
from uuid import uuid4
import smtplib, urllib3, requests, urllib.parse, datetime, sys, os

app = Flask(__name__)
executor = Executor(app)

http = urllib3.PoolManager()
process = CrawlerProcess()

list = set([])
list_validate = set([])
list_final = set([])

@app.route('/', methods=["POST", "GET"])
def index():
   if request.method == "POST":
      url_input = request.form["usr_input"]

        # Modifying URL
        if 'https://' in url_input and url_input[-1] == '/':
            url = str(url_input)
        elif 'https://' in url_input and url_input[-1] != '/':
            url = str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] != '/':
            url = 'https://' + str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] == '/':
            url = 'https://' + str(url_input)
        # Validating URL
        try:
            response = requests.get(url)
            error = http.request("GET", url)
            if error.status == 200:
                parse = urlparse(url).netloc.split('.')
                base_url = parse[-2] + '.' + parse[-1]
                start_url = [str(url)]
                allowed_url = [str(base_url)]

                # Crawling links
                class Crawler(CrawlSpider):
                    name = "crawler"
                    start_urls = start_url
                    allowed_domains = allowed_url
                    rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]

                    def parse_links(self, response):
                        base_url = url
                        href = response.xpath('//a/@href').getall()
                        list.add(urllib.parse.quote(response.url, safe=':/'))
                        for link in href:
                            if base_url not in link:
                                list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
                        for link in list:
                            if base_url in link:
                                list_validate.add(link)
                 def start():
                    process.crawl(Crawler)
                    process.start()

                    for link in list_validate:
                        error = http.request("GET", link)
                        if error.status == 200:
                            list_final.add(link)

                    original_stdout = sys.stdout
                    with open('templates/file.txt', 'w') as f:
                        sys.stdout = f
                        for link in list_final:
                           print(link)

                   unique_id = uuid4().__str__()
                   executor.submit_stored(unique_id, start)
                   return redirect(url_for('crawling', id=unique_id))
   else:
     return render_template('index.html')

@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
    return render_template('start-crawl.html', refresh=True)
else:
    executor.futures.pop(id)
    return render_template('finish-crawl.html')

In my start.html, I have this:

{% if refresh %}
    <meta http-equiv="refresh" content="5">
{% endif %}

This code takes a url from a user, validates it, and if it is a working url, it starts crawling and takes the user to start-crawl.html page. The page refreshes every 5 seconds until the crawling is complete and if the crawling finishes it renders the finish-crawl.html. In finish-crawl.html, the user can download a file that has the output (didn't include it because it isn't necessary).

Everything works as expected. My problem is once I crawl a website and it finishes crawling and I am at the finish-crawl.html, I can't crawl another website. If I go back to the home page and enter another url, it validates the url and then goes directly to finish-crawl.html. I think this happens because scrappy can only be run once and the reactor isn't restartable which is what I am trying to do here. So does anyone know what I can do to fix this? Please ignore the complicity of the code and anything that isn't considered "a programming convention".


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Scrapy recommended the use of CrawlerRunner instead of CrawlerProcess.

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
        #Spider definition
        configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
        runner = CrawlerRunner()
        d = runner.crawl(MySpider)
        def finished(e):
              print("finished")
        def spider_error(e):
              print("spider error :/")
        d.addCallback(finished)
        d.addErrback(spider_error)
        reactor.run() 

More information about reactor is available here:ReactorBasic


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...