I am currently working on a Flask app. The app takes a url from the user and then crawls that website and returns the links found in that website. This is what my code looks like:
from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
from uuid import uuid4
import smtplib, urllib3, requests, urllib.parse, datetime, sys, os
app = Flask(__name__)
executor = Executor(app)
http = urllib3.PoolManager()
process = CrawlerProcess()
list = set([])
list_validate = set([])
list_final = set([])
@app.route('/', methods=["POST", "GET"])
def index():
if request.method == "POST":
url_input = request.form["usr_input"]
# Modifying URL
if 'https://' in url_input and url_input[-1] == '/':
url = str(url_input)
elif 'https://' in url_input and url_input[-1] != '/':
url = str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] != '/':
url = 'https://' + str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] == '/':
url = 'https://' + str(url_input)
# Validating URL
try:
response = requests.get(url)
error = http.request("GET", url)
if error.status == 200:
parse = urlparse(url).netloc.split('.')
base_url = parse[-2] + '.' + parse[-1]
start_url = [str(url)]
allowed_url = [str(base_url)]
# Crawling links
class Crawler(CrawlSpider):
name = "crawler"
start_urls = start_url
allowed_domains = allowed_url
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
def start():
process.crawl(Crawler)
process.start()
for link in list_validate:
error = http.request("GET", link)
if error.status == 200:
list_final.add(link)
original_stdout = sys.stdout
with open('templates/file.txt', 'w') as f:
sys.stdout = f
for link in list_final:
print(link)
unique_id = uuid4().__str__()
executor.submit_stored(unique_id, start)
return redirect(url_for('crawling', id=unique_id))
else:
return render_template('index.html')
@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
return render_template('start-crawl.html', refresh=True)
else:
executor.futures.pop(id)
return render_template('finish-crawl.html')
In my start.html
, I have this:
{% if refresh %}
<meta http-equiv="refresh" content="5">
{% endif %}
This code takes a url from a user, validates it, and if it is a working url, it starts crawling and takes the user to start-crawl.html
page. The page refreshes every 5 seconds until the crawling is complete and if the crawling finishes it renders the finish-crawl.html
. In finish-crawl.html
, the user can download a file that has the output (didn't include it because it isn't necessary).
Everything works as expected. My problem is once I crawl a website and it finishes crawling and I am at the finish-crawl.html
, I can't crawl another website. If I go back to the home page and enter another url, it validates the url and then goes directly to finish-crawl.html
. I think this happens because scrappy can only be run once and the reactor isn't restartable which is what I am trying to do here. So does anyone know what I can do to fix this? Please ignore the complicity of the code and anything that isn't considered "a programming convention".
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…