User Controls
Any pythoners in here know how I can speed up this code?
-
2020-09-14 at 3:23 PM UTCThis post has been edited by a bot I made to preserve my privacy.
-
2020-09-14 at 3:26 PM UTCThis post has been edited by a bot I made to preserve my privacy.
-
2020-09-14 at 5:51 PM UTCUse the threading lib, and i don't mean the multiproc lib, the actual threading lib. IIRC if you have the requests lib, which you do you should probably have the requests_toolbelt also.
https://toolbelt.readthedocs.io/en/latest/threading.html -
2020-09-15 at 1:11 AM UTCrewrite it in asm
-
2020-09-15 at 1:15 AM UTCFfs aldra I was hoping you already got killed by a drop bear
-
2020-09-15 at 1:42 AM UTCI'll drop bears on your chin
-
2020-09-15 at 2:22 AM UTCIt's hard to tell without being able to run it here myself, but these should be pretty universal
# Get all articles for the day
for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():
There's no need to transform the date range to a list. A date range is already iterable, and you're just allocating more memory for no reason.
articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"
Never ever use += on an python string. They are immutable so you're reallocating memory every loop. Use join, like you mentioned before. Some info here:
https://waymoot.org/home/python_string/
It was 30 times slower to use += then to use join in this benchmark
with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file:
file.write(articles)
The easiest win is here. You're opening and closing a file every loop, just to write to it. Open it before the loop starts, and close it when the loop ends instead.
If you use threads like sophie suggested, I'd have a queue of articles. One thread adds articles to the queue, another thread takes them from the queue and writes them to the file. -
2020-09-15 at 2:23 AM UTCAnd most importantly... benchmark now, then benchmark after every optimisation you add. Just because you feel something is faster doesn't mean it actually is.
-
2020-09-15 at 2:36 AM UTC
import os fast
import pandas fast
import grequests fast
import requests fast
from selectorlib import Extractor fast
# Selectors for the archive and the article itself
articles_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/articles.yml")
article_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/article.yml")
# Get all articles for the day fast
for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():
url = f"https://www.nascar.com.html"
content = articles_extractor.extract(requests.get(url).text)
# So I know where I am up to
print fast(f"Scraping articles from: {date.strftime('%Y%m%d')}")
# Get the article content fast
urls = []
articles = ""
for article in content['links']:
urls.append(f"https://www.dailymail.co.uk/{article}")
results = grequests.imap((grequests.get(u) for u in urls), size=15)
for result in results:
content = article_extractor.extract(result.text)
# Check for content fast
if content['article'] is None:
continue
articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"
with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file fast:
file.write(articles)
[b]*insert racing stripe*[/b] fast -
2020-09-15 at 2:39 AM UTCI don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:
results = grequests.imap((grequests.get(u) for u in urls), size=15)
not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible -
2020-09-15 at 2:43 AM UTCI would use this: beer=buzz (drink) more+more=fucked up /don't
-
2020-09-15 at 3:05 AM UTC
Originally posted by aldra I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:
not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible
I completely missed that bit, but you're totally right. That's the worst bottleneck in the whole loop. -
2020-09-15 at 3:08 AM UTCNever heard of grequests (I'm straight just like aldra). But I checked out the github, which says:
Note: You should probably use requests-threads or requests-futures instead.
So there we go.
https://github.com/requests/requests-threads
each request is a new thread. -
2020-09-20 at 1:51 PM UTCSince my post already resolved the issue at hand let me just say that you may think Python is gay, but you haven't seen anything if all you ever see are snippets or quick scripts. I shell script for quick scripts, i Python for bigger projects, if you hate using a lot of libs for whatever reason, then write Pure Python. It will take you ten times as long, but you'll only need to rely on the core libs.
Also if Python is gay so is Ruby, Ruby is basically Python with more chances of semantic errors. Perl is objectively more OG, but if we are going to purity spiral anyway, you best start learning ASM and all it's dialects. -
2020-09-25 at 9:47 PM UTCI've been using Python lately. Its actually really nice.
-
2020-10-05 at 3:48 PM UTC
Originally posted by Misterigh I've been using Python lately. Its actually really nice.
Hell yeah it is, coming from a C background gives you a good advantage when leveraging the ctypes lib to do some relatively low level stuff. Personally i think Python is one of the most versatile langs out there. That said, i have been learning the low level OG langs as well as NodeJS, so i think i have a nice spread. Node and related for prototyping and web apps, Python for bigger projects and C, and Asm for exploit dev. I want to be proficient in that sort of thing. C++ as well. As i've mentioned one of my long term goals is to be able to code in most modern programming languages. It's challenging but it will be worth it in the end.