User Controls

Any pythoners in here know how I can speed up this code?

  1. #1
    filtration African Astronaut
    This post has been edited by a bot I made to preserve my privacy.
  2. #2
    filtration African Astronaut
    This post has been edited by a bot I made to preserve my privacy.
  3. #3
    Sophie Pedophile Tech Support
    Use the threading lib, and i don't mean the multiproc lib, the actual threading lib. IIRC if you have the requests lib, which you do you should probably have the requests_toolbelt also.

    https://toolbelt.readthedocs.io/en/latest/threading.html
    The following users say it would be alright if the author of this post didn't die in a fire!
  4. #4
    aldra JIDF Controlled Opposition
    rewrite it in asm
  5. #5
    Grylls Cum Looking Faggot [abrade this vocal tread-softly]
    Ffs aldra I was hoping you already got killed by a drop bear
  6. #6
    aldra JIDF Controlled Opposition
    I'll drop bears on your chin
  7. #7
    rabbitweed African Astronaut
    It's hard to tell without being able to run it here myself, but these should be pretty universal


    # Get all articles for the day
    for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():

    There's no need to transform the date range to a list. A date range is already iterable, and you're just allocating more memory for no reason.

    articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"


    Never ever use += on an python string. They are immutable so you're reallocating memory every loop. Use join, like you mentioned before. Some info here:
    https://waymoot.org/home/python_string/

    It was 30 times slower to use += then to use join in this benchmark


    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file:
    file.write(articles)

    The easiest win is here. You're opening and closing a file every loop, just to write to it. Open it before the loop starts, and close it when the loop ends instead.

    If you use threads like sophie suggested, I'd have a queue of articles. One thread adds articles to the queue, another thread takes them from the queue and writes them to the file.
  8. #8
    rabbitweed African Astronaut
    And most importantly... benchmark now, then benchmark after every optimisation you add. Just because you feel something is faster doesn't mean it actually is.
  9. #9
    BeeReBuddy motherfucker [pimp your due marabout]

    import os fast
    import pandas fast
    import grequests fast
    import requests fast
    from selectorlib import Extractor fast


    # Selectors for the archive and the article itself
    articles_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/articles.yml")
    article_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/article.yml")

    # Get all articles for the day fast
    for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():
    url = f"https://www.nascar.com.html"
    content = articles_extractor.extract(requests.get(url).text)

    # So I know where I am up to
    print fast(f"Scraping articles from: {date.strftime('%Y%m%d')}")

    # Get the article content fast
    urls = []
    articles = ""
    for article in content['links']:
    urls.append(f"https://www.dailymail.co.uk/{article}")

    results = grequests.imap((grequests.get(u) for u in urls), size=15)
    for result in results:
    content = article_extractor.extract(result.text)

    # Check for content fast
    if content['article'] is None:
    continue

    articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"

    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file fast:
    file.write(articles)

    [b]*insert racing stripe*[/b] fast
    The following users say it would be alright if the author of this post didn't die in a fire!
  10. #10
    aldra JIDF Controlled Opposition
    I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:

    results = grequests.imap((grequests.get(u) for u in urls), size=15)

    not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible
    The following users say it would be alright if the author of this post didn't die in a fire!
  11. #11
    Erekshun Naturally Camouflaged
    I would use this: beer=buzz (drink) more+more=fucked up /don't
  12. #12
    rabbitweed African Astronaut
    Originally posted by aldra I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:



    not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible

    I completely missed that bit, but you're totally right. That's the worst bottleneck in the whole loop.
  13. #13
    rabbitweed African Astronaut
    Never heard of grequests (I'm straight just like aldra). But I checked out the github, which says:

    Note: You should probably use requests-threads or requests-futures instead.

    So there we go.

    https://github.com/requests/requests-threads

    each request is a new thread.
  14. #14
    Sophie Pedophile Tech Support
    Since my post already resolved the issue at hand let me just say that you may think Python is gay, but you haven't seen anything if all you ever see are snippets or quick scripts. I shell script for quick scripts, i Python for bigger projects, if you hate using a lot of libs for whatever reason, then write Pure Python. It will take you ten times as long, but you'll only need to rely on the core libs.

    Also if Python is gay so is Ruby, Ruby is basically Python with more chances of semantic errors. Perl is objectively more OG, but if we are going to purity spiral anyway, you best start learning ASM and all it's dialects.
    The following users say it would be alright if the author of this post didn't die in a fire!
  15. #15
    Misterigh Houston
    I've been using Python lately. Its actually really nice.
  16. #16
    Sophie Pedophile Tech Support
    Originally posted by Misterigh I've been using Python lately. Its actually really nice.

    Hell yeah it is, coming from a C background gives you a good advantage when leveraging the ctypes lib to do some relatively low level stuff. Personally i think Python is one of the most versatile langs out there. That said, i have been learning the low level OG langs as well as NodeJS, so i think i have a nice spread. Node and related for prototyping and web apps, Python for bigger projects and C, and Asm for exploit dev. I want to be proficient in that sort of thing. C++ as well. As i've mentioned one of my long term goals is to be able to code in most modern programming languages. It's challenging but it will be worth it in the end.
Jump to Top