User Controls

Any pythoners in here know how I can speed up this code?

  1. #1
    filtration African Astronaut
    Sure, it somewhat 'fast' but I want it to be faster, I know I am also being bottlenecked by the web requests, but maybe there's other room for improvement:



    import os
    import pandas
    import grequests
    import requests
    from selectorlib import Extractor


    # Selectors for the archive and the article itself
    articles_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/articles.yml")
    article_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/article.yml")

    # Get all articles for the day
    for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():
    url = f"https://www.dailymail.co.uk/home/sitemaparchive/day_{date.strftime('%Y%m%d')}.html"
    content = articles_extractor.extract(requests.get(url).text)

    # So I know where I am up to
    print(f"Scraping articles from: {date.strftime('%Y%m%d')}")

    # Get the article content
    urls = []
    articles = ""
    for article in content['links']:
    urls.append(f"https://www.dailymail.co.uk/{article}")

    results = grequests.imap((grequests.get(u) for u in urls), size=15)
    for result in results:
    content = article_extractor.extract(result.text)

    # Check for content
    if content['article'] is None:
    continue

    articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"

    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file:
    file.write(articles)

  2. #2
    filtration African Astronaut
    I think I can improve the speed by using join instead of +=, but what about something like:


    articles = []

    articles.append(['f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"']]
    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file:
    file.write(''.join(articles))

  3. #3
    Sophie Pedophile Tech Support
    Use the threading lib, and i don't mean the multiproc lib, the actual threading lib. IIRC if you have the requests lib, which you do you should probably have the requests_toolbelt also.

    https://toolbelt.readthedocs.io/en/latest/threading.html
    The following users say it would be alright if the author of this post didn't die in a fire!
  4. #4
    aldra JIDF Controlled Opposition
    rewrite it in asm
  5. #5
    Grylls motherfucker [abrade this vocal tread-softly]
    Ffs aldra I was hoping you already got killed by a drop bear
  6. #6
    aldra JIDF Controlled Opposition
    I'll drop bears on your chin
  7. #7
    rabbitweed African Astronaut
    It's hard to tell without being able to run it here myself, but these should be pretty universal


    # Get all articles for the day
    for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():

    There's no need to transform the date range to a list. A date range is already iterable, and you're just allocating more memory for no reason.

    articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"


    Never ever use += on an python string. They are immutable so you're reallocating memory every loop. Use join, like you mentioned before. Some info here:
    https://waymoot.org/home/python_string/

    It was 30 times slower to use += then to use join in this benchmark


    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file:
    file.write(articles)

    The easiest win is here. You're opening and closing a file every loop, just to write to it. Open it before the loop starts, and close it when the loop ends instead.

    If you use threads like sophie suggested, I'd have a queue of articles. One thread adds articles to the queue, another thread takes them from the queue and writes them to the file.
  8. #8
    rabbitweed African Astronaut
    And most importantly... benchmark now, then benchmark after every optimisation you add. Just because you feel something is faster doesn't mean it actually is.
  9. #9
    BeeReBuddy African Astronaut [pimp your due marabout]

    import os fast
    import pandas fast
    import grequests fast
    import requests fast
    from selectorlib import Extractor fast


    # Selectors for the archive and the article itself
    articles_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/articles.yml")
    article_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/article.yml")

    # Get all articles for the day fast
    for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():
    url = f"https://www.nascar.com.html"
    content = articles_extractor.extract(requests.get(url).text)

    # So I know where I am up to
    print fast(f"Scraping articles from: {date.strftime('%Y%m%d')}")

    # Get the article content fast
    urls = []
    articles = ""
    for article in content['links']:
    urls.append(f"https://www.dailymail.co.uk/{article}")

    results = grequests.imap((grequests.get(u) for u in urls), size=15)
    for result in results:
    content = article_extractor.extract(result.text)

    # Check for content fast
    if content['article'] is None:
    continue

    articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"

    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file fast:
    file.write(articles)

    [b]*insert racing stripe*[/b] fast
    The following users say it would be alright if the author of this post didn't die in a fire!
  10. #10
    aldra JIDF Controlled Opposition
    I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:

    results = grequests.imap((grequests.get(u) for u in urls), size=15)

    not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible
    The following users say it would be alright if the author of this post didn't die in a fire!
  11. #11
    Erekshun Naturally Camouflaged
    I would use this: beer=buzz (drink) more+more=fucked up /don't
  12. #12
    rabbitweed African Astronaut
    Originally posted by aldra I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:



    not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible

    I completely missed that bit, but you're totally right. That's the worst bottleneck in the whole loop.
  13. #13
    rabbitweed African Astronaut
    Never heard of grequests (I'm straight just like aldra). But I checked out the github, which says:

    Note: You should probably use requests-threads or requests-futures instead.

    So there we go.

    https://github.com/requests/requests-threads

    each request is a new thread.
  14. #14
    Sophie Pedophile Tech Support
    Since my post already resolved the issue at hand let me just say that you may think Python is gay, but you haven't seen anything if all you ever see are snippets or quick scripts. I shell script for quick scripts, i Python for bigger projects, if you hate using a lot of libs for whatever reason, then write Pure Python. It will take you ten times as long, but you'll only need to rely on the core libs.

    Also if Python is gay so is Ruby, Ruby is basically Python with more chances of semantic errors. Perl is objectively more OG, but if we are going to purity spiral anyway, you best start learning ASM and all it's dialects.
    The following users say it would be alright if the author of this post didn't die in a fire!
  15. #15
    Misterigh Houston
    I've been using Python lately. Its actually really nice.
Jump to Top