Any pythoners in here know how I can speed up this code?

2020-09-14 at 3:23 PM UTC

#1

filtration African Astronaut

This post has been edited by a bot I made to preserve my privacy.

2020-09-14 at 3:26 PM UTC

#2

filtration African Astronaut

This post has been edited by a bot I made to preserve my privacy.

2020-09-14 at 5:51 PM UTC

#3

Sophie Pedophile Tech Support

Use the threading lib, and i don't mean the multiproc lib, the actual threading lib. IIRC if you have the requests lib, which you do you should probably have the requests_toolbelt also.

https://toolbelt.readthedocs.io/en/latest/threading.html

The following users say it would be alright if the author of this post didn't die in a fire!

aldra,
netstat

2020-09-15 at 1:11 AM UTC

#4

aldra JIDF Controlled Opposition

rewrite it in asm

2020-09-15 at 1:15 AM UTC

#5

Grylls Cum Looking Faggot [abrade this vocal tread-softly]

Ffs aldra I was hoping you already got killed by a drop bear

2020-09-15 at 1:42 AM UTC

#6

aldra JIDF Controlled Opposition

I'll drop bears on your chin

2020-09-15 at 2:22 AM UTC

#7

rabbitweed African Astronaut

It's hard to tell without being able to run it here myself, but these should be pretty universal


# Get all articles for the day
for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():

There's no need to transform the date range to a list. A date range is already iterable, and you're just allocating more memory for no reason.

articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"

Never ever use += on an python string. They are immutable so you're reallocating memory every loop. Use join, like you mentioned before. Some info here:
https://waymoot.org/home/python_string/

It was 30 times slower to use += then to use join in this benchmark


    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file:
        file.write(articles)

The easiest win is here. You're opening and closing a file every loop, just to write to it. Open it before the loop starts, and close it when the loop ends instead.

If you use threads like sophie suggested, I'd have a queue of articles. One thread adds articles to the queue, another thread takes them from the queue and writes them to the file.

2020-09-15 at 2:23 AM UTC

#8

rabbitweed African Astronaut

And most importantly... benchmark now, then benchmark after every optimisation you add. Just because you feel something is faster doesn't mean it actually is.

2020-09-15 at 2:36 AM UTC

#9

BeeReBuddy motherfucker [pimp your due marabout]


import os fast
import pandas fast
import grequests fast
import requests fast
from selectorlib import Extractor fast


# Selectors for the archive and the article itself
articles_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/articles.yml")
article_extractor = Extractor.from_yaml_file(f"{os.getcwd()}/scrape/selectors/dailymail/article.yml")

# Get all articles for the day fast
for date in pandas.date_range('2011-01-01', '2020-09-01', tz='Europe/London').tolist():
    url = f"https://www.nascar.com.html"
    content = articles_extractor.extract(requests.get(url).text)

    # So I know where I am up to
    print fast(f"Scraping articles from: {date.strftime('%Y%m%d')}")

    # Get the article content fast
    urls = []
    articles = ""
    for article in content['links']:
        urls.append(f"https://www.dailymail.co.uk/{article}")

    results = grequests.imap((grequests.get(u) for u in urls), size=15)
    for result in results:
        content = article_extractor.extract(result.text)

        # Check for content fast
        if content['article'] is None:
            continue

        articles += f"title: {content['title']}\narticle: {' '.join(content['article'])}\n"

    with open(f"{os.getcwd()}/scrape/data/dailymail.txt", 'a+') as file fast:
        file.write(articles)

[b]*insert racing stripe*[/b] fast

The following users say it would be alright if the author of this post didn't die in a fire!

rabbitweed

2020-09-15 at 2:39 AM UTC

#10

aldra JIDF Controlled Opposition

I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:

results = grequests.imap((grequests.get(u) for u in urls), size=15)

not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible

The following users say it would be alright if the author of this post didn't die in a fire!

netstat

2020-09-15 at 2:43 AM UTC

#11

Erekshun Naturally Camouflaged

I would use this: beer=buzz (drink) more+more=fucked up /don't

2020-09-15 at 3:05 AM UTC

#12

rabbitweed African Astronaut

Originally posted by aldra I don't code in python because I'm not a homosexual but you were probably right the first time, this is going to be the bottleneck:

not so much slow code but it'll queue up each page get, you'll want to use threading to run several of them concurrently if at all possible

I completely missed that bit, but you're totally right. That's the worst bottleneck in the whole loop.

2020-09-15 at 3:08 AM UTC

#13

rabbitweed African Astronaut

Never heard of grequests (I'm straight just like aldra). But I checked out the github, which says:

Note: You should probably use requests-threads or requests-futures instead.

So there we go.

https://github.com/requests/requests-threads

each request is a new thread.

2020-09-20 at 1:51 PM UTC

#14

Sophie Pedophile Tech Support

Since my post already resolved the issue at hand let me just say that you may think Python is gay, but you haven't seen anything if all you ever see are snippets or quick scripts. I shell script for quick scripts, i Python for bigger projects, if you hate using a lot of libs for whatever reason, then write Pure Python. It will take you ten times as long, but you'll only need to rely on the core libs.

Also if Python is gay so is Ruby, Ruby is basically Python with more chances of semantic errors. Perl is objectively more OG, but if we are going to purity spiral anyway, you best start learning ASM and all it's dialects.

The following users say it would be alright if the author of this post didn't die in a fire!

Misterigh

2020-09-25 at 9:47 PM UTC

#15

Misterigh Houston

I've been using Python lately. Its actually really nice.

2020-10-05 at 3:48 PM UTC

#16

Sophie Pedophile Tech Support

Originally posted by Misterigh I've been using Python lately. Its actually really nice.

Hell yeah it is, coming from a C background gives you a good advantage when leveraging the ctypes lib to do some relatively low level stuff. Personally i think Python is one of the most versatile langs out there. That said, i have been learning the low level OG langs as well as NodeJS, so i think i have a nice spread. Node and related for prototyping and web apps, Python for bigger projects and C, and Asm for exploit dev. I want to be proficient in that sort of thing. C++ as well. As i've mentioned one of my long term goals is to be able to code in most modern programming languages. It's challenging but it will be worth it in the end.

User Controls

Navigation

Any pythoners in here know how I can speed up this code?