User Controls
How do i get rid of strings that are (almost)duplicates but not entirely from a list?
-
2017-01-13 at 2:40 AM UTCFirst thought would be regex but i don't really know how since the strings are so similar. So like i would have a list with a bunch of these:
http://url.com/page?id=1
http://url.com/page?id=5
http://url.com/page?id=17
Do i sort on the ID? What if instead of `id` is have `cid`? So to understand what i am doing here please see my function below.
def search():
driver = webdriver.Firefox()
for int in range(args.pages):
driver.get("http://google.com")
assert "Google" in driver.title
for items in dork_list:
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys(items)
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
links = driver.find_elements_by_xpath("//a[@href]")
for elem in links:
link_list = []
link_list.append(elem.get_attribute("href"))
driver.close()
return link_list
This is untested code so please no bully, also if there is something wrong with that function feel free to point it out and give some constructive criticism. For the import statements you can use your:
But i used Selenium, also `args.pages` and such are defined with the argparse library as well.
BUT ANYWAY! I am returning a list of links but i have not sorted the duplicates out yet i want to have a separate function to do that why? I have no idea, i just felt like i probably shouldn't put over 9000 loops inside of each other.
As you can see once i have the list i need to do some additional processing with it. To be honest, since this code is untested i don't even know if the result will come out with HTML tags. But i figured i would hammer out the program as far as i am able at first then run it and debug from the first problem down.
So any thoughts?
Post last edited by Sophie at 2017-01-13T02:43:03.458331+00:00 -
2017-01-13 at 2:41 AM UTCi dont speak chinese
-
2017-01-13 at 2:48 AM UTChow do you want to strip them?
do you want to only keep unique domains or unique pages?
http://url.com/nigger?id=1
http://url.com/nigger?id=5
http://url.com/dicks?id=17
ie. unique domains keeps only
http://url.com/nigger?id=1
unique pages keeps
http://url.com/nigger?id=1
http://url.com/dicks?id=17
at any rate I'd separate the strings into domain, page and arguments. could use regex or a simple split - for the domain get 0 to the first '/', for the page get the first '/' to the '?' if it exists. parameter would be '?' to end (assuming there's a ?) -
2017-01-13 at 2:52 AM UTCI just realized `elem.send_keys(items)` is going to break because elem.send_keys doesn't have an attribute inurl:some_dork, fuck me.
-
2017-01-13 at 2:55 AM UTC
Originally posted by aldra how do you want to strip them?
do you want to only keep unique domains or unique pages?
http://url.com/nigger?id=1
http://url.com/nigger?id=5
http://url.com/dicks?id=17
ie. unique domains keeps only
http://url.com/nigger?id=1
unique pages keeps
http://url.com/nigger?id=1
http://url.com/dicks?id=17
at any rate I'd separate the strings into domain, page and arguments. could use regex or a simple split - for the domain get 0 to the first '/', for the page get the first '/' to the '?' if it exists. parameter would be '?' to end (assuming there's a ?)
Unique pages will do nicely, so you mean i should split all the links up to '/' and then what? How does my program know i have a duplicate? -
2017-01-13 at 2:56 AM UTC
Originally posted by Sophie I just realized `elem.send_keys(items)` is going to break because elem.send_keys doesn't have an attribute inurl:some_dork, fuck me.
Or is it? I don't fucking know. -
2017-01-13 at 3:05 AM UTCI don't know much python so I can't write you proper code, but basically I'd parse URLs into those 3 parts.
first a structure to hold that data:
structure splitUrl (protocol, domain, page, parameter)
then split each url into protocol (http:)
domain (url.com)
page (dicks)
and if it exists, parameter (id=17)
that way you can quickly search through each of the properties to see if there's already an item with a matching page or parameter, for instance -
2017-01-13 at 3:14 AM UTC
Originally posted by aldra I don't know much python so I can't write you proper code, but basically I'd parse URLs into those 3 parts.
first a structure to hold that data:
structure splitUrl (protocol, domain, page, parameter)
then split each url into protocol (http:)
domain (url.com)
page (dicks)
and if it exists, parameter (id=17)
that way you can quickly search through each of the properties to see if there's already an item with a matching page or parameter, for instance
What manner of structure holds multiple pieces of data that all have a relation to each other, i am not building a *SQL database. -
2017-01-13 at 3:16 AM UTChttp://stackoverflow.com/questions/35988/c-like-structures-in-python
though if you want to keep it simple you could just create 4 arrays, protocol,domain,page,args
so that way you'd get a full URL by concatenating protocol[x].domain[x].page[x].args[x] -
2017-01-13 at 3:20 AM UTCString matching can get pretty complex, but basically what aldra said. Once you have it split up you can look for query parameters and use regex to match cid vs id, and really anything. I'm confused what you are really trying to accomplish and why you care about partial matches.
If you want to save space by having:
site.com/foo?id=
1,2,3,4,etc.
The complexity involved probably isn't worth whatever space you'd save. Ultimately I'm guessing you need to assemble it into the original url at a later point?
match query regex untested and not for any particular language so probably doesn't work
/((\w)=(\d))&/
Find a way to dump the () groupings into variables and fine all matches in a string, it's probably a arg for whatever regex function. -
2017-01-13 at 3:20 AM UTCYeah, you need to figure out what "almost duplicate" means. Whether two urls point to the same page or not isn't something you can't answer in general, partly because it's not even clear what "same page" means. Like let's say you're talking about old vb, you're looking at a thread. If you hit the same path with params "p=1&pp=50" you'll get the first 50 posts of the thread, if you use "p=1&pp=25" you'll get the first 25 posts. Are these the same pages? You could also use a get param to set theme. Is the same content in two different themes the same page? In practice you can't depend on querystrings being or not being part of a page's identity, generally it's better to err on the side of caution and assume they are (potentially producing duplicates).
Now there's the point that the order of elements in the querystring shouldn't alter the resource being requested at all (although there's no reason a service couldn't violate this property of the spec it could wreak havoc on cashes in a lot of place) and in that case you could do some parsing to collapse re-ordered query strings into each other as duplicates. But I would expect google would already be doing this so it's probably not a problem you need to solve for.
Can you give some examples of the kind of duplicates you're trying to remove? Like those three URLs listed in your op, do you consider them duplicates? -
2017-01-13 at 3:26 AM UTC
Originally posted by aldra http://stackoverflow.com/questions/35988/c-like-structures-in-python
though if you want to keep it simple you could just create 4 arrays, protocol,domain,page,args
so that way you'd get a full URL by concatenating protocol[x].domain[x].page[x].args[x]
I was thinking arrays yeah but then i have four arrays full of protocols and domains and such, do i do:
# Pseudo code
if not domain in string:
''.join(protocol, domain, page, args)
-
2017-01-13 at 3:30 AM UTCThere's kind of a lot involved in parsing uris, stdlib has something for it: https://docs.python.org/2/library/urlparse.html
I'd suggest using that over regex or ad-hoc parsing -
2017-01-13 at 3:30 AM UTC
Originally posted by Merlin String matching can get pretty complex, but basically what aldra said. Once you have it split up you can look for query parameters and use regex to match cid vs id, and really anything. I'm confused what you are really trying to accomplish and why you care about partial matches.
If you want to save space by having:
site.com/foo?id=
1,2,3,4,etc.
The complexity involved probably isn't worth whatever space you'd save. Ultimately I'm guessing you need to assemble it into the original url at a later point?
match query regex untested and not for any particular language so probably doesn't work
/((\w)=(\d))&/
Find a way to dump the () groupings into variables and fine all matches in a string, it's probably a arg for whatever regex function.
The point of the program for now is to collect a list of websites that match whatever dork i throw at google. And not have any duplicates. But now that i think about it that's more a thing of aesthetics than anything. I could just say fuck it and dump all results, then do some fuzzing on those with my program for max keks. -
2017-01-13 at 3:33 AM UTC
Originally posted by Lanny Yeah, you need to figure out what "almost duplicate" means. Whether two urls point to the same page or not isn't something you can't answer in general, partly because it's not even clear what "same page" means. Like let's say you're talking about old vb, you're looking at a thread. If you hit the same path with params "p=1&pp=50" you'll get the first 50 posts of the thread, if you use "p=1&pp=25" you'll get the first 25 posts. Are these the same pages? You could also use a get param to set theme. Is the same content in two different themes the same page? In practice you can't depend on querystrings being or not being part of a page's identity, generally it's better to err on the side of caution and assume they are (potentially producing duplicates).
Now there's the point that the order of elements in the querystring shouldn't alter the resource being requested at all (although there's no reason a service couldn't violate this property of the spec it could wreak havoc on cashes in a lot of place) and in that case you could do some parsing to collapse re-ordered query strings into each other as duplicates. But I would expect google would already be doing this so it's probably not a problem you need to solve for.
Can you give some examples of the kind of duplicates you're trying to remove? Like those three URLs listed in your op, do you consider them duplicates?
Yes i would consider those duplicates, also in the eman time i will check out the library you posted. -
2017-01-13 at 4:15 AM UTC
Originally posted by Sophie The point of the program for now is to collect a list of websites that match whatever dork i throw at google. And not have any duplicates. But now that i think about it that's more a thing of aesthetics than anything. I could just say fuck it and dump all results, then do some fuzzing on those with my program for max keks.
Well I'm lazy so that's what I would do. A 1MB text file can hold thousands of urls, storage is cheap and cpu is not. Though looking for full duplicates would still be worthwhile and a lot easier. -
2017-01-13 at 4:17 AM UTC
Originally posted by Merlin Well I'm lazy so that's what I would do. A 1MB text file can hold thousands of urls, storage is cheap and cpu is not. Though looking for full duplicates would still be worthwhile and a lot easier.
Google takes care of that, so yeah i think i will probably just keep "duplicates". -
2017-01-13 at 4:22 AM UTC
Originally posted by Sophie Google takes care of that, so yeah i think i will probably just keep "duplicates".
Though if they are jumbled and out of order it might still be worthwhile to parse and sort them, say by website, and keep all those in 'website.com.txt'. Which would still require all the same work, but then it's a one time thing. That way you can target a particular website and not just pull urls out of a hat. -
2017-01-13 at 4:28 AM UTC
Originally posted by Merlin Though if they are jumbled and out of order it might still be worthwhile to parse and sort them, say by website, and keep all those in 'website.com.txt'. Which would still require all the same work, but then it's a one time thing. That way you can target a particular website and not just pull urls out of a hat.
Google does the sorting for me as well, i should put in a party trick argument and have the program randomly pull an url out of a hat.