How do i get rid of strings that are (almost)duplicates but not entirely from a list?

2017-01-13 at 2:40 AM UTC

#1

Sophie Pedophile Tech Support

First thought would be regex but i don't really know how since the strings are so similar. So like i would have a list with a bunch of these:

http://url.com/page?id=1
http://url.com/page?id=5
http://url.com/page?id=17

Do i sort on the ID? What if instead of `id` is have `cid`? So to understand what i am doing here please see my function below.


def search():
	driver = webdriver.Firefox()
	for int in range(args.pages):
		driver.get("http://google.com")
		assert "Google" in driver.title
		for items in dork_list:
			elem = driver.find_element_by_name("q")
			elem.clear()
			elem.send_keys(items)
			elem.send_keys(Keys.RETURN)
			assert "No results found." not in driver.page_source
			
			links = driver.find_elements_by_xpath("//a[@href]")
			for elem in links:
				link_list = []
				link_list.append(elem.get_attribute("href"))
			
	
	driver.close()
	return link_list

This is untested code so please no bully, also if there is something wrong with that function feel free to point it out and give some constructive criticism. For the import statements you can use your:

But i used Selenium, also `args.pages` and such are defined with the argparse library as well.

BUT ANYWAY! I am returning a list of links but i have not sorted the duplicates out yet i want to have a separate function to do that why? I have no idea, i just felt like i probably shouldn't put over 9000 loops inside of each other.

As you can see once i have the list i need to do some additional processing with it. To be honest, since this code is untested i don't even know if the result will come out with HTML tags. But i figured i would hammer out the program as far as i am able at first then run it and debug from the first problem down.

So any thoughts?

Post last edited by Sophie at 2017-01-13T02:43:03.458331+00:00

2017-01-13 at 2:41 AM UTC

#2

The Self Taught Man Black Hole

i dont speak chinese

2017-01-13 at 2:48 AM UTC

#3

aldra JIDF Controlled Opposition

how do you want to strip them?

do you want to only keep unique domains or unique pages?

http://url.com/nigger?id=1
http://url.com/nigger?id=5
http://url.com/dicks?id=17

ie. unique domains keeps only

http://url.com/nigger?id=1

unique pages keeps

http://url.com/nigger?id=1
http://url.com/dicks?id=17

at any rate I'd separate the strings into domain, page and arguments. could use regex or a simple split - for the domain get 0 to the first '/', for the page get the first '/' to the '?' if it exists. parameter would be '?' to end (assuming there's a ?)

2017-01-13 at 2:52 AM UTC

#4

Sophie Pedophile Tech Support

I just realized `elem.send_keys(items)` is going to break because elem.send_keys doesn't have an attribute inurl:some_dork, fuck me.

2017-01-13 at 2:55 AM UTC

#5

Sophie Pedophile Tech Support

Originally posted by aldra how do you want to strip them?

do you want to only keep unique domains or unique pages?

http://url.com/nigger?id=1
http://url.com/nigger?id=5
http://url.com/dicks?id=17

ie. unique domains keeps only

http://url.com/nigger?id=1

unique pages keeps

http://url.com/nigger?id=1
http://url.com/dicks?id=17

at any rate I'd separate the strings into domain, page and arguments. could use regex or a simple split - for the domain get 0 to the first '/', for the page get the first '/' to the '?' if it exists. parameter would be '?' to end (assuming there's a ?)

Unique pages will do nicely, so you mean i should split all the links up to '/' and then what? How does my program know i have a duplicate?

2017-01-13 at 2:56 AM UTC

#6

Sophie Pedophile Tech Support

Originally posted by Sophie I just realized `elem.send_keys(items)` is going to break because elem.send_keys doesn't have an attribute inurl:some_dork, fuck me.

Or is it? I don't fucking know.

2017-01-13 at 3:05 AM UTC

#7

aldra JIDF Controlled Opposition

I don't know much python so I can't write you proper code, but basically I'd parse URLs into those 3 parts.

first a structure to hold that data:

structure splitUrl (protocol, domain, page, parameter)

then split each url into protocol (http:)
domain (url.com)
page (dicks)
and if it exists, parameter (id=17)

that way you can quickly search through each of the properties to see if there's already an item with a matching page or parameter, for instance

2017-01-13 at 3:14 AM UTC

#8

Sophie Pedophile Tech Support

Originally posted by aldra I don't know much python so I can't write you proper code, but basically I'd parse URLs into those 3 parts.

first a structure to hold that data:

structure splitUrl (protocol, domain, page, parameter)

then split each url into protocol (http:)
domain (url.com)
page (dicks)
and if it exists, parameter (id=17)

that way you can quickly search through each of the properties to see if there's already an item with a matching page or parameter, for instance

What manner of structure holds multiple pieces of data that all have a relation to each other, i am not building a *SQL database.

2017-01-13 at 3:16 AM UTC

#9

aldra JIDF Controlled Opposition

http://stackoverflow.com/questions/35988/c-like-structures-in-python

though if you want to keep it simple you could just create 4 arrays, protocol,domain,page,args

so that way you'd get a full URL by concatenating protocol[x].domain[x].page[x].args[x]

2017-01-13 at 3:20 AM UTC

#10

Merlin Houston

String matching can get pretty complex, but basically what aldra said. Once you have it split up you can look for query parameters and use regex to match cid vs id, and really anything. I'm confused what you are really trying to accomplish and why you care about partial matches.

If you want to save space by having:
site.com/foo?id=
1,2,3,4,etc.

The complexity involved probably isn't worth whatever space you'd save. Ultimately I'm guessing you need to assemble it into the original url at a later point?


match query regex untested and not for any particular language so probably doesn't work
/((\w)=(\d))&/

Find a way to dump the () groupings into variables and fine all matches in a string, it's probably a arg for whatever regex function.

2017-01-13 at 3:20 AM UTC

#11

Lanny Bird of Courage

Yeah, you need to figure out what "almost duplicate" means. Whether two urls point to the same page or not isn't something you can't answer in general, partly because it's not even clear what "same page" means. Like let's say you're talking about old vb, you're looking at a thread. If you hit the same path with params "p=1&pp=50" you'll get the first 50 posts of the thread, if you use "p=1&pp=25" you'll get the first 25 posts. Are these the same pages? You could also use a get param to set theme. Is the same content in two different themes the same page? In practice you can't depend on querystrings being or not being part of a page's identity, generally it's better to err on the side of caution and assume they are (potentially producing duplicates).

Now there's the point that the order of elements in the querystring shouldn't alter the resource being requested at all (although there's no reason a service couldn't violate this property of the spec it could wreak havoc on cashes in a lot of place) and in that case you could do some parsing to collapse re-ordered query strings into each other as duplicates. But I would expect google would already be doing this so it's probably not a problem you need to solve for.

Can you give some examples of the kind of duplicates you're trying to remove? Like those three URLs listed in your op, do you consider them duplicates?

2017-01-13 at 3:26 AM UTC

#12

Sophie Pedophile Tech Support

Originally posted by aldra http://stackoverflow.com/questions/35988/c-like-structures-in-python

though if you want to keep it simple you could just create 4 arrays, protocol,domain,page,args

so that way you'd get a full URL by concatenating protocol[x].domain[x].page[x].args[x]

I was thinking arrays yeah but then i have four arrays full of protocols and domains and such, do i do:


# Pseudo code
if not domain in string:
  ''.join(protocol, domain, page, args)

2017-01-13 at 3:30 AM UTC

#13

Lanny Bird of Courage

There's kind of a lot involved in parsing uris, stdlib has something for it: https://docs.python.org/2/library/urlparse.html

I'd suggest using that over regex or ad-hoc parsing

The following users say it would be alright if the author of this post didn't die in a fire!

Sophie

2017-01-13 at 3:30 AM UTC

#14

Sophie Pedophile Tech Support

Originally posted by Merlin String matching can get pretty complex, but basically what aldra said. Once you have it split up you can look for query parameters and use regex to match cid vs id, and really anything. I'm confused what you are really trying to accomplish and why you care about partial matches.

If you want to save space by having:
site.com/foo?id=
1,2,3,4,etc.

The complexity involved probably isn't worth whatever space you'd save. Ultimately I'm guessing you need to assemble it into the original url at a later point?
match query regex untested and not for any particular language so probably doesn't work
/((\w)=(\d))&/
Find a way to dump the () groupings into variables and fine all matches in a string, it's probably a arg for whatever regex function.

The point of the program for now is to collect a list of websites that match whatever dork i throw at google. And not have any duplicates. But now that i think about it that's more a thing of aesthetics than anything. I could just say fuck it and dump all results, then do some fuzzing on those with my program for max keks.

2017-01-13 at 3:33 AM UTC

#15

Sophie Pedophile Tech Support

Originally posted by Lanny Yeah, you need to figure out what "almost duplicate" means. Whether two urls point to the same page or not isn't something you can't answer in general, partly because it's not even clear what "same page" means. Like let's say you're talking about old vb, you're looking at a thread. If you hit the same path with params "p=1&pp=50" you'll get the first 50 posts of the thread, if you use "p=1&pp=25" you'll get the first 25 posts. Are these the same pages? You could also use a get param to set theme. Is the same content in two different themes the same page? In practice you can't depend on querystrings being or not being part of a page's identity, generally it's better to err on the side of caution and assume they are (potentially producing duplicates).

Now there's the point that the order of elements in the querystring shouldn't alter the resource being requested at all (although there's no reason a service couldn't violate this property of the spec it could wreak havoc on cashes in a lot of place) and in that case you could do some parsing to collapse re-ordered query strings into each other as duplicates. But I would expect google would already be doing this so it's probably not a problem you need to solve for.

Can you give some examples of the kind of duplicates you're trying to remove? Like those three URLs listed in your op, do you consider them duplicates?

Yes i would consider those duplicates, also in the eman time i will check out the library you posted.

2017-01-13 at 4:15 AM UTC

#16

Merlin Houston

Originally posted by Sophie The point of the program for now is to collect a list of websites that match whatever dork i throw at google. And not have any duplicates. But now that i think about it that's more a thing of aesthetics than anything. I could just say fuck it and dump all results, then do some fuzzing on those with my program for max keks.

Well I'm lazy so that's what I would do. A 1MB text file can hold thousands of urls, storage is cheap and cpu is not. Though looking for full duplicates would still be worthwhile and a lot easier.

2017-01-13 at 4:17 AM UTC

#17

Sophie Pedophile Tech Support

Originally posted by Merlin Well I'm lazy so that's what I would do. A 1MB text file can hold thousands of urls, storage is cheap and cpu is not. Though looking for full duplicates would still be worthwhile and a lot easier.

Google takes care of that, so yeah i think i will probably just keep "duplicates".

2017-01-13 at 4:22 AM UTC

#18

Merlin Houston

Originally posted by Sophie Google takes care of that, so yeah i think i will probably just keep "duplicates".

Though if they are jumbled and out of order it might still be worthwhile to parse and sort them, say by website, and keep all those in 'website.com.txt'. Which would still require all the same work, but then it's a one time thing. That way you can target a particular website and not just pull urls out of a hat.

2017-01-13 at 4:28 AM UTC

#19

Sophie Pedophile Tech Support

Originally posted by Merlin Though if they are jumbled and out of order it might still be worthwhile to parse and sort them, say by website, and keep all those in 'website.com.txt'. Which would still require all the same work, but then it's a one time thing. That way you can target a particular website and not just pull urls out of a hat.

Google does the sorting for me as well, i should put in a party trick argument and have the program randomly pull an url out of a hat.

User Controls

Navigation

How do i get rid of strings that are (almost)duplicates but not entirely from a list?