How do i reformat a row from a CSV file to JSON?

2017-08-25 at 9:59 PM UTC

#1

Sophie Pedophile Tech Support

So i ahve a big CSV file that is formatted like this.

CVE-1999-0026,Entry,"root privileges via buffer overflow in pset command on SGI IRIX systems.","CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul

I want to use Python's JSON module to make it look more like this:


CVE-1999-0026,
Entry,"root privileges via buffer overflow in pset command on SGI IRIX systems.",
"CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul

If i can do that without the JSON module that'd be ok too. As usual we are talking Python here.

2017-08-25 at 10:07 PM UTC

#2

The Self Taught Man Black Hole

Have you tried turning it off and turning it back on again?

2017-08-25 at 10:08 PM UTC

#3

Sophie Pedophile Tech Support

Originally posted by 霍比特人说中文不好 Have you tried turning it off and turning it back on again?

You're neither funny nor original.

2017-08-25 at 10:10 PM UTC

#4

The Self Taught Man Black Hole

Originally posted by Sophie You're neither funny nor original.

2017-08-25 at 10:12 PM UTC

#5

The Self Taught Man Black Hole

Don't encourage him

2017-08-25 at 10:12 PM UTC

#6

The Self Taught Man Black Hole

pythons belong in the jungle, not in captivity

2017-08-25 at 10:39 PM UTC

#7

benny vader YELLOW GHOST

^泥哥飞鸽

2017-08-25 at 11:24 PM UTC

#8

Lanny Bird of Courage

Originally posted by Sophie So i ahve a big CSV file that is formatted like this.
CVE-1999-0026,Entry,"root privileges via buffer overflow in pset command on SGI IRIX systems.","CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul
I want to use Python's JSON module to make it look more like this:
CVE-1999-0026,
Entry,"root privileges via buffer overflow in pset command on SGI IRIX systems.",
"CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul
If i can do that without the JSON module that'd be ok too. As usual we are talking Python here.

The first thing you want to think about is the most natural way to represent these entries in JSON. Rather than representing them as a sequence it seems like key/values would make more sense. Usually CSV files come with row names at the top. Assuming that's the case you could do something like this:


import csv
import json

import StringIO

sample_csv = """cve_id,type,description,whatever
CVE-1999-0026,Entry,"root privileges via buffer overflow in pset command on SGI IRIX systems.","CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul"
CVE-1999-0027,Entry,"root privileges via buffer overflow in pset command on SGI IRIX systems.","CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul"
"""

if __name__ == '__main__':
    pseudo_file = StringIO.StringIO(sample_csv)
    csv_reader = csv.DictReader(pseudo_file)

    rows = [row for row in csv_reader]

    print json.dumps({'vulns': rows})

which produces this (after formatting):


{
  "vulns": [
    {
      "whatever": "CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul",
      "type": "Entry",
      "description": "root privileges via buffer overflow in pset command on SGI IRIX systems.",
      "cve_id": "CVE-1999-0026"
    },
    {
      "whatever": "CERT:CA-97.21.sgi_buffer_overflow   |   AUSCERT:AA-97.20.IRIX.pset.buffer.overflow.vul",
      "type": "Entry",
      "description": "root privileges via buffer overflow in pset command on SGI IRIX systems.",
      "cve_id": "CVE-1999-0027"
    }
  ]
}

You can ignore the StringIO stuff, it's just a way to make a file-like object out of a string (csv.DictReader accepts a file-like object) but you'll have a real file.

Also worth pointing out that the `rows` list (a list of dictionaries built from the rows of the CSV file) gets wrapped in a dict before being passed to `json.dumps`. You don't _have_ to do this, json.dumps will accept a list, but strictly speaking plain arrays are not valid JSON, per the spec every piece of JSON must decode to exactly one object. That object can have keys which are arrays but the top level must be an object (map from keys to values, equivalent to python's "dictionary" type).

The following users say it would be alright if the author of this post didn't die in a fire!

Sophie

2017-08-26 at 5:41 AM UTC

#9

Sophie Pedophile Tech Support

Thank you Lanny, when i return to coding in a bit i will try this out and report back with results.

2017-08-26 at 7:19 AM UTC

#10

aldra JIDF Controlled Opposition

I can give you the code I used to do the same in perl, not sure how helpful it'd be though.

it was just a matter of reading the csv into an array then using the json cpan library to convert the array to json

2017-08-27 at 5:42 PM UTC

#11

Sophie Pedophile Tech Support

Anyway, i want to search for specific CVEs in my csv file and print them as JSON can i just do something like this?


query = raw_input("wut vuln m8?: ")
	
with open('cve_mitre.csv', 'rb') as infile:
	csv_reader = csv.DictReader(infile)
	rows = [row for row in csv_reader]
	for item in rows:
		if query in item:
			print json.dumps({'vulns': item})

Or can't i do "if this string is in this dictionary item print the dictionary item". Do i need to do some regex hocus pocus, pls no, i suck at regex.

2017-08-27 at 9 PM UTC

#12

Lanny Bird of Courage

There's no general way to ask "does my string occurs in this dictionary" because it's not really obvious what "occurs in" would mean. It's like asking "what's the sum of this number and this sting", the operation isn't defined over those types. When you write `if 'foo' in my_dict:` what you're asking is "is 'foo' one of the keys in my_dict`. You probably want to search within a specific field in each row (like if someone types in "apache" there's not much point in looking at the CVE column since the string "apache" is never going to appear there). If you really want to search every field you'd have to do something like this instead:


query = raw_input("wut vuln m8?: ")
	
with open('cve_mitre.csv', 'rb') as infile:
	csv_reader = csv.DictReader(infile)
	rows = [row for row in csv_reader]
	for row in rows:
		for col_name in row:
                    if query in row[col_name]:
                        print json.dumps(row)

Note what gets sent to json.dumps doesn't need to be wrapped in this case since it (the vulnerability row) is already a dictionary.

The following users say it would be alright if the author of this post didn't die in a fire!

Sophie

2017-08-27 at 11:32 PM UTC

#13

Sophie Pedophile Tech Support

Thanks, but wouldn't that just print the row that had 'apache' in it? So the output would just say:


 "description": "something something apache, something something vuln"

Because that doesn't seem particularly useful.

2017-08-27 at 11:33 PM UTC

#14

Sophie Pedophile Tech Support

Oh wait a row is like the whole item right. If so, don't mind me.

2017-08-28 at 6:27 AM UTC

#15

Lanny Bird of Courage

yeah, `row` is the dict with a key for each column. `col_name` would be "description" and `row[col_name]` would be the actual string describing that vulnerability.

Oh also as an aside, you can replace


	rows = [row for row in csv_reader]
	for row in rows:

with


for row in csv_reader:

They do the same thing but the difference is the former will read the whole file into memory at once while the latter will read the file in chunks, which can be significant if you're working with large files.

2017-08-30 at 2:22 AM UTC

#16

Sophie Pedophile Tech Support

I think the whole thing is like 100mb. I'll try it out to check out the difference.

User Controls

Navigation

How do i reformat a row from a CSV file to JSON?