Someone explain this hexdump function to me.

2015-09-18 at 3:33 PM UTC

#1

Sophie Pedophile Tech Support

So i was going over the code for a tcp proxy that's discussed in the book i'm reading and i came upon a particular piece of code that kind of looks like jibberish to me. It's a hexdump function and it's meant to convert data from network packets into hex so you can inspect it. I have no clue how it works however and how the data is computed and i'd like to know. Here's the code.


def hexdump(src, length=16):
    result = []
    digits = 4 if isinstance(src, unicode) else 2
    
    for i in xrange(0, len(src), length):
        s = src[i:i+length]
        hexa = b' '.join(["%0*X" % (digits, ord(x)) for x in s])
        text = b''.join([x if 0x20 <= ord(x) < 0x7f else b'.' for x in s])
        result.append( b"%04X  %-*s  %s" % (i, length*(digits + 1), hexa, text) )
        
        print b'\n'.join(result)

Also i've never seen an if/else statement formatted like this:


digits = 4 if isinstance(src, unicode) else 2

What's up with that?

Anyway i'll probably get zero replies, i wish we had more programmers here but in that case i'll probably go ask over at stackoverflow. Wanted to post it here because content is good.

2015-09-18 at 4:20 PM UTC

#2

-SpectraL coward [the spuriously bluish-lilac bushman]

https://github.com/Eid010n/Python/blob/master/Black-Hat-Python/BHP-Code/Chapter2/proxy.py

2015-09-18 at 4:24 PM UTC

#3

-SpectraL coward [the spuriously bluish-lilac bushman]

https://www.alertlogic.com/blog/journalctl-terminal-escape-injection/

2015-09-18 at 4:29 PM UTC

#4

Sophie Pedophile Tech Support

https://github.com/Eid010n/Python/bl...pter2/proxy.py

Yup that's the one, and i know how the proxy in general operates. However while i know what the hexdump function is for i'd like to know how it works, it's unclear to me how the code acutally does what it does and how the data is computed. I even checked the link from the script which didn't make me any the wiser.

https://code.activestate.com/recipes/142812-hex-dumper/

2015-09-18 at 8 PM UTC

#5

Parker Brother Yung Blood [the valiantly arthrosporous wyatt]

The if/else statement you refer to is Python's version of the Ternary Operator. It is pretty much shorthand for

if isinstance(src,unicode):
    digits=4
else:
    digits=2

In most other programming languages the Ternary Operator is written as

[i]condition[/i]? [i]true_value[/i]:[i]false_value[/i]

So that line would be written as

digits=isinstance(src,unicode)? 4:2

2015-09-18 at 9:30 PM UTC

#6

Lanny Bird of Courage

There's no deep magic here, a hex dump is pretty straightforward, there's just some cleverness in the implementation here (for shorter code, not really performance. I'm tempted to call it "showy").

My comments on lines that seem non-obvious to me, if I leave something out that's unclear let me know.

digits = 4 if isinstance(src, unicode) else 2

the if/else construct here is python's rendition of the the ternary operator. In its general form it can be rewritten

VAR = LEFT if COND else RIGHT

becomes


def f():
  if COND:
    return LEFT:
  else:
    return RIGHT

VAR = f()

The interesting difference here is that the ternary form is an expression, that is it has a value, while a traditional if/else can only operate through side effects (changing variable values and such), it's incoherent to ask "what is the value of a (traditional) if/else" in the same way we might ask "what is the value of this function call" or "what is the value of `2 + 2`". In C-like languages the ternary form is viewed with a sort of suspicion, it's considered "tricky" and non-standard compilers have trained programmers to fear promises short-circuit semantics (i.e. that only one arm of an if will be executed) but in functional programming circles a language construct that isn't an expression is variously considered poor style or blatantly wrong.

But that's kind of a tangent on style. Ternary operator aside, the point here is that `digits` is the number of nibbles per character of `src`. The character/byte/symbol/glyph/what-the-fuck-ever distinction is subtle the idea is this: each hexadecimal digit (0-F) represents a nibble (half a byte, 4 bits, 2^4 = 16 possible values). If the text is encoded in ASCII then each 'character' (character being what's accessed by python's subscript notation ('foo'[1] == 'f')) is 1 byte (two nibbles) but if it's UTF-16 (16-bit characters) then it's 2 bytes or 4 nibbles. This is kind of a strange way of doing things, I don't know why anyone would be receiving something over a network in anything other than a byte-string (represented in python 2.x as type `str` rather than `unicode`) but whatever, that's the idea here.

for i in xrange(0, len(src), length):

`xrange` is basically the same thing as `range`, it just has to do with when the sequence is generated. Python has what are called 'generators' which are what we'd call 'lazy sequences' in other languages. It just means that the next number in the range is computed when it's asked for, instead of when the function is first called. For large sequences this is more efficient because we don't need to store the whole sequence in memory, we can just generate numbers as needed and let GC dispose of them when we move onto the next. The third argument just says "increments in steps of `length`", and `length` is the number of characters worth of data to show per line.

s = src[i:i+length]

Slice notation, s becomes a `length` long substring of src starting at `i`.

hexa = b' '.join(["%0*X" % (digits, ord(x)) for x in s])

There's a lot going on in this line. Python has a construct called "list comprehensions" which are a way of defining a list as a function of another list. They look like:

[ITEM_EXPR for VAR_NAME in SRC_LIST]

Which will step over every item in SRC_LIST, assign the item's value to VAR_NAME, and make the item in the result list with the same index the value of ITEM_EXPR where VAR_ NAME is bound. That sounds fancy but it's just a shorthand for a for loop. Consider:

src = [39, 40, 41]
dest = [x+1 for x in src]
# dest === [40, 41, 42]

Although there's special syntax for "enhancement" operations as well but we don't have to worry about that since it's not used here. The important point is that it defines a transformation of a list. So we know what two thirds of this list comprehension is doing. `s` is the 16 characters of data (it's not actually a list, but it is what's called an `iterable` in python, meaning we can use the subscript and list comprehension syntax on it) and `x` will be each character of that data when the ITEM_EXPR is evaluated. So the question is what

"%0*X" % (digits, ord(x))

does. This syntax (more syntax, sorry) is known as interpolation. The general form is `FORMAT % PARAMS` where FORMAT is a string that's like a "template" and some data that's going to be formatted (according to the template) in the output. You see a lot of this in things like


name = "Sophie"
print "Hello there %s" % name

which will output "Hello there Sophie". `%s` is the marker for "format the corresponding input as a string and stick it here". "%X" is the marker for "format the corresponding input as a hexadecimal number and stick it here". You can specify "zero padding", so like "%X" % 32` would be `"20"` but `"%04X" % 32` would be `"0020"` (the output will always be four characters, even if that means including leading zeros which we don't traditionally do). You can also specify the length of the padding as a parameter in the same way we specify the number to be formatted. That's what `%0*X` means, pad with the number of 0s of the corresponding parameters, in this case that's the value of `length`. So the result of the list comprehension is a list of (presumably 16) strings that are the hex representation of the bytes in our length-16 slice of src. `' '.join(LIST)` just returns a string which is each member of LIST concatenated and separated by `' '` (a space).

As an aside, most languages implements join either as a standalone function or as a method on list types while in python you get this odd inversion of it being a method of strings. Guido has an argument for why this is so and it actually kinda works but it's interesting that most programmers consider it a "wart" on python.

Oh, and `ord(CHAR)` returns the byte value of CHAR, so like `ord('a')` is 61 (base 10) because under ASCII and UTF-8 'a' is encoded as the 61 (well, the byte that has the numeric value of 61, whatever).

text = b''.join([x if 0x20 <= ord(x) < 0x7f else b'.' for x in s])

Similar thing here, we're just iterating over each character in src. The difference is that the list built by the list comprehension is of single characters. If the character is outside the printable range (i.e. less than 32 (0x20) or more than 127 (0x7f)) it's represented in the dump as just a simple dot, so things like control characters, you see these frequently in binary data since it's basically random bit patterns (at least when looking at it as a hex dump. This helps because something like a newline has a 1 in 256 chance of appearing in a byte in a chunk of binary and if that gets printed in your dump the formatting will be fucked. So yeah, the point of this line is to produce the right column (like you see here) and the LC replaces non-printables with dots.

result.append( b"%04X %-*s %s" % (i, length*(digits + 1), hexa, text) )[/quote]

More string formatting. Start every line with the "line number", that is the index of the first byte in that line (printed in hex). Then the hex values (calculated two lines up) (and padded with '-', in the case the last line isn't the same length as the others (this is the `%-*s` part). Then the printable ascii representation, the third column (the last `%s`).

And that's it. Maybe it would have been less opaque if the author had just used for loops and stuff but I've found that as you become a bit more experienced thinking in terms of sequence (or in pythonese "iterable") transformations is a really powerful conceptual model. It's kinda surprising how many problems can be expressed/solved in this way and it lends itself to composability/reuse.

2015-09-19 at 12:33 AM UTC

#7

Sophie Pedophile Tech Support

Dang Lan, thanks for the amazing write up. All in all there are some thing's in your explanation that i don't quite grasp but that's no fault of yours it's because i don't really know the fundamentals of programming yet. This helped me along though so thank you for that.

2015-09-19 at 2:57 AM UTC

#8

Lanny Bird of Courage

Happy to help blood, I feel bad that I leave threads in this forum unreplied to (among other things) either out of laziness or a lack of domain knowledge so it's nice if I can be helpful now and then. Feel free to ask follow ups on anything you don't understand, for any reason. Obviously some things are easier to just google but some questions are hard to google for like "ternary operator". I remember trying to google for the ternary operator the first time I saw it (not knowing it was called that) and "?:" or "X?Y:Z" don't return useful results (interestingly "colon question mark" does actually have some valid results). Similar situation with list comprehensions, string interpolation, tuple unpacking etc. (and just for python). And of course there are some things that are just easier to talk through/ask about than read through.

2015-09-19 at 11:17 AM UTC

#9

Sophie Pedophile Tech Support

Happy to help blood, I feel bad that I leave threads in this forum unreplied to (among other things) either out of laziness or a lack of domain knowledge so it's nice if I can be helpful now and then. Feel free to ask follow ups on anything you don't understand, for any reason. Obviously some things are easier to just google but some questions are hard to google for like "ternary operator". I remember trying to google for the ternary operator the first time I saw it (not knowing it was called that) and "?:" or "X?Y:Z" don't return useful results (interestingly "colon question mark" does actually have some valid results). Similar situation with list comprehensions, string interpolation, tuple unpacking etc. (and just for python). And of course there are some things that are just easier to talk through/ask about than read through.

Sure thing, probably tonight when i get back from a dinner party(Yes i go to dinner parties, because i am a high society nigga' like that) and i have some time to spare i'll go through your post again and take a good look at the things i don't quite grasp, to articulate some questions related to them, forgive me if it will include some questions that may seem obvious to you but hey if i'm gonna' learn i got to get the fundamentals nahmean?

User Controls

Navigation

Someone explain this hexdump function to me.