There's no deep magic here, a hex dump is pretty straightforward, there's just some cleverness in the implementation here (for shorter code, not really performance. I'm tempted to call it "showy").
My comments on lines that seem non-obvious to me, if I leave something out that's unclear let me know.
digits = 4 if isinstance(src, unicode) else 2
the if/else construct here is python's rendition of the the
ternary operator. In its general form it can be rewritten
VAR = LEFT if COND else RIGHT
becomes
def f():
if COND:
return LEFT:
else:
return RIGHT
VAR = f()
The interesting difference here is that the ternary form is an expression, that is it has a value, while a traditional if/else can only operate through side effects (changing variable values and such), it's incoherent to ask "what is the value of a (traditional) if/else" in the same way we might ask "what is the value of this function call" or "what is the value of `2 + 2`". In C-like languages the ternary form is viewed with a sort of suspicion, it's considered "tricky" and non-standard compilers have trained programmers to fear promises short-circuit semantics (i.e. that only one arm of an if will be executed) but in functional programming circles a language construct that isn't an expression is variously considered poor style or blatantly wrong.
But that's kind of a tangent on style. Ternary operator aside, the point here is that `digits` is the number of nibbles per character of `src`. The character/byte/symbol/glyph/what-the-fuck-ever distinction is subtle the idea is this: each hexadecimal digit (0-F) represents a nibble (half a byte, 4 bits, 2^4 = 16 possible values). If the text is encoded in ASCII then each 'character' (character being what's accessed by python's subscript notation ('foo'[1] == 'f')) is 1 byte (two nibbles) but if it's UTF-16 (16-bit characters) then it's 2 bytes or 4 nibbles. This is kind of a strange way of doing things, I don't know why anyone would be receiving something over a network in anything other than a byte-string (represented in python 2.x as type `str` rather than `unicode`) but whatever, that's the idea here.
for i in xrange(0, len(src), length):
`xrange` is basically the same thing as `range`, it just has to do with when the sequence is generated. Python has what are called 'generators' which are what we'd call 'lazy sequences' in other languages. It just means that the next number in the range is computed when it's asked for, instead of when the function is first called. For large sequences this is more efficient because we don't need to store the whole sequence in memory, we can just generate numbers as needed and let GC dispose of them when we move onto the next. The third argument just says "increments in steps of `length`", and `length` is the number of characters worth of data to show per line.
Slice notation, s becomes a `length` long substring of src starting at `i`.
hexa = b' '.join(["%0*X" % (digits, ord(x)) for x in s])
There's a lot going on in this line. Python has a construct called "list comprehensions" which are a way of defining a list as a function of another list. They look like:
[ITEM_EXPR for VAR_NAME in SRC_LIST]
Which will step over every item in SRC_LIST, assign the item's value to VAR_NAME, and make the item in the result list with the same index the value of ITEM_EXPR where VAR_ NAME is bound. That sounds fancy but it's just a shorthand for a for loop. Consider:
src = [39, 40, 41]
dest = [x+1 for x in src]
# dest === [40, 41, 42]
Although there's special syntax for "enhancement" operations as well but we don't have to worry about that since it's not used here. The important point is that it defines a transformation of a list. So we know what two thirds of this list comprehension is doing. `s` is the 16 characters of data (it's not actually a list, but it is what's called an `iterable` in python, meaning we can use the subscript and list comprehension syntax on it) and `x` will be each character of that data when the ITEM_EXPR is evaluated. So the question is what
"%0*X" % (digits, ord(x))
does. This syntax (more syntax, sorry) is known as interpolation. The general form is `FORMAT % PARAMS` where FORMAT is a string that's like a "template" and some data that's going to be formatted (according to the template) in the output. You see a lot of this in things like
name = "Sophie"
print "Hello there %s" % name
which will output "Hello there Sophie". `%s` is the marker for "format the corresponding input as a string and stick it here". "%X" is the marker for "format the corresponding input as a hexadecimal number and stick it here". You can specify "zero padding", so like "%X" % 32` would be `"20"` but `"%04X" % 32` would be `"0020"` (the output will always be four characters, even if that means including leading zeros which we don't traditionally do). You can also specify the length of the padding as a parameter in the same way we specify the number to be formatted. That's what `%
0*X` means, pad with the number of 0s of the corresponding parameters, in this case that's the value of `length`. So the result of the list comprehension is a list of (presumably 16) strings that are the hex representation of the bytes in our length-16 slice of src. `' '.join(LIST)` just returns a string which is each member of LIST concatenated and separated by `' '` (a space).
As an aside, most languages implements join either as a standalone function or as a method on list types while in python you get this odd inversion of it being a method of strings. Guido has an argument for why this is so and it actually kinda works but it's interesting that most programmers consider it a "wart" on python.
Oh, and `ord(CHAR)` returns the byte value of CHAR, so like `ord('a')` is 61 (base 10) because under ASCII and UTF-8 'a' is encoded as the 61 (well, the byte that has the numeric value of 61, whatever).
text = b''.join([x if 0x20 <= ord(x) < 0x7f else b'.' for x in s])
Similar thing here, we're just iterating over each character in src. The difference is that the list built by the list comprehension is of single characters. If the character is outside the printable range (i.e. less than 32 (0x20) or more than 127 (0x7f)) it's represented in the dump as just a simple dot, so things like control characters, you see these frequently in binary data since it's basically random bit patterns (at least when looking at it as a hex dump. This helps because something like a newline has a 1 in 256 chance of appearing in a byte in a chunk of binary and if that gets printed in your dump the formatting will be fucked. So yeah, the point of this line is to produce the right column (like you see
here) and the LC replaces non-printables with dots.
result.append( b"%04X %-*s %s" % (i, length*(digits + 1), hexa, text) )[/quote]
More string formatting. Start every line with the "line number", that is the index of the first byte in that line (printed in hex). Then the hex values (calculated two lines up) (and padded with '-', in the case the last line isn't the same length as the others (this is the `%-*s` part). Then the printable ascii representation, the third column (the last `%s`).
And that's it. Maybe it would have been less opaque if the author had just used for loops and stuff but I've found that as you become a bit more experienced thinking in terms of sequence (or in pythonese "iterable") transformations is a really powerful conceptual model. It's kinda surprising how many problems can be expressed/solved in this way and it lends itself to composability/reuse.