July 20, 2011
C programming language allows developers to directly access the memory where
variables are stored. Ruby does not allow that. There are times while working in
Ruby when you need to access the underlying bits and bytes. Ruby provides two
methods pack
and unpack
for that.
Here is an example.
> 'A'.unpack('b*')
=> ["10000010"]
In the above case 'A' is a string which is being stored and using unpack
I am
trying to read the bit value. The ASCII table says
that ASCII value of 'A' is 65 and the binary representation of 65 is 10000010
.
Here is another example.
> 'A'.unpack('B*')
=> ["01000001"]
Notice the difference in result from the first case. What's the difference
between b*
and B*
. In order to understand the difference first lets discuss
MSB and LSB.
All bits are not created equal. C
has ascii value of 67. The binary value of
67 is 1000011
.
First let's discuss MSB (most significant bit) style . If you are following MSB
style then going from left to right (and you always go from left to right) then
the most significant bit will come first. Because the most significant bit comes
first we can pad an additional zero to the left to make the number of bits
eight. After adding an additional zero to the left the binary value looks like
01000011
.
If we want to convert this value in the LSB (Least Significant Bit) style then we need to store the least significant bit first going from left to right. Given below is how the bits will be moved if we are converting from MSB to LSB. Note that in the below case position 1 is being referred to the leftmost bit.
move value 1 from position 8 of MSB to position 1 of LSB
move value 1 from position 7 of MSB to position 2 of LSB
move value 0 from position 6 of MSB to position 3 of LSB
and so on and so forth
After the exercise is over the value will look like 11000010
.
We did this exercise manually to understand the difference between
most significant bit
and least significant bit
. However unpack method can
directly give the result in both MSB and LSB. The unpack
method can take both
b*
and B*
as the input. As per the ruby documentation here is the
difference.
B | bit string (MSB first)
b | bit string (LSB first)
Now let's take a look at two examples.
> 'C'.unpack('b*')
=> ["11000010"]
> 'C'.unpack('B*')
=> ["01000011"]
Both b*
and B*
are looking at the same underlying data. It's just that they
represent the data differently.
Let's say that I want binary value for string hello
. Based on the discussion
in the last section that should be easy now.
> "hello".unpack('B*')
=> ["0110100001100101011011000110110001101111"]
The same information can also be derived as
> "hello".unpack('C*').map {|e| e.to_s 2}
=> ["1101000", "1100101", "1101100", "1101100", "1101111"]
Let's break down the previous statement in small steps.
> "hello".unpack('C*')
=> [104, 101, 108, 108, 111]
Directive C*
gives the 8-bit unsigned integer
value of the character. Note
that ascii value of h
is 104
, ascii value of e
is 101
and so on.
Using the technique discussed above I can find hex value of the string.
> "hello".unpack('C*').map {|e| e.to_s 16}
=> ["68", "65", "6c", "6c", "6f"]
Hex value can also be achieved directly.
> "hello".unpack('H*')
=> ["68656c6c6f"]
Notice the difference in the below two cases.
> "hello".unpack('H*')
=> ["68656c6c6f"]
> "hello".unpack('h*')
=> ["8656c6c6f6"]
As per ruby documentation for unpack
H | hex string (high nibble first) h | hex string (low nibble first)
A byte consists of 8 bits. A nibble consists of 4 bits. So a byte has two
nibbles. The ascii value of 'h' is 104
. Hex value of 104 is 68
. This 68
is
stored in two nibbles. First nibble, meaning 4 bits, contain the value 6
and
the second nibble contains the value 8
. In general we deal with high nibble
first and going from left to right we pick the value 6
and then 8
.
However if you are dealing with low nibble first then low nibble value 8
will
take the first slot and then 6
will come. Hence the result in "low nibble
first" mode will be 86
.
This pattern is repeated for each byte. And because of that a hex value of
68 65 6c 6c 6f
looks like 86 56 c6 c6 f6
in low nibble first format.
In all the previous examples I used *
. And a *
means to keep going as long
as it has to keep going. Lets see a few examples.
A single C
will get a single byte.
> "hello".unpack('C')
=> [104]
You can add more Cs
if you like.
> "hello".unpack('CC')
=> [104, 101]
> "hello".unpack('CCC')
=> [104, 101, 108]
> "hello".unpack('CCCCC')
=> [104, 101, 108, 108, 111]
Rather than repeating all those directives, I can put a number to denote how many times you want previous directive to be repeated.
> "hello".unpack('C5')
=> [104, 101, 108, 108, 111]
I can use *
to capture al the remaining bytes.
> "hello".unpack('C*')
=> [104, 101, 108, 108, 111]
Below is an example where MSB
and LSB
are being mixed.
> "aa".unpack('b8B8')
=> ["10000110", "01100001"]
Method pack
is used to read the stored data. Let's discuss a few examples.
> [1000001].pack('C')
=> "A"
In the above case the binary value is being interpreted as
8 bit unsigned integer
and the result is 'A'.
> ['A'].pack('H')
=> "\xA0"
In the above case the input 'A' is not ASCII 'A' but the hex 'A'. Why is it hex
'A'. It is hex 'A' because the directive 'H' is telling pack to treat input
value as hex value. Since 'H' is high nibble first and since the input has only
one nibble then that means the second nibble is zero. So the input changes from
['A']
to ['A0']
.
Since hex value A0
does not translate into anything in the ASCII table the
final output is left as it and hence the result is \xA0
. The leading \x
indicates that the value is hex value.
Notice the in hex notation A
is same as a
. So in the above example I can
replace A
with a
and the result should not change. Let's try that.
> ['a'].pack('H')
=> "\xA0"
Let's discuss another example.
> ['a'].pack('h')
=> "\n"
In the above example notice the change. I changed directive from H
to h
.
Since h
means low nibble first and since the input has only one nibble the
value of low nibble becomes zero and the input value is treated as high nibble
value. That means value changes from ['a']
to ['0a']
. And the output will be
\x0A
. If you look at ASCII table then hex value A
is ASCII value 10 which is
NL line feed, new line
. Hence we see \n
as the output because it represents
"new line feed".
I did a quick grep in Rails source code and found following usage of unpack.
email_address_obfuscated.unpack('C*')
'mailto:'.unpack('C*')
email_address.unpack('C*')
char.unpack('H2')
column.class.string_to_binary(value).unpack("H*")
data.unpack("m")
s.unpack("U\*")
Already we have seen the usage of directive C*
and H
for unpack. The
directive m
gives the base64 encoded value and the directive U*
gives the
UTF-8 character. Here is an example.
> "Hello".unpack('U*')
=> [72, 101, 108, 108, 111]
Above code was tested with ruby 1.9.2 .
French version of this article is available here .
If this blog was helpful, check out our full blog archive.