Base64 encoding
In this article I take a quick look at the Base64 binary-to-text encoding system, and see how to encode and decode in Python.
Why binary to text encoding?
So, back in the day, there was a need to transmit binary data such as images, over dial-up communications lines. The problem with sending binary data is the protocol can end up treating certain binary codes as control codes. For example, 00000010
might be interpreted as the STX
control code, when it's actually just binary data in an image. The solution they came up with is to convert binary data into safe text. Safe text are ASCII characters A-Z
, a-z
, 0-9
, +
, /
and =
. These characters are standard ASCII characters and are never interpreted by mistake as control codes.
What is Base64 encoding?
Base64 is a six digit code where each of the 'safe' ASCII characters has a corresponding 6-bit binary code associated with it. 6-bits means there are 64 entries in the encoding table, where 000000
is A
and 63, or 111111
is /. The =
character is used for output padding, and does not have a position in the table.
Output padding
Let's imagine I just want to encode A, which has an ASCII binary value of 01000001
(0x41
). I'm actually going to encode 010000
as Q
and then I still have another 01
to encode. I'll need to append extra 0s there to get another 010000
as Q
. So far I've got QQ
. But when QQ
is decoded, it will only be 12 bits. I need to make sure that the encoded string, when I decode it, is a multiple of 8-bits. This is achieved with output padding. However, in this case, I can't really add just one character of padding, to get QQ=, as this would now represent 3 * 6 = 18 bits, and 18 is not a multiple of 8. I'd have to add another padding character, which would give QQ==, which is 6*4 = 24 bits, which is a multiple of 8.
By way of another example, let's encode AA
. This would be 01000001
and 01000001
, or 010000 010100 0001
. First 6 bits would be 010000
or Q
. Next 6 bits would be 010100
or U
. The next 6 bits would be 0001
, and to make it 6 bits 000100
or E
. This gives us a total of 18 bits of encode characters (3 * 6). Adding one padding character would give us another 6 for 4 * 6 is 24 bits, a multiple of 8.
Further, an input of AAA
would not require any padding characters as it would encode to 24 bits (QUFB
is 4 * 6 = 24 bits). AAABBB
would not require any padding (QUFBQkJC
is 8 * 6 = 48 bits). Basically, if the input is a multiple of 3 characters, you won't require padding. For example, AAABBBCCC
would not require padding. AAABBBCCCD
is short two characters to make a multiple of 3, so would require two padding characters for an output of QUFBQkJCQ0NDRA==
.
You can test this encoding (and decoding) and the output padding process here.
How to encode and decode in Python
The following code shows a simple example. You read an image file as binary data. You then encode it, and decode it, and write the decoded data out to a file. You can then compare the output file to the input file to make sure they are the same. You can also optionally print out the encoded string if you want.
import base64
# Base64 encodes binary info as ASCII string
fn1 = "timmy.jpg"
fn2 = "test.jpg"
# Open test file as input in binary mode and read in bytes
f = open (fn1, "rb")
bytes = f.read()
f.close()
# Encode the picture as Base64 string
e = base64.b64encode(bytes)
# print(e)
# decode the string into bytes
d = base64.b64decode(e)
# Write decoded bytes to test file
f = open (fn2, "wb")
f.write(d)
f.close()
# Make sure you check the test file now displays correctly
You'll notice how easy Python makes it to both encode and decode.
Typical applications
Base64 is common where you need to send binary data over systems designed primarily to handle text. An example of this is an email system. Here, binary attachments have to be encoded as text for transmission. Any attached files such as images will be Base64 encoded, and sent as text.
Summary
In this article I took a very quick look at Base64 encoding, and saw how you can encode and decode Base64 using Python.