This article comprises of things that you’ll encounter while working with Strings and Raw bytes explaining with real situational examples. I tried to design the images, to focus on what we are talking. Hope you like them.
Elixir Version
All the examples used in this article are executed in iex
using the following combination of Elixir/Erlang OTP
.
Gentle Intro
I got to do the heavy workout on packet parsing using the header lengths on raw binaries decoding and encoding of 16, 32, 64 bit strings in one of my projects. So, I just got a thought to share the experience.
Hope, you already knew the difference of bitstring
, binary
, bit
, and byte
. If true, do: skip the following screen shot
else: have a glance of it
.
🔥 “Every binary is a bitstring but every bitstring is not a binary “
In elixir, binary is represented by <<>>
. Of course, everybody does know.
iex(8) data = <<"hello">>
"hello"
iex(9) is_binary data
true
iex(10) is_bitstring data
true
iex(11) data2 = <<1,2,3::4>>
<<1, 2, 3::size(4)>>
iex(12) is_bitstring data2
true
iex(13) is_binary data2
false
What makes a binary different from bitstring ?
If the number of bits
is a multiple of 8
, then we call it as a binary
.
Consider the following example.
<<1,2,3::4>>
In the above line, we did not mention the number bits
to be used for 1,2
but we represented for 3
. In elixir, if the size is not mentioned, it uses default 8 bits
. So, <<1,2,3::4>>
is equal to <<1::8, 2::8, 3::4>
> which is a 20 bit
data. We cannot call it as a binary
as number of bits
is 20
which is not a multiple of 8
.
Have a look at the following representation.
Raw bytes and Understanding Elixir representation
Strings in elixir are binaries. Sorry for repeating the same statement again and again. But, I have to do. Even when you are asked by waking up from sleep, you are supposed to say that.
Consider a word hello
each letter or a grapheme
will take 8 bits. So, the total byte_size
of a word hello
is 5.
iex> byte_size "hello"
5
iex> String.graphemes "hello"
["h", "e", "l", "l", "o"]
iex> String.valid? <<35>>
true
iex> <<35>>
"#" // valid string
The ASCII (American Standard Code for Information Interchange) code for #
is 35
. The binary representation of 35
is 100011
6 bit data.
Here, << 35 >>
means we are telling to use 8
bits for 35
. So, 00100011
is the binary form for 35
. If you represent like << 35::6>>
is fall under raw bytes
of data.
iex> <<35::6>>
<<35::size(6)>>
iex> String.valid?(<<35::6>>)
false
iex> String.valid?(<<35::8>>)
true
Understanding Elixir Representation
Consider the following lines of code
iex> match?("#", <<35>>)
true
iex> match? "#", <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1, 1::1, 1::1>>
true
iex> match? <<35>>, <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1,1::1,1::1>>
true
Here, literally we are dividing each bit of << 35::8 >>
to <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1,1::1, 1::1>>
Back End Story of Learning
When I was learning the basics of programming in Elixir, I used to turn the pages without reading when ever I see the symbols <<>>
. These symbols are night mare when I was a kid relative to Elixir. Learning them is like a feeling of hitting the mountain with your head at a speed of 200. Just imagine.
OK! Stories are apart. But, once you get a clear picture of what is meant by raw byte and valid strings in your mind, you’ll climb the mountain with ease.
Programmers heavily deal with raw bytes in their life than Strings. Especially, one who always do parsing.
Programmers count memory but not in length.
Remember the previous line, we talk on this later inside the article in deep.
Extracting Sub-String
This is a real-world situation.
Extracting a String of Known Length
If you know the exact length of the string and position from where you want to extract, then you can go with the following approach
Using binary_part for raw bytes
When you dwell on real world project, you’ll encounter the situations dealing with raw bytes of data. I would suggest you learn as much as possible before working with raw bytes of data.
iex> binary_part("hello medium", 6, 6)
"medium"
The binary_part(binary, start, length)
extracts the binary part from start
to the length
. It is used for splitting the raw bytes of data.
When the length is negative and within the bounds, it extracts the string from right to left unlike it does from left to right.
Things to remember.
→Here, the index cannot be negative.
→Here, the binaries are zero-indexed
means binary_part("hello",1,1)
would results e
not h
. You have to try binary_part("hello", 0,1)
. Hope you understood what the zero-indexed is.
→The start
and length
cannot exceed the byte_size
of string. Otherwise, it raises an Argument Error
Exception.
Using binary_part in Guard clause
This definition can be used in guard clause as well.
Example: Packet Parsing
For an example, you are parsing the packets like $admin#medium#worlds#best#blog
, $user#blackode#a#medium#writer
. You are asked to write a definition that receives a packet and you have to differentiate each packet from other.
You can do this by splitting the packet like String.split(packet, "#")
and using if
macro to do the job. But, it takes more code logic. You can make use of the binary_part
in guard clause like following.
defmodule Parser do
def parse(packet) when binary_part(packet, 1,5)=="admin" do
IO.puts "Admin Packet !"
end
...
end
Check out the execution screenshot
= = = = = = = = = = ======Warning====== = = = = = = = = = =
As I already mentioned in the things to remember section, if either length
or start
values are out of bounds, then it raises an Argument Error exception.
— Extracting a string of Unknown Length
If you don’t know the length
of the sub string
, you cannot use the binary_part function. Here comes the binary pattern matching «» in handy.
Situation
You are asked to extract the string from the position 6 to end
of the string
.
String
in Elixir
is a multiple of 8
bits which we call it as binary
. It means, if the bit_size
is divided by 8
then we call that bitstring
as binary
.
As we talked earlier in the intro section, each letter in string is of 8 bits
means 1 byte
. So, to skip the 6
letters you have to skip 6x8
bits.
— Extracting first letter from the string
Situation
Extract the currency symbol from string “$500”
This can be achieved in many ways
String.first
iex> string = "$500"
"$500"
iex> string |> String.first
"$"
Pattern Matching
iex> string = "$500"
"$500"
iex> <<first::8,_rest::binary>> = string
"$500"
iex> <<first>>
"$"
iex> first
36 // code_point ascii-code of $
iex> <<35>>
"#"
String.split
Not recommended in this situation but, it is good to know the option existence.
As we know, it splits the string based on the given pattern. If the pattern is ""
it gives some different result.
iex> string = "$500"
"$500"
iex> string |> String.split("")
["", "$", "5", "0", "0", ""]
🔥 no space `between`
If you observe here, it added some extra ""
at head and tail. You have to again trim them by passing an option trim: true
.
iex> string = "$500"
"$500"
iex> string |> String.split("", trim: true) |> hd
"$"
String.slice
iex> String.slice "$500", 0, 1
"$"
iex> String.slice "$500", -4, 1
"$"
String.slice [ VS ] binary_part
As we know, both will takes arguments as (str, start, len)
and returns a sub string starting at the offset start
, and of length len
.
I kept thinking of why would be there two functions with similar functionality. So, I started checking out the things that differentiate them.
Out of bound options
When the start
and len
are out of the bounds then binary_part
would raise an Argument Error as it is designed to use along with raw bytes but not String.slice
which refers to the String.length.
Let’s check that.
iex(14) str = "hello medium"
"hello medium"
iex(15) String.slice str, 6, 10
"medium"
iex(16) binary_part str, 6, 10Bug
Bug ..!!** (ArgumentError) argument error
:erlang.binary_part("hello medium", 6, 10)
Here, after position 6
only remain with 6
letters, but we tried to extract sub string of len 10
. So, the binary_part
raised an error but not String.slice which gave a result of sub string from index 6
to end of the string. Hope you got the point.
Raw Bytes and Graphemes
The function String.slice(str, start, len)
, the start
is the index of the graphemes whereas in binary_part
it is the index of a byte.
It will be more clear with the following example.
iex> str = "hełło"
"hełło"
iex> String.length str
5
iex> byte_size str
7
iex> String.graphemes str
["h", "e", "ł", "ł", "o"]
I hope you understand what I mean of graphemes
. The graphemes
length of str
is 5
but its byte_size
is 7
that is where these functions differ from each other.
The byte_size/1
counts the underlying raw bytes
, and String.length/1
counts characters .
The function String.slice
deals with unicode graphemes and binary_part
deals byte_size.
In general, binary_part
deals with raw bytes
.
Internal Representation of String (Raw Bytes)
iex> str = "hełło"
"hełło"
iex> raw = str <> <<0>>
<<104, 101, 197, 130, 197, 130, 111, 0>>
iex(37) String.slice raw, 2, 3
"łło"
iex(38) binary_part raw, 2, 3
<<197, 130, 197>>
The elixir has a Base
module which helps you in decoding and encoding of binaries. Have a look here
.
Hope you enjoyed playing with strings. Practice makes you more perfect. Try to parse ipv4 packet based on its header length .
If you find this helpful, please put your hand forward to share. Let’s others get benefited from this.
Check out the GitHub repository on Killer Elixir Tips
Glad if you can contribute with a ★