# The Byte is dead.



## PMc (Jul 26, 2020)

This is something funny I just ran into.

Computers do only process 0 and 1. But, since this is a bit laboursome and impractical to handle, there were agreements to group these bits into larger entities. There were some different approaches, but with the spread of usually 8-bit microprocessors, the octet, or byte, has become the generally established grouping scheme.

This was all fine as long as the characters of the commonly used language, i.e. english, would fit into these 8 bit. One would then only need to agree on some mapping table to give numbers to the characters - like ASCII or EBCDIC - and everything is fine. One would not need to care about what the bits might actually represent - to decide on that is entirely left to the final addressee.

But now this is different, now we use UTF-8 for the characters of the spoken language, and a byte is no longer a character.

So, what happened - I tried to store away some cipher material. And, in the time of web applications, the best place to store away some lounging-around cipher material is the cookie. Only, this is not so very easy, and this is where the fun starts. Cipher material by its nature is random (or at least it should look as much like random as possible, because that's the whole point in it). Now if you try to find out what would be the allowed character set for a cookie, that might become entertaining[1] - but that's not really the problem, because the application should already care for that.

What the application does, is: it encrypts the cookie data, it adds a signature hash, and then most likely it does base64 on it, because of the beforementioned character set issues. So, I thought it should be just fine to throw my cipher material thereto, and it should be well cared for.

But that wasn't the case; I got a complaint: *UndefinedConversionError "\xAD" from ASCII-8BIT to UTF-8  *(so now You know one of the bytes from my cipher material)
As usual, I looked into the code what is happening there, and it seems, the application does not only encrypt&sign the data, as a safety measure it does beforehand convert it to JSON. And JSON, as it seems, has no notion whatsoever about what a byte might be; it only knows data as language text, and language text must have a correct encoding.

The surprizing thing is: nobody ever worried about that - nobody seems to need, use or know a byte anymore.

As a proof, I give You this example. Here it is stated: _All stored data, even ASCII, has an encoding._[2]

I would think, ASCII is not "stored data", ASCII is an encoding. And I am very sure cipher material can be "stored data" and does not have a character set encoding. As does machine code. As does music. As do pictures. etc. etc. In fact _everything_, except the special case of language text, does not have an encoding. 

So what seems to have happend here is, people have learned from the ASCII times that, for most practical purposes, byte strings and language texts can be treated the same. And now they apply that same wisdom to the UTF-8 world: anything that is not UTF-8 just doesn't exist.

The common workaround is, in the modern world of web applications all cipher material and similar delicate stuff must be treated base64 before being handed to the application. Which in this case means, do base64, then encrypt&sign, then again do base64 - and repeat as often as necessary. Luckily we have ample bandwidth and get ever more of it - except in a cookie where there is not infinite space.

Cheerio.


[1] I don't recommend doing that, but if You do, You may come to a bit more of an understanding why I am saying HTTP is a tremendous misconception; it was never intended and it is utterly unsuited to what we are nowadays doing with it, i.e. run 75% of the world's business thru it.
[2] From here You might get a little bit of a clue about where my occasional comments concerning the developer's ivory towers derive from.


----------



## jmos (Jul 26, 2020)

PMc said:


> I would think, ASCII is not "stored data", ASCII is an encoding.


ASCII itself is 7bit, not 8bit. Using 8bit means you need take care of an encoding. Code has to be 7bit clean. (It always creeps me out when I find f.e. german special chars in a code…); And "\xAD" … is a soft hyphen included in any 8bit codepage?


----------



## unitrunker (Jul 26, 2020)

Lots of REST APIs pass authorization tokens (like SAML) in the HTTP header as base64.


----------



## Crivens (Jul 26, 2020)

Ah, so the _value=""";_ bug has bitten again...


----------



## zirias@ (Jul 26, 2020)

PMc said:


> with the spread of usually 8-bit microprocessors, the octet, or byte, has become the generally established grouping scheme.


This is not entirely true. A _byte_, at least in the sense the C language standard is based on, is the smallest sensible data unit (typically the smallest _addressable_ unit) of a machine. C mandates that a byte must have _at least_ 8 bits, so if there ever were architectures with smaller bytes, C wouldn't be (easily) portable to them. But there were for example architectures with 9-bit bytes. Sizes are measured in multiples of bytes, so if you just assume that a byte always has 8 bits, this can lead to funny effects. It doesn't matter too much in practice any more, as all relevant architectures nowadays indeed use 8-bit bytes. TLDR: what a _byte_ means is defined by the machine's hardware architecture.

An _octet_ on the other hand is a fixed unit of 8 bits.

Speaking of text encodings, these early encodings all used 7 bits (with ASCII being the only "survivor" nowadays), so there was never a direct mapping to bytes, but of course, any character encoded in ASCII will "fit" into bytes, leaving one bit unused. With Unicode, different actual encodings exist, but they have in common that a character will not typically fit in a single byte -- but if you use UTF-8, it's compatible with ASCII, so all ASCII characters will still fit. This doesn't change anything about the meaning of a byte, so I don't see how it should be "dead".



PMc said:


> it encrypts the cookie data, it adds a signature hash, and then most likely it does base64 on it [...]
> I got a complaint: *UndefinedConversionError "\xAD" from ASCII-8BIT to UTF-8 *


"Most likely"? Anyways, this sounds more like a problem with libraries/frameworks to me. Either they are used in the wrong way, or they have a problem themselves.
Encryption typically works on "raw data", which is assumed to be _octets_. You will transform one stream of octets to a different, encrypted one. If your original stream represented some text in some encoding, you have to make sure the receiver, after decrypting again, will know how to interpret that data.

To transport this encrypted data in a HTTP Header (e.g. a cookie), you must transform it to some ASCII text. That's because the HTTP headers are defined to be text and the encoding assumed is ASCII. Well, this is what base64 will do, it encodes an octet stream as a stream of (7bit) ASCII characters. It doesn't care what the octet stream represents at all.



PMc said:


> The common workaround is, in the modern world of web applications all cipher material and similar delicate stuff must be treated base64 before being handed to the application. Which in this case means, do base64, then encrypt&sign, then again do base64 - and repeat as often as necessary.


This sounds _very_ wrong, see above. You won't ever need to apply base64 more than once. IMHO, you shouldn't accept such a "solution", but dig deeper to find out how to do it correctly with the tools you use...


----------



## PMc (Jul 26, 2020)

jmos said:


> ASCII itself is 7bit, not 8bit. Using 8bit means you need take care of an encoding. Code has to be 7bit clean. (It always creeps me out when I find f.e. german special chars in a code…); And "\xAD" … is a soft hyphen included in any 8bit codepage?



What You describe is exactly that developer's thinking - and I don't really understand how You get there.

Now please do me a favor: if You think "code has to be 7bit clean", then go to /usr/bin, make Your binaries 7bit-clean, and then try to run them. Then maybe You will see what is wrong here.

ASCII is 7bit, alright, but a cipher key is not supposed to be ASCII. And if \xAD is contained in a cipher key, it is not a "soft hyphen", and it does not matter in which codepage that would be.

The point is, all You describe is only true for _language text_. But computers do not deal with language text only, they deal with all kinds of data. And when dealing with data, all the encodings, codepages etc. are irrelevant - the only thing that is relevant is, is the intended storage 8bit-clean? For about 25 years now, almost all storage is.



unitrunker said:


> Lots of REST APIs pass authorization tokens (like SAML) in the HTTP header as base64.



Yes, that is the usual way of doing this - because HTTP headers were designed like e-mail headers, and these were designed to contain language text only. 

This goes back to the beginnings of the Internet - protocols were always designed to be compatible to the legacy stuff, and now it is a tremendous pile of rather ugly stuff - but it can work that way, and so nobody has a real pain, and nobody worries.



Zirias said:


> "Most likely"? Anyways, this sounds more like a problem with libraries/frameworks to me. Either they are used in the wrong way, or they have a problem themselves.
> [...]
> This sounds _very_ wrong, see above. You won't ever need to apply base64 more than once. IMHO, you shouldn't accept such a "solution", but dig deeper to find out how to do it correctly with the tools you use...



That is exactly what I did, and so I found out about it all. There is nothing wrong with the libraries and the application, it all had been successively developed to do the right thing:

Data had to be stored in the cookie, so it had to be stored with proper encoding.
The data had to be secure and tamper-proof, so it was encrypted and signed first.
People might put complex objects into that storage, so it was put into JSON beforehand.
JSON is not 8bit-clean, so you have to do base64 again.
As I perceive it, the problem is with JSON, which might not be the proper tool for the purpose, as it depends on UTF-8 (which translates to: it can only handle language text, which translates to: it is not 8-bit clean) - probably because it was designed to carry javascript payload only?
But this again appears as just a pile of suboptimal concepts, which happen to have evolved in modern web application design.


----------



## zirias@ (Jul 26, 2020)

PMc said:


> JSON is not 8bit-clean, so you have to do base64 again.


I don't see it yet from what you describe, but something is fundamentally wrong here. First of all, 8bit-clean means that a computer (or, more relevant, communication) system can correctly handle encodings that need 8 bits per symbol, like UTF-8 does, and many other encodings do -- contrary to e.g. ASCII which only needs 7 bits. A format cannot be "8bit-clean". JSON mandates strings are encoded as UTF-8, so to transmit JSON, you need 8bit-clean systems, because UTF-8 requires 8bit symbols.

Second, It sounds like you want some JSON as part of the to-be-encrypted payload. Any encryption I know works on octet streams, it doesn't care whether this octet stream represents JSON or whatever else. You just have to make sure the receiver that decrypts the stream again knows how to interpret it. It makes definitely no sense at all to base64-encode some JSON, only to encrypt it afterwards. In that scenario (whatever are the exact details, as that's not totally clear from your descriptions), you need base64 _exactly once_: To encode an arbitrary octet-stream into a (7-bit) ASCII text.

Oh and .. this topic has very little to do with bytes, which have at least 8 bits, see above. This is more about compatibility with, let's say, historic systems that knew only 7bit symbols for transmission.


----------



## ralphbsz (Jul 26, 2020)

Zirias said:


> This is not entirely true. A _byte_, at least in the sense the C language standard is based on, is the smallest sensible data unit (typically the smallest _addressable_ unit) of a machine.


Computers existed long before C. And many computers existed that C never ran on.

Most of the computers that had non-power-of-two word sizes actually used the term "byte". And the width of the byte was quite variable. On 36-bit machines, byte sizes of 6 and 9 bits were common. On the 60-bit Cyber, all manner of crazy configurations happened, for efficiency lots of code configured it as 6-bit bytes. The 1401 had 6-bit bytes, but actually had 8 bits in hardware; the two extra bits are a parity bit (which is not really controllable, it will be the parity of the remainder), and a word-mark bit, because the machine had a variable word size. There are quite a few machines where individual bytes are not addressable, or need pointers that have a different implementation from byte pointers.

And if you want to hear about "not 8 bit clean": on microprocessors, it was common to use the 8th bit for a separate function. For example, since only 7-bit ASCII characters could appear in strings, it was common to set the high bit to mark the end of a string (more efficient than inserting a nul character). Or for example, cp.m used to store a file's creation time in the 8th bits of the file name.

I think the problems that PMc is having are good problems to have. Our computer culture is finally understanding that even text data has semantics; there is more to text than just saying "a whole bunch of printable ASCII or Unicode characters, terminated by nul, perhaps with a few NL thrown in". We're beginning to require that the binary image identify data objects, for example saying "this is a human-readable string, and it is encoded as follows". This is a step towards self-describing data. Look at the Burroughs machine for an example for how this helped: For every memory word, you knew whether it contained an integer, float, or instruction; self-describing data is a logical extension of that concept.


----------



## Jose (Jul 26, 2020)

I'm really not sure what you're talking about. Cookies were invented by Netscape back in the late '90s when the Web was the Wild West. Yes, they abused HTTP headers and created a crappy system that we're all stuck with now, but that's no valid criticism of HTTP in particular or of text-over-sockets protocols in general.

I had to work on cookies at a non-trivial scale and it was a nightmare. One of the first things I did was to turn our custom, crappy text-delimited cookies into a binary blob that I would base64 encode(*) before setting them in the browser. One of the things that caused me to pull my hair out is the fact that the server sets cookies one at a time with individual headers, but the client reports all cookies at the same time in a single header. This caused problems with our legacy Apache servers that could not handle more than 2000 bytes per HTTP header.

I don't remember if I used JSON or not. I might've. JSON is a perfectly reasonable choice, because then you don't have to write a parser. What we had before was some nightmare delimited format where commas, semicolons, etc. had special meaning. We actually ran out of delimiters. Don't do this. Encodings are hard to design and parsers are hard to write.

In any case, I never had to double-encode anything, and in fact, I made it my mission to remove any such double encoding I found with extreme prejudice.

* There is no such thing as Base64 encoding: https://en.wikipedia.org/wiki/Base64#Implementations_and_history


----------



## zirias@ (Jul 26, 2020)

ralphbsz said:


> Computers existed long before C. And many computers existed that C never ran on.
> 
> Most of the computers that had non-power-of-two word sizes actually used the term "byte"


Sure. C is just a nice "baseline" including architectures of any relevance since decades. My point was simply to point out the difference between _byte_ and _octet_, which is, contrary to nowadays popular belief, not the same thing. I didn't really get what your point was?



ralphbsz said:


> And if you want to hear about "not 8 bit clean": on microprocessors, it was common to use the 8th bit for a separate function. For example, since only 7-bit ASCII characters could appear in strings, it was common to set the high bit to mark the end of a string (more efficient than inserting a nul character). Or for example, cp.m used to store a file's creation time in the 8th bits of the file name.


Well, the historical reasons for 8bit text encodings causing problems are probably numerous ... and there were also simply systems that were hard-wired (to the very sense of these words) to 7 bits. Anyways, "8bit-safe" is a property of processing and transferring systems to properly handly 8bit symbols. It doesn't make any sense to use that term for an encoding or data format


----------



## PMc (Jul 27, 2020)

Zirias said:


> I don't see it yet from what you describe, but something is fundamentally wrong here.



If You want to figure it out, just ask what You like to know.



> First of all, 8bit-clean means that a computer (or, more relevant, communication) system can correctly handle encodings that need 8 bits per symbol, like UTF-8 does, and many other encodings do -- contrary to e.g. ASCII which only needs 7 bits. A format cannot be "8bit-clean". JSON mandates strings are encoded as UTF-8, so to transmit JSON, you need 8bit-clean systems, because UTF-8 requires 8bit symbols.



Yes - and no. 
8-bit clean, as I understand it, means that a transport or media will transfer an arbitrary sequence of octets unaltered - that it will neither strip the 8th bit, nor reject bytes carrying it. This is a bit different from Your definition, but I think it suits better for practical purposes.

Now, if we -logically- build a storage where data objects can be put in, then this logical storage becomes a kind of media, and the beforementioned can be asked: can it handle arbitrary sequences of octets unaltered?
But then, if JSON is a part of the design of that logical storage, it will require that only legal UTF-8 data can be stored, and will reject sequences of octets which are not legal UTF-8. The reason for that is not that it is not 8-bit clean by itself, but nevertheless the outcome is the same.
And as a mere consumer of such a logical storage, it is only important for me that I can NOT store arbitrary octets (say, pictures, music, machine code, cipher keys, etc.) into it. I then need to use the same precautions as for any media that is not 8-bit clean (e.g. to use base64).



> Second, It sounds like you want some JSON as part of the to-be-encrypted payload.



No, *I* dont. The application does. It converts the to-be-encrypted data to JSON before encryption. It *might* be that the reason is historical: that the data was converted to JSON in order to store it (the final storage media is a web cookie), and the encryption was inserted at a later time for security reasons. (I was already thinking that the JSON might have become superfluous at that point, but Jose gives us a nice description about why it could appear here.)


----------



## PMc (Jul 27, 2020)

Jose said:


> I'm really not sure what you're talking about. Cookies were invented by Netscape back in the late '90s when the Web was the Wild West. Yes, they abused HTTP headers and created a crappy system that we're all stuck with now, but that's no valid criticism of HTTP in particular or of text-over-sockets protocols in general.



Well, then look at the aches and woes of getting http a secure transport. That's again a "crappy system that we're all stuck with now". Then look at the methods of handling the CERTS (and there revocation schemes). And so on. That whole HTTP thing consists of "construction sites" gone wrong.

To make it a bit more clear: when I look at something that has come out of engineering, I get an over-all impression. For instance, when I look at the unix rsp. Berkeley OS, then at almost any place I come to, I am delighted how well-thought and well-structured the engineering-concepts are. With HTTP and web applications, well, it's just the opposite.



> I don't remember if I used JSON or not. I might've. JSON is a perfectly reasonable choice, because then you don't have to write a parser. What we had before was some nightmare delimited format where commas, semicolons, etc. had special meaning. We actually ran out of delimiters. Don't do this. Encodings are hard to design and parsers are hard to write.



Nice story about how these things develop. That might even be similar to the way my used application framework has come to these habits.



> In any case, I never had to double-encode anything, and in fact, I made it my mission to remove any such double encoding I found with extreme prejudice.



You found them, so they tend to appear.



> * There is no such thing as Base64 encoding: https://en.wikipedia.org/wiki/Base64#Implementations_and_history



Not quite clear what You mean. That there are a couple of different ones? Great. Next "construction site" for the list.


----------



## PMc (Jul 27, 2020)

ralphbsz said:


> This is a step towards self-describing data. Look at the Burroughs machine for an example for how this helped: For every memory word, you knew whether it contained an integer, float, or instruction; self-describing data is a logical extension of that concept.



But this is to the exact contrary the von-Neumann principle (code and data are handled in an indistinguishable fashion), and the latter is said to have made modern computing possible at all.


----------



## zirias@ (Jul 27, 2020)

Ok the picture is getting a little bit clearer as it seems this is about storing "binary" data aka an arbitrary octet-stream in JSON, or, to be precise, in a string property of a JSON object. Yes, JSON can't do that, and you would need something like base64 to encode the octet-stream as ASCII text. I just wonder, why would you want to store binary data in JSON in the first place? Or is this already the encrypted stuff? Then I suggest something with the "layering" is wrong -- putting something in a JSON structure should probably be done first and the actual encryption last...


----------



## PMc (Jul 27, 2020)

Zirias said:


> Ok the picture is getting a little bit clearer as it seems this is about storing "binary" data aka an arbitrary octet-stream in JSON, or, to be precise, in a string property of a JSON object. Yes, JSON can't do that, and you would need something like base64 to encode the octet-stream as ASCII text. I just wonder, why would you want to store binary data in JSON in the first place? Or is this already the encrypted stuff?



It is not really encrypted stuff, it is keys and signatures and such - and these do technically belong to the web session, so I need to store them in some session storage. And the cheapest session storage is the session cookie, and this should already be safe, so I put them there. (The session cookie is maintained by the application framework.)



> Then I suggest something with the "layering" is wrong -- putting something in a JSON structure should probably be done first and the actual encryption last...



This would be the case when using the framework as-is. But I am using my own cipher material for my own stuff, and just need some place to store that.

Now I have figured out how the whole matter came into being:

The framework provides a session cookie, which is encrypted+signed. (It also enables the developer to add further cookies, where they can decide if they want them encrypted and/or signed.)
This is using Ruby, which is truly OO and handles arbitrarily complex objects fully transparent. The cookie content must
therefore be marshaled.
If, by some means, the user or an adversary gains the ability to modify the cookie content, then, due to the OO nature of it, they can send back not only arbitrary data, but also procedures, that is, code that will be executed. 
Therefore marshaling was changed to JSON, as a secondary safety measure to reduce attack-surface, because JSON should not be able to contain such executable procedures. Some discussion: https://github.com/presidentbeef/brakeman/issues/1316
There is a hint in the docs that some other data types might also not work well with JSON (web developers are probably supposed to know the limitations of JSON).
The marshaler is configurable, so one could easily change it back to the old binary-compatible (but maybe unsafe) one, or even provide an entirely different solution as a plugin.


----------



## Jose (Jul 28, 2020)

PMc said:


> Yes - and no.
> 8-bit clean, as I understand it, means that a transport or media will transfer an arbitrary sequence of octets unaltered - that it will neither strip the 8th bit, nor reject bytes carrying it...


This is only relevant when dealing with data that is expected to be text. I'm still not sure what your complaint is.



PMc said:


> Well, then look at the aches and woes of getting http a secure transport. That's again a "crappy system that we're all stuck with now". Then look at the methods of handling the CERTS (and there revocation schemes). And so on. That whole HTTP thing consists of "construction sites" gone wrong.


Secure transport (and certificates) are handled by TLS, which is a completely separate, general purpose protocol. I don't see how you can blame HTTP for any perceived shortcomings.



PMc said:


> You found them, so they tend to appear.


They were almost always the result of ignorance and/or confusion. Hence why I could remove them with no problems and get a tiny performance boost to boot. The few times they were not a mistake had to do with systems that would sometimes get data that was already encoded, and sometimes not. You have to encode to be safe in those situations.



PMc said:


> Not quite clear what You mean. That there are a couple of different ones? Great. Next "construction site" for the list.


There are many incompatible encodings that are all called "base64". "Base64" encoding does not describe anything useful, and is therefore meaningless.


----------

