🍝 🧔🏻 ⛹🏻 Stas Afanasyev. Juno. Pipelines based on io.Reader / io.Writer. Part 2 🤽🏼 👨‍🌾 🤝

In the report, we will talk about the concept of io.Reader / io.Writer, why they are needed, how to implement them correctly and what pitfalls exist in this regard, as well as about building pipelines based on standard and custom io.Reader / io.Writer implementations .

Stas Afanasyev. Juno. Pipelines based on io.Reader / io.Writer. Part 1

Bug “on trust”

Another nuance: in this implementation there is a “bagul”. This bug is confirmed by the developers (I wrote to them about it). Maybe someone knows what this “bagul” is? It on the slide is the penultimate line:

It is associated with too much trust in the wrapped Reader: if Reader returns a negative number of bytes, then the limit that we would like to get by the number of bytes subtracted increases. And in some cases, this is a pretty serious bug that you can’t immediately understand.

I wrote in the issue: let's do something, let's fix it! And then a layer of problems was revealed ... First, they told me that if you add this check now here, you will have to add this check everywhere, and there are a dozen of these places. If we want to shift this to the client side, then we need to determine a number of rules by which the client will validate the data (and there may also be five or two of them). It turns out that all this needs to be copied.

I agree that this is not optimal. Then let's come to some consistent version! Why do we have one implementation of the standard library that does not trust anything, while others trust absolutely everything?

In general, while I was writing my civic opinion, thinking it over, we closed the issue with comments - “We will not do anything. Bye"! They made me look like some kind of fool ... Politely, of course, you can’t find fault.

In general, we now have a problem. It consists in the fact that it is not clear who should validate the data of the wrapped Reader. Either the client, or we completely trust the contract ... We have one solution! If there is time left, I will tell about it.

Let's move on to the next case.

Teereader

We looked at an example of how to wrap Reader data. The next example of pipes is to overtake Reader data in Writer. There are two situations.

First situation. We need to read the data from Reader, somehow copy it to Writer (transparently) and work with it as with Reader. There is an implementation of TeeReader for this. It is presented in the upper implementation snippet:

Works like the Tee team on Unix. I think many of you have heard about this.
Note that this implementation checks the number of bytes that it reads from the wrapped Reader. See the conditions in the second line? Because when you write such an implementation, it is intuitively clear: in case of a negative number, you will get panic. And this is another place where we trust the wrapped Reader! I remind you that these are all standard libraries.

Let's move on to a case, for example, how to use it. What will we do on the lower snippet? We will download the robot.txt file from golang.org using the standard http client.

As you know, the http client returns a response structure to us, in which the Body field is an implementation of the Reader interface. It should be clarified by saying that this is an implementation of the ReadCloser interface. But ReadCloser is just an interface built from Reader and Closer. That is, this is a Reader, which can, in general, be closed.

In this example (in the lower snippet) we collect TeeReader, which will read data from this Body and write it to a file. The creation of the file today, unfortunately, remained behind the scenes, because everything did not fit. But, again, if you look at the dendrogram, the file type implements the Writer interface, that is, we can write to it. It is obvious.

We assembled our TeeReader and read it using ReadAll. Everything works as expected: we subtract the resulting Body, write it to a file, and see it in Assad out.

Beginner way

The second situation. We just need to read the data from Reader and write it to Writer. The solution is obvious ...

When I just started working with Go, I solved such problems as on a slide:

I located the buffer, filled it with data from Reader, and transferred the filled slice to Writer. Everything is simple.

Two points. Firstly, there is no guarantee that the entire Reader will be subtracted in one call to the Read method, because there may be data left (in a good way, this should be done in a loop).

The second point is that this path is not optimal. Here is pretty boilerplate code that is written before us.

For this, there is a special family of helpers in the standard library - these are Copy, CopyN and CopyBuffer.

io.Copy. WriterTo and ReaderFrom

io.Copy basically does what it was on the previous slide: it allocates a default buffer of 32 KB and writes data from Reader to Writer (the signature of this Copy is shown on the upper snippet):

In addition to this template routine, it also contains a series of tricky optimizations. And before we talk about these optimizations, we need to get acquainted with two more interfaces:

WriterTo;
ReadFrom.

Hypothetical situation. Your Reader works with a memory buffer. He has already relocated it, writes, reads something from there, that is, a place under it has already been relocated. You want to read this Reader somewhere from the outside.

We have already seen how this happens: a buffer is created, the buffer is passed, which is passed to the Read method; The Reader, which works with memory, throws it out of the replicated piece ... But this is no longer optimal - the place has been repositioned. Why do it again?

Somewhere 5-6 years ago (there is a link to the change list) two interfaces were made: WriteTo and ReadFrom, which are implemented locally. Reader implements WriteTo, and Writer implements ReadFrom. It turns out that Reader, having a slice with data already replicated, can avoid an additional location and accept Write To Writer methods and pass a buffer available inside.

This is how the implementation of bytes.Buffer and bufio works. And if you look at the dendrogram again, you will see that these two interfaces are not very popular. They are just implemented for those types that work with the internal buffer - where the memory is already relocated. This will not help you avoid eloquence every time, but only if you are already working with a relocated piece.

ReaderFrom works in a similar way (it is only implemented by Writer). ReaderFrom reads the entire Reader, which comes as an argument to it (before EOF) and writes somewhere in the internal implementation of Writer.

CopyBuffer implementation

This snippet shows the implementation of the copyBuffer helper. This non-exportable copyBuffer is used under the hood of io.Copy, CopyN, and CopyBuffer.

And here there is a small nuance that is worth mentioning. CopyN has recently been optimized - untied from this logic. This is exactly the optimization I spoke about earlier: before creating an additional buffer of 32 KB, a check is made - maybe the data source implements the WriterTo interface, and this additional buffer is not needed?

If this does not happen, we check: maybe Writer implements ReaderFrom to connect them without this intermediary? If this does not happen, the last hope remains: maybe we were given some sort of relocated buffer that we could use?

That's how io.Copy works.

There is one issue, which is a semi-proposal, a semi-bug - it is not clear what. It has been hanging for a year and a half. It sounds like this: CopyBuffer is semantically incorrect.

Unfortunately, there is no signature for this copyBuffer, but it looks exactly like this non-exportable method.

When you call copyBuffer in the hope of avoiding an additional location, pass some relocated slice byte there, the following logic works: if Reader or Writer have the WriterTo and ReaderFrom interfaces, then there is no guarantee that you will be able to avoid this location. This was accepted as a proposal and promised to think about it in Go 2.0. For now, you just need to know.

Work with io.Pipe. PipeReader and pipeWriter

Another case: you need to get data from Writer somehow in Reader. Pretty life case.

Imagine that you already have some data, they implement the Reader interface - everything is clear with this. You need to compress this data, “tweak” it and send it to S3. What is the nuance? ..
Who worked with the gzip type in the compess package knows that the gzip'er itself is just a proxy: it takes data into itself, implements the Writer interface, it writes the data, something will do to them, and then I have to drop them somewhere. On the constructor, it takes an implementation of the Writer interface.

Accordingly, here we need some kind of intermediate Writer, where we will drop the already compressed data that is archived in the first stage. Our next move is to upload this data to S3. And the standard AWS client accepts the io.Reader interface as a data source.

The slide shows the pipeline - it shows how it looks: we need to overtake the data to overtake from Reader to Writer, from Writer to Reader. How to do it?

The standard library has a cool feature - io.Pipe. It returns two values: pipeReader and pipeWriter. This pair is inextricably linked. Imagine a “baby phone” in cups with ropes: it makes no sense to speak in one cup while no one is listening at the other end ...

What does this io.Pipe do? It will not read until no one writes the data. And vice versa, he will not write anything until no one reads this data at the other end. Here is an example implementation:

We will do the same here. We will read the robot.txt file, which was read before, we will compress it using our gzip and send it to S3.

On the first line, a pair is created - pipeReader, pipeWriter. Next, we must run at least one goroutine, which will read data from one end (a kind of pipe). In this gorutin, run uploader with a data source (source - pipeReader).
In the next step, we need to compress the data. We compress the data and write it to pipeWriter (it will be the other end of the pipe), and already running goroutine receives the data at the other end of the pipe and reads it. When this whole sandwich is ready, all that remains is to set the wick on fire ...
See: io.Copy on the last line writes data from the Body to the gzip we created (i.e. from Reader to Writer). All this works out as expected.

This example can be solved in another way. If you use any implementation that implements both Reader and Writer. You will first write data into it, and then read them.
It was a clear demonstration of how to work with io.Pipe.

Other implementations

That's basically all for me. We come to interesting implementations that I would like to talk about.

I did not say anything about MultiReader, nor about MultiWriter. And this is another cool implementation of the standard library, which allows you to connect different implementations. For example, MultiWriter writes to all Writers simultaneously, and MultiReader reads Readers sequentially.

Another implementation is called limio. It allows you to set a limit for subtraction. You can set the speed in bytes per second at which your Reader needs to be read.

Another interesting implementation is just a visualization of the reading progress - the Progress bar (from some dude). It is called ioprogress.

Why did I say all this? What did I mean by that?

If you suddenly need to implement the Reader and Writer interfaces, do it right. There is no single decision yet who is responsible for the implementation - we will assume that everyone trusts the contract. So you need to abide by it impeccably.
If your case is working with a repositioned buffer, do not forget about the ReaderFrom and WriterTo interfaces.
If you are at a dead end and you need examples - see the standard library, there are many cool implementations that you can rely on. There is documentation there.
If something is completely incomprehensible to you, then feel free to write issues. The guys there are adequate, respond quickly, very politely and competently help you.

That’s all for me. Thank you for coming!

Questions

Question from the audience (B): - I have a simple question, I guess. Please tell us about some use-cases from life: which were used and why? You said that Reader / Writer returns the length that it read. Have you ever experienced any problems with this; when did you demand to read (not just ReadAll exists), but something didn’t work?

SA: - I must honestly admit that I never had such cases, because I always worked with implementations of the standard library. But hypothetically, such a situation, of course, is possible. As for specific cases, we often collect multilayer pipes, and if you hypothetically allow such a bug, the whole pipe will fall apart ...

Q:- This is not quite a bug. Let’s then tell you about my little experience. I had a problem with Booking.com: they used the driver that I wrote, and they had a problem - something was not working. There is a standard binary protocol that we did; locally, everything works well, everyone is fine, but it turned out that they have a very bad network with a data center. Then Reader didn’t really return everything (bad network cards, something else).

CA: - But if he did not return everything, then he should not have returned the sign of the end (end), and the client should come again. Under the contract that is described, Reader should not ... Let's just say that Reader, of course, decides when he wants to come, when he does not want, however, if he wants to read everything, he must wait for EOF.

AT:“But that is precisely because of the connection.” This is exactly the problem that occurred in the standard net package.

CA: - And he returned the EOF?

Q: - He did not return everything - he simply did not read everything. I told him: "Read the next 20 bytes." He reads. And I do not read everything.

SA: - Hypothetically, this is possible, because it is just an interface that describes a communication protocol. It is necessary to watch and specifically disassemble the case. Here I can only answer you that the client, in theory, should have come again if he did not receive everything he wanted. You asked him for a slice of 20 bytes, he subtracted 15 for you, but EOF didn’t come - you should go again ...

Q: - There is io.ReadFull for this situation. It is specially designed to read the slice to the end.

CA:- Yes. I did not say anything about ReadFull.

Q: - This is a completely normal situation when Read does not fill in the entire slice. You need to be prepared for this.

SA: - This is a very expected case!

Q: - Thanks for the report - it was interesting. I use Readers in a small, simple proxy that reads http and writes the other way. I use Close Reader to solve one problem - to close what I read all the time. Do I need to blindly trust a contract? You said that there could be problems. Or add additional checks? It is theoretically possible that something will not come completely on this site. Do I need to do these additional checks and not trust the contract?

CA:- I would say this: if your application is tolerant of these errors (for example, if you fully trust the contract), then maybe not. But if you would not like to get a “panic” in yourself (as I showed on negative reading in byte.Buffer), then I would still check.
But this is “up to you.” What can I recommend to you? I think just weigh the pros and cons. What happens if you suddenly get a negative number of bytes?

Q: - Thanks for the report. Unfortunately, I don’t know anything in Go. If a “panic” has occurred, is there any way to intercept this information and get information on what, where, how to be biased, to avoid problems on Friday night?

CA: - Yes. The Recover mechanism allows you to "catch" a panic and bring it out without falling, relatively speaking.

AT:- How do your recommendations for using implementations of Writer and Reader are consistent with the errors that are returned when implementing web sockets. I won’t give a concrete example, but is end of file always used there? As far as I remember, the message ends with some other meanings ...

SA: - This is a good question, because I just have nothing to answer. Must watch! If EOF does not come, then the client, if he wants to get everything, must go again.

Q: - How long was the pipe able to assemble? Are there any internal beliefs that pipe is not worth collecting more than five participants, or with branches? How long did you manage to build a tree from these pipes (Read, Write)?

CA:- In my practice, about five consecutive calls are optimal, because it’s harder to debug, keep in mind what flows and where it goes. Pretty branchy structure is obtained. But I would say somewhere 5-7 maximum.

Q: - 5-7 - this is in which case?

SA: - This is reading, for example, some data. You need to pledge, and what you log in, you need to trim. Pledged - then you read this data - you need to send it back to some storage (well, hypothetically). In any storage that the Writer interface implements. With this pipe, 5-6 steps occur, although at one of the steps it still branches off to the side, and you continue to work with Reader.

AT:- According to the Beginner way, you had an interesting slide. Can you indicate another 2-3 interesting points that were there, but now it’s better not to do them, but to do it differently now?

SA: - With that slide, I wanted to show exactly how to do it without the need to read Reader. It never crossed my mind that something like the Beginner way ... This is probably the main mistake, the main pattern that should be avoided when working with Readers.
Presenter: - I would add on my own that it is very important for a beginner to read all the documentation of the io package, on all the interfaces that are there, and understand them. Because in fact there are a lot of them, and you often start to do something of your own, although it already exists there and is correctly implemented (“right” - taking into account all the features).
Question of the leader: - How to live further?

CA: - Good question! I promised to tell if we have time. As a result of the discussion of the bug, LimitedReader came up with the following decision: to make a Reader- “condom” in a sense, which protects against external threats, wrap some Reader that you do not trust — do not let any infection inside your system.

And in this Reader, you implement all the checks that you cannot do: for example, negative reading, experiments with the number of bytes (let's say you sent a slice of 10 bytes, and you got 15 back - how to react to this?) ... In this Reader and you can implement a set of such checks. I said: “Maybe let's add to the standard library, because it would be useful for everyone to use”?

I was given the answer that there seems to be no sense in this - this is a simple thing that you can implement yourself. All. We live on. We trust the contract guys. But I would not trust.

Q: - When we work with Readers, Writers and there is an opportunity to run into a gzip “bomb” ... How much do we trust in ReadAll and WriteAll? Or, nevertheless, implement buffer reading and work only with the buffer?

CA:- ReadAll itself uses only bytes.Buffer under the hood. When you want to use this or that thing, it is advisable for you to get in and see how these "guts" are implemented. Again, it depends on your requirements: if you are intolerant of such errors that I showed, you need to see if what comes from the wrapped Reader is checked. If it is not checked, use, for example, bufio (there it is all checked). Or do what I just said: a certain proxy Reader, which according to your list of requirements will check this data and either return it to the client or return it to the client.

A bit of advertising :)

Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to your friends cloud-based VPS for developers from $ 4.99 , a unique analog of entry-level servers that was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $ 19 or how to divide the server? (options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper at the Equinix Tier IV data center in Amsterdam? Only we have 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $ 199 in the Netherlands!Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $ 99! Read about How to Build Infrastructure Bldg. class c using Dell R730xd E5-2650 v4 servers costing 9,000 euros for a penny?

Stas Afanasyev. Juno. Pipelines based on io.Reader / io.Writer. Part 2