The delay is much shorter if there is a response from a web server or similar. As soon as data is coming back from the destination server the "SEND OK" also returns. If the response time is 20ms the SEND OK is also returned fast and comes first, with no time separation from received remote data.
The images below show sampled voltages on serial port Tx, Rx physical pins on the device - The blue signal being data sent to the ESP8266 and yellow being the response.
If server response is delayed the "SEND OK" comes by itself after 290ms.
If remote server responds in 20ms the "SEND OK" returns much faster in front of the response data.
It looks like the ESP8266 "forgets" to flush the input stream. With no remote response, flushing is finally done by an internal timer after 290ms. When a remote response returns earlier than the 290 ms, the ESP8266 is forced to send data back - starting with "SEND OK".
As the delay inhibits further sending until you have data on the receive stream it makes streaming data out in small portions extremely slow for no good reason. "SEND OK", I think, should be confirmed as soon as data is in the ESP8266 output buffer or possibly when the buffer has been sent over TCP. It would be much more effective to only rely on a "buffer full" error when you request a certain number of bytes sent in the initial CIPSEND command and the output buffer has not yet room for the number of bytes requested. It should just stream out as fast as it can an not care if anything comes back from the remote peer.