TL;DR β The One-Minute Version
Every single thing you do on the internet β loading this webpage, sending a Slack message, streaming Netflix, playing Valorant β involves chopping your data into small pieces called packetsA small chunk of data (typically 1,460 bytes of payload for TCP on Ethernet). Each packet travels independently through the network, potentially taking different routes. Think of each packet as a separate envelope in a stack of mail. and shipping them across the network. The question is: how carefully do you want those packets delivered?
TCP (Transmission Control Protocol, RFC 793The original TCP specification, published in September 1981 by the IETF. Written by Jon Postel. It defines the 3-way handshake, sequence numbers, acknowledgments, and flow control. Still the foundation of the protocol used today, though updated by many subsequent RFCs., 1981) says: "I will guarantee every packet arrives, in order, with nothing missing or corrupted. If something gets lost, I'll re-send it. This costs some speed β a 3-packet handshake before any data flows, acknowledgments for every segment, retransmission timers β but you'll get perfect data." UDP (User Datagram Protocol, RFC 768The UDP specification β just 3 pages long. Published in August 1980 by Jon Postel. The entire protocol is: source port, destination port, length, checksum, data. That's it. The shortest RFC you'll ever read., 1980) says: "I'll fire packets as fast as I can. No handshake, no acknowledgments, no retransmission. If some get lost, that's your problem. But I'm fast."
These aren't abstractions β they're real things running on your machine right now. Open a terminal and try:
# See every active TCP connection on your machine RIGHT NOW
netstat -an | grep ESTABLISHED # macOS/Linux
netstat -an | findstr ESTABLISHED # Windows
# See UDP sockets too
netstat -an | grep udp # macOS/Linux
# Want to see real TCP packets? Wireshark filter:
# tcp.flags.syn==1 β shows every new TCP handshake
# udp β shows how bare-bones UDP packets are
Why You Need This β The Scenario
Right now, on your phone or laptop, two completely different types of network traffic are happening simultaneously. You can literally see them:
$ netstat -an | head -20
Proto Local Address Foreign Address State
tcp 192.168.1.42:52341 104.244.42.65:443 ESTABLISHED # Twitter/X (HTTPS)
tcp 192.168.1.42:52347 140.82.121.3:443 ESTABLISHED # GitHub (HTTPS)
tcp 192.168.1.42:52390 172.217.14.99:443 ESTABLISHED # Gmail (HTTPS)
tcp 192.168.1.42:52401 54.239.28.85:443 ESTABLISHED # AWS S3 download
udp 192.168.1.42:50012 142.250.80.17:443 * # Google Meet (QUIC/UDP)
udp 192.168.1.42:50001 52.202.82.141:8801 * # Zoom media stream
udp 192.168.1.42:61234 8.8.8.8:53 * # DNS query
See the pattern? Your browser connections to Twitter, GitHub, Gmail, and that file download from AWS β all TCP. Your video call on Zoom, your Google Meet session, your DNS lookups β all UDP. Same device, same network cable, same Wi-Fi radio. Two completely different delivery philosophies running side by side.
Scenario A β downloading a bank statement (2.3 MB PDF): This file is 2,359,296 bytes. If even one byte is wrong, the PDF's internal structure is corrupted and your PDF reader shows "Error: file is damaged." If byte 1,048,576 arrives before byte 524,288, the file is scrambled. You need every byte, in order, verified. You're willing to wait an extra 45ms for the TCP handshake and tolerate the occasional 200ms retransmission delay. The alternative is a broken file.
Scenario B β a Zoom call with your team: Zoom sends ~3.8 Mbps of video and audio. That's roughly 260 UDP packets per second, each carrying about 1,200 bytes of encoded video. If packet #4,721 gets lost, what happens? Zoom doesn't wait for it. By the time it could be retransmitted (200ms minimum), the conversation has moved on by 5-6 video frames. That lost packet contained one 33ms frame β the viewer sees a tiny glitch or the codec interpolates. Nobody notices. But if Zoom used TCP and paused the video every time a packet was lost, waiting for retransmission? On a typical network with 0.1% packet loss, that's a freeze every ~4 seconds. Unwatchable.
Classify these five real services: (1) Sending an email with a contract PDF attached, (2) Playing Fortnite online, (3) Loading your bank's website, (4) A Discord voice call with friends, (5) Stripe processing a $500 payment. For each one: does it need every byte to arrive perfectly (TCP), or is speed more important than perfection (UDP)?
Group them before scrolling. Think about what happens if a packet is lost β is it catastrophic (file corrupt, money lost) or barely noticeable (tiny audio glitch, one stale game frame)?Email + contract PDF, bank website, Stripe payment β all TCP. A corrupted email attachment, a half-loaded bank page, or a garbled payment amount would be catastrophic. Fortnite player positions and Discord voice β UDP. A 30ms-old position update is useless, and re-requesting a lost audio packet 200ms later means hearing a word after the conversation has moved on. The pattern: if losing a packet means corruption or data loss, use TCP. If losing a packet means a tiny glitch that nobody notices, use UDP.
But here's the question that should be bugging you: why do packets get lost, reordered, or corrupted in the first place? Let's go one layer deeper.
The First Attempt β Just Send Bytes Over Raw IP
Before TCP and UDP existed, there was just IPInternet Protocol β the base layer of the internet. IP handles two things: addressing (every device gets a unique IP address like 142.250.80.46) and routing (routers forward each packet hop-by-hop toward the destination). But IP makes ZERO promises about reliability. Packets can be lost, reordered, duplicated, or corrupted, and IP won't even tell you. β the Internet Protocol. IP does exactly two things: it gives every device an address (like 192.168.1.42 or 142.250.80.46), and it routes packets hop-by-hop toward that address. That's it. No reliability, no ordering, no error recovery. IP is called a "best-effort" delivery service"Best effort" means the network will TRY to deliver your packet, but it makes no guarantees. If a router is overloaded, it silently drops your packet. If two routes have different delays, packets arrive out of order. IP doesn't even tell you something went wrong β you just never get the data. β it tries to deliver your packet, but if it fails, it doesn't tell you, doesn't retry, doesn't even shrug.
You can see this "best effort" behavior right now. Open a terminal and run traceroute (or tracert on Windows) to see the actual path your packets take:
$ traceroute google.com
1 192.168.1.1 (your router) 1.2 ms
2 10.0.0.1 (ISP local) 5.8 ms
3 72.14.215.85 (ISP backbone) 12.3 ms
4 108.170.252.129 (Google edge) 14.1 ms
5 142.251.49.163 (Google datacenter) 18.7 ms
6 142.250.80.46 (google.com) 19.2 ms
# 6 hops. Each one is an independent router making its own decision.
# Packet A might go through hop 3a, Packet B through hop 3b.
# Different paths = different delays = packets arrive OUT OF ORDER.
Each of those hops is a separate router. Each router independently decides where to forward your packet. If router #3 is congested, it might send packet A on a different path than packet B. This means packet B could arrive before packet A, even though A was sent first. And if any router along the way runs out of buffer space? It silently drops your packetWhen a router's incoming queue is full and a new packet arrives, the router has no choice β it discards the packet with no notification. This is called "tail drop." The sender and receiver have no idea it happened unless they have their own detection mechanism (which is exactly what TCP adds). β no error message, no notification, nothing. Your data just vanishes.
This "fire and forget" approach is essentially what the earliest internet experiments used (the ARPANETThe precursor to the modern internet, funded by the US Department of Defense. Operational from 1969. It connected 4 universities initially (UCLA, Stanford Research Institute, UC Santa Barbara, University of Utah) and used a simple packet-switching protocol. The unreliability of this early network is what motivated the creation of TCP. in the late 1960s). And honestly? For some uses, this is fine. A DNS lookup β "what's the IP address for google.com?" β is a single tiny packet out and a single tiny packet back. If the answer doesn't come, you just ask again in 2 seconds. Try it yourself:
# This is a real UDP packet. One question out, one answer back.
$ dig @8.8.8.8 google.com
;; ANSWER SECTION:
google.com. 179 IN A 142.250.80.46
;; Query time: 22 msec β total round-trip: 22ms
;; SERVER: 8.8.8.8#53 β Google's public DNS, port 53 (UDP)
;; MSG SIZE rcvd: 55 β 55 bytes. That's the whole response.
That 55-byte DNS response worked great over raw UDP. But the moment you try to send anything bigger β a 2 MB file, a webpage with 47 resources, a bank transaction β raw IP's "best effort" delivery falls apart in four specific, measurable ways.
Where It Breaks β Four Failure Modes (With Math)
The internet looks smooth on the surface, but underneath it's a maze of independent routers, each making split-second forwarding decisions. Your data doesn't travel on a dedicated wire from A to B β each packet bounces through 6 to 14 hopsEach router your packet passes through is one "hop." Run `traceroute google.com` to see the actual hops. A typical path from your home to a US server crosses 10-14 routers. Each hop adds 1-50ms of delay depending on distance and congestion. (you saw this in the traceroute above), and any of those routers might be overloaded, making different routing decisions, or even failing mid-transfer.
This creates four specific, measurable problems. Every one of them is real β you can observe them with tools on your own machine.
1. Packet Loss β Data Vanishes Silently
Every router has a finite buffer β a queue where incoming packets wait to be forwarded. A typical router buffer is sized to hold about 200ms worth of traffic at the link rate. For a 1 Gbps link, that's about 25 MB of buffer. When a burst of traffic exceeds this buffer, the router has no choice: it drops the newest packets silently. No error message, no retry, no notification. Your packet just ceases to exist.
How common is this? On a well-provisioned wired network (your office LAN, a datacenter), packet loss is typically 0.01-0.1%. On congested Wi-Fi or mobile networks, it can spike to 2-5%. You can measure it yourself:
# Send 100 pings and see how many get lost
$ ping -c 100 google.com
# ... (100 ping results) ...
# 100 packets transmitted, 99 received, 1% packet loss
#
# On a good wired connection: 0% loss
# On congested Wi-Fi: 1-3% loss
# On a bad mobile connection: 5%+ loss
2. Reordering β Packets Arrive Scrambled
You send packets A, B, C, D, E β in that order. They arrive as C, A, E, B, D. How? Because each packet is independently routed. Your traceroute showed 6+ hops β and at each hop, the router checks its routing tableA data structure inside every router that maps destination IP prefixes to outgoing network interfaces. Routing tables are updated dynamically (every 30-90 seconds via BGP). If a link goes down or gets congested, routes change β and packets in flight might suddenly take different paths. and picks the best outgoing link at that instant. If a link gets congested between packet A and packet B (which can happen in milliseconds), packet B might take a completely different path β a longer one, or a shorter one.
The math: on a path with 14 hops (typical cross-country), each router independently forwards based on current congestion. A 5ms difference at just one hop means packets arrive out of order. On the real internet, reordering rates of 3-5% are common on long paths.
3. Duplication β The Same Packet Arrives Twice
A router forwards your packet, but doesn't get a link-layer acknowledgment from the next hop (maybe due to a brief radio glitch on Wi-Fi). So it retransmits. But the original did make it through β now the receiver gets two copies of the same packet. This seems harmless until you consider: what if that packet was a bank transfer instruction? Two copies = two transfers. Your $500 payment becomes $1,000.
Duplication can also happen when a route changes mid-connection. A router that held a copy of your packet in its buffer might forward it on the new route, while the original is already traveling the old route. Both arrive.
4. Corruption β Bits Flip in Transit
As your packet travels through copper wires, fiber optic cables, and radio waves, electromagnetic interference can flip individual bits. A 0 becomes a 1, or vice versa. The probability of a bit error on fiber is roughly 1 in 1010 bits. That sounds rare β until you realize that at 1 Gbps, that's one corrupted bit every 10 seconds. On wireless (Wi-Fi, cellular), the error rate is orders of magnitude higher.
Ethernet has its own CRC checksumCyclic Redundancy Check β a 32-bit hash computed over each Ethernet frame. If the CRC doesn't match at the receiver, the frame is silently dropped (the Ethernet layer doesn't retransmit). CRC catches most errors, but the undetected error rate is still roughly 1 in 10^18 β which means at scale, some corrupted packets DO get through. that catches most bit errors at the link layer. But "most" isn't "all." At scale β think Google processing 100+ petabytes per day β even a 1-in-1018 undetected error rate means corrupted data gets through. This is exactly why TCP has its own checksum on top of Ethernet's.
Notice the pattern? The first three are catastrophic β corrupt files, double charges, broken pages. The last two are barely noticeable. This split is the entire reason two protocols exist.
Given these four problems (loss, reordering, duplication, corruption), what mechanisms would YOU add to raw IP to make file transfer reliable? Think about: how does the sender know something was lost? How does the receiver reassemble the right order? How do you detect corruption? How do you prevent duplicates?
Think about how you'd handle this with physical mail. You might number each envelope, ask for a signed receipt, add a checksum total, and tell the post office "if I don't hear back in 3 days, send it again." That's basically TCP.The Breakthrough β Two Opposite Solutions (1974 and 1980)
In the early 1970s, Vint CerfOften called "the father of the internet." Co-designed TCP/IP while at Stanford and DARPA. Received the Presidential Medal of Freedom in 2005. He later joined Google as VP and "Chief Internet Evangelist." The original 1974 paper was co-authored with Bob Kahn. and Bob KahnCo-inventor of TCP/IP alongside Vint Cerf. Was a program manager at DARPA (the US Defense Department's research agency). Together, they published "A Protocol for Packet Network Intercommunication" in 1974 β the paper that defined TCP and laid the foundation for the modern internet. faced the exact problems from Section 4. The ARPANET was unreliable β packets got lost, reordered, corrupted. Their solution, published in a 1974 paper titled "A Protocol for Packet Network Intercommunication", became TCP. Six years later, David P. Reed realized that not every application needs all that reliability machinery, and created a stripped-down alternative: UDP.
Two brilliant, opposite philosophies:
You can see the difference with your own eyes. If you open WiresharkA free, open-source network protocol analyzer. It captures packets on your network interface and lets you inspect every byte of every protocol. Download at wireshark.org. It's the stethoscope of network engineering β used by every network engineer, security analyst, and protocol developer. and capture traffic while browsing and making a Zoom call simultaneously, the contrast is stark:
Look at the difference. TCP's header is a carefully designed state machine β sequence numbers to track ordering, ACK numbers to confirm delivery, flags for connection management (SYN to start, FIN to end, RST to abort), a window size for flow control. It solves all four problems from Section 4. UDP's header is deliberately bare: two port numbers (so the OS knows which application gets the packet), a length field, and an optional checksum. Everything else β reliability, ordering, flow control β is the application's problem.
We said TCP adds reliability on top of unreliable IP. But all that reliability machinery (handshakes, ACKs, retransmission, reordering buffers) takes time and memory. Can you think of a situation where TCP's reliability actively makes things WORSE?
Hint: imagine a live multiplayer game where player positions update 60 times per second. What happens when TCP retransmits a 200ms-old position update that's already obsolete?That's the fundamental tension. TCP's reliability is a feature for file downloads and a bug for real-time applications. A retransmitted game position from 200ms ago is worse than no position at all β by the time it arrives, the player has already moved. This is called head-of-line blockingWhen TCP detects a lost packet, it holds ALL subsequent packets in a buffer until the missing one is retransmitted and arrives. This means even perfectly-delivered packets sit waiting, blocked by one lost packet "at the head of the line." For real-time applications, this creates artificial latency spikes that are far worse than just skipping the lost data., and it's the single biggest reason real-time applications choose UDP.
Now let's dig into exactly how each protocol works β TCP's six clever mechanisms and UDP's deliberate simplicity.
How It Works β The Mechanics
TCP has six major mechanisms that turn an unreliable network into a reliable data stream. Think of them as a team β each one solves a specific problem we identified in Section 4. Then we'll look at UDP's intentional simplicity.
The 3-Way Handshake β "Let's Agree to Talk"
Before sending a single byte of real data, TCP makes both computers confirm they're ready. It's like calling someone on the phone: you dial (ring ring), they pick up ("hello?"), and you say "hey, it's me" β now you both know the connection is live.
The handshake has three steps, each using a special flag in the TCP header. If you fire up WiresharkA free, open-source network protocol analyzer. It captures packets on your network interface and lets you inspect every byte. Download at wireshark.org. Filter "tcp.flags.syn==1" to see handshakes happening in real time. and filter tcp.flags.syn==1, you'll see these flying by every time your browser opens a connection:
TCP's 3-way handshake costs 1.5 round trips before any data can flow. On a 30ms RTT link that is 45ms. On a 150ms link (Mumbai to Virginia), that is 225ms. If your app makes 10 sequential API calls to the same server (each on a new TCP connection), how much total time is spent just on handshakes? What technique would you use to avoid this waste?
Hint: Think about connection reuse, keep-alive, and connection pooling.Those huge numbers (2,837,104,521) aren't made up β they're the Initial Sequence Numbers (ISNs)Each side of a TCP connection picks a random 32-bit starting number. Every byte of data sent gets a sequence number relative to this start. The randomness prevents old packets from previous connections being confused with new ones and provides basic defense against attackers guessing numbers. you'd actually see in Wireshark. Each side picks a random 32-bit number, and from then on, every byte of data gets numbered starting from there. Why random? Two reasons: (1) prevents old packets from a previous connection being mistaken for new ones, and (2) makes it harder for attackers to inject fake packets by guessing the numbers.
Sequence Numbers β The Math of Ordering
Remember the reordering problem from Section 4? Packets can arrive out of order because they take different routes through the network. TCP solves this by stamping every byte with a sequence number. The receiver uses these numbers like page numbers in a book β even if page 5 arrives before page 3, you know exactly where each one goes.
Here's the math that makes it work. Suppose the client's ISN is 1000 and it sends 3 segments of 500 bytes each:
The ACK number tells the sender: "I've received everything up to this byte; send me what comes next." So an ack=2000 means "I have bytes 1000β1999, please send from byte 2000 onward." If the receiver gets segment 3 (seq=2000) but is missing segment 2 (seq=1500), it keeps sending ack=1500 β a duplicate ACKWhen the receiver sends the same ACK number multiple times, it signals a gap in the data. After 3 duplicate ACKs, TCP triggers "fast retransmit" β resending the missing segment immediately without waiting for the full timeout timer. β until the sender gets the hint and retransmits segment 2.
Flow Control β "Slow Down, I Can't Keep Up!"
Imagine pouring water into a glass. If you pour faster than someone can drink, the glass overflows. The same thing happens with data: if a beefy server blasts gigabytes at a phone on a slow Wi-Fi connection, the phone's buffer fills up and data gets dropped.
TCP solves this with a receive windowA 16-bit field in every TCP ACK that tells the sender "I have this many bytes of free buffer space." With the Window Scale option (RFC 7323), this effectively becomes a 30-bit field, supporting windows up to 1 GB.. In every ACK, the receiver advertises how much buffer space it has left. The sender never sends more unacknowledged data than the receiver's window allows.
The bandwidth-delay productThe amount of data that can be "in transit" on a network link at any moment. Calculated as: bandwidth x round-trip time. A 1 Gbps link with 50ms RTT = 6.25 MB in flight. The TCP window needs to be at least this big to keep the pipe full. (BDP) determines the ideal window size. Here's the math:
This prevents a fast server from overwhelming a slow client. The beauty is that it's completely automatic β neither the sender nor receiver application has to do anything. But there's a subtlety: the default window size on many systems is too small for high-bandwidth, high-latency links. A 10 Gbps link across the Atlantic (100ms RTT) has a BDP of 125 MB β far larger than the default 4 MB window. That's why kernel tuning (net.core.rmem_max, net.core.wmem_max on Linux) matters at scale.
Congestion Control β Slow Start, CUBIC, and BBR
Flow control prevents overwhelming the receiver. Congestion control prevents overwhelming the network itself. Think of it like traffic management on a highway. Even if the destination parking lot has plenty of space (big receive window), the highway between you might be jammed with everyone else's traffic.
TCP starts by sending data slowly β typically 10 segmentsThe initial congestion window (initcwnd) was 1 segment in the original RFC, raised to 2-4 in RFC 3390, and then to 10 in RFC 6928 (2013) based on Google's research. Most modern kernels default to 10. β and gradually speeds up. This ramp-up phase is called slow start. Every time the sender gets an ACK, it increases the congestion window exponentially (1 β 2 β 4 β 8 β 16). The moment TCP detects a packet loss, it interprets that as "the network is full" and dramatically cuts back.
Over the decades, three major congestion control algorithms have dominated:
Connection Teardown and TIME_WAIT
When both sides are done exchanging data, TCP cleanly closes the connection with a 4-step process. Each side says "I'm done sending" (FIN) and the other acknowledges (ACK). Think of it like ending a phone call: "OK, I'm done." "Got it." "I'm done too." "Bye!"
The TIME_WAITAfter sending the final ACK, the client holds the connection's IP:port pair reserved for 2Γ the Maximum Segment Lifetime (typically 60 seconds on Linux). This ensures any delayed packets from the old connection are absorbed rather than confusing a new connection that reuses the same ports. state deserves special attention because it causes real production headaches. After the client sends the final ACK, it holds the socket open for 60β120 seconds. On a busy load balancer handling 10,000 connections per second, that's up to 1.2 million sockets stuck in TIME_WAIT simultaneously β each consuming kernel memory.
net.ipv4.tcp_tw_reuse=1 to let new outbound connections reuse TIME_WAIT sockets (safe for clients). Never use the deprecated tcp_tw_recycle β it breaks behind NAT. Also consider SO_REUSEADDR on servers and connection pooling to reduce the number of connections opened/closed.
UDP β Deliberate Simplicity
Now compare all of TCP's machinery to UDP. UDP has... none of it. No handshake. No sequence numbers. No ACKs. No retransmission. No flow control. No congestion control. No teardown. The entire UDP "protocol" adds just 8 bytes of header: source port, destination port, length, and checksum.
UDP is intentionally simple. It gets your data out the door with the absolute minimum overhead. Every feature TCP provides (reliability, ordering, flow control), UDP deliberately omits β because some applications don't want them. A voice call would rather skip a lost audio frame than wait 200ms for a retransmission. A DNS query just resends the whole request if no response comes back in a second.
If you need reliability on top of UDP, you build it yourself. And some applications do exactly that β QUIC (which powers HTTP/3) is built on top of UDP but adds its own reliability layer, smarter than TCP's, because Google can iterate on it without waiting for OS kernel updates.
TCP needs a 3-way handshake before sending data and a 4-way teardown to close. That's overhead of 7 packets just for setup and cleanup. If you only need to send one tiny message (like a DNS query of 50 bytes), is TCP's overhead worth it?
A DNS query over TCP = 3 handshake packets + 1 query + 1 response + 4 teardown packets = 9 packets for 50 bytes of data. Over UDP = 1 query + 1 response = 2 packets. This is exactly why DNS uses UDP by default.Going Deeper β The Clever Details
Now that you understand the big picture, let's explore four TCP mechanisms that come up in system design interviews and real-world debugging. Each one is a clever optimization that solves a subtle problem.
Imagine you're sending letters and you can only send one at a time, waiting for a reply before mailing the next. That would be painfully slow β especially if the recipient is overseas. TCP's sliding windowA mechanism that lets the sender have multiple packets "in flight" simultaneously without waiting for individual ACKs. The window size determines how many bytes can be outstanding before the sender must pause. solves this by allowing a whole batch of packets to be "in flight" at the same time.
The effective window is the minimum of the receiver's advertised window and the congestion window. Even if the receiver says "I can handle 1 MB," the congestion window might only be at 64 KB during slow start. The bottleneck always wins.
The 3-way handshake adds one full round-tripThe time for a packet to travel from sender to receiver and back. A typical RTT: 1-5ms in a data center, 20-50ms within a region, 100-300ms cross-continent. of latency before any data can be sent. On a mobile connection with 100ms RTT, that's 100ms of dead air before any content starts flowing.
TCP Fast Open (TFO) lets the client embed data in the very first SYN packet. On the first connection, the server gives the client a cryptographic cookie. On subsequent connections, the client presents the cookie in its SYN and includes the HTTP request right there β the server can start processing before the handshake even finishes.
TFO is supported on Linux (kernel 3.7+), macOS (since 10.11), and Windows (since 10). Google measured a 4-7% reduction in HTTP transaction latency by enabling TFO on their servers. For repeat visitors on mobile networks, the savings are even larger.
HTTP/2 multiplexes 10 streams over a single TCP connection. One packet (carrying data for stream 3) gets lost. TCP blocks ALL 10 streams until that packet is retransmitted β roughly 200ms on a typical network. How many of the other 9 streams actually needed that packet? What percentage of the connection's throughput is wasted by this blocking?
Hint: The answer is why HTTP/3 exists. Think about what would happen if each stream managed its own reliability independently.This is TCP's biggest weakness, and it's the main reason HTTP/3 moved to UDP. TCP guarantees that data arrives in order. That sounds great β until one packet gets lost.
Imagine you're loading a webpage over a single HTTP/2 connection. The browser is downloading images, CSS, and JavaScript as multiplexed streams. Packet 5 contains part of an image. Packet 6 contains CSS. Packet 7 contains JavaScript. If packet 5 gets lost, TCP blocks everything β packets 6 and 7 sit in the buffer, waiting for packet 5's retransmission. The CSS and JavaScript are right there, ready to go, but TCP won't deliver them out of order.
This single problem β head-of-line blocking β motivated the entire development of QUIC and HTTP/3. On a 2% packet loss network (typical for mobile), HTTP/2-over-TCP can be slower than HTTP/1.1 with multiple connections, because all those multiplexed streams share one TCP connection and block each other. QUIC gives each stream independent reliability.
Here's a fun problem that's caused headaches for decades. You're typing in a terminal connected to a remote server. Each keystroke is 1 byte of data. But TCP adds a 20-byte header (plus IP's 20-byte header). So you're sending 41 bytes to deliver 1 byte of actual data β 97.5% overhead.
Nagle's algorithmA TCP optimization from 1984 (RFC 896) that batches small outgoing messages. If there's already unacknowledged data in flight, hold back small new messages and combine them into one bigger packet. Named after John Nagle, who invented it to fix "small packet problem" on the early internet. fixes this by holding back small messages until either (a) enough data accumulates to fill a segment, or (b) the ACK for the previous data comes back.
Separately, delayed ACKsAn optimization where the receiver waits up to 40ms before sending an ACK, hoping to piggyback it on a data packet going the other direction. Saves an extra packet but adds latency if no reverse traffic exists. are a receiver-side optimization: instead of sending an ACK immediately, the receiver waits up to 40ms, hoping to piggyback the ACK on a data packet going back to the sender.
Each optimization is sensible alone. Together, they create a nasty deadlock:
The sender is waiting for an ACK (Nagle), and the receiver is delaying the ACK (delayed ACK). They stare at each other for 40ms until the delayed ACK timer fires. For interactive applications, this creates mysterious 40ms latency spikes. The fix is simple: set TCP_NODELAY on the socket, which disables Nagle's algorithm. Every serious low-latency application β Redis, game servers, trading systems β does this.
Variations β TCP vs UDP vs The New Kids
TCP and UDP are the classic duo, but the networking world hasn't stood still. New protocols have emerged that borrow ideas from both. Let's compare them all.
TCP vs UDP vs QUIC β The Complete Comparison
| Feature | TCP | UDP | QUIC (HTTP/3) |
|---|---|---|---|
| Connection | 3-way handshake (1 RTT) | None (0 RTT) | 1 RTT first time, 0-RTT repeat |
| Reliability | Full (retransmit all losses) | None | Per-stream (independent) |
| Ordering | Strict byte ordering | None | Per-stream ordering |
| Head-of-line blocking | Yes (major problem) | N/A | No (streams are independent) |
| Encryption | Optional (TLS on top) | Optional (DTLS) | Mandatory TLS 1.3 |
| Congestion control | CUBIC / BBR | None | CUBIC / BBR (configurable) |
| Connection migration | No (tied to IP:port) | N/A | Yes (connection ID) |
| Implementation | OS kernel | OS kernel | Userspace (fast to update) |
| Header size | 20-60 bytes | 8 bytes | Variable (encrypted) |
| NAT/firewall support | Universal | Universal | ~95% (some block UDP) |
SCTP β The Telecom Protocol
SCTP (Stream Control Transmission Protocol, RFC 4960Published in 2007, SCTP was designed for telecom signaling. It combines the best of TCP and UDP: reliable delivery with message boundaries, multi-streaming, and multi-homing. Used extensively in 4G/5G core networks.) is a lesser-known protocol that was designed for telecom signaling. It has features that were ahead of its time:
- Multi-homing β A single connection can span multiple IP addresses. If one network path fails, traffic seamlessly switches to another. Perfect for always-on telecom infrastructure.
- Multi-streaming β Like QUIC, SCTP supports independent streams within one connection, avoiding head-of-line blocking. QUIC essentially rediscovered this idea.
- Message-oriented β Unlike TCP's byte stream, SCTP preserves message boundaries (like UDP) while providing reliability (like TCP). You send a 500-byte message, the receiver gets exactly a 500-byte message.
- 4-way handshake β SCTP uses a 4-step handshake with a cookie to prevent SYN flood attacks. TCP adopted a similar idea later with SYN cookies.
In practice, SCTP is rarely used on the public internet because many NATNetwork Address Translation β what your home router does to share one public IP among multiple devices. NAT devices understand TCP and UDP packet formats, but most don't recognize SCTP, silently dropping it. devices and firewalls don't support it. But it's critical in telecom β 4G/5G core networks, SS7 signaling, and Diameter authentication all use SCTP.
DCCP β Congestion-Controlled UDP
DCCP (Datagram Congestion Control Protocol, RFC 4340Published in 2006, DCCP was designed for applications that need congestion control but NOT reliability β like streaming media. It adds congestion control to datagram delivery without the overhead of retransmission and ordering.) occupies an interesting middle ground. Think of it as "UDP with congestion control but without reliability." It was designed for streaming media applications that need to be network-friendly (not flood the network) but don't need retransmission.
DCCP's key idea: applications using raw UDP can accidentally (or maliciously) flood the network because UDP has no congestion control. DCCP adds just congestion control β it backs off when the network is congested β while keeping the datagram delivery model. Lost packets are still lost, but the sender adjusts its rate to be fair to other traffic.
QUIC is built on top of UDP, and QUIC adds its own reliability. Why not just improve TCP instead? What does building on UDP give QUIC that TCP couldn't?
Think about who controls the protocol implementation. TCP is in the OS kernel. Changing TCP requires OS updates across billions of devices β years. QUIC runs in userspace (Chrome, nginx) and can be updated with a browser push in weeks.At Scale β How the Giants Use TCP & UDP
Theory is great, but let's see how real companies make these decisions at massive scale. Each story reinforces the core trade-off: reliability vs speed β and sometimes the surprising answer is "both."
Netflix β How BBR Changed Streaming
Netflix delivers roughly 15% of all internet traffic worldwide. Their servers run on Open ConnectNetflix's custom CDN. Instead of using Akamai or Cloudflare, Netflix places its own servers (called Open Connect Appliances) directly inside ISP data centers worldwide. Each box is a FreeBSD server with 100+ TB of SSDs loaded with popular content for that region., a custom CDN with servers inside ISPs globally. The surprising part? Netflix streams video over TCP, not UDP.
Why TCP for video? Netflix pre-buffers 30-60 seconds ahead. TCP retransmission delays (200ms worst case) are invisible when you have a 30-second cushion. And TCP's reliability means no corrupted frames or glitchy playback.
But Netflix's real innovation was adopting BBR for congestion control. Before BBR, Netflix used CUBIC β a loss-based algorithm. On the last-mile connection to your home (Wi-Fi, shared cable), random packet loss is common even when the link isn't congested. CUBIC interpreted every loss as congestion and slashed the sending rate, causing buffering events.
After switching to BBR, Netflix reported a 4% improvement in video quality and significantly fewer rebuffering events, especially in regions with lossy last-mile connections (India, Southeast Asia, parts of Latin America). The takeaway: the congestion control algorithm matters as much as the protocol choice.
Fortnite β Custom UDP at 30 Million Concurrent Players
Epic Games' Fortnite peaked at 30+ million concurrent players. The game server sends position updates, combat events, and world state to up to 100 players per match, 30 times per second. That's 3,000 packets per second per server β and each server runs multiple matches.
Fortnite uses a custom protocol built on top of UDP, with a clever twist: two reliability channels.
Position updates (90% of traffic) are sent unreliably β if update #47 is lost, update #48 has newer data anyway. But critical events (eliminations, loot, storm changes) go through the reliable channel with custom ACK and retransmit logic. The sequence number on every packet lets the client detect and discard stale position data even if packets arrive out of order.
Discord β TCP for Text, UDP for Voice
Discord is a textbook example of using both protocols in the same app. Over 200 million monthly active users generate two very different types of traffic:
Discord has 200 million monthly active users. Voice channels use UDP for real-time audio. Text chat uses TCP via WebSocket. What would happen if Discord used TCP for voice calls too? On a typical home Wi-Fi connection with 0.5% packet loss, how often would a voice call freeze?
At 50 packets/sec for voice audio and 0.5% loss, roughly 1 packet is lost every 4 seconds. Each TCP retransmission adds ~200ms. Would you notice a 200ms pause every 4 seconds in a voice call?Text messages, friend lists, server data, and reactions all go over a persistent WebSocketA protocol that upgrades an HTTP/TCP connection to a full-duplex, persistent channel. The initial HTTP request says "upgrade to WebSocket," and then both sides can send data freely without the overhead of HTTP request/response cycles. Perfect for real-time chat. connection (TCP). Every message must arrive perfectly and in order β you can't have chat messages disappearing or arriving scrambled.
Voice chat runs over UDP with OpusAn open-source audio codec designed for interactive speech and music. It can operate at bitrates from 6 kbps to 510 kbps with latency as low as 2.5ms. Used by Discord, Zoom, WhatsApp, and most modern VoIP applications. codec. When you're in a voice channel, audio packets fly via UDP at ~64 kbps per user. If a packet is lost, you hear a tiny click or millisecond gap β but the conversation keeps flowing. Discord also implements a TCP fallback for networks that block UDP (some corporate firewalls).
Google QUIC β UDP Powering 35% of Global Web Traffic
Google invented QUIC in 2013 (Jim Roskind) and has been rolling it out ever since. As of 2025, QUIC/HTTP/3 handles roughly 35% of all web traffic β powered by Chrome (3+ billion users) talking to Google servers, Cloudflare, and Meta.
The numbers tell the story of why Google invested years in this:
The deployability advantage is key. TCP is in OS kernels β changing TCP behavior requires Windows/macOS/Linux updates across billions of devices. QUIC runs in userspace (inside Chrome, inside Cloudflare's edge servers). Google can ship QUIC improvements to 3 billion Chrome users in a single browser update. That's why QUIC was built on UDP instead of modifying TCP.
The Anti-Lesson β Myths That Cost Real Money
Knowing when to use TCP vs UDP is important. Knowing the myths that lead people to choose wrong is even more valuable. These are real misconceptions that have caused real production outages.
The myth: TCP is reliable, so it's always the safe choice. Just use TCP for everything.
The reality: A team built a multiplayer mobile game using TCP because "reliability is always better." Every time a player on a flaky mobile connection lost a packet, the game froze for 200-500ms waiting for retransmission. Players on the subway (5-10% packet loss) reported the game was "unplayable." The 200ms-old position data that TCP diligently retransmitted was already obsolete β the player had moved.
The fix: Switch to UDP for game state updates (positions, rotations), keep TCP for chat and leaderboards (data that must be perfect). The team saw player retention jump 23% after the switch.
The lesson: TCP's reliability is a feature for data that must be perfect, and a bug for data that expires faster than it can be retransmitted.
The myth: UDP is only for real-time media. "Serious" applications should always use TCP.
The reality: Some of the most important internet infrastructure runs on UDP. DNS uses UDP for queries (billions per day). NTP (network time synchronization) uses UDP. DHCP (how your device gets an IP address) uses UDP. Service mesh tools like ConsulHashiCorp's service discovery tool. Uses UDP-based gossip protocol for health checking and membership management across a cluster. The gossip protocol can tolerate lost messages because each piece of information is sent multiple times through different paths. use UDP-based gossip protocols for health checking.
And the biggest surprise: HTTP/3 runs on UDP. 35% of all web traffic β your Google searches, YouTube videos, Cloudflare-protected websites β uses UDP underneath, via QUIC. The line between "UDP applications" and "TCP applications" has blurred dramatically.
The lesson: UDP is for any situation where TCP's overhead isn't worth it β whether that's real-time media, tiny request-response exchanges (DNS), high-frequency telemetry (IoT sensors), or modern web protocols (QUIC).
The myth: Need real-time features? Just use WebSockets. They solve everything.
The reality: WebSocket is a great protocol β for what it does. It gives you a persistent, full-duplex TCP connection. Perfect for chat, notifications, collaborative editing, live dashboards. But it's still TCP underneath. That means it still suffers from head-of-line blocking, it still needs a handshake, and it still retransmits lost packets.
A fintech team used WebSockets for their real-time stock price feed. It worked well for most users. But during market volatility (high traffic, network congestion), TCP's congestion control kicked in and delayed price updates by 200-800ms. For a trading platform, stale prices can cause wrong decisions. They switched the price feed to UDP multicast with application-level sequencing, and latencies dropped to single-digit milliseconds.
The lesson: WebSockets are great for bidirectional, reliable real-time data. But they're not a silver bullet. For latency-critical applications where stale data is worse than lost data (gaming, trading, voice), you need UDP-based solutions. And for simple server-to-client streaming (live scores, news feeds), Server-Sent EventsA simple HTTP-based protocol where the server pushes text events to the client over a long-lived connection. Easier than WebSockets (no upgrade handshake, auto-reconnect built in) but only supports serverβclient direction. (SSE) are simpler and more reliable than WebSockets.
Common Mistakes β What Even Senior Engineers Get Wrong
These aren't hypothetical misunderstandings. They're mistakes we've seen in real production code, real architecture reviews, and real interview answers. Each one comes with a concrete way to prove it wrong on your own machine.
The mistake: People hear "3-way handshake" and "acknowledgments" and conclude TCP is inherently slow compared to UDP.
Why it's wrong: Only the connection setup is slow β one round-trip (SYN β SYN-ACK β ACK) adds ~1Γ RTT of delay before any data flows. After that, TCP is just as fast as the link allows. The ACKs fly back asynchronously while data keeps streaming. On a 1 Gbps LAN, TCP saturates the link β 940+ Mbps of actual throughput.
Prove it yourself:
# On server machine:
iperf3 -s
# On client machine:
$ iperf3 -c 192.168.1.50 -t 10
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 1.09 GBytes 938 Mbits/sec
# β TCP saturated a 1 Gbps link. "Slow" where?
# Compare with UDP:
$ iperf3 -c 192.168.1.50 -u -b 1G -t 10
[ ID] Interval Transfer Bitrate Jitter Lost/Total
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0.015ms 12/765028 (0%)
# β Virtually identical throughput. TCP's overhead is negligible for bulk data.
When this actually matters: The handshake cost adds up if you're opening thousands of short-lived connections. That's why HTTP keep-alive and connection pooling exist β amortize the handshake across many requests.
The mistake: Equating "unreliable" with "unusable" or "low quality."
Why it's wrong: "Unreliable" means the protocol makes no delivery guarantees. It doesn't mean your application can't. QUIC (built on UDP) is arguably more reliable than TCP β it has independent stream-level retransmission, so a lost packet in one stream doesn't block other streams. Google built QUIC on UDP precisely because UDP lets you implement better reliability than TCP offers.
The real distinction is where reliability lives:
Bottom line: UDP is unreliable the way a blank canvas is unpainted. It's a starting point, not a limitation.
The mistake: Stack Overflow answers saying "always set TCP_NODELAY" without understanding what Nagle's algorithmInvented by John Nagle in 1984 (RFC 896). It buffers small outgoing TCP segments and combines them into fewer, larger packets. This reduces the number of tiny packets on the network (the "small packet problem") but adds up to 200ms of delay for small writes. Named after its creator, who was trying to fix congestion on the Ford Motor Company network. actually does.
Why it's wrong: Nagle's algorithm (John Nagle, 1984, RFC 896) buffers small writes and combines them into larger segments. For bulk transfers β file uploads, database replication, log shipping β this is exactly what you want. Sending 1,000 individual 10-byte writes as 1,000 separate packets wastes bandwidth on headers (40 bytes of TCP/IP overhead per packet). Nagle combines them into ~7 full-sized segments.
When to disable it: Only when you need low latency for small messages β like a keystroke in an SSH session, a mouse movement in a game, or a real-time trading order. In those cases, even 200ms of Nagle buffering is unacceptable.
# Linux β show TCP socket options including Nagle status
$ ss -ti dst 142.250.80.46
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 192.168.1.42:54312 142.250.80.46:443
cubic wscale:7,7 rto:204 rtt:18.2/4.5 mss:1360 rcvmss:1360
advmss:1460 cwnd:10 bytes_sent:1842 bytes_acked:1843
# β No "nodelay" flag means Nagle IS enabled (the default)
The mistake: Treating connection reuse as a nice-to-have optimization instead of a hard requirement at scale.
Why it's wrong: Let's do the math. A TCP 3-way handshake takes 1 round-trip β typically 1-3ms on a LAN, 30-80ms cross-region. TLS adds another 1-2 round-trips. At 10,000 requests/second with a fresh connection each time:
TCP handshake: ~1.5ms per connection = 15 seconds of cumulative handshake time per second
TLS on top: ~3ms per connection = 30 more seconds per second
Total: 45 seconds of CPU time spent just shaking hands every second.
That's 45Γ your available time budget β physically impossible without pooling.
Connection pooling reuses established TCP connections across requests. One handshake serves hundreds or thousands of requests. Every serious HTTP client, database driver, and service mesh does this by default β HttpClient in .NET, pgBouncer for PostgreSQL, Envoy's connection pools.
# Watch connection states β ESTABLISHED = reused, TIME_WAIT = wasted
$ ss -s
Total: 847
TCP: 423 (estab 89, closed 284, orphaned 3, timewait 284)
# β 284 TIME_WAIT = 284 connections opened and closed recently
# That's wasted handshakes. A pool would keep them ESTABLISHED.
The mistake: Ignoring the receive windowThe amount of data a TCP receiver is willing to accept before the sender must pause and wait for an acknowledgment. Think of it as a "speed limit" β the sender can't have more than window-size bytes in flight at once. The default was 64 KB for decades, now dynamically scaled via the window scale option (RFC 7323). when designing systems that transfer large amounts of data across high-latency links.
Why it's wrong: TCP can only have window_size bytes in flight before it must wait for an ACK. The maximum throughput is governed by the bandwidth-delay productBDP = bandwidth Γ round-trip time. It tells you how many bytes are "in the pipe" at any given moment. To keep the pipe full, your TCP window must be at least as large as the BDP. If it's smaller, the sender idles waiting for ACKs, and you waste bandwidth. (BDP):
# Linux β show window scale factor and current window
$ ss -ti dst 142.250.80.46
ESTAB 0 0 192.168.1.42:54312 142.250.80.46:443
cubic wscale:7,7 rto:204 rtt:18.2/4.5 cwnd:10 ssthresh:28
# wscale:7,7 β scale factor 2β·=128 for both directions
# Effective max window = 65535 Γ 128 = 8,388,480 bytes β 8 MB
# At 18ms RTT: max throughput = 8 MB / 0.018s β 3.5 Gbps β
Real impact: A team replicating a database across US-East to US-West (80ms RTT) with default 64 KB windows was getting 6 Mbps on a 10 Gbps link. Tuning the window to 4 MB jumped throughput to 400 Mbps β a 67Γ improvement from one kernel parameter.
The mistake: Using HTTP/REST for all service-to-service communication, even internal microservices doing millions of requests per second.
Why it's wrong: HTTP/1.1 adds significant overhead per request β headers (200-800 bytes of text), chunked transfer encoding, content-type negotiation, cookie handling. For a 50-byte internal RPC payload, HTTP headers can be 10Γ larger than the actual data.
At scale, this adds up fast:
The right approach: HTTP/REST is perfect for external APIs, browser-facing endpoints, and moderate-traffic internal services. For high-throughput internal communication (1M+ RPS), use gRPC (binary framing, multiplexing, header compression) or a custom binary protocol. That's exactly what Google, Meta, and Netflix do internally β gRPC was built at Google to replace their internal HTTP/JSON services.
Interview Playbook β Choosing TCP vs UDP Under Pressure
In system design interviews, protocol choice often comes up as a side question while you're designing something bigger β a chat app, a video platform, a gaming backend. The interviewer wants to see that you can reason about tradeoffs, not just recite definitions. Here's how to handle it at each career level.
What They'll Ask
"Explain the difference between TCP and UDP."
What to Say
"TCP sets up a connection first β a 3-way handshake β and then guarantees every byte arrives in order. If something gets lost, it retransmits. UDP skips all of that. It just fires packets with no connection, no ordering, no retransmission. TCP is like registered mail with tracking. UDP is like shouting across a room β fast, but no guarantees."
Bonus Point
Mention a real example: "Web browsers use TCP because every byte of HTML matters. Zoom uses UDP because a 200ms-old audio packet is useless by the time it's retransmitted."
What They'll Ask
"You're designing a multiplayer game. Which protocol do you use?"
Sample Answer
"I'd use both. Player position updates are sent 30-60 times per second and each one replaces the last β so losing one is fine. I'd send those over UDP. But for critical game events like scoring, inventory changes, or chat messages, those must arrive perfectly and in order, so I'd use TCP for those. This is actually what Fortnite and Valorant do β a hybrid approach."
The Key Insight to Demonstrate
Show that you don't think in binary (TCP OR UDP) but in terms of data characteristics: does this data expire? Is loss tolerable? Does order matter? Different data in the same system can use different protocols.
What They'll Ask
"You're designing a real-time collaborative document editor (like Google Docs). Walk me through the protocol choices."
Sample Answer
"For the editing session, I'd use a WebSocket connection β that's TCP underneath, which gives us reliable, ordered delivery for document operations. Every keystroke and cursor movement needs to arrive and be applied in the correct order, otherwise the document state diverges between users."
"But I'd consider the scale implications. With 100M concurrent editors, that's 100M persistent TCP connections. Each connection consumes kernel memory (3-10 KB for socket buffers). I'd use connection-aware load balancers and consider QUIC/HTTP/3 for the transport β it gives us reliable delivery but with better multiplexing and connection migration when users switch between Wi-Fi and cellular."
"For presence indicators (who's online, cursor positions of other users), I'd evaluate whether those can tolerate some loss. Cursor positions update 10-30 times per second and are immediately stale β that's a UDP candidate. But the operational transforms for actual text changes must be TCP-reliable."
Hands-On Exercises β Prove It on Your Own Machine
Reading about TCP and UDP is useful. Watching them in action is where it clicks. These exercises use real tools β Wireshark, ss, dig, Python sockets β so you're not just memorizing theory, you're building muscle memory with the same tools network engineers use daily.
Open Wireshark, start a capture, then open any website in your browser. Filter for tcp.flags.syn==1 to see every new TCP connection being established. You'll see the SYN β SYN-ACK β ACK dance happening in real time β dozens of them for a single page load.
- Download and install Wireshark (free, available on all platforms).
- Open Wireshark, select your network interface (usually "Wi-Fi" or "Ethernet"), and click the blue shark fin to start capturing.
- In the display filter bar at the top, type:
tcp.flags.syn==1 && tcp.flags.ack==0β this shows only the initial SYN packets (the "hello" that starts every TCP connection). - Open your browser and go to
https://google.com. - You'll see SYN packets appear. Click one, then look at the "Transmission Control Protocol" section in the packet details β you'll see sequence numbers, window size, and all the TCP options negotiated during the handshake.
- Change the filter to
tcp.stream eq 0(replace 0 with your stream number) to see the full SYN β SYN-ACK β ACK β data flow for one connection.
What to look for: The "Time" column shows how long each step takes. The difference between the SYN and the SYN-ACK is your RTT to that server.
ss -ti IntermediateThe ss command (socket statistics, Linux) shows you the internal state of every TCP connection β round-trip time, congestion window, retransmission count, and more. It's like an X-ray of your TCP stack.
# Open a connection first (keep a browser tab open to any site)
# Then inspect all TCP connections with internal details:
$ ss -ti
ESTAB 0 0 192.168.1.42:48920 142.250.80.46:443
cubic β congestion control algorithm
wscale:7,7 β window scale factor (2β· = 128Γ)
rto:204 β retransmission timeout: 204ms
rtt:18.2/4.5 β RTT: 18.2ms avg, 4.5ms deviation
mss:1360 β max segment size (payload per packet)
cwnd:10 β congestion window: 10 segments
bytes_sent:4821 β total bytes sent on this connection
bytes_acked:4822 β total bytes acknowledged
segs_out:42 β segments sent
segs_in:38 β segments received
send 5.98Mbps β estimated send rate
# Key insight: cwnd Γ mss = bytes in flight
# 10 Γ 1360 = 13,600 bytes. At 18ms RTT:
# max throughput = 13600 / 0.018 = 755 KB/s β 6 Mbps
Try this: Run ss -ti while downloading a large file. Watch cwnd grow from 10 to hundreds as TCP's slow start ramps up throughput.
DNS normally uses UDP (one question, one answer, fits in a single packet). But you can force it to use TCP with the +tcp flag. Compare the timing to see the handshake overhead in action.
# UDP DNS query (the default)
$ dig @8.8.8.8 google.com | grep "Query time"
;; Query time: 22 msec
# TCP DNS query (forced with +tcp)
$ dig +tcp @8.8.8.8 google.com | grep "Query time"
;; Query time: 68 msec
# The difference (~46ms) is the TCP handshake overhead:
# SYN β SYN-ACK β ACK before the query even sends.
# That's 1 extra round-trip (~22ms each way to Google DNS).
Why DNS uses UDP by default: A typical DNS response is 50-500 bytes β well under the 512-byte limit that fits in a single UDP packet. No handshake needed, no teardown, just one packet out and one back. At billions of DNS queries per day globally, saving one RTT per query is enormous.
When DNS uses TCP: Zone transfers (copying the entire DNS database between servers) and responses larger than 512 bytes (like DNSSEC signatures) switch to TCP because they need reliability and can exceed one packet.
Build both a TCP and UDP echo server, send messages to each, and observe the fundamental difference: TCP requires connect β send β receive β close, while UDP just fires a packet and (hopefully) gets one back.
import socket
# TCP: must listen, accept a connection, THEN exchange data
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # SOCK_STREAM = TCP
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(('0.0.0.0', 9001))
server.listen(5) # β backlog: queue up to 5 pending connections
print("TCP echo server listening on :9001")
while True:
conn, addr = server.accept() # β BLOCKS until a client connects (handshake)
print(f"TCP connection from {addr}")
data = conn.recv(1024) # β receive up to 1024 bytes
print(f" Received: {data.decode()}")
conn.sendall(data) # β echo it back (TCP guarantees delivery)
conn.close() # β 4-way teardown (FIN β ACK β FIN β ACK)
import socket
# UDP: no listen, no accept, no connection. Just bind and receive.
server = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) # SOCK_DGRAM = UDP
server.bind(('0.0.0.0', 9002))
print("UDP echo server listening on :9002")
while True:
data, addr = server.recvfrom(1024) # β receive a datagram + sender address
print(f"UDP datagram from {addr}: {data.decode()}")
server.sendto(data, addr) # β echo it back (no connection needed)
# No close() β there's no connection to close!
# Terminal 1: start TCP server
$ python tcp_echo_server.py
# Terminal 2: start UDP server
$ python udp_echo_server.py
# Terminal 3: test TCP (nc = netcat)
$ echo "Hello TCP" | nc localhost 9001
Hello TCP β echoed back after handshake
# Terminal 3: test UDP (nc -u for UDP mode)
$ echo "Hello UDP" | nc -u localhost 9002
Hello UDP β echoed back instantly, no handshake
# Compare with timing:
$ time echo "Hello" | nc localhost 9001 # TCP: ~2-5ms (includes handshake)
$ time echo "Hello" | nc -u localhost 9002 # UDP: ~0.5-1ms (no handshake)
Notice the difference: The TCP server needs listen(), accept(), and close(). The UDP server just binds and receives β no connection lifecycle at all. This is the fundamental architectural difference in code.
TCP's biggest weakness β head-of-line blocking β is easy to describe but hard to believe until you see it. In this exercise, you'll send 100 TCP segments, simulate a loss at segment #50, and measure how long segments #51-100 are delayed.
tc netem (Linux) + Python
# Step 1: Add 5% packet loss on the loopback interface
# (requires Linux with root access)
$ sudo tc qdisc add dev lo root netem loss 5%
# Step 2: Run a TCP sender that sends 100 numbered messages
$ python3 -c "
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 9001))
for i in range(100):
msg = f'Segment {i:03d} sent at {time.monotonic():.6f}\n'
s.sendall(msg.encode())
time.sleep(0.01) # 10ms between sends
s.close()
"
# Step 3: On the receiver, log arrival times and look for gaps.
# When segment ~50 is dropped, TCP retransmits it (~200ms RTO).
# Segments 51-100 arrive in the kernel buffer but the APPLICATION
# can't read them until #50 is retransmitted and fills the gap.
# You'll see a ~200ms pause in the arrival log around that point.
# Step 4: Clean up
$ sudo tc qdisc del dev lo root netem
What you'll observe: Most segments arrive 10ms apart (your send interval). But when a loss occurs, you'll see a ~200ms gap in the application-level receive timestamps β even though segments after the lost one had already arrived at the kernel level. That's head-of-line blocking. The application is forced to wait because TCP guarantees in-order delivery.
Why this matters: This is exactly why HTTP/2 over TCP suffers when multiplexing multiple streams β a loss in one stream blocks ALL streams. QUIC fixes this by running each stream independently over UDP.
Cheat Sheet β Pin This to Your Wall
Six cards. Everything you need to pick the right protocol, debug connection issues, and impress in interviews. Screenshot these.
TCP UDP βββββββββ βββββββββ Connected Connectionless Ordered Unordered Reliable Best-effort Flow control None 20-60B header 8B header Streams Datagrams
β Every byte must arrive β Order matters (file, web) β You need flow control β Long-lived connections β Error detection + recovery Examples: HTTP, SSH, SMTP, DB connections, FTP
β Speed > perfection β Data expires fast β Loss is tolerable β Small, independent msgs β Multicast/broadcast needed Examples: DNS, VoIP, gaming, QUIC/HTTP3, NTP, DHCP
Handshake cost = 1 RTT (TCP)
+ 1-2 RTT (TLS)
Max throughput =
window_size Γ· RTT
BDP = bandwidth Γ RTT
(bytes that fit in pipe)
Nagle delay β€ 200msss -ti TCP internals ss -s socket summary netstat -an all connections tcpdump -i any packet capture dig +tcp force TCP DNS iperf3 -c HOST throughput test ping -c 100 measure loss
"TCP trades latency for correctness." "UDP trades correctness for latency." "Default to TCP. Switch to UDP only when data expires faster than retransmission."
Connected Topics β Where to Go Next
TCP and UDP are the transport layer β the foundation everything else sits on. Here's what builds directly on what you've learned.