Posted on June 28, 2019 by Jon-Eric Eliker
Though Oracle Cloud Infrastructure provides default (i.e. "no configuration required") DNS name resolution for OCI resources following the pattern "instancename.subnetlabel.vcnlabel.oraclevcn.com", it is common to supplement or override that behavior with a custom DNS implementation hosted in OCI or elsewhere. If you intend to host any portion of your DNS solution in OCI (either the DNS resolvers themselves or a proxy tool like dnsmasq), you will experience multiple seconds of latency per DNS name resolution unless you properly configure stateless rules to manage DNS requests/responses for both your Application Subnets in OCI as well as the Subnet that contains your DNS servers/proxies.
Consider the advice below that builds on guidance shared in these three related blogs:
As the article "How to Remediate Application Slowness Due To Incomplete DNS Resolutions" explains in detail, DNS request latency (evident as a five second delay for each DNS request) occurs from any application using the GNU C library ("glibc") to handle name resolution. This library changed in 2008 (starting with glibc 2.9 release) to send both an IPv4 "A" record request and IPv6 "AAAA" record in parallel (i.e. two outbound UDP requests without waiting for a reply in between).
A simple way to see if you are affected by this latency is to try these command-lines from an OCI-hosted Linux instance in a Subnet configured for Custom DNS resolution.
$ ssh somehost.yourdomain.com
$ hostname --fqdn
For either command above, if you notice a five-second delay before the result (or before a password prompt if your attempted SSH connection requires a password) you are affected.
Avoid latency with stateless Security Rules to/from DNS hosts
The solution is to ensure you are using stateless rules not the default stateful Security Rules to govern DNS traffic from the Subnet containing your test Linux instance and to your DNS hosts.
The example above assumes 172.16.0.23 and 172.16.0.24 are our custom DNS servers. We have added stateless rules governing traffic DNS requests to (egress to UDP 53) and DNS responses from (ingress from UDP 53) these servers back to our test machine Subnet.
However, this isn’t the definitive solution. Implementing hybrid DNS to resolve both “oraclevcn.com” and your custom domains poses additional challenges. We explore that below. First, though, let’s review how this latency concern may not be readily evident in your environment.
No delay with VCN Resolver
This latency may exist in your network configuration but may “hide” if you evaluate response times by testing only the VCN Resolver DNS host (as determined by the DHCP Options associated with your Subnet) without also testing response times for your custom DNS hosts.
For example, assuming a Compute instance "web1" in Subnet "apps" and VCN "prod" and a Subnet using the VCN Resolver, the command below causes a DNS query for a host in “oraclevcn.com”—a domain the VCN Resolver recognizes—and reports the time to execute:
$ time ssh -oBatchMode=yes web1.apps.prod.oraclevcn.com
email@example.com: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Here, the "real" result (0.030s) is well under five seconds suggesting we are (apparently) unaffected by the condition I’ve described above. Even with this promising result, you may still find poor performance in this same Subnet once you switch to Custom DNS.
Delays only with Custom Resolver
Consider now having updated the DHCP Options for your test Subnet to use your custom DNS hosts. See the results of a similar test command assuming DNS record web1.yourdomain.com is among your DNS host records:
$ time ssh -oBatchMode=yes web1.yourdomain.com
firstname.lastname@example.org: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Now we are very clearly seeing a five-second delay. What happened? We did not change the Security List and are using the same Subnet so why do we see the delay now? The answer is in how Oracle exposes the VCN Resolver DNS host address to our Subnet. If you examine /etc/resolv.conf on a test server in a Subnet using VCN Resolver DNS, you will see one nameserver configured:
The VCN Resolver DNS host address falls in the "link local" address space 169.254/16 defined in RFC 3927 - Dynamic Configuration of IPv4 Link-Local Addresses. By definition, this address space is local (i.e. non-routed). Thus, DNS queries to this address do not follow the same traffic pattern as DNS requests to name servers outside the Subnet address space (such as to your custom DNS servers). Therefore, Security Rules and other traffic-management policies applied at the edge of the Subnet are not interpreted in the same way when addressing link-local hosts.
The information shared in this article and in the articles noted at the top of this post help us understand and overcome complications that arise when deploying a hybrid DNS solution. In that, we want to maintain the behavior of the VCN Resolver (e.g. resolving "oraclevcn.com" hosts) while also supporting name resolution for our custom domains. Consider this example architecture:
Because we will configure DNSMasq1 and DNSMasq2 like this:
…we need to ensure that both inbound and outbound DNS traffic for dnsmasq as well as in/out traffic to the on-premises servers is handled by stateless rules.
Security Rules for DNS Subnet
We achieve that using the Security Rules noted above but applied to our DNS Subnet (not the Apps Subnet) plus a few additions to handle traffic to/from hosts throughout our VCN (10.1.0.0/16):
These rules cover DNS requests from our VCN (ingress to UDP 53 from 10.1/16) and the corresponding responses (egress from UDP 53 to 10.1/16) as well as DNS requests forwarded to our on-premises servers (egress to UDP 53) and the responses (ingress from UDP 53).
Security Rules for Apps Subnet
Then, we create these rules for our Apps Subnet:
The end result is stateless rules governing traffic to/from our dnsmasq servers and stateless rules governing traffic to/from the on-premises DNS servers: each applied to the appropriate Subnet.
Oracle Cloud Infrastructure provides the versatility of multiple DNS configuration models:
If you understand how DNS traffic — both within your OCI Subnets and to/from your external (non-OCI) environments—is affected by stateful rules and plan your Security Lists accordingly, you can implement any of these models without introducing any unexpected and unwarranted latency.
If you have any questions or need additional assistance, please contact the Mythics Cloud Team.
Jon-Eric Eliker, Cloud Solution Consultant, Mythics