Posted:
Customers occasionally contact Google Cloud Platform Support to ask for help with troubleshooting latency issues in a Google App Engine application. In this post, I'll discuss how I typically isolate the root cause of this type of problem.

I start by creating a dynamic script that only returns a short text string, and then add it to the the customer’s App Engine app so that it can be accessed through a known URL. For an example of such a page in Python, see the hello world tutorial.

Then, I run this curl command from a terminal window:

curl -s -o /dev/null -w "@curl-format.txt"

The curl command uses a format file to define its output. Here are contents of  the format file. You need to create and save this file as curl-format.txt before you run curl:

\n
time_namelookup: %{time_namelookup}\n
time_connect: %{time_connect}\n
time_appconnect: %{time_appconnect}\n
time_pretransfer: %{time_pretransfer}\n
time_redirect: %{time_redirect}\n
time_starttransfer: %{time_starttransfer}\n
--------\n
time_total: %{time_total}\n
\n

The output will look something like this, showing latencies in milliseconds:

           time_namelookup:  0.060
              time_connect:  0.098
           time_appconnect:  0.000
          time_pretransfer:  0.099
             time_redirect:  0.000
        time_starttransfer:  0.144
                           ----------
                time_total:  0.144

The value for time_connect generally represents the latency of the client’s connection to the nearest Google datacenter. If this connection is slow, you can troubleshoot further using traceroute to determine which hop on the network causes the delay, as packets traverse your ISP’s network and Google’s production network to reach the Google frontend server.

You can run tests from clients in different geographical locations. Google Cloud Platform will automatically route requests to the closest data center, which will vary based on the client’s location.

If packets reach the Google frontend server with acceptable latency, then you need to troubleshoot the source of latency problems within App Engine’s serving infrastructure or your application code or configuration.

Look at your logs for the corresponding request in the Google Developers Console. It may help to print out the time when you ran the curl command.

The key field is the wall clock time for the request. This value doesn't include time spent between the client and the server that's running your application. You can calculate the time that the request spent within App Engine's serving infrastructure before reaching your application: subtract the time to reach the Google frontend server from the wall clock time.
All App Engine applications are hosted in the United States, unless their app ID is prefixed by e~, which signifies that the application is hosted in Europe. If your client is in a different geographical region from your application, you will see a significant delay as packets traverse Google’s internal network between the Google frontend server and the server running your application. You will see this delay, for example, if your application is in the US and your client is in Europe or Asia. One of the advantages of hosting your application on App Engine is that this latency is usually significantly less than if you used the public Internet to route requests to an application in another region.

Assuming that your client is in the same geographical region as your application, you can expect the App Engine serving infrastructure to add negligible latency.

Here are some additional troubleshooting tips for isolating latency problems:
  • Was the latency caused by the time to start up a new instance of your application? You will see these start-ups flagged as loading requests in the logs. Try running your tests with the default scheduler settings. In most cases, the default scheduler settings will provide an optimal tradeoff between cost and latency. If you make changes to these settings, run load tests to determine the impact. Also consider adding resident instances.
  • Do the logs show high pending time for a slow request? This is the time that your request spends in the queue waiting for an instance to be available. You can usually avoid by reverting to the default scheduler settings. In some cases, you may need to add resident instances.
  • Are you serving a static file or using the Blobstore API to serve the request? Both of these approaches use a serving path that doesn't run any of your application’s code. Run separate tests for latency in these cases. Use Google’s high performance image serving infrastructure to reduce latency.
  • Do slow requests have a large response size, according to the logs? If so, determine whether there is a bandwidth limitation between your client and Google.
  • For consistency during tests, ensure that your requests aren't cached. When running in production, add a Cache-Control HTTP header to your response in order to improve latency.
  • Does your request make API calls? If so, use Appstats to determine the time taken for API calls.
  • Do you see a high value in the CPU milliseconds field in your logs? If so, your request might be CPU-bound.  Using a higher instance class may reduce latency.
  • Are you using HTTPS or a custom domain? Compare latency with HTTP requests to your appspot.com domain to isolate whether the latency is caused by these factors.
  • If you think the slowdown occurs in your code, add application logging to record timing events in your code.

If you have purchased a support package, you can contact Google Cloud Platform's support team for further help. Here is information you should have at hand to help us quickly diagnose latency caused by network issues:

  1. Your IP address. You can get that by looking at the Developers Console logs for a request sent to App Engine.
  2. The URL of your App Engine application.
  3. The IP address to which the domain name from the above URL resolves to.
  4. The output of ping and traceroute from your client to the above IP address.
  5. The output from running the curl command, shown earlier in this blog post. You may want to run this a few times to ensure you have a representative result.
  6. The Developers Console logs for the above request.

If you’d like to explore this topic further, check out our methodology for YouTube video quality and read about Mobile analysis in PageSpeed Insights.

- Posted by John Lowry, Technical Account Manager

Posted:
A fellow Technical Solutions Engineer recently found their Google Cloud Platform project in an interesting state. They could create Compute Engine VM instances that would boot, but could not remotely connect via SSH into any of them. While this problem is often due to a misconfigured firewall rule, a quick check of the rules showed this was not the case, as an SSH rule existed and its SRC_RANGES value was non-discriminatory:

$ gcloud compute firewall-rules list -r .*ssh.*
NAME              NETWORK SRC_RANGES RULES  SRC_TAGS TARGET_TAGS
default-allow-ssh default 0.0.0.0/0  tcp:22

We ruled out a system-level firewall misconfiguration, as new systems from default images would not share that issue. As a sanity check, we used tcptraceroute to ensure traffic was reaching the instance:

$ sudo tcptraceroute -P 22 130.211.181.201
Selected device en0, address 172.31.130.174, port 22 for outgoing packets
Tracing the path to 130.211.181.201 on TCP port 80 (http), 30 hops max
1  172.31.131.252  1.247 ms  0.256 ms  0.250 ms
2  * * *
...
10  * * *
11  201.181.211.130.bc.googleusercontent.com (130.211.181.201) [closed]  38.175 ms  38.918 ms  38.072 ms

We would expect the last hop to report open, not closed.  Typically, this value means that the instance has responded but the port wasn't open for communication.  With no firewall interference, we knew it had to be something else.  The next step was to grep through the serial port output to see if sshd had started:

$ gcloud compute instances get-serial-port-output gcp-rge0-blog --zone us-central1-a | grep Starting.*sshd
[....] Starting OpenBSD Secure Shell server: sshd
Jan 14 23:19:19 gcp-rge0-blog sshd[1911]: Server listening on 0.0.0.0 port 22.
[ ok ] Starting OpenBSD Secure Shell server: sshd.

Okay, that looked fine.  With the most obvious points of interference ruled out, the network routes were the next best bet:  

$ gcloud compute routes list
NAME                            NETWORK     DEST_RANGE      NEXT_HOP                 PRIORITY
default-route-31a84e4cfff40b29 default     10.240.0.0/16                            1000

Now we’ve found the root cause.  The default route for non-local traffic (0.0.0.0/0) had been inadvertently deleted, which caused all external traffic to be lost on the return path.  Recreating the missing route solved the issue:

$ gcloud compute routes create default-internet --destination-range 0.0.0.0/0 --next-hop-gateway default-internet-gateway
Created [https://www.googleapis.com/compute/v1/projects/PROJECTID/global/routes/default-internet].

$ gcloud compute routes list
NAME                            NETWORK     DEST_RANGE      NEXT_HOP                 PRIORITY
default-route-31a84e4cfff40b29 default     10.240.0.0/16                            1000
default-internet     default     0.0.0.0/0       default-internet-gateway 1000

Now, the instances are once again reachable by SSH and any other external method. Case closed!

You can find a lot of help and information in the Google Cloud Platform documentation and more information on troubleshooting Compute Engine specifically here.

- Posted by Josh Moore, Technical Solutions Engineer

Posted:
The emergence of affordable high IOPS storage, such as Google Compute Engine local SSDs, enables a new generation of technologies to re-invent storage. Helium, an embedded key-value store from Levyx, is one such example -- designed to scale with multi-core CPUs, SSDs, and memory efficient indexing.

At Levyx, we believe in a “scale-in before you scale-out” mantra. Often times technology vendors advertise scale-out as a way to achieve high performance. It is a proven approach, but it is often used to mask single node inefficiencies. Without a well balanced system where CPU, memory, network, and local storage are properly balanced, this is simply what we call “throwing hardware at the problem”. Hardware that, virtual or not, customers pay for.

To demonstrate this, we decided to check Helium’s performance on a single node on Google Cloud Platform with a workload similar to the one previously used to showcase Aerospike and Cassandra (200 byte objects and 100 million operations). With Cassandra, the data store contained 3 billion indices. Helium starts with an empty data store. The setup consists of:

  1. Single n1-highcpu-32 instance -- 32 virtual CPUs and 28.8 GB memory.
  2. Four local SSDs (4 x 375 GB) for the Helium datastore. (Note: local-SSDs is limited in terms of create time flexibility and reliability compared to persistence-disks, but the goal of this blog post is to test with highest performing GCP IO devices).
  3. OS: Debian 7.7 (kernel 3.16-0.bpo.4-amd64, NVMe drivers).
  4. The gists and tests are on github.

Scaling and Performance with CPUs

The test first populates an empty datastore followed by reading the entire datastore sequentially and then randomly. Finally, the test deletes all objects. The 100 million objects are in memory with persistence on SSD, which acts as the local storage every replicated system requires. The total datastore size is kept fixed.
image (8).png
Takeaways
  • Single node performance of over 4 Million inserts/sec (write path) and over 9 Million gets/sec (read path) with persistence that is as durable as the local SSDs.
  • 99% (in memory) latency for updates < 15 usec, and < 5 usec for gets.
  • Almost linear scaling helps with the math of provisioning instances.

Scaling with SSDs and Pure SSD Performance

Compute Engine provides high IOPS, low latency local SSDs. To demonstrate a case where data is read purely from SSDs (and not take advantage of memory), let’s run the same benchmark with 4K object size x 5 million objects, and reduce Helium’s cache to a minimal 2% (400 MB) of total data size (20GB). Only random gets performance is shown below because it is a better stress test than sequential gets.


Take aways:
  • Single node SSDs capable of updates at 1.6 GB/sec (400K IOPS) and random gets at 1.9 GB/sec (480K IOPS).
  • IOPS scaling with SSDs.
  • Numbers comparable to fio, a pure IO benchmark.
  • With four SSDs and 256 threads, median latency < 600 usec, and 95% latency < 2 msec.
  • Deterministic memory usage (< 1GB) by not relying on OS page caches.image (10).png
image (12).png
image (11).png

Cost Analysis

The cost of this Google Compute Engine instance for one hour is $1.22 (n1-highcpu-32) + $0.452 (4 x Local SSD) = $1.67. Based on 200-byte objects, this boils down to:


  • 2.5 Million updates per dollar
  • 4.6 Million gets per dollar


To put this in perspective, New York’s population is ~8.4 million; therefore, you can scan through a Helium datastore containing everyone’s record (assuming each record is less than 200 bytes. Eg: name, address and phone) in one second on a single Google Cloud Platform instance for under $2 per hour.

Summary

Helium running on Google Compute Engine commodity VMs enables processing data at near memory speeds using SSDs. The combination of Cloud Platform and Helium makes high throughput, low latency data processing affordable for everyone. Welcome to the era of dollar store priced datastores at enterprise grade reliability!

For details about running Helium on Google Cloud Platform, contact info@levyx.com.

- Posted by Siddharth Choudhuri, Principal Engineer at Levyx