Tuesday, November 27, 2012

Automatically Maintaining Entity Count in Google Appengine Datastore

For a server application, it's always good to keep capacity limit in mind. No resource is unlimited. On Google Appengine, a typical case is the entities in Datastore. This article describes how to maintain entity count within designed limit, in an automated manner. It's achieved by the following two steps in general:
  • Check the capacity in a regular basis.
  • Delete the excessive entities.
Practical tips are also presented, including:
  • Use back end server to do the task - save front end CPU time, and perform time consuming task.
  • Use cron job and task queue to arrange daily task automatically and robustly.

Background
The Robotypo (http://robotypo.appspot.com) on Appengine hosts fights between robots - special Java programs. The result of each fight is stored. However, to keep the result count in a reasonable limit without exceeding the quota limit, the old ones should be deleted.

To achieve this, a back end instance is created for the task, and a cron job is configured to schedule daily maintenance. A specifically configured task queue is used to assure sequential task execution on the back end server instance.


Benefits of using back end server

Using back end server saves CPU time of front end server, and if maintenance task takes long to complete, only the back end server can do that. Here a back end server named "auto" is defined. It's a dynamic instance which is automatically shutdown when idle. This saves CPU time. The definition in backends.xml looks like this:

 <backend name="auto">
   <class>B2</class>
   <instances>1</instances>
   <options>
     <dynamic>true</dynamic>
   </options>
 </backend>


Scheduling maintenance task

There are multiple entity kinds to maintain, and several other maintenance tasks. The first idea came into mind was to create multiple cron jobs, and each of them starts one task. There are several limitations with this approach:
  • There are cases that one maintenance task should happen before others (e.g. reduce the data complexity, or avoid unnecessary computing). It's hard to manage the relationship using only cron jobs, which only control fixed start time points.
  • If you control the sequence by leaving sufficient time between the pre-required task and the successor task, that may work. But you are likely not sure about when the pre-required task actually completes. It may complete within 1 second, or lasts for hours. You can leave big time interval between such tasks to make sure the pre-required task is completed before starting the following ones, but that will cause waste of time (server idle) in most of the case.
A practical approach is by using cron job with task queue. Here a cron job is created to submit a trigger for the daily maintenance task. Only one cron job is created. The cron job targets to the back end server named auto by using the <target> attribute:



  <cron>
    <url>/auto/maint?f=dailyTaskTrigger</url>
    <target>auto</target>
    <description>Daily task trigger</description>
    <schedule>every day 02:30</schedule>
  </cron>


In the daily maintenance task (/auto/maint), multiple maintenance sub tasks are scheduled in sequence, to a specific queue named backend-queue. Here's the source code example in the Java servlet, which schedules tasks to the back end server instance:

    Queue queue = QueueFactory.getQueue("backend-queue");
    String backendAddress = BackendServiceFactory.getBackendService()

        .getBackendAddress("auto");
    queue

        .add(withUrl("/auto/maint")
        .param("f", "deletion")
        .header("Host", backendAddress));
    queue

        .add(withUrl("/auto/maint")
        .param("f", "capacity")
        .header("Host", backendAddress));
    queue

        .add(withUrl("/auto/fight")
        .method(Method.GET)
        .param("backend", "true")
        .header("Host", backendAddress));




In queue.xml, the definition of the queue backend-queue looks like this:

  <queue>
    <name>backend-queue</name>
    <rate>1/s</rate>
    <bucket-size>1</bucket-size>
    <max-concurrent-requests>1</max-concurrent-requests>
    <retry-parameters>
      <task-retry-limit>0</task-retry-limit>
    </retry-parameters>
  </queue>


The queue defines only one bucket-size and max concurrent to 1. So that all tasks in the queue starts in sequence: one will not start until the previous one finishes. It matches what we need for Robotypo.

There are several benefits to separate maintenance tasks into multiple queued tasks:
  • It's robust and simple. You do not need to care too much about exception and recovery. Even if one task fails, others can still be handled. And the errors are logged by GAE automatically. If you handle them together in one task, exception handling can be annoying and error prone.
  • You can do things in parallel if the total maintenance time is important. For example: do_sequential_task1 -> do_sequential_task2 (submit multiple parallel tasks to another parallel queue) -> parallel tasks run in parallel.


Query the entity count of a kind

To maintain the entity count, we must know the number: whether it's out of limit, or not. It's crazy to count all the entities by listing all of them (if they are many). Here are two different approaches to do the task in a fast manner:
  1. Using the Datastore statistic, which is automatically created by DatastTore. This method is probably the most efficient way. The disadvantages are: a). It does not reflect the current size (likely one day later). b). It's only available on production environment and no support in development environment.
  2. Create a counter (e.g. the Shared Counter using shards, from example in the SDK). In this way you always get the precise number. The drawback is that some overhead is added when creating/deleting each entity.
With Robotypo, the disadvantages of method #1 are acceptable. We do not need the precise count, and avoiding additional overhead (drawback of #2) is always good. Here's the sample code snippet to query the entity count:

public static int countEntities(String kind) {
    DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
       
    Query q = new Query("__Stat_Kind__");
    List<Entity> kindStat = ds.prepare(q)

        .asList(FetchOptions.Builder.withLimit(1000));
       
    for (Entity en : kindStat) {
        String name = (String) en.getProperty("kind_name");
        if (name.equals(kind)) {
            return ((Number)en.getProperty("count")).intValue();
        }
    }
    return 0;
       
    //you can use the following in development environment for testing purpose
    //q = new Query(kind).setKeysOnly();
    //return ds.prepare(q).countEntities(FetchOptions.Builder.withLimit(1000));
}




Deleting old entities

To query old entities, some ideas come into mind:
  1. Add timestamp when creating each entity. Query them by that later.
  2. Query by default ID.
Option 2 *seems* work, and it's nice that no overhead is added (attribute, index). However no Google document clearly stated that the sorting order (default) of key is assured to return entitie of the creation order (smaller key id is older item).

So here we use #1. The following code queries n oldest ones, and delete them.

public static int deleteOldDateEntities(String kind, int toDelete) {
       
    DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
    Query q = new Query(kind).addSort("date", SortDirection.ASCENDING)

        .setKeysOnly();
       
    List<Key> keys = new ArrayList<Key>();
    int deleted = 0;
           
    for (Entity en : ds.prepare(q).asIterable()) {
        keys.add(en.getKey());
           
        if (++deleted >= toDelete || keys.size() == 1000) {
            DataUtil.tryDelete(ds, keys);

            if (deleted >= toDelete)
                break;
            keys.clear();
        }
    }
    return deleted;
}






Further discussion can be found here:
http://stackoverflow.com/questions/13281777/google-app-engine-whats-the-recommended-way-to-keep-entity-amount-within-limi


Hope this doc helps you, and appreciate comments.

In addition, welcome to Robotypo: http://robotypo.appspot.com


No comments:

Post a Comment