Mastering Mesos
上QQ阅读APP看书,第一时间看更新

Reservation

Mesos also provides the ability to reserve resources on specified slaves. This is particularly useful in ensuring that important services get guaranteed resource offers from a particular slave (for example, a database may need resource offers only from a particular slave, which contains the necessary data). In the absence of a reservation mechanism, there is the possibility that an important service or job may need to wait for a long time before it gets a resource offer satisfying its filter criteria, which would have a detrimental impact on performance.

On the other hand, misusing the reservation feature can lead to the same kind of problems, such as the resource underutilization that Mesos sought to resolve in the first place. Thus, it is necessary to use this judiciously. The Mesos access control mechanism makes sure that the framework requesting a reservation of resources has the appropriate authorization to do so.

Mesos provides two methods of resource reservations:

  1. Static reservation
  2. Dynamic reservation

Static reservation

In this type of reservation, specified resources can be reserved on specific slave nodes for a particular framework or group of frameworks. In order to reserve resources for a framework, it must be assigned to a role. Multiple frameworks can be assigned to a single role if necessary. Only the frameworks assigned to a particular role (say, role X) are entitled to get offers for the resources reserved for role X. Roles need to be defined first, then frameworks need to be assigned to the required roles, and finally, resource policies must be set for these roles.

Role definition

Roles can be defined by starting the master with the following flag:

--roles = "name1, name2, name3"

For example, if we want to define a role called hdfs, then we can start the master using the following:

--roles = "hdfs"

Alternatively, you can do this by running the following:

echo hdfs > /etc/mesos-master/role

Now, the master needs to be restarted by running the following:

sudo service mesos-master restart

Framework assignment

Now, we need to map the frameworks to specific roles. The method to do this varies by the framework. Some, such as Marathon, can be configured using the –mesos_role flag. In the case of HDFS, this can be done by changing mesos.hdfs.role in mesos-site.xml to the value of hdfs defined before.

<property>
  <name>mesos.hdfs.role</name>
  <value>hdfs</value>
  </property>

Custom roles for frameworks can be specified by setting the role option within FrameworkInfo to the desired value (the default is *).

Role resource policy setting

Resources on each slave can be reserved for a particular role by leveraging the slave's –resources flag. Slave-level resource policy setting has its drawbacks as the management overhead can quickly become daunting as the cluster size and number of frameworks being run increases.

If we have eight cores and 24 GB (the number is specified in MBs in Mesos) RAM available on a particular slave and seek to reserve 2 cores and 6 GB RAM for the hdfs role, then we can make the following changes on the slave:

--resources="cpus:6;mem:18432;cpus(hdfs):2;mem(hdfs):6144"

Once this is done, mesos-slave with these changed settings can be stopped by executing the following:

sudo service mesos-slave stop

The older state on these slaves can be removed by the following command. Any running tasks can be manually terminated as the task states will also get removed:

rm -f /tmp/mesos/meta/slaves/latest

Now, the slave can be restarted with the following command:

sudo service mesos-slave start

Dynamic reservation

The main drawback of static reservation is that the reserved resources cannot be used by other roles during downtime, nor can they be unreserved and made available as part of the wider pool. This leads to poor resource utilization. In order to overcome this challenge, support for dynamic reservation was added in version 0.23.0, which allows users to reserve and unreserve resources more dynamically as per workload requirements.

For a resource offer, frameworks can send back the following two messages (through the acceptOffers API) as a response:

  • Offer::Operation::Reserve
  • Offer::Operation::Unreserve

These are described in detail in the following sections. Note that the framework's principal is required for authorization, which will be discussed in more detail in Chapter 6, Mesos Frameworks.

Offer::Operation::Reserve

Each framework can reserve resources as part of the offer cycle. As an example, let's say that a resource offer with eight cores and 12 GB RAM unreserved is received by a framework. Take a look at the following code:

{
  "id": <offer_id>,
  "framework_id": <framework_id>,
  "slave_id": <slave_id>,
  "hostname": <hostname>,
  "resources": [
    {
      "name": "cpus",
      "type": "SCALAR",
      "scalar": { "value": 8 },
      "role": "*",
    },
    {
      "name": "mem",
      "type": "SCALAR",
      "scalar": { "value": 12288 },
      "role": "*",
    }
  ]
}

We can reserve four cores and 6 GB RAM for the framework by specifying the quantity of each resource type that needs to be reserved and the framework's role and principal in the following message:

{
  "type": Offer::Operation::RESERVE,
  "reserve": {
    "resources": [
      {
        "name": "cpus",
        "type": "SCALAR",
        "scalar": { "value": 4 },
        "role": <framework_role>,
        "reservation": {
          "principal": <framework_principal>
        }
      },
      {
        "name": "mem",
        "type": "SCALAR",
        "scalar": { "value": 6144 },
        "role": <framework_role>,
        "reservation": {
          "principal": <framework_principal>
        }
      }
    ]
  }
}

The next resource offer will include the preceding reserved resources, as follows:

{
  "id": <offer_id>,
  "framework_id": <framework_id>,
  "slave_id": <slave_id>,
  "hostname": <hostname>,
  "resources": [
    {
      "name": "cpus",
      "type": "SCALAR",
      "scalar": { "value": 4 },
      "role": <framework_role>,
      "reservation": {
        "principal": <framework_principal>
      }
    },
    {
      "name": "mem",
      "type": "SCALAR",
      "scalar": { "value": 6144 },
      "role": <framework_role>,
      "reservation": {
        "principal": <framework_principal>
      }
    },
  ]
}

Offer::Operation::Unreserve

Each framework can also unreserve resources as part of the offer cycle. In the previous example, we reserved four cores and 6 GB RAM for the framework/role that will continue to be offered until specifically unreserved. The way to unreserve this is explained here.

First, we will receive the reserved resource offer, as follows:

{
  "id": <offer_id>,
  "framework_id": <framework_id>,
  "slave_id": <slave_id>,
  "hostname": <hostname>,
  "resources": [
    {
      "name": "cpus",
      "type": "SCALAR",
      "scalar": { "value": 4 },
      "role": <framework_role>,
      "reservation": {
        "principal": <framework_principal>
      }
    },
    {
      "name": "mem",
      "type": "SCALAR",
      "scalar": { "value": 6144 },
      "role": <framework_role>,
      "reservation": {
        "principal": <framework_principal>
      }
    },
  ]
}

We can now unreserve four cores and 6 GB RAM for the framework by specifying the quantity of each resource type that needs to be unreserved and the framework's role and principal in the following message:

{
  "type": Offer::Operation::UNRESERVE,
  "unreserve": {
    "resources": [
      {
        "name": "cpus",
        "type": "SCALAR",
        "scalar": { "value": 4 },
        "role": <framework_role>,
        "reservation": {
          "principal": <framework_principal>
        }
      },
      {
        "name": "mem",
        "type": "SCALAR",
        "scalar": { "value": 6144 },
        "role": <framework_role>,
        "reservation": {
          "principal": <framework_principal>
        }
      }
    ]
  }
}

In subsequent resource offers, these unreserved resources will become part of the wider unreserved pool and start being offered to other frameworks.

The /reserve and /unreserve HTTP endpoints were also introduced in v0.25.0 and can be used for dynamic reservation management from the master.

/reserve

Let's say that we are interested in reserving four cores and 6 GB RAM for a role on a slave whose ID is <slave_id>. An HTTP POST request can be sent to the /reserve HTTP endpoint, as follows:

$ curl -i \
  -u <operator_principal>:<password> \
  -d slaveId=<slave_id> \
  -d resources='[ \
    { \
      "name": "cpus", \
      "type": "SCALAR", \
      "scalar": { "value": 4 }, \
      "role": <framework_role>, \
      "reservation": { \
        "principal": <operator_principal> \
      } \
    }, \
    { \
      "name": "mem", \
      "type": "SCALAR", \
      "scalar": { "value": 6144 }, \
      "role": <framework_role>,\
      "reservation": { \
        "principal": <operator_principal> \
      } \
    } \
  ]' \
  -X POST http://<ip>:<port>/master/reserve

The response can be one of the following:

  • 200 OK: Success
  • 400 BadRequest: Invalid arguments (for example, missing parameters)
  • 401 Unauthorized: Unauthorized request
  • 409 Conflict: Insufficient resources to satisfy the reserve operation

/unreserve

Now, if we are interested in unreserving the resources that were reserved before, an HTTP POST request can be sent to the /unreserve HTTP endpoint, as follows:

$ curl -i \
  -u <operator_principal>:<password> \
  -d slaveId=<slave_id> \
  -d resources='[ \
    { \
      "name": "cpus", \
      "type": "SCALAR", \
      "scalar": { "value": 4 }, \
      "role": <framework_role>, \
      "reservation": { \
        "principal": <operator_principal> \
      } \
    }, \
    { \
      "name": "mem", \
      "type": "SCALAR", \
      "scalar": { "value": 6144 }, \
      "role": <framework_role>\
      "reservation": { \
        "principal": <operator_principal> \
      } \
    } \
  ]' \
  -X POST http://<ip>:<port>/master/unreserve

The response can be one of the following:

  • 200 OK: Success
  • 400 BadRequest: Invalid arguments (for example, missing parameters)
  • 401 Unauthorized: Unauthorized request
  • 409 Conflict: Insufficient resources to satisfy unreserve operation

Oversubscription

Frameworks are generally provided with enough buffer resources by users to be able to handle unexpected workload surges. This leads to an overall underutilization of the entire cluster because a sizeable chunk of resources are lying idle. Add this across frameworks, and you find that it adds up to significant wastage. The concept of oversubscription, introduced in v0.23.0, seeks to address this problem by executing low priority tasks, such as background processes or ad hoc noncritical analytics, on these idle resources.

To enable this, two additional components are introduced:

  1. Resource estimator: This is used to determine the number of idle resources that can be used by best-effort processes
  2. Quality of Service (QoS) controller: This is used to terminate these best-effort tasks in case a workload surge or performance degradation in the original tasks is observed

While the basic default estimators and controllers are provided, Mesos provides users with the ability to create their own custom ones.

In addition, the existing resource allocator, resource monitor, and Mesos slave are also extended with new flags and options. The following diagram illustrates how the oversubscription concept works (source: http://mesos.apache.org/documentation/latest/oversubscription/):

Oversubscription

Revocable resource offers

The following steps are followed:

  1. The primary step involves collecting the usage statistics and estimating the number of resources that are oversubscribed and available for use by low-priority jobs. The resource monitor sends these statistics by passing ResourceStatistics messages to something known as the resource estimator.
  2. The estimator identifies the quantity of resources that are oversubscribed by leveraging algorithms that calculate these buffer amounts. Mesos provides the ability to develop custom resource estimators based on user-specified logic.
  3. Each slave polls the resource estimator to get the most recent estimates.
  4. The slave, then, periodically (whenever the estimate values change) transmits this information to the allocator module in the master.
  5. The allocator marks these oversubscribed resources as "revocable" resources and monitors these separately.
  6. Frameworks that register with the REVOCABLE_RESOURCES set in the FrameworkInfo method receive offers of these revocable resources and can schedule tasks on them using the launchTasks() API. Note that these cannot be dynamically reserved.

Registering with the revocable resources capability

Run the following code:

FrameworkInfo framework;
framework.set_name("Revocable framework");

framework.add_capabilities()->set_type(
  FrameworkInfo::Capability::REVOCABLE_RESOURCES);

An example offer with a mix of revocable and standard resources

Take a look at the following code:

{
  "id": <offer_id>,
  "framework_id": <framework_id>,
  "slave_id": <slave_id>,
  "hostname": <hostname>,
  "resources": [
    {
      "name": "cpus",
      "type": "SCALAR",
      "scalar": {
        "value": 4
      },
      "role": "*"
    }, {
      "name": "mem",
      "type": "SCALAR",
      "scalar": {
        "value": 6144
      },
      "role": "*"
    },
    {
      "name": "cpus",
      "type": "SCALAR",
      "scalar": {
        "value": 1
      },
      "role": "*",
      "revocable": {}
    }
  ]
}
  • The task is launched on the slave when the runTask request is received by it. A container with even a single revocable resource can be terminated by the QoS controller as it is considered a revocable container.
  • The original task is also monitored continuously, and the revocable resources are returned to it if any performance deterioration or workload spike is observed. This is known as interference detection.

Currently, the Mesos resource estimator is pretty basic with two default estimators called the fixed and noop resource estimators. In the first one, a fixed set of resources can be tagged as oversubscribed, while the latter provides a null estimate upon being polled by the slave, effectively saying that no resources are available for oversubscription.

Active work is being done on introducing sophisticated and dynamic oversubscribed resource estimation models (a module called Project Serenity by Mesosphere and Intel, for instance) to maximize resource utilization while ensuring no impact on Quality of Service at the same time.

Resource estimator

Run the following code:

class ResourceEstimator 
{ 
public: 
  virtual Try initialize(const lambda::function<process::Future()>& usage) = 0; 
  virtual process::Future oversubscribable() = 0;
};

The QoS controller

Execute the following code:

class QoSController 
{ 
public: 
  virtual Try initialize(const lambda::function<process::Future()>& usage) = 0; 
  virtual process::Future<std::list> corrections() = 0;
};

Configuring oversubscription

The slave now has four new oversubscription-related flags available, as shown in the following table:

Extendibility

Different organizations have different requirements. Also, within the same organization, different users run clusters in different ways with different scale and latency requirements. Users need to deal with application-specific behavior, ensuring that their industry-specific security compliances are met and so on. All this means that Mesos needs to be extremely customizable and extendable if it is to achieve its goal of serving as the OS for the entire datacenter for all organizations. It required a feature that could keep the Mesos core small and lightweight while making it powerful enough to allow as much customization/extendibility as required at the same time.

A number of software systems, such as browsers, support libraries to:

  • Extend feature support
  • Abstract complexity
  • Make development configuration-driven