Reservation
Mesos also provides the ability to reserve resources on specified slaves. This is particularly useful in ensuring that important services get guaranteed resource offers from a particular slave (for example, a database may need resource offers only from a particular slave, which contains the necessary data). In the absence of a reservation mechanism, there is the possibility that an important service or job may need to wait for a long time before it gets a resource offer satisfying its filter criteria, which would have a detrimental impact on performance.
On the other hand, misusing the reservation feature can lead to the same kind of problems, such as the resource underutilization that Mesos sought to resolve in the first place. Thus, it is necessary to use this judiciously. The Mesos access control mechanism makes sure that the framework requesting a reservation of resources has the appropriate authorization to do so.
Mesos provides two methods of resource reservations:
- Static reservation
- Dynamic reservation
Static reservation
In this type of reservation, specified resources can be reserved on specific slave nodes for a particular framework or group of frameworks. In order to reserve resources for a framework, it must be assigned to a role. Multiple frameworks can be assigned to a single role if necessary. Only the frameworks assigned to a particular role (say, role X) are entitled to get offers for the resources reserved for role X. Roles need to be defined first, then frameworks need to be assigned to the required roles, and finally, resource policies must be set for these roles.
Roles can be defined by starting the master with the following flag:
--roles = "name1, name2, name3"
For example, if we want to define a role called hdfs
, then we can start the master using the following:
--roles = "hdfs"
Alternatively, you can do this by running the following:
echo hdfs > /etc/mesos-master/role
Now, the master needs to be restarted by running the following:
sudo service mesos-master restart
Now, we need to map the frameworks to specific roles. The method to do this varies by the framework. Some, such as Marathon, can be configured using the –mesos_role
flag. In the case of HDFS, this can be done by changing mesos.hdfs.role
in mesos-site.xml
to the value of hdfs
defined before.
<property> <name>mesos.hdfs.role</name> <value>hdfs</value> </property>
Custom roles for frameworks can be specified by setting the role
option within FrameworkInfo
to the desired value (the default is *
).
Role resource policy setting
Resources on each slave can be reserved for a particular role by leveraging the slave's –resources
flag. Slave-level resource policy setting has its drawbacks as the management overhead can quickly become daunting as the cluster size and number of frameworks being run increases.
If we have eight cores and 24 GB (the number is specified in MBs in Mesos) RAM available on a particular slave and seek to reserve 2 cores and 6 GB RAM for the hdfs
role, then we can make the following changes on the slave:
--resources="cpus:6;mem:18432;cpus(hdfs):2;mem(hdfs):6144"
Once this is done, mesos-slave
with these changed settings can be stopped by executing the following:
sudo service mesos-slave stop
The older state on these slaves can be removed by the following command. Any running tasks can be manually terminated as the task states will also get removed:
rm -f /tmp/mesos/meta/slaves/latest
Now, the slave can be restarted with the following command:
sudo service mesos-slave start
Dynamic reservation
The main drawback of static reservation is that the reserved resources cannot be used by other roles during downtime, nor can they be unreserved and made available as part of the wider pool. This leads to poor resource utilization. In order to overcome this challenge, support for dynamic reservation was added in version 0.23.0, which allows users to reserve and unreserve resources more dynamically as per workload requirements.
For a resource offer, frameworks can send back the following two messages (through the acceptOffers
API) as a response:
Offer::Operation::Reserve
Offer::Operation::Unreserve
These are described in detail in the following sections. Note that the framework's principal is required for authorization, which will be discussed in more detail in Chapter 6, Mesos Frameworks.
Each framework can reserve resources as part of the offer cycle. As an example, let's say that a resource offer with eight cores and 12 GB RAM unreserved is received by a framework. Take a look at the following code:
{ "id": <offer_id>, "framework_id": <framework_id>, "slave_id": <slave_id>, "hostname": <hostname>, "resources": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 8 }, "role": "*", }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 12288 }, "role": "*", } ] }
We can reserve four cores and 6 GB RAM for the framework by specifying the quantity of each resource type that needs to be reserved and the framework's role and principal in the following message:
{ "type": Offer::Operation::RESERVE, "reserve": { "resources": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 4 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 6144 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } } ] } }
The next resource offer will include the preceding reserved resources, as follows:
{ "id": <offer_id>, "framework_id": <framework_id>, "slave_id": <slave_id>, "hostname": <hostname>, "resources": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 4 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 6144 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } }, ] }
Each framework can also unreserve resources as part of the offer cycle. In the previous example, we reserved four cores and 6 GB RAM for the framework/role that will continue to be offered until specifically unreserved. The way to unreserve this is explained here.
First, we will receive the reserved resource offer, as follows:
{ "id": <offer_id>, "framework_id": <framework_id>, "slave_id": <slave_id>, "hostname": <hostname>, "resources": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 4 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 6144 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } }, ] }
We can now unreserve four cores and 6 GB RAM for the framework by specifying the quantity of each resource type that needs to be unreserved and the framework's role and principal in the following message:
{ "type": Offer::Operation::UNRESERVE, "unreserve": { "resources": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 4 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 6144 }, "role": <framework_role>, "reservation": { "principal": <framework_principal> } } ] } }
In subsequent resource offers, these unreserved resources will become part of the wider unreserved pool and start being offered to other frameworks.
The /reserve
and /unreserve
HTTP endpoints were also introduced in v0.25.0 and can be used for dynamic reservation management from the master.
Let's say that we are interested in reserving four cores and 6 GB RAM for a role on a slave whose ID is <slave_id>
. An HTTP POST
request can be sent to the /reserve
HTTP endpoint, as follows:
$ curl -i \ -u <operator_principal>:<password> \ -d slaveId=<slave_id> \ -d resources='[ \ { \ "name": "cpus", \ "type": "SCALAR", \ "scalar": { "value": 4 }, \ "role": <framework_role>, \ "reservation": { \ "principal": <operator_principal> \ } \ }, \ { \ "name": "mem", \ "type": "SCALAR", \ "scalar": { "value": 6144 }, \ "role": <framework_role>,\ "reservation": { \ "principal": <operator_principal> \ } \ } \ ]' \ -X POST http://<ip>:<port>/master/reserve
The response can be one of the following:
200 OK
: Success400 BadRequest
: Invalid arguments (for example, missing parameters)401 Unauthorized
: Unauthorized request409 Conflict
: Insufficient resources to satisfy the reserve operation
Now, if we are interested in unreserving the resources that were reserved before, an HTTP POST
request can be sent to the /unreserve
HTTP endpoint, as follows:
$ curl -i \ -u <operator_principal>:<password> \ -d slaveId=<slave_id> \ -d resources='[ \ { \ "name": "cpus", \ "type": "SCALAR", \ "scalar": { "value": 4 }, \ "role": <framework_role>, \ "reservation": { \ "principal": <operator_principal> \ } \ }, \ { \ "name": "mem", \ "type": "SCALAR", \ "scalar": { "value": 6144 }, \ "role": <framework_role>\ "reservation": { \ "principal": <operator_principal> \ } \ } \ ]' \ -X POST http://<ip>:<port>/master/unreserve
The response can be one of the following:
200 OK
: Success400 BadRequest
: Invalid arguments (for example, missing parameters)401 Unauthorized
: Unauthorized request409 Conflict
: Insufficient resources to satisfy unreserve operation
Oversubscription
Frameworks are generally provided with enough buffer resources by users to be able to handle unexpected workload surges. This leads to an overall underutilization of the entire cluster because a sizeable chunk of resources are lying idle. Add this across frameworks, and you find that it adds up to significant wastage. The concept of oversubscription, introduced in v0.23.0, seeks to address this problem by executing low priority tasks, such as background processes or ad hoc noncritical analytics, on these idle resources.
To enable this, two additional components are introduced:
- Resource estimator: This is used to determine the number of idle resources that can be used by best-effort processes
- Quality of Service (QoS) controller: This is used to terminate these best-effort tasks in case a workload surge or performance degradation in the original tasks is observed
While the basic default estimators and controllers are provided, Mesos provides users with the ability to create their own custom ones.
In addition, the existing resource allocator, resource monitor, and Mesos slave are also extended with new flags and options. The following diagram illustrates how the oversubscription concept works (source: http://mesos.apache.org/documentation/latest/oversubscription/):
The following steps are followed:
- The primary step involves collecting the usage statistics and estimating the number of resources that are oversubscribed and available for use by low-priority jobs. The resource monitor sends these statistics by passing
ResourceStatistics
messages to something known as the resource estimator. - The estimator identifies the quantity of resources that are oversubscribed by leveraging algorithms that calculate these buffer amounts. Mesos provides the ability to develop custom resource estimators based on user-specified logic.
- Each slave polls the resource estimator to get the most recent estimates.
- The slave, then, periodically (whenever the estimate values change) transmits this information to the allocator module in the master.
- The allocator marks these oversubscribed resources as "revocable" resources and monitors these separately.
- Frameworks that register with the
REVOCABLE_RESOURCES
set in theFrameworkInfo
method receive offers of these revocable resources and can schedule tasks on them using thelaunchTasks()
API. Note that these cannot be dynamically reserved.
Run the following code:
FrameworkInfo framework; framework.set_name("Revocable framework"); framework.add_capabilities()->set_type( FrameworkInfo::Capability::REVOCABLE_RESOURCES);
Take a look at the following code:
{ "id": <offer_id>, "framework_id": <framework_id>, "slave_id": <slave_id>, "hostname": <hostname>, "resources": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 4 }, "role": "*" }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 6144 }, "role": "*" }, { "name": "cpus", "type": "SCALAR", "scalar": { "value": 1 }, "role": "*", "revocable": {} } ] }
- The task is launched on the slave when the
runTask
request is received by it. A container with even a single revocable resource can be terminated by the QoS controller as it is considered a revocable container. - The original task is also monitored continuously, and the revocable resources are returned to it if any performance deterioration or workload spike is observed. This is known as interference detection.
Currently, the Mesos resource estimator is pretty basic with two default estimators called the fixed and noop resource estimators. In the first one, a fixed set of resources can be tagged as oversubscribed, while the latter provides a null estimate upon being polled by the slave, effectively saying that no resources are available for oversubscription.
Active work is being done on introducing sophisticated and dynamic oversubscribed resource estimation models (a module called Project Serenity by Mesosphere and Intel, for instance) to maximize resource utilization while ensuring no impact on Quality of Service at the same time.
Run the following code:
class ResourceEstimator { public: virtual Try initialize(const lambda::function<process::Future()>& usage) = 0; virtual process::Future oversubscribable() = 0; };
Execute the following code:
class QoSController { public: virtual Try initialize(const lambda::function<process::Future()>& usage) = 0; virtual process::Future<std::list> corrections() = 0; };
Extendibility
Different organizations have different requirements. Also, within the same organization, different users run clusters in different ways with different scale and latency requirements. Users need to deal with application-specific behavior, ensuring that their industry-specific security compliances are met and so on. All this means that Mesos needs to be extremely customizable and extendable if it is to achieve its goal of serving as the OS for the entire datacenter for all organizations. It required a feature that could keep the Mesos core small and lightweight while making it powerful enough to allow as much customization/extendibility as required at the same time.
A number of software systems, such as browsers, support libraries to:
- Extend feature support
- Abstract complexity
- Make development configuration-driven