.. _deploy-at-scale:

Deploy at Scale
###############

This guide describes deployment considerations and strategies when deploying
|CL-ATTR| at scale in your environment.

.. contents::
    :local:
    :depth: 1

Overview
********

In this guide the term *endpoint* refers to a system targeted for |CL|
installation, whether that is a datacenter system or unit deployed in field.

.. note::

    This guide is not a replacement or blueprint for designing your own IT
    operating environment.

    Implementation details for a scale deployment are beyond the scope of this
    guide.

    Your |CL| deployment should complement your existing environment and
    available tools. It is assumed core IT dependencies of your environment,
    such as your network, are healthy and scaled to suit the deployment.

Pick a usage and update strategy
********************************

Different business scenarios call for different deployment methodologies.
|CL| offers the flexibility to continue consuming the upstream |CL|
distribution or the option to fork away from the |CL| distribution and
act as your own :abbr:`OSV (Operating System Vendor)`.

Below is an overview of some considerations.

Create your own Linux distribution (mix)
========================================

This approach forks away from the |CL| upstream and has you act as your own
:abbr:`OSV (Operating System Vendor)` by leveraging the :ref:`mixer` process to
create customized images based on |CL|. This is a level of responsibility
that requires having more infrastructure and processes to adopt. In return,
this approach *offers you a high degree of control and customization*. Consider:

* Development systems that generate bundles and updates should have
  sufficient performance for the task and be separate from the swupd update
  webservers that serve update content to production machines.

* swupd update webservers that serve update content to production machines
  should be appropriately scaled. Specific implementation details for a scalable,
  resilient web server are beyond the scope of this document.

  (See :ref:`mixer` for more information about update servers.)

Adopt an agile methodology
==========================

The cloud, and other scaled deployments, are all about flexibility and speed.
It only makes sense that any |CL| deployment strategy should follow suit.

Manually rebuilding your own bundles or mix for every release is not
sustainable at a large scale. A |CL| deployment pipeline should be agile
enough to validate and produce new versions with speed. Whether or not those
updates actually make their way to your production can be separate
business decision. However this *ability to frequently roll new versions* of
software to your endpoints is an important prerequisite.

You own the validation and lifecycle of the OS and should treat it like any
other software development lifecycle. Below are some pointers:

* Thoroughly understand the custom software packages that you will need to
  integrate with |CL| and maintain along with their dependencies.

* Setup a path to production for building |CL| based images. At minimum this
  should include:

  * A development clr-on-clr environment to test building packages and
    bundles for |CL| systems.

  * A pre-production environment to deploy |CL| versions to before
    production

* Employ a continuous integration and continuous deployment (CI/CD)
  philosophy in order to:

  - Automatically pull custom packages as they are updated from their
    upstream projects or vendors.

  - Generate |CL| bundles and potentially bootable images with your
    customizations, if any.

  - Measure against metrics and indicators which are relevant to your
    business (e.g. performance, power, etc) from release to release.

  - Integrate with your organization's governance processes, such as change
    control.

Versioning infrastructure
=========================

|CL| version numbers are very important as they apply to the whole
infrastructure stack from OS components to libraries and applications.

Good record keeping is important, so you should keep a detailed registry and
history of previously deployed versions and their contents.

With a glance at the |CL| version numbers deployed, you should be able to
tell if your Clear systems are patched against a particular security
vulnerability or incorporate a critical new feature.

Pick an image distribution strategy
***********************************

Once you have decided on a usage and update strategy, you should understand
*how* |CL| will be deployed to your endpoints. In a large scale deployment,
interactive installers should be avoided in favor of automated installations
or prebuilt images.

There are many well-known ways to install an operating system at scale. Each
have their own benefits, and one may lend itself easier in your environment
depending on the resources available to you.

See the available :ref:`image-types`.

Below are some common ways to install |CL| to systems at scale:

Bare metal
==========

Preboot Execution Environments (PXE) or other out-of-band booting options are
one way to distribute |CL| to physical bare metal systems on a LAN.

This option works well if your customizations are fairly small in size
and infrastructure can be stateless.

The |CL| `Downloads`_ page offers a live image that can be deployed as
a PXE boot server if one doesn't already exist in your environment. Also see
documentation on how to :ref:`bare-metal-install-server`.

Cloud instances or virtual machines
===================================

Image templates in the form of cloneable disks are an effective way to
distribute |CL| for virtual machine environments, whether on-premises or
hosted by a Cloud Solution Provider (CSP).

When used in concert with cloud VM migration features, this can be a good option
for allowing your applications a degree of high availability and workload
mobility; VMs can be restarted on a cluster of hypervisor host or moved between
datacenters transparently.

The |CL| `Downloads`_ page offers example prebuilt VM images and is readily
available on popular CSPs. Also see documentation on how to
:ref:`virtual-machine-install`.

Containers
==========

Containerization platforms allow images to be pulled from a repository and
deployed repeatedly as isolated containers.

Containers with a |CL| image can be a good option to blueprint and ship
your application, including all its dependencies, as an artifact while
allowing you or your customers to dynamically orchestrate and scale
applications.

|CL| is capable of running a Docker host, has a container image which can
be pulled from DockerHub, or can be built as a customized container.
For more information visit the `Containers`_ page.

Considerations with stateless systems
*************************************

An important |CL| concept is statelessness and partitioning of system data
from user data. This concept can change the way you think about an at scale
deployment.

Backup strategy
===============

A |CL| system and its infrastructure should be considered a commodity and
be easily reproducible. Avoid focusing on backing up the operating system
itself or default values.

Instead, focus on backing up what's important and unique - the application
and data.  In other words, only focus on backing up critical areas like
:file:`/home`, :file:`/etc`, and :file:`/var`.

Meaningful logging & telemetry
==============================

Offload logging and telemetry from endpoints to external servers, so it is
persistent and can be accessed on another server when an issue occurs.

* Remote syslogging in |CL| is available through the
  `systemd-journal-remote.service`_

* |CL| offers a :ref:`telem-guide`, which can be a powerful tool
  for a large deployment to quickly crowdsource issues of interest. Take
  advantage of this feature with careful consideration of the target audience
  and the kind of data that would be valuable, and expose events
  appropriately.

  Like any web server, the telemetry server should be appropriately scaled and
  resilient. Specific implementation details for a scalable, resilient web
  server are beyond the scope of this document.

Orchestration and configuration management
==========================================

In cloud environments, where systems can be ephemeral, being able to
configure and maintain generic instances is valuable.

|CL| offers an efficient cloud-init style solution, `micro-config-drive`_,
through the *os-cloudguest* bundles which allow you to configure many Day 1
tasks such as setting hostname, creating users, or placing
SSH keys in an automated way at boot. For more information on
automating configuration during deployment of |CL| endpoints see the
:ref:`ipxe-install` guide.

A configuration management tool is useful for maintaining consistent system
and application-level configuration. Ansible\* is offered through the
*sysadmin-hostmgmt* bundle as a configuration management and automation
tool.

Cloud-native applications
=========================

An Infrastructure OS can design for good behavior, but it is ultimately up
to applications to make agile design choices. Applications deployed
on |CL| should aim to be host-aware but not depend on any specific host to
run. References should be relative and dynamic when possible.

The application architecture should incorporate an appropriate tolerance for
infrastructure outages. Don't just keep stateless design as a noted feature.
Continuously test its use; Automate its use by redeploying |CL| and
application on new hosts. This naturally minimizes configuration drift,
challenges your monitoring systems, and business continuity plans.

.. _`Downloads`: https://clearlinux.org/downloads/
.. _`Containers`: https://clearlinux.org/downloads/containers
.. _`systemd-journal-remote.service`: https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html
.. _`micro-config-drive`: https://github.com/clearlinux/micro-config-drive

.. |WEB-SERVER-SCALE| replace::
   There are many well-known ways to achieve a scalable and resilient web
   server for this purpose, however implementation details are not in the
   scope of this document. In general, they should be close to your
   endpoints, highly available, and easy to scale with a load balancer when
   necessary.