cloudinit解惑 - Local Stage

发表于 2018-08-13 | 分类于技术

简介

systemd service: cloud-init-local.service
runs: As soon as possible with / mounted read-write.
blocks: as much of boot as possible, must block network bringup.
modules: none

目的：

寻找本地的datasource
将网络配置应用于系统

网络配置可以来自于：

datasource
fallback
none (disabled)

实现

该服务实际上只是执行了命令cloud-init init --local，该命令的入口为cloud-init.cmd.main:main_init方法，在该方法的开头，我们可以看到一段注释，描述了cloud-init的init阶段做了哪些事情：

# Cloud-init 'init' stage is broken up into the following sub-stages
# 1. Ensure that the init object fetches its config without errors
# 2. Setup logging/output redirections with resultant config (if any)
# 3. Initialize the cloud-init filesystem
# 4. Check if we can stop early by looking for various files
# 5. Fetch the datasource
# 6. Connect to the current instance location + update the cache
# 7. Consume the userdata (handlers get activated here)
# 8. Construct the modules object
# 9. Adjust any subsequent logging/output redirections using the modules
#    objects config as it may be different from init object
# 10. Run the modules for the 'init' stage
# 11. Done!

根据注释描述的10个步骤来简单看一下相关的代码：

Sub-stages

0、 Prepare

首先实例化了一个stages.Init对象:

init = stages.Init(ds_deps=deps, reporter=args.reporter)

这个stages.Init对象就是Local阶段操作的实体。

1、 Ensure that the init object fetches its config without errors

该sub-stage对应到代码是：

init.read_cfg(extract_fns(args))

此处是init对象读取配置文件的一步，extract_fns(args)返回的是命令行参数--f指定的配置文件。

init.read_cfg()方法中会读取以下几个类型的配置：

additional config（命令行指定的）
base config（系统基础配置）
- builtin config
- cloud.cfg
- runtime config
- Kernel/cmdline parameters
env config（环境变量中指定的）
instance configs（在创建实例时指定的# cloud config）
datasource configs（datasource对象获取的配置）

2、 Setup logging/output redirections with resultant config (if any)

这一步是根据配置的结果将日志的output进行重定向，并进行基本的日志配置。

3、 Initialize the cloud-init filesystem

这里是初始化cloud-init的文件系统，主要是指确认创建cloud-init用来保存生成的实例相关文件的目录（称为cloud目录，默认为/var/lib/cloud）和确认cloud-init的日志文件（默认为/var/log/cloud-init.log）及其权限，这就是所谓的cloud-init filesystem。

相关代码实现：

    def _initialize_filesystem(self):
        util.ensure_dirs(self._initial_subdirs())
        log_file = util.get_cfg_option_str(self.cfg, 'def_log_file')
        if log_file:
            util.ensure_file(log_file)
            perms = self.cfg.get('syslog_fix_perms')
            if not perms:
                perms = {}
            if not isinstance(perms, list):
                perms = [perms]

            error = None
            for perm in perms:
                u, g = util.extract_usergroup(perm)
                try:
                    util.chownbyname(log_file, u, g)
                    return
                except OSError as e:
                    error = e

            LOG.warning("Failed changing perms on '%s'. tried: %s. %s",
                        log_file, ','.join(perms), error)

4、 Check if we can stop early by looking for various files

这步对Local阶段和Network阶段有不同的行为，这里只讲Local阶段相关的。

在Local阶段的这一步，cloud-init会去检查配置项manual_cache_clean是否为True或者instance目录（实际上是一个软链接）下是否存在文件manual_clean，如果上述条件任意一个为真，则代表缓存需要被手动清除，不再自动清除。

接着系统自动删除上述的instance软链接和no-net文件。no-net文件会在Network阶段被检测，如果存在代表不需要从网络获取数据，cloud-init则直接退出Local阶段。

相关代码实现：

        existing = "check"
        mcfg = util.get_cfg_option_bool(init.cfg, 'manual_cache_clean', False)
        if mcfg:
            LOG.debug("manual cache clean set from config")
            existing = "trust"
        else:
            mfile = path_helper.get_ipath_cur("manual_clean_marker")
            if os.path.exists(mfile):
                LOG.debug("manual cache clean found from marker: %s", mfile)
                existing = "trust"

        init.purge_cache()
        # Delete the non-net file as well
        util.del_file(os.path.join(path_helper.get_cpath("data"), "no-net"))

疑惑：existing 的作用是什么？

5、 Fetch the datasource

根据标题就能知道这步是为了定位datasource，在Local阶段主要是获取本地的datasource（例如：config drive），根据代码，cloud-init一开始是尝试从cache中获取datasource（cache指的是cloud目录中已经生成的数据），实际上是从instance软链接指向的那个目录中查找obj_pkl文件，并从中读取datasource。但是在上一步中已经把这个instance软链接删除了，所以Local阶段无法从cache中获取datasource。

相关代码实现：

    def _restore_from_checked_cache(self, existing):
        if existing not in ("check", "trust"):
            raise ValueError("Unexpected value for existing: %s" % existing)

        ds = self._restore_from_cache()
        if not ds:
            return (None, "no cache found")

        run_iid_fn = self.paths.get_runpath('instance_id')
        if os.path.exists(run_iid_fn):
            run_iid = util.load_file(run_iid_fn).strip()
        else:
            run_iid = None

        if run_iid == ds.get_instance_id():
            return (ds, "restored from cache with run check: %s" % ds)
        elif existing == "trust":
            return (ds, "restored from cache: %s" % ds)
        else:
            if (hasattr(ds, 'check_instance_id') and
                    ds.check_instance_id(self.cfg)):
                return (ds, "restored from checked cache: %s" % ds)
            else:
                return (None, "cache invalid in datasource: %s" % ds)

    def _restore_from_cache(self):
        # We try to restore from a current link and static path
        # by using the instance link, if purge_cache was called
        # the file wont exist.
        return _pkl_load(self.paths.get_ipath_cur('obj_pkl'))

由于从cache中获取不到datasource，cloud-init会继续根据配置文件cloud.cfg的datasource_list配置项来获取。

每个datasource类都有一个依赖列表（依赖filesystem或依赖network或两者都依赖），cloud-init获取datasource的时候会根据依赖先筛选出符合的datasource类的列表，Local阶段获取的datasource只需要依赖filesystem，例如DataSourceConfigDrive类。

cloud-init筛选出datasource类的列表后，会依次实例化类的对象，调用datasource对象中的update_metadata方法，该方法是用来尝试刷新缓存的metadata。如果某个datasource对象成功刷新了缓存的metadata（包含instance-id等信息），则cloud-init就认为成功找到了datasource，将该datasource对象返回（在刷新cache的过程中可能已经创建了userdata信息）。

由于Local阶段筛选的datasource都是依赖本地filesystem的，所以这一步就是为了查找本地数据源。

6、 Connect to the current instance location + update the cache

由于上一步成功获取到了instance-id，在这一步会创建instance-id对应的目录结构，并创建instance软链接到正确的instance-id目录。然后会记录本次使用的datasource信息，上次使用的datasource信息以及上次的instance-d（previous-instance-id）等信息。

相关代码实现：

    def _reflect_cur_instance(self):
        # Remove the old symlink and attach a new one so
        # that further reads/writes connect into the right location
        idir = self._get_ipath()
        util.del_file(self.paths.instance_link)
        util.sym_link(idir, self.paths.instance_link)

        # Ensures these dirs exist
        dir_list = []
        for d in self._get_instance_subdirs():
            dir_list.append(os.path.join(idir, d))
        util.ensure_dirs(dir_list)

        # Write out information on what is being used for the current instance
        # and what may have been used for a previous instance...
        dp = self.paths.get_cpath('data')

        # Write what the datasource was and is..
        ds = "%s: %s" % (type_utils.obj_name(self.datasource), self.datasource)
        previous_ds = None
        ds_fn = os.path.join(idir, 'datasource')
        try:
            previous_ds = util.load_file(ds_fn).strip()
        except Exception:
            pass
        if not previous_ds:
            previous_ds = ds
        util.write_file(ds_fn, "%s\n" % ds)
        util.write_file(os.path.join(dp, 'previous-datasource'),
                        "%s\n" % (previous_ds))

        # What the instance id was and is...
        iid = self.datasource.get_instance_id()
        iid_fn = os.path.join(dp, 'instance-id')

        previous_iid = self.previous_iid()
        util.write_file(iid_fn, "%s\n" % iid)
        util.write_file(self.paths.get_runpath('instance_id'), "%s\n" % iid)
        util.write_file(os.path.join(dp, 'previous-instance-id'),
                        "%s\n" % (previous_iid))

        self._write_to_cache()
        # Ensure needed components are regenerated
        # after change of instance which may cause
        # change of configuration
        self._reset()
        return iid

在完成上述操作以后，就开始配置instance的网络，在配置之前首先要获取这些网络配置，获取的过程主要分为以下几步：

首先判断在cloud-init的data目录下是否存在upgraded-network文件，如果存在，代表网络已配置，直接返回。
读取系统内核的命令行配置（cmdline_cfg）、datasource对象的中的network配置（dscfg）和cloud-init自身的network配置（sys_cfg），按照cmdline_cfg -> sys_cfg -> dscfg的顺序依次查找，如果发现network被disable，则返回空的网络配置，如果找到网络配置，则返回该配置。
如果还没有发现网络配置，则会自动生成一个fallback的network config，其规则是寻找一个最可能需要连接的网卡设备，修改它的状态，让其运行dhcp来获取IP。寻找的规则可以查看find_fallback_nic方法。

通过上面的步骤，如果能找到对应的网络配置，则将其应用到该instance中。应用时还有一个判断：如果当前的instance不是一个全新的instance或者datasource在该阶段不需要更新cache，cloud-init就不会去配置网络。

判断是否是新的instance的依据是：比对当前的instance-id和previous instance id，如果相同则代表不是新的instance

成功配置了网络之后，会调用datasource对象的setup方法，使得datasource有机会使用网络进行一些处理，比如获取userdata、vendordata。

7、 Consume the userdata (handlers get activated here)

看字面意思是消费userdata，理解起来就是执行userdata（还包含vendordata）。

相关代码实现：

    def consume_data(self, frequency=PER_INSTANCE):
        # Consume the userdata first, because we need want to let the part
        # handlers run first (for merging stuff)
        with events.ReportEventStack("consume-user-data",
                                     "reading and applying user-data",
                                     parent=self.reporter):
                self._consume_userdata(frequency)
        with events.ReportEventStack("consume-vendor-data",
                                     "reading and applying vendor-data",
                                     parent=self.reporter):
                self._consume_vendordata(frequency)

        # Perform post-consumption adjustments so that
        # modules that run during the init stage reflect
        # this consumed set.
        #
        # They will be recreated on future access...
        self._reset()
        # Note(harlowja): the 'active' datasource will have
        # references to the previous config, distro, paths
        # objects before the load of the userdata happened,
        # this is expected.

首先看怎么消费userdata的。cloud-init在这一步会注册几个handler，包括：

默认的四个handler
- CloudConfigPartHandler
- ShellScriptPartHandler
- BootHookPartHandler
- UpstartJobPartHandler
在cloud目录下的handlers目录中包含的handler

然后cloud-init会使用这些handler处理userdata。

cloud-init支持用户自定义的part handler来处理userdata，入口是handle_part方法。

8、 Construct the modules object

这一步很简单，只是构造了一个Modules类型的对象。

9、 Adjust any subsequent logging/output redirections using the modules objects config as it may be different from init object

根据Modules对象重新配置日志的输出

10、 Run the modules for the 'init' stage

运行cloud_init_modules会话下的所有module。

cloud-init cloudinit cloudinit工作原理

< cloudinit解惑 - 启动流程

Python中的类与对象-笔记 >

cloudinit解惑 - Local Stage

简介

目的：

实现

Sub-stages

0、 Prepare

1、 Ensure that the init object fetches its config without errors

2、 Setup logging/output redirections with resultant config (if any)

3、 Initialize the cloud-init filesystem

4、 Check if we can stop early by looking for various files

5、 Fetch the datasource

6、 Connect to the current instance location + update the cache

7、 Consume the userdata (handlers get activated here)

8、 Construct the modules object

9、 Adjust any subsequent logging/output redirections using the modules objects config as it may be different from init object

10、 Run the modules for the 'init' stage

发表新评论