极客时间——趣谈操作系统

时间:May 28, 2021 分类:

目录:

Linux操作系统综述

开篇 为什么要学习Linux操作系统

趣谈基于

初创期

  • 开放的经营环境(x86体系架构)
  • 创建公司(系统启动)
  • 老板干活(实模式)

发展期

  • 项目越来越多(保护模式,多线程)
  • 项目外包、项目管理(进程管理)
  • 会议室管理(内存管理)
  • 文档资料管理(文件系统)
  • 售前售后(输入输出设备管理)

壮大期

  • 促进内部项目合作(进程间通信)
  • 和外部公司合作(网络通信)

集团化

  • 成立子公司(虚拟化)
  • 内部创业(容器化)
  • 集团公司(Linux集群)

01 入学检测

02 学习路径

  1. 熟练使用命令行
  2. 通过系统调用或者glibc进行程序设计,推荐《Unix环境高级编程》
  3. 了解Linux内核机制,推荐《深入理解LINUX内核》
  4. 阅读Linux内核代码,聚焦核心逻辑和场景,推荐《LINUX 内核源代码情景分析》
  5. 定制化Linux组件
  6. 面向真实场景的开发

03 Linux内核是软件外包公司的老板

04 快速上手几个Linux命令

05 学会几个系统应用

创建进程的系统调用为fork

  • 父进程返回值为子进程PID,继续执行,可以调用waitpid来获取子进程信息
  • 子进程返回0,调用execve来执行另一个程序

内存管理

进程有独立的内存空间,存放数据的为数据段,在指明销毁才会被销毁的数据放在堆里,分配堆内存有brk和mmap

文件操作

  • open,close
  • create
  • lseek
  • read,write

信号处理

kill

进程通信

信息量小可以使用队列

  • msgget创建新的队列
  • msgsnd
  • msgrcv

更大的需要共享内存

  • shmget创建共享内存
  • shmat将共享内存映射到自己的内存空间
  • sem_wait占用信号量
  • sem_post释放

Glibc是Linux下使用的开源标准C库,封装了操作系统的系统调用

例如sys_open对应Glibc中的open函数等

系统初始化

06 x86架构

CPU与内存配合工作

CPU包括

  • 运算单元
  • 数据单元
  • 控制单元

CPU的控制单元有指令指针寄存器,存放下一条指令的内存地址,控制单元会不会断的将代码段的指令拿到指令急促器

指令分两个部分

  • 做什么操作
  • 操作什么数据

这两个部分分别交给运算单元和数据单元

数据单元根据数据地址将数据读取到数据寄存器,运算单元计算完成会将数据存储到数据寄存器,然后会有指令将数据写回内存中的数据段

CPU组件

对于8086处理器

  • 数据单元有8个16位通用寄存器,分别是AX、BX、CX、DX、SP、BP、SI和DI,而AX、BX、CX、DX又分为两个8位的寄存器,例如AX分为AH和AL,对应高位和低位
  • 控制单元中IP寄存器就是指针指令寄存器,指向代码段下一跳指令的位置,CPU会根据其不断讲指令从内存加载到指令队列,交由运算单元执行。每个进程都分代码段和数据段,指向了不同进程的地址空间,CS、DS、SS和ES,CS是代码段寄存器,可以找到代码在内存的位置。DS是数据段寄存器,可以找到数据在内存的位置。SS是栈寄存器,存取只在一端,陷入后出,push入栈,pop出栈,函数调用相关都会将上级逻辑放到栈里

CS和DS存放着起始地址,代码段偏移量在IP寄存器,数据段偏移量在通用寄存器

对于32位处理器

对比8086架构

  • 通用寄存器由8个16位扩展为8个32位,同时兼容16位
  • 指令寄存器扩展为32位,同时兼容16位
  • 段寄存器,CS、SS、DS、ES仍是16位,但不是段的起始地址,而是段起始地址的在内存存放的位置,存放的是表格,表格内是段描述符,而段寄存器保存的是表格中的哪一项,被称为选择子

有两种模式

  • 实模式
  • 保护模式

CPU在启动的时候是实模式,和8086的模式兼容,当需要更高内存的时候,切换为保护模式,使用32位能力

07 从BIOS到bootloader

主板上有ROM只读存储器,固化了BIOS

Linux的Grub2配置系统启动

# 启动程序安装到相应的位置
grub2-install /dev/sda

会安装boot.img,由boot.S编译而成,正好512字节,安装到启动盘的第一个扇区,这个扇区为MBR

BIOS在启动的时候会加载启动盘(约定第一个扇区512字节且以0xAA55结束)到内存的0x7c00,boot.img的来加载grub2的另一个镜像core.img

core.img由很多模块组成

  • lzma_decompress.img
  • diskboot.img
  • kernel.img等

流程为

  1. 先加载的diskboot.img用途是将core.img的余下部分加载进来
  2. 再是解压缩程序lzma_decompress.img,并切换到保护模式
  3. 然后是kernel.img(grub的内核)
  4. 最后是各个module对应的镜像

切换到保护模式还有很多的工作要做

  1. 启用分段
  2. 启用分页

grub_load_config()会解析grub.conf文件配置,如果正常启动,grub_main会调用grub_command_execute("normal"”", 0, 0),最终会调用grub_normal_execute()函数中的grub_show_menu()`显示操作系统列表

选择了某个系统就会调用grub_menu_execute_entry就会解析选择的那一项,装载指定文件,并传递内核启动参数,grub_cmd_linux()函数会被调用读取Linux内核镜像头部的一些数据结构,放到内存中的数据结构来,进行检查。如果检查通过,则会读取整个Linux内核镜像到内存

这样grub_normal_execute()函数开始真正的启动内核

08 内核初始化

初始化基础功能

内核入口函数start_kernel(),位于init/main.c,会初始化各种函数

  • INIT_TASK(init_task),创建进程set_task_stack_end_magic(&init_task),这里init_task为struct task_struct init_task = INIT_TASK(init_task),是0号进程
  • trap_init(),初始化中断门用于处理中断
  • mm_init(),初始化内存管理模块,初始化内存文件系统rootfs会调用mnt_init()->init_rootfs()在vfs注册rootfs_fs_type
  • sched_init(),初始化调度模块
  • rest_init(其他初始化),例如vfs

rest_init调用kernel_thread(kernel_init, NULL, CLONE_FS)创建1号进程。

有了员工就需要区分那些是核心资源,那些是非核心资源,所以有分层权限机制,分Ring0~3,0为内核,1和2为设备驱动,3为应用。

  • 能够访问关键资源的代码放到Ring0,也就是内核态
  • 普通程序放到Ring3,也就是用户态

用户态的代码想要访问核心资源,就需要通过系统调用的方式统一的来访问来获取结果。

例如访问网卡发送一个网络包,需要暂停当前进程进行系统调用,在内核中从系统调用传来的数据包,在网卡上排队,轮到的时候进行发送,发送完系统调用结束,返回用户态继续运行

这个暂停的时候,需要将CPU寄存器的值都暂存到其他的地方,整个过程是:用户态->系统调用->保存寄存器->内核态执行系统调用->恢复寄存器->返回用户态

初始化1号进程

1号进程启动的实行的kernel_thread还是在内核态,参数为函数kernel_init,在这个函数会调用kernel_init_freeable

  if (ramdisk_execute_command) {
    ret = run_init_process(ramdisk_execute_command);
......
  }
......
  if (!try_to_run_init_process("/sbin/init") ||
      !try_to_run_init_process("/etc/init") ||
      !try_to_run_init_process("/bin/init") ||
      !try_to_run_init_process("/bin/sh"))
    return 0;

这里1号进程是一个可执行的文件

run_init_process中调用了do_execve,是执行了一个execve的系统调用

static int run_init_process(const char *init_filename)
{
  argv_init[0] = init_filename;
  return do_execve(getname_kernel(init_filename),
    (const char __user *const __user *)argv_init,
    (const char __user *const __user *)envp_init);
}

所以从内核态到用户态的过程是:内核态执行系统调用->恢复寄存器->返回用户态

do_execve->do_execveat_common->exec_binprm->search_binary_handler会调用


int search_binary_handler(struct linux_binprm *bprm)
{
  ......
  struct linux_binfmt *fmt;
  ......
  retval = fmt->load_binary(bprm);
  ......
}

运行程序需要加载一个二进制文件,在Linux的格式为ELF


static struct linux_binfmt elf_format = {
.module  = THIS_MODULE,
.load_binary  = load_elf_binary,
.load_shlib  = load_elf_library,
.core_dump  = elf_core_dump,
.min_coredump  = ELF_EXEC_PAGESIZE,
};

也就是先调用load_elf_binary,最后调用start_thread

void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
set_user_gs(regs, 0);
regs->fs  = 0;
regs->ds  = __USER_DS;
regs->es  = __USER_DS;
regs->ss  = __USER_DS;
regs->cs  = __USER_CS;
regs->ip  = new_ip;
regs->sp  = new_sp;
regs->flags  = X86_EFLAGS_IF;
force_iret();
}
EXPORT_SYMBOL_GPL(start_thread);

这里struct pt_regs为register,这个结构就是寄存器,系统调用的时候,内核保存用户态上下文的,里边

  • 用户态的代码段CS设置为__USER_CS
  • 用户态的数据段DS设置为__USER_DS
  • 指令指针寄存器IP
  • 栈指针寄存器SP

最后的iret用于从系统调用中返回,这个时候后会恢复寄存器

ramdisk是一个基于内存的根文件系统,运行完ramdisk的/init,就已经在由用户态了,会加载会根据存储系统的类型加载驱动,有了驱动来设置真正的根文件系统。有了根文件系统ramdisk的/init会启动文件系统的init,接下来就是各个系统的初始化,启动系统服务,启动控制台等

初始化2号进程

rest_init创建2号进程kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES),还是用kernel_thread

在内核中,无论是线程还是进程,都是task,使用相同的数据结构,平放在一张链表

kthreadd负责所有内核态线程的调度和管理,是所有内核态运行线程的祖先

这里1和2号进程的ppid都为0

09|系统调用

glibc会对系统调用进行封装

glibc会对系统调用进行封装,例如open

int open(const char *pathname, int flags, mode_t mode)

而在glibc的源码,syscalls.list中有glibc函数对应的系统调用

# File name Caller  Syscall name    Args    Strong name Weak names
open    -  open    Ci:siv  __libc_open __open open

make-syscall.sh根据配置对每个系统调用生成单独文件,这个文件定义了一些宏,例如#define SYSCALL_NAME open

glibc还有文件syscall-template.S记录如何使用这些宏,定义了系统调用的方式

T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
    ret
T_PSEUDO_END (SYSCALL_SYMBOL)

#define T_PSEUDO(SYMBOL, NAME, N)    PSEUDO (SYMBOL, NAME, N)

PSEUDO也是宏,会调用DO_CALL宏,每个系统调用都会调用DO_CALL

#define PSEUDO(name, syscall_name, args)                      \
  .text;                                      \
  ENTRY (name)                                    \
    DO_CALL (syscall_name, args);                         \
    cmpl $-4095, %eax;                               \
    jae SYSCALL_ERROR_LABEL

32位系统调用过程

会将请求参数放到寄存器,根据系统调用名获取系统调用号,放到寄存器eax里,执行ENTER_KERNEL


/* Linux takes system call arguments in registers:
  syscall number  %eax       call-clobbered
  arg 1    %ebx       call-saved
  arg 2    %ecx       call-clobbered
  arg 3    %edx       call-clobbered
  arg 4    %esi       call-saved
  arg 5    %edi       call-saved
  arg 6    %ebp       call-saved
......
*/
#define DO_CALL(syscall_name, args)                           \
    PUSHARGS_##args                               \
    DOARGS_##args                                 \
    movl $SYS_ify (syscall_name), %eax;                          \
    ENTER_KERNEL                                  \
    POPARGS_##args


# define ENTER_KERNEL int $0x80

ENTER_KERNEL是触发一个软中断,trap陷入内核,接收到系统调用,调用entry_INT80_32

ENTRY(entry_INT80_32)
        ASM_CLAC
        pushl   %eax                    /* pt_regs->orig_ax */
        SAVE_ALL pt_regs_ax=$-ENOSYS    /* save rest */
        movl    %esp, %eax
        call    do_syscall_32_irqs_on
.Lsyscall_32_done:
......
.Lirq_return:
  INTERRUPT_RETURN

通过pushSAVE_ALL将用户态寄存器保存到pt_regs结构

在进入内核前保存所有的寄存器,然后调用do_syscall_32_irqs_on

static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
  struct thread_info *ti = current_thread_info();
  unsigned int nr = (unsigned int)regs->orig_ax;
......
  if (likely(nr < IA32_NR_syscalls)) {
    regs->ax = ia32_sys_call_table[nr](
      (unsigned int)regs->bx, (unsigned int)regs->cx,
      (unsigned int)regs->dx, (unsigned int)regs->si,
      (unsigned int)regs->di, (unsigned int)regs->bp);
  }
  syscall_return_slowpath(regs);
}

将系统调用号从eax取出,进行对应的调用,寄存器的参数就是调用的参数。调用完成后调用INTERRUPT_RETURN将用户态保存的现场恢复,包括代码段,命令指针

64位系统调用

64位将系统调用存在了rax,这里不使用中断,而是syscall指令,传递参数的寄存器也变化了,syscall使用了特殊模块寄存器MSR,是cpu用于完成某些特殊功能的寄存器

系统初始化的时候,trap_init除了初始化中断,还会调用cpu_init->syscall_init

wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

通过rdmsr和wrmsr来读写MSR,当调用syscall,会从MSR拿出函数地址,调用entry_SYSCALL_64


ENTRY(entry_SYSCALL_64)
        /* Construct struct pt_regs on stack */
        pushq   $__USER_DS                      /* pt_regs->ss */
        pushq   PER_CPU_VAR(rsp_scratch)        /* pt_regs->sp */
        pushq   %r11                            /* pt_regs->flags */
        pushq   $__USER_CS                      /* pt_regs->cs */
        pushq   %rcx                            /* pt_regs->ip */
        pushq   %rax                            /* pt_regs->orig_ax */
        pushq   %rdi                            /* pt_regs->di */
        pushq   %rsi                            /* pt_regs->si */
        pushq   %rdx                            /* pt_regs->dx */
        pushq   %rcx                            /* pt_regs->cx */
        pushq   $-ENOSYS                        /* pt_regs->ax */
        pushq   %r8                             /* pt_regs->r8 */
        pushq   %r9                             /* pt_regs->r9 */
        pushq   %r10                            /* pt_regs->r10 */
        pushq   %r11                            /* pt_regs->r11 */
        sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
        movq    PER_CPU_VAR(current_task), %r11
        testl   $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
        jnz     entry_SYSCALL64_slow_path
......
entry_SYSCALL64_slow_path:
        /* IRQs are off. */
        SAVE_EXTRA_REGS
        movq    %rsp, %rdi
        call    do_syscall_64           /* returns with IRQs disabled */
return_from_SYSCALL_64:
  RESTORE_EXTRA_REGS
  TRACE_IRQS_IRETQ
  movq  RCX(%rsp), %rcx
  movq  RIP(%rsp), %r11
    movq  R11(%rsp), %r11
......
syscall_return_via_sysret:
  /* rcx and r11 are already restored (see code above) */
  RESTORE_C_REGS_EXCEPT_RCX_R11
  movq  RSP(%rsp), %rsp
  USERGS_SYSRET64

保存了很多寄存器到pt_regs,例如用户态代码段、数据段、参数等,然后调用entry_SYSCALL64_slow_pat->do_syscall_64


__visible void do_syscall_64(struct pt_regs *regs)
{
        struct thread_info *ti = current_thread_info();
        unsigned long nr = regs->orig_ax;
......
        if (likely((nr & __SYSCALL_MASK) < NR_syscalls)) {
                regs->ax = sys_call_table[nr & __SYSCALL_MASK](
                        regs->di, regs->si, regs->dx,
                        regs->r10, regs->r8, regs->r9);
        }
        syscall_return_slowpath(regs);
}

do_syscall_64中,从rax拿出系统调用号,根据系统调用号,在sys_call_table找到调用对应的函数,参数从寄存器获取

通过sysretq返回用户态

进程管理

10 进程

编译程序为二进制格式

上边的程序,include的是头文件,其他写的是源文件

编译以上代码

gcc -c -fPIC process.c
gcc -c -fPIC createprocess.c

获得到.o的文件就是ELF的第一种类型,在内核中数据类型为elf32_hdrelf64_hdr

分别是

  • .text:放编译好的二进制可执行代码
  • .data:已经初始化好的全局变量
  • .rodata:只读数据,例如字符串常量、const的变量
  • .bss:未初始化全局变量,运行时会置0
  • .symtab:符号表,记录的则是函数和变量
  • .strtab:字符串表、字符串常量和变量名

这时还只是部分代码片段,不是一个程序,例如createprocess.o调用了create_process函数,但是自身没有,就需要在rel.text标注需要重新定位,所以叫可重定位文件

静态链接库

想使create_process作为库文件被重用,需要形成库文件,最简单的是将.o文件归档变为静态链接库.a文件,这里可以是多个.o文件

ar cr libstaticprocess.a process.o

链接到程序内

gcc -o staticcreateprocess createprocess.o -L. -lstaticprocess

这里-L代表从当前目录找.a文件,-lstaticprocess自动会补全文件名,在前边补全lib,后边补全后缀,将里边的process.o文件取出来,和createprocess.o做一个链接,形成可执行二进制文件staticcreateprocess,在这个过程中,就知道了函数的位置

ETF的第二种文件,二进制文件

这里每个section分成了需要加载到内存的代码段和数据段,以及其他不需要加载到内存的部分,并将小的section合并为segment加到段头

在内核中的数据类型为elf32_phdrelf64_phdr,区别是加上了p_vaddr段加载到内存的虚拟地址

在ELF头里有个e_entry,也是内存的虚拟地址,代表程序的入口

动态链接

静态链接有个问题就是相同的代码段在被引用的时候,在内存就有多份,变更需要重新编译,是有了动态链接库

gcc -shared -fPIC -o libdynamicprocess.so process.o

当动态链接库被链接到一个程序文件的时候,最后程序文件并不包括动态链接库的代码,而仅仅是动态链接库的引用,保存的是动态链接库的名称

gcc -o dynamiccreateprocess createprocess.o -L. -ldynamicprocess

当运行的时候需要找到动态链接库进行加载,默认情况下是在/lib和/usr/lib目录下寻找动态链接库,也可以设置LD_LIBRARY_PATH环境变量

export LD_LIBRARY_PATH=.
./dynamiccreateprocess

基于动态链接库创建的是ETF的第三种类型,共享对象文件,对比二进制文件多了

  • .interp,为ld-linux.so,进行动态链接
  • .plt,过程链接表
  • .got.plt,全局偏移量表

在dynamiccreateprocess运行的时候,不知道create_process函数

  1. 编译的时候在PTL里立项PLT[x]代理代码
  2. 运行的时候运行PLT[x],调用GOT[y]
  3. GOT[y]调用PLT[0],PLT[0]调用GOT[2],也就是调用加载内存中ld-linux.so的入口函数
  4. 将函数加载到内存,将内存地址写到GOT[y]

运行程序为进程

内核中有数据结构,定了了加载二进制的方法

struct linux_binfmt {
        struct list_head lh;
        struct module *module;
        int (*load_binary)(struct linux_binprm *);
        int (*load_shlib)(struct file *);
        int (*core_dump)(struct coredump_params *cprm);
        unsigned long min_coredump;     /* minimal dump size */
} __randomize_layout;

对于ELF有对应的实现

static struct linux_binfmt elf_format = {
        .module         = THIS_MODULE,
        .load_binary    = load_elf_binary,
        .load_shlib     = load_elf_library,
        .core_dump      = elf_core_dump,
        .min_coredump   = ELF_EXEC_PAGESIZE,
};

load_elf_binary在加载内核镜像的时候也用过,这次是被do_execve调用,exec包含一组函数

  • 包含p的函数(execvp, execlp)会在PATH 路径下面寻找程序
  • 不包含p的函数需要输入程序的全路径
  • 包含v的函数(execv, execvp, execve)以数组的形式接收参数
  • 包含l的函数(execl, execlp, execle)以列表的形式接收参数
  • 包含e的函数(execve, execle)以数组的形式接收环境变量

11 线程

多个进程并行执行的问题

  • 创建进程占用资源过多
  • 进程之间通信数据需要在不同内存空间传递

创建线程

创建线程直接通过线程的方法即可

线程的数据主要有三类

  • 线程栈上的本地数据,例如执行过程的局部变量,默认为8192KB,也可以通过int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);设置,栈之间有小块区域隔离各自的栈
  • 进程内共享数据,例如全局变量
  • 线程私有数据,每个线程都可以创建,但是每个线程对相同的key的value不一定是一样的
# 创建私有数据key
int pthread_key_create(pthread_key_t *key, void (*destructor)(void*))
# 设置value
int pthread_setspecific(pthread_key_t key, const void *value)
# 获取value
void *pthread_getspecific(pthread_key_t key)

数据保护

互斥锁Mutex

通过pthread_mutex_init初始化锁,通过pthread_mutex_lock()竞争锁,没有竞争到就阻塞等待,使用完数据之后使用pthread_mutex_unlock来释放锁,不用了使用pthread_mutex_destroy销毁锁

阻塞会降低效率,所以一般会使用pthread_mutex_trylock尝试获取锁,获取不到也不会堵塞

条件变量+互斥锁

条件变量和互斥锁声明在进程中

分配工作

  1. 获取锁pthread_mutex_trylock
  2. 调用pthread_cond_broadcast通知线程

线程工作

  1. pthread_cond_wait收到通知抢锁
  2. 获取数据
  3. 释放锁

没有获取到任务就会进入pthread_cond_wait,这个过程会解锁并进入等待

以上就是java的wait()notify()函数

12 进程数据结构

无论是进程还是线程,在内核中都是task_struct,通过链表串起来

struct list_head    tasks;

任务ID

涉及任务ID的有

pid_t pid;
pid_t tgid;
struct task_struct *group_leader; 

对于进程,这三个都是自己,而对于线程

  • pid是自己
  • tgid是主进程的pid
  • group_leader指向进程的主线程

任务信号处理

task_struct中,信号处理的字段

/* Signal handlers: */
struct signal_struct    *signal;
struct sighand_struct    *sighand;
sigset_t      blocked;
sigset_t      real_blocked;
sigset_t      saved_sigmask;
struct sigpending    pending;
unsigned long      sas_ss_sp;
size_t        sas_ss_size;
unsigned int      sas_ss_flags;
  • 被阻塞暂不处理blocked
  • 等待处理pending
  • 通过信号处理函数处理sighand

信号处理默认使用用户态的函数栈处理,也可以单独开辟新栈用于处理信号,sas_ss的三个变量的作用

除了struct sigpending pending是本任务,在struct signal_struct *signal中还有struct sigpending shared_pending为线程组共享

任务状态

任务状态涉及到的变量

 volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
 int exit_state;
 unsigned int flags;

状态定义在头文件include/linux/sched.h

/* Used in tsk->state: */
#define TASK_RUNNING                    0
#define TASK_INTERRUPTIBLE              1
#define TASK_UNINTERRUPTIBLE            2
#define __TASK_STOPPED                  4
#define __TASK_TRACED                   8
/* Used in tsk->exit_state: */
#define EXIT_DEAD                       16
#define EXIT_ZOMBIE                     32
#define EXIT_TRACE                      (EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_DEAD                       64
#define TASK_WAKEKILL                   128
#define TASK_WAKING                     256
#define TASK_PARKED                     512
#define TASK_NOLOAD                     1024
#define TASK_NEW                        2048
#define TASK_STATE_MAX                  4096

TASK_RUNNING并非进程在运行,而是时刻准备运行的状态

  • 这个状态获取到时间片就是在运行中
  • 没有获取到时间片就是被其他进程抢占了,等待再次分配时间片

当运行中的进程需要进行IO操作,或者等待IO完毕就会释放CPU进入睡眠状态,有两种睡眠状态

  • TASK_INTERRUPTIBLE是可中断的睡眠状态,在等待IO完成,一旦有信号,进程会被唤醒,进行信号处理
  • TASK_UNINTERRUPTIBLE是不可中断的睡眠状态,不可被信号唤醒,只能死等IO操作完成,但是这样一旦IO因为特殊原因没能完成,就谁也叫不醒这个进程了

所以有了TASK_KILLABLE,和TASK_UNINTERRUPTIBLE机制一样,但是有能响应致命信号的能力

  • TASK_STOPPED进程收到SIGSTOP、SIGTTIN、SIGTSTP或者SIGTTOU进入的状态
  • TASK_TRACED表示进程被debugger等进程监视,进程被监视进程所停止
  • EXIT_ZOMBIE是一个进程要结束之前进入的状态,但是父进程没调用wait()来获知其终止信息,所以变为了僵尸进程
  • EXIT_DEAD为进程的最终状态

状态与进程的运行和调度有关,还有一些标志,在flags字段,例如

#define PF_EXITING    0x00000004
#define PF_VCPU      0x00000010
#define PF_FORKNOEXEC    0x00000040
  • PF_EXITING为正在退出,有这个flag,在函数find_alive_thread中找活线程会直接跳过这个flag的线程
  • PF_VCPU表示运行在虚拟CPU上,在函数account_system_time统计系统运行时间的时候,对这个flag调用account_guest_time按照宿主机时间统计
  • PF_FORKNOEXEC表示fork完,还没有exec,在_do_fork调用了copy_process设置的,在exec中调用load_elf_binary的时候去掉的

进程调度

涉及到进程调度


//是否在运行队列上
int        on_rq;
//优先级
int        prio;
int        static_prio;
int        normal_prio;
unsigned int      rt_priority;
//调度器类
const struct sched_class  *sched_class;
//调度实体
struct sched_entity    se;
struct sched_rt_entity    rt;
struct sched_dl_entity    dl;
//调度策略
unsigned int      policy;
//可以使用哪些CPU
int        nr_cpus_allowed;
cpumask_t      cpus_allowed;
struct sched_info    sched_info;

13 进程数据结构

运行信息统计

u64        utime;//用户态消耗的CPU时间
u64        stime;//内核态消耗的CPU时间
unsigned long      nvcsw;//自愿(voluntary)上下文切换计数
unsigned long      nivcsw;//非自愿(involuntary)上下文切换计数
u64        start_time;//进程启动时间,不包含睡眠时间
u64        real_start_time;//进程启动时间,包含睡眠时间

进程亲缘关系

struct task_struct __rcu *real_parent; /* real parent process */
struct task_struct __rcu *parent; /* recipient of SIGCHLD, wait4() reports */
struct list_head children;      /* list of my children */
struct list_head sibling;       /* linkage in my parent's children list */
  • parent指向父进程,终止时需要向其发信号
  • children是链表的头部,链表中为其子进程
  • sibling当前进程插入兄弟链表

通常情况下real_parent和parent通常一样;但在bash中用GDB调试程序时, bash是parent,GDB是real_parent

进程权限

进程权限用于访问文件和项目

/* Objective and real subjective task credentials (COW): */
const struct cred __rcu         *real_cred;
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu         *cred;
  • Objective 谁能操作我,对应real_cred
  • Subjective 我能操作谁,对应cred

cred定义

struct cred {
......
        kuid_t          uid;            /* real UID of the task */
        kgid_t          gid;            /* real GID of the task */
        kuid_t          suid;           /* saved UID of the task */
        kgid_t          sgid;           /* saved GID of the task */
        kuid_t          euid;           /* effective UID of the task */
        kgid_t          egid;           /* effective GID of the task */
        kuid_t          fsuid;          /* UID for VFS ops */
        kgid_t          fsgid;          /* GID for VFS ops */
......
        kernel_cap_t    cap_inheritable; /* caps our children can inherit */
        kernel_cap_t    cap_permitted;  /* caps we're permitted */
        kernel_cap_t    cap_effective;  /* caps we can actually use */
        kernel_cap_t    cap_bset;       /* capability bounding set */
        kernel_cap_t    cap_ambient;    /* Ambient capability set */
......
} __randomize_layout;

euid和egid为操作消息队列,共享内存和信号量,fsuid和fsgid为操作文件

一般进程uid,euid和fsuid是一样的,gid,egid和fsgid是一样的,谁启动的进程,就审查用户有没有对应的权限,但是也有意外情况

例如A想玩一个游戏,游戏是B安装,权限为rwxr–-r--,B需要给A权限,设置为rwxr-xr-x,这样uid,euid和fsuid都是A,但是存档在另一个目录,只有rw-------,这时候可以通过chmod u+s的方式,设置set-user-ID,变为rwsr-xr-x,再启动的时候,uid还是A,因为set-user-ID的设置,euid和fsuid就是B了,一个进程就可以通过setuid设置用户ID,所以B的ID就保存在suid和sgid

一个用户想获取高权限,如果只是单一的功能,就可以通过capabilities来实现,在capability.h中定义

#define CAP_CHOWN            0
#define CAP_KILL             5
#define CAP_NET_BIND_SERVICE 10
#define CAP_NET_RAW          13
#define CAP_SYS_MODULE       16
#define CAP_SYS_RAWIO        17
#define CAP_SYS_BOOT         22
#define CAP_SYS_TIME         25
#define CAP_AUDIT_READ          37
#define CAP_LAST_CAP         CAP_AUDIT_READ
  • cap_permitted为进程能够使用的权限,cap_effective为进程实际使用的权限,cap_effective可能因为安全原因放弃使用某些权限
  • cap_inheritable表示可执行文件的扩展属性设置了inheritable位时,调用exec执行该程序会继承调用者的inheritable集合,并加入到permitted。但是在非root用户执行exec的时候,不会保留inheritable,但是只有非root用户
  • cap_bset是系统中所有进程允许保留的权限,可以在启动的时候将部分内核功能去掉
  • cap_ambient是为了解决cap_inheritable的问题,当非root用户exec之后cap_ambient会添加到cap_permittedcap_effective

内存管理

每个进程单独的虚拟内存空间

struct mm_struct                *mm;
struct mm_struct                *active_mm;

文件和文件系统

每个进程有一个文件系统结构和打开文件的数据结构

/* Filesystem information: */
struct fs_struct                *fs;
/* Open file information: */
struct files_struct             *files;

14 进程数据结构

进程在执行的过程中,涉及到系统调用就需要进入内核态执行,主要的涉及

struct thread_info    thread_info;
void  *stack;

用户态函数栈

函数调用是通过栈来实现的,对应的汇编操作为指令跳转,涉及到参数和返回地址

进程的内存空间内,栈是从高地址到低地址,往下增长的结构,上边是栈底,下边为栈顶,入栈和出栈都是从下面的栈顶开始

对于32位系统,在CPU中,ESP为栈顶指针寄存器,入栈push和出栈pop指令都会自动调整ESP的值,还有一个寄存器EBP为栈基地址指针寄存器的栈帧底部

例如

A调用B,A的栈里是包含A函数的局部变量,然后是调用B的时候需要传给它的参数,最后是返回A的地址(在B返回的之后下一跳指令的位置),这个地址入栈,就形成A的栈帧,(这里ESP和EBP限定了函数堆栈范围,当需要新的栈就让新的EBP等于ESP)。

然后是B的栈帧,先保存A栈底的位置,也就是EBP(函数栈之间格式是固定的,所以访问A的第一个参数用EBP+8,第二个参数为EBP+12),用于获取A传入的参数,再后边是B的局部变量(缓冲区溢出攻击就是通过往堆栈中写入超过为其所分配空间的元素,导致覆盖了函数返回地址)。

当B返回的时候,返回值会保存在EAX寄存器,从栈中弹出返回地址,将指令指向返回地址,参数也从栈中弹出,继续执行A

对于64位系统,因为寄存器多,rax用于保存返回结果,栈顶指针寄存器为rsp,栈基寄存器为rbp,rdi、rsi、rdx、rcx、r8、r9用于传递存储函数调用的6个参数,超过6个参数就需要放到栈里(所以参数过多会影响栈传递性能)

但是当寻址不在寄存器的参数的时候,因为没有地址,需要放到栈里,这个操作是被调用的函数完成的

内核态函数栈

Linux为每个task分配了内核栈

在32位位于arch/x86/include/asm/page_32_types.h,一个PAGE_SIZE是4K,左移一位就是乘以2,也就是8K

#define THREAD_SIZE_ORDER  1
#define THREAD_SIZE    (PAGE_SIZE << THREAD_SIZE_ORDER)

而在64位位于arch/x86/include/asm/page_64_types.h,在PAGE_SIZE的基础上左移两位,也即16K,并且要求起始地址必须是8192的整数倍

#ifdef CONFIG_KASAN
#define KASAN_STACK_ORDER 1
#else
#define KASAN_STACK_ORDER 0
#endif


#define THREAD_SIZE_ORDER  (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

内核栈结构

最低位为thread_info,用于对task_struct的补充,是因为task_struct结构大,并且是一个通用的,但是不同的体系结构需要保存不同的东西,就放到thread_info

在include/linux/sched.h中,通过union,讲thread_infostack放在一起

union thread_union {
#ifndef CONFIG_THREAD_INFO_IN_TASK
  struct thread_info thread_info;
#endif
  unsigned long stack[THREAD_SIZE/sizeof(long)];
};

最高位为pt_regs,内容为


#ifdef __i386__
struct pt_regs {
  unsigned long bx;
  unsigned long cx;
  unsigned long dx;
  unsigned long si;
  unsigned long di;
  unsigned long bp;
  unsigned long ax;
  unsigned long ds;
  unsigned long es;
  unsigned long fs;
  unsigned long gs;
  unsigned long orig_ax;
  unsigned long ip;
  unsigned long cs;
  unsigned long flags;
  unsigned long sp;
  unsigned long ss;
};
#else 
struct pt_regs {
  unsigned long r15;
  unsigned long r14;
  unsigned long r13;
  unsigned long r12;
  unsigned long bp;
  unsigned long bx;
  unsigned long r11;
  unsigned long r10;
  unsigned long r9;
  unsigned long r8;
  unsigned long ax;
  unsigned long cx;
  unsigned long dx;
  unsigned long si;
  unsigned long di;
  unsigned long orig_ax;
  unsigned long ip;
  unsigned long cs;
  unsigned long flags;
  unsigned long sp;
  unsigned long ss;
/* top of stack page */
};
#endif 

当系统调用的时候,会保存这个结构,保存用户态的CPU上下文,用于内核系统调用返回时重新加载继续运行进程

在内核中,CPU的ESP或RSP都指向了内核栈的栈顶

通过task_struct查找内核栈

有task_struct的指针,可以找到内核栈

static inline void *task_stack_page(const struct task_struct *task)
{
  return task->stack;
}

获取pt_regs

/*
 * TOP_OF_KERNEL_STACK_PADDING reserves 8 bytes on top of the ring0 stack.
 * This is necessary to guarantee that the entire "struct pt_regs"
 * is accessible even if the CPU haven't stored the SS/ESP registers
 * on the stack (interrupt gate does not save these registers
 * when switching to the same priv ring).
 * Therefore beware: accessing the ss/esp fields of the
 * "struct pt_regs" is possible, but they may contain the
 * completely wrong values.
 */
#define task_pt_regs(task) \
({                  \
  unsigned long __ptr = (unsigned long)task_stack_page(task);  \
  __ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;    \
  ((struct pt_regs *)__ptr) - 1;          \
})

就是找到内核栈位置,在这个位置加上THREAD_SIZE位置就找到了最后的位置,转换为struct pt_regs,再减1,就是了。因为在32位,在不发生权限变化的时候,不会压栈8个字节(用户态到内核态,权限变化的时候保存SS、ESP寄存区的8字节)

通过内核栈查找task_struct

在32位系统

CPU上进程想获取自己的task_struct,需要通过thread_info

struct thread_info {
  struct task_struct  *task;    /* main task structure */
  __u32      flags;    /* low level flags */
  __u32      status;    /* thread synchronous flags */
  __u32      cpu;    /* current CPU */
  mm_segment_t    addr_limit;
  unsigned int    sig_on_uaccess_error:1;
  unsigned int    uaccess_err:1;  /* uaccess failed */
};

有变量指向task_struct·,可以通过current_thread_info()->task来获取task_struct`

static inline struct thread_info *current_thread_info(void)
{
  return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
}

通过thread_info最高位置减去THREAD_SIZE获取起始位置

在64位系统就有一个flag

struct thread_info {
        unsigned long           flags;          /* low level flags */
};

include/linux/thread_info.h定义current_thread_info

#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif

current在arch/x86/include/asm/current.h中定义

struct task_struct;


DECLARE_PER_CPU(struct task_struct *, current_task);


static __always_inline struct task_struct *get_current(void)
{
  return this_cpu_read_stable(current_task);
}


#define current get_current 

每个CPU运行的task_struct不通过thread_info,而是通过Per CPU变量获取。

在多核系统,CPU是同时运行的,在使用其他的资源的时候需要解决同步的问题,使用的就是Per CPU变量,是每个CPU构造的变量副本,这样每个CPU操作自己的副本,当前进程的current_task就是Per CPU

这个变量在arch/x86/include/asm/current.h中声明

DECLARE_PER_CPU(struct task_struct *, current_task);

被定义在arch/x86/kernel/cpu/common.c

DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;

在初始化的时候所有的current_task指向了init_task

当CPU发生了进程切换时,current_task被修改为需要切换的目标进程,__switch_to来进行操作


__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
......
this_cpu_write(current_task, next_p);
......
return prev_p;
}

获取当前运行的task_struct就可以调用this_cpu_read_stable

#define this_cpu_read_stable(var)       percpu_stable_op("mov", var)

15 调度

调度策略和调度类

进程分为两类

  • 实时进程,尽快执行并返回
  • 普通进程,正常完成即可

对于不同进程调度策略不同,使用的task_struct中的

unsigned int policy;
#define SCHED_NORMAL    0
#define SCHED_FIFO    1
#define SCHED_RR    2
#define SCHED_BATCH    3
#define SCHED_IDLE    5
#define SCHED_DEADLINE    6

有优先级

int prio, static_prio, normal_prio;
unsigned int rt_priority;

对于实时进程,优先级范围为0~99,对于普通进程,优先级范围为100~139

实时调度策略有

  • SCHED_FIFO:相同优先级先进先出,高优先级抢占低优先级
  • SCHED_RR:相同优先级时间片用完放到队尾,高优先级抢占低优先级
  • SCHED_DEADLINE:接近deadline优先完成

普通调度策略有

  • SCHED_NORMAL:普通
  • SCHED_BATCH:后台进程
  • SCHED_IDLE:空闲运行

工作由task_struct中调度实现为

const struct sched_class *sched_class;

有几种实现

  • stop_sched_class优先级最高的认为会使用这个策略,中断其他线程,而不会被其他任务打断
  • dl_sched_class对应deadline
  • rt_sched_class对应rr算法或者FIFO算法,具体由task_struct->policy指定
  • fair_sched_class普通进程的调度策略
  • idle_sched_class空闲进程的调度策略

完全公平调度算法

Linux中实现了基于CFS的调度算法

首先需要记录进程的运行时间,CPU会提供一个时钟,每过一段时间触发一次时钟中断,被称为tick。而CPU为每个进程安排一个虚拟运行时间vruntime,在运行的进程每过一个tick其vruntime就会增加,没有执行的vruntime不变

更新运行统计量的流程

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
  struct sched_entity *curr = cfs_rq->curr;
  u64 now = rq_clock_task(rq_of(cfs_rq));
  u64 delta_exec;
......
  delta_exec = now - curr->exec_start;
......
  curr->exec_start = now;
......
  curr->sum_exec_runtime += delta_exec;
......
  curr->vruntime += calc_delta_fair(delta_exec, curr);
  update_min_vruntime(cfs_rq);
......
}


/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
  if (unlikely(se->load.weight != NICE_0_LOAD))
        /* delta_exec * weight / lw.weight */
    delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
  return delta;
}

当前时间 - 时间片起始时间 = 运行时间delta_exec

然后需要转换为vruntime,公式为

虚拟运行时间vruntime += 实际运行时间delta_exec * NICE_0_LOAD / 权重

下一次调度按照最小vruntime

调度队列和调度实体

cfs需要一个数据结构对进程vruntime进行实时的排序,但是vruntime是会经常变的,就会需要重新插入进行重新排序

更够平衡查找和更新速度的就是树了,这里采用红黑树,在struct task_struct中

struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
  • 实时调度实体sched_rt_entity
  • deadline调度实体sched_dl_entity
  • 完全公平调度实体sched_entity

不同调度实体之间也会决定那个先进行执行

进程会根据自身的调度类型挂载到对应的调度实体等待调度,示例sched_entity调度实体的定义

struct sched_entity {
  struct load_weight    load;
  struct rb_node      run_node;
  struct list_head    group_node;
  unsigned int      on_rq;
  u64        exec_start;
  u64        sum_exec_runtime;
  u64        vruntime;
  u64        prev_sum_exec_runtime;
  u64        nr_migrations;
  struct sched_statistics    statistics;
......
};

示例挂载的红黑树

CFS调度策略会选择最左边的叶子节点作为下一个获得CPU的任务

每个CPU有一个自己的结构用于描述CPU上所有运行的进程,包括一个一个实时进程队列rt_rq和一个CFS运行队列cfs_rq

struct rq {
  /* runqueue lock: */
  raw_spinlock_t lock;
  unsigned int nr_running;
  unsigned long cpu_load[CPU_LOAD_IDX_MAX];
......
  struct load_weight load;
  unsigned long nr_load_updates;
  u64 nr_switches;


  struct cfs_rq cfs;
  struct rt_rq rt;
  struct dl_rq dl;
......
  struct task_struct *curr, *idle, *stop;
......
};

在调度时,调度器首先会去实时进程队列找是否有实时进程需要运行,如果没有才会去CFS运行队列找是否有进程需要运行

cfs队列定义

/* CFS-related fields in a runqueue */
struct cfs_rq {
  struct load_weight load;
  unsigned int nr_running, h_nr_running;


  u64 exec_clock;
  u64 min_vruntime;
#ifndef CONFIG_64BIT
  u64 min_vruntime_copy;
#endif
  struct rb_root tasks_timeline;
  struct rb_node *rb_leftmost;


  struct sched_entity *curr, *next, *last, *skip;
......
};
  • rb_root指向了红黑树的根节点
  • rb_leftmost指向了最左边的节点

调度类如何工作

调度类定义

struct sched_class {
  const struct sched_class *next;


  void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
  void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
  void (*yield_task) (struct rq *rq);
  bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);


  void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);


  struct task_struct * (*pick_next_task) (struct rq *rq,
            struct task_struct *prev,
            struct rq_flags *rf);
  void (*put_prev_task) (struct rq *rq, struct task_struct *p);


  void (*set_curr_task) (struct rq *rq);
  void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
  void (*task_fork) (struct task_struct *p);
  void (*task_dead) (struct task_struct *p);


  void (*switched_from) (struct rq *this_rq, struct task_struct *task);
  void (*switched_to) (struct rq *this_rq, struct task_struct *task);
  void (*prio_changed) (struct rq *this_rq, struct task_struct *task, int oldprio);
  unsigned int (*get_rr_interval) (struct rq *rq,
           struct task_struct *task);
  void (*update_curr) (struct rq *rq)

第一个成员变量就指向了下一个调度类,调度类有以下几种

extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

这几个调度类实际是在一个链表上

示例取下一个任务,会有一个for_each_class循环,依次调用每个调度类的方法

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
  const struct sched_class *class;
  struct task_struct *p;
......
  for_each_class(class) {
    p = class->pick_next_task(rq, prev, rf);
    if (p) {
      if (unlikely(p == RETRY_TASK))
        goto again;
      return p;
    }
  }
}

调度的时候从优先级最高调度类到优先级低的调度类,每个都有自己的实现,例如fair_sched_class

const struct sched_class fair_sched_class = {
  .next      = &idle_sched_class,
  .enqueue_task    = enqueue_task_fair,
  .dequeue_task    = dequeue_task_fair,
  .yield_task    = yield_task_fair,
  .yield_to_task    = yield_to_task_fair,
  .check_preempt_curr  = check_preempt_wakeup,
  .pick_next_task    = pick_next_task_fair,
  .put_prev_task    = put_prev_task_fair,
  .set_curr_task          = set_curr_task_fair,
  .task_tick    = task_tick_fair,
  .task_fork    = task_fork_fair,
  .prio_changed    = prio_changed_fair,
  .switched_from    = switched_from_fair,
  .switched_to    = switched_to_fair,
  .get_rr_interval  = get_rr_interval_fair,
  .update_curr    = update_curr_fair,
};

对于pick_next_task选取下一个运行的任务,每个调度类会有不同的实现,例如

  • fair_sched_class的实现为pick_next_task_fair
  • pick_next_task_fair的实现为pick_next_task_rt

每个函数操作不同的队列

sched_class定义了很多与调度有关的函数

  • enqueue_task在队列中添加进程,进程进入可运行状态时,调用的这个函数
  • dequeue_task在队列中删除进程
  • pick_next_task选择接下来需要运行的进程
  • put_prev_task用一个进程代替当前运行的进程
  • set_curr_task修改调度策略
  • task_tick每次周期时钟达到,函数就会被调用

fair_sched_classpick_next_task_fair获取下一个进程,调用链路为pick_next_task_fair->pick_next_entity->__pick_first_entity

struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
  struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);


  if (!left)
    return NULL;


  return rb_entry(left, struct sched_entity, run_node);

16 主动调度

主动调度场景

写入块设备的场景

写需要等待一段时间,这段时间用不到CPU

static void btrfs_wait_for_no_snapshoting_writes(struct btrfs_root *root)
{
......
  do {
    prepare_to_wait(&root->subv_writers->wait, &wait,
        TASK_UNINTERRUPTIBLE);
    writers = percpu_counter_sum(&root->subv_writers->counter);
    if (writers)
      schedule();
    finish_wait(&root->subv_writers->wait, &wait);
  } while (writers);
}

等待网络设备读取

当数据没有到达网卡,就会等待

static ssize_t tap_do_read(struct tap_queue *q,
         struct iov_iter *to,
         int noblock, struct sk_buff *skb)
{
......
  while (1) {
    if (!noblock)
      prepare_to_wait(sk_sleep(&q->sk), &wait,
          TASK_INTERRUPTIBLE);
......
    /* Nothing to read, let's sleep */
    schedule();
  }
......
}

让出CPU都是通过shedule()函数完成

asmlinkage __visible void __sched schedule(void)
{
  struct task_struct *tsk = current;


  sched_submit_work(tsk);
  do {
    preempt_disable();
    __schedule(false);
    sched_preempt_enable_no_resched();
  } while (need_resched());
}

主要逻辑在__schedule中实现

static void __sched notrace __schedule(bool preempt)
{
  struct task_struct *prev, *next;
  unsigned long *switch_count;
  struct rq_flags rf;
  struct rq *rq;
  int cpu;


  cpu = smp_processor_id();
  rq = cpu_rq(cpu);
  prev = rq->curr;
......

首先在当前CPU取出rq,将ask_struct *prev指向这个CPU的任务队列上面正在运行的那个进程curr,因为切换了他就是前一个了

next = pick_next_task(rq, prev, &rf);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();

第二步,获取下一个任务,task_struct *next指向下一个任务

static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
  const struct sched_class *class;
  struct task_struct *p;
  /*
   * Optimization: we know that if all tasks are in the fair class we can call that function directly, but only if the @prev task wasn't of a higher scheduling class, because otherwise those loose the opportunity to pull in more work from other CPUs.
   */
  if (likely((prev->sched_class == &idle_sched_class ||
        prev->sched_class == &fair_sched_class) &&
       rq->nr_running == rq->cfs.h_nr_running)) {
    p = fair_sched_class.pick_next_task(rq, prev, rf);
    if (unlikely(p == RETRY_TASK))
      goto again;
    /* Assumes fair_sched_class->next == idle_sched_class */
    if (unlikely(!p))
      p = idle_sched_class.pick_next_task(rq, prev, rf);
    return p;
  }
again:
  for_each_class(class) {
    p = class->pick_next_task(rq, prev, rf);
    if (p) {
      if (unlikely(p == RETRY_TASK))
        goto again;
      return p;
    }
  }
}

cfs调度就会调用fair_sched_class中的pick_next_task_fair

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
  struct cfs_rq *cfs_rq = &rq->cfs;
  struct sched_entity *se;
  struct task_struct *p;
  int new_tasks;

取出对应的队列

    struct sched_entity *curr = cfs_rq->curr;
    if (curr) {
      if (curr->on_rq)
        update_curr(cfs_rq);
      else
        curr = NULL;
......
    }
    se = pick_next_entity(cfs_rq, curr);

pick_next_entity取出红黑树最左边节点

  p = task_of(se);


  if (prev != p) {
    struct sched_entity *pse = &prev->se;
......
    put_prev_entity(cfs_rq, pse);
    set_next_entity(cfs_rq, se);
  }


  return p

取出当前运行的任务curr,调用update_curr更新vruntime

前任和继任不是一个,就需要更新红黑树了

set_next_entity将继任者设为当前任务

第三步,上下文切换,继任进程运行

if (likely(prev != next)) {
    rq->nr_switches++;
    rq->curr = next;
    ++*switch_count;
......
    rq = context_switch(rq, prev, next, &rf);

上下文切换

上下文切换主要做两件事

  • 切换进程空间
  • 切换寄存器和CPU上下文

context_switch实现

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
         struct task_struct *next, struct rq_flags *rf)
{
  struct mm_struct *mm, *oldmm;
......
  mm = next->mm;
  oldmm = prev->active_mm;
......
  switch_mm_irqs_off(oldmm, mm, next);
......
  /* Here we just switch the register state and the stack. */
  switch_to(prev, next, prev);
  barrier();
  return finish_task_switch(prev);
}

switch_to用于寄存器和栈的切换,调用了__switch_to_asm,是一段汇编代码

对于32位系统,切换的是栈顶指针esp

/*
 * %eax: prev task
 * %edx: next task
 */
ENTRY(__switch_to_asm)
......
  /* switch stack */
  movl  %esp, TASK_threadsp(%eax)
  movl  TASK_threadsp(%edx), %esp
......
  jmp  __switch_to
END(__switch_to_asm)

对于64位系统,切换的是栈顶指针rsp

/*
 * %rdi: prev task
 * %rsi: next task
 */
ENTRY(__switch_to_asm)
......
  /* switch stack */
  movq  %rsp, TASK_threadsp(%rdi)
  movq  TASK_threadsp(%rsi), %rsp
......
  jmp  __switch_to
END(__switch_to_asm)

最后都返回到了__switch_to

例如64位系统__switch_to的流程

__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
  struct thread_struct *prev = &prev_p->thread;
  struct thread_struct *next = &next_p->thread;
......
  int cpu = smp_processor_id();
  struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
......
  load_TLS(next, cpu);
......
  this_cpu_write(current_task, next_p);


  /* Reload esp0 and ss1.  This changes current_thread_info(). */
  load_sp0(tss, next);
......
  return prev_p;
}

在x86架构维护了一个硬件的方式进程切换

  • 每个进程在x86架构在内存维护了一个TSS任务状态段结构,有所有的寄存器
  • TR任务寄存器,指向某个进程的TSS

更改TSS,会触发硬件保存CPU所有寄存器的值到当前进程 的TSS中,然后从新进程读取出所有寄存器的值,加载到CPU的寄存器

示例32位的TSS结构

但是进程切换的时候没必要每个寄存器都进行切换,全量切换TSS代价太大,在Linux中每个CPU会有一个TSS对象,然后将TR指向这个RSS,在操作系统运行的时候,TR不进行切换,就不会触发TSS切换了

在x86_hw_tss中有TSS对应的结构

void cpu_init(void)
{
  int cpu = smp_processor_id();
  struct task_struct *curr = current;
  struct tss_struct *t = &per_cpu(cpu_tss, cpu);
    ......
    load_sp0(t, thread);
  set_tss_desc(cpu, t);
  load_TR_desc();
    ......
}


struct tss_struct {
  /*
   * The hardware state:
   */
  struct x86_hw_tss  x86_tss;
  unsigned long    io_bitmap[IO_BITMAP_LONGS + 1];
} 

在Linux中需要切换的主要是栈顶寄存器,在task_struct有成员变量thread,记录了进程切换需要修改的寄存器

/* CPU-specific state of this task: */
  struct thread_struct    thread;

进程切换的时候,会将某个进程的thread_struct里寄存器的值写到CPU的TR指向的tss_struct,对于CPU就完成了切换

例如__switch_to中的load_sp0就是将下一个进程的thread_struct的sp0加载到tss_struct

指令指针的保存与恢复

在进程切换的时候

  • 用户栈,跟随进程空间一起切换的
  • 内核栈,在__switch_to中切换,current_task指向当前的task_struct,里边的void *stack指针就是当前的内核栈
  • 内核栈栈顶指针,__switch_to_asm中切换
  • 用户栈栈顶指针,在内核栈的pt_regs

当发生切换的时候,A进程的内核态指令指针指向了__schedule,内核栈会保留这个调用,通过btrfs_wait_for_no_snapshoting_writes函数,经过层层调用到达context_switch

switch_to(prev, next, prev);
barrier();
return finish_task_switch(prev);

当A进程执行switch_to,内核态的指令指针也是指向这一行,在执行的时候将寄存器和栈都换为了B,没有变的是指令寄存器,返回的时候finish_task_switch已经指向了下一个进程B

当其他进程切换到A进程的时候,也是在finish_task_switch已经指向了A进程继续执行

finish_task_switch执行完毕后,返回__schedule的调用,但是btrfs_wait_for_no_snapshoting_writes是在A进程,所以是去B进程找当时进行scheduler()之后继续执行,在内核执行完返回由用户态

在返回用户态的时候,通过内核栈的pt_regs的用户态的指令指针寄存器运行下一条指令

switch_to有三个参数

#define switch_to(prev, next, last)          \
do {                  \
  prepare_switch_to(prev, next);          \
                  \
  ((last) = __switch_to_asm((prev), (next)));      \
} while (0)

这里switch_to(prev = A, next=B, last=C)记录了,是被B切换走的,但是切换是从C回来的

这么设计finish_task_switch是因为需要对C做一些善后工作

总结一下各类指针保存位置和时刻

  • 用户栈, 切换进程内存空间时切换
  • 用户栈顶指针, 内核栈pt_regs中弹出
  • 用户指令指针, 内核栈pt_regs中弹出
  • 内核栈, 由切换的task_struct中的stack指针指向
  • 内核栈顶指针, __switch_to_asm函数切换(保存在thread中)
  • 内核指令指针寄存器, 进程调度最终都会调用__schedule, 因此在让出(从当前进程)和取得(从其他进程) CPU时, 该指针都指向同一个代码位置

17 抢占调度

抢占式调度

产生的原因就是一个进程执行过长,需要切换到另一个进程。计算机中有一个时钟,过一段时间触发一次时间中断

时钟中断处理函数会调用scheduler_tick()

void scheduler_tick(void)
{
  int cpu = smp_processor_id();
  struct rq *rq = cpu_rq(cpu);
  struct task_struct *curr = rq->curr;
......
  curr->sched_class->task_tick(rq, curr, 0);
  cpu_load_update_active(rq);
  calc_global_load_tick(rq);
......
}

这个函数会取出CPU的运行队列,获取到CPU队列上运行的进程task_struct,然后调用task_structtask_tick函数

例如普通进程,调度类为fair_sched_class,调度的处理时钟函数为task_tick_fair


static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &curr->se;


  for_each_sched_entity(se) {
    cfs_rq = cfs_rq_of(se);
    entity_tick(cfs_rq, se, queued);
  }
......
}

根据当前线程的task_struct,找到对应调度实体sched_entitycfs_rq队列,调用entity_tick

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
  update_curr(cfs_rq);
  update_load_avg(curr, UPDATE_TG);
  update_cfs_shares(curr);
.....
  if (cfs_rq->nr_running > 1)
    check_preempt_tick(cfs_rq, curr);
}

entity_tick里,通过update_curr进行进程的vruntime更新,并调用check_preempt_tick判断是否是时候被抢占

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
  unsigned long ideal_runtime, delta_exec;
  struct sched_entity *se;
  s64 delta;


  ideal_runtime = sched_slice(cfs_rq, curr);
  delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
  if (delta_exec > ideal_runtime) {
    resched_curr(rq_of(cfs_rq));
    return;
  }
......
  se = __pick_first_entity(cfs_rq);
  delta = curr->vruntime - se->vruntime;
  if (delta < 0)
    return;
  if (delta > ideal_runtime)
    resched_curr(rq_of(cfs_rq));
}

check_preempt_tick

  • 调用sched_slice计算出ideal_runtime(一个调度周期中进程应该运行的实际时间)
  • sum_exec_runtime(进程总计执行时间)与prev_sum_exec_runtime(进程上次被调用总计执行时间)差值大于ideal_runtime会被抢占
  • 调用__pick_first_entity去出树中最小的进程,如果当前进程的vruntime大于树中最小的进程的vruntime,并且差值大于ideal_runtime也会被抢占

把进程标记为被强占,因为需要运行程序调用__schedule才行,标记通过resched_curr中的set_tsk_need_resched进行标记,标签为TIF_NEED_RESCHED

static inline void set_tsk_need_resched(struct task_struct *tsk)
{
  set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

还有一种抢占情况是,当IO到来的时候,进程会被唤醒,当被唤醒的进程优先级高于CPU上的当前进程,就会触发抢占。

  • try_to_wake_up()调用ttwu_queue将这个唤醒的任务添加到队列中
  • ttwu_queue再调用ttwu_do_activate激活这个任务
  • ttwu_do_activate调用ttwu_do_wakeup,这里面调用了check_preempt_curr检查是否应该发生抢占
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
         struct rq_flags *rf)
{
  check_preempt_curr(rq, p, wake_flags);
  p->state = TASK_RUNNING;
  trace_sched_wakeup(p);

抢占时机

抢占的时机有很多种

  • 用户态系统调用返回
  • 用户态中断返回
  • 内核态允许调度
  • 内核态中断返回

用户态抢占时机是在系统调用结束,调用链路:do_syscall_64->syscall_return_slowpath->prepare_exit_to_usermode->exit_to_usermode_loop

exit_to_usermode_loop中有

static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
{
  while (true) {
    /* We have work to do. */
    local_irq_enable();


    if (cached_flags & _TIF_NEED_RESCHED)
      schedule();
......
  }
}

exit_to_usermode_loop函数上会判断是否打了_TIF_NEED_RESCHED,如果有就调用schedule进行调度

用户态中断返回,在arch/x86/entry/entry_64.S

common_interrupt:
        ASM_CLAC
        addq    $-0x80, (%rsp) 
        interrupt do_IRQ
ret_from_intr:
        popq    %rsp
        testb   $3, CS(%rsp)
        jz      retint_kernel
/* Interrupt came from user space */
GLOBAL(retint_user)
        mov     %rsp,%rdi
        call    prepare_exit_to_usermode
        TRACE_IRQS_IRETQ
        SWAPGS
        jmp     restore_regs_and_iret
/* Returning to kernel space */
retint_kernel:
#ifdef CONFIG_PREEMPT
        bt      $9, EFLAGS(%rsp)  
        jnc     1f
0:      cmpl    $0, PER_CPU_VAR(__preempt_count)
        jnz     1f
        call    preempt_schedule_irq
        jmp     0b

中断调用使用do_IRQ函数,返回可能是用户态,也可能是内核态

  • 在返回用户态的时候,retint_user会调用prepare_exit_to_usermode,最终调用exit_to_usermode_loop进行scheduler
  • 在返回内核态的时候,会调用preempt_schedule_irq进行schedule
asmlinkage __visible void __sched preempt_schedule_irq(void)
{
......
  do {
    preempt_disable();
    local_irq_enable();
    __schedule(true);
    local_irq_disable();
    sched_preempt_enable_no_resched();
  } while (need_resched());
......
}

在内核态的时候,被抢占一般都在preempt_enable中,但是在内核中很多操作是不能中断的,所以在这些操作之前会提供preempt_disable关闭抢占,当再次打开的时候就调用的preempt_enable,进而调用preempt_count_dec_and_test来判断是否可以被抢占

#define preempt_enable() \
do { \
  if (unlikely(preempt_count_dec_and_test())) \
    __preempt_schedule(); \
} while (0)


#define preempt_count_dec_and_test() \
  ({ preempt_count_sub(1); should_resched(0); })


static __always_inline bool should_resched(int preempt_offset)
{
  return unlikely(preempt_count() == preempt_offset &&
      tif_need_resched());
}


#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)


static void __sched notrace preempt_schedule_common(void)
{
  do {
......
    __schedule(true);
......
  } while (need_resched())

18 进程的创建

创建进程是通过fork来完成,对应系统调用sys_fork

SYSCALL_DEFINE0(fork)
{
......
  return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
}

sys_fork调用_do_fork

long _do_fork(unsigned long clone_flags,
        unsigned long stack_start,
        unsigned long stack_size,
        int __user *parent_tidptr,
        int __user *child_tidptr,
        unsigned long tls)
{
  struct task_struct *p;
  int trace = 0;
  long nr;


......
  p = copy_process(clone_flags, stack_start, stack_size,
       child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
......
  if (!IS_ERR(p)) {
    struct pid *pid;
    pid = get_task_pid(p, PIDTYPE_PID);
    nr = pid_vnr(pid);


    if (clone_flags & CLONE_PARENT_SETTID)
      put_user(nr, parent_tidptr);


......
    wake_up_new_task(p);
......
    put_pid(pid);
  } 
......

fork的第一件大事:复制结构

复制结构通过copy_process实现

static __latent_entropy struct task_struct *copy_process(
          unsigned long clone_flags,
          unsigned long stack_start,
          unsigned long stack_size,
          int __user *child_tidptr,
          struct pid *pid,
          int trace,
          unsigned long tls,
          int node)
{
  int retval;
  struct task_struct *p;
......
  p = dup_task_struct(current, node);

dup_task_struct做了几件事

  • 调用alloc_task_struct_node分配一个alloc_task_struct_node
  • 调用alloc_thread_stack_node创建内核栈,调用__vmalloc_node_range分配一个连续的THREAD_SIZE内存空间,赋值task_structvoid *stack成员变量
  • 调用rch_dup_task_struct(struct task_struct *dst, struct task_struct *src)进行task_struct复制,实际是调用memcpy
  • 调用setup_thread_stack设置thread_info
retval = copy_creds(p, clone_flags);

copy_creds做的是权限相关的事

  • 调用prepare_creds准备一个新的cred,从内存中分配struct cred,memcpy从父进程复制cred
  • 执行p->cred = p->real_cred = get_cred(new),将新进程的“我能操作谁”和“谁能操作我”两个权限都指向新的cred

重新设置进程运行的统计量

p->utime = p->stime = p->gtime = 0;
p->start_time = ktime_get_ns();
p->real_start_time = ktime_get_boot_ns();

设置调度相关的变量

retval = sched_fork(clone_flags, p);

sched_fork的主要工作

  • 调用__sched_fork,将on_rq设置为0,初始化sched_entity,将exec_start、sum_exec_runtime、prev_sum_exec_runtime、vruntime设置为0
  • 设置进程状态p->state = TASK_NEW
  • 初始化优先级prio、normal_prio、static_prio
  • 设置调度类,例如普通进程设置p->state = TASK_NEW
  • 调用调度类的task_fork函数,对于CFS来讲,就是调用task_fork_fair,通过update_curr重新设置父子进程的vruntime一致,调用place_entity初始化sched_entity,如果sysctl_sched_child_runs_first设置父子进程那个先运行(子进程先运行就把子进程sched_entity放在前边,调用resched_curr,标记当前进程TIF_NEED_RESCHED

copy_process初始化与文件系统相关的变量

retval = copy_files(clone_flags, p);
retval = copy_fs(clone_flags, p);
  • copy_files复制进程打开的文件信息,调用dup_fd,创建一个新的files_struct,然后将所有的文件描述符数组fdtable拷贝一份
  • copy_fs复制进程目录信息,调用copy_fs_struct,创建一个新的fs_struct,复制进程的根目录,根文件系统root,当前目录pwd和当前文件系统

copy_process初始化信号量相关变量

init_sigpending(&p->pending);
retval = copy_sighand(clone_flags, p);
retval = copy_signal(clone_flags, p);

copy_process复制进程内存空间

retval = copy_mm(clone_flags, p);
  • init_sigpending用于初始化,并且复制用于维护发给这个进程的信号的数据结构。
  • copy_sighand会分配一个新的sighand_struct维护信号处理函数,调用memcpy,将信号处理函数sighand->action从父进程复制到子进程
  • copy_signal函数会分配一个新的signal_struct,并进行初始化。

进程空间通过mm_struct来维护,copy_mm函数调用dup_mm分配新的mm_struct,调用memcpy复制内存空间中内存映射的部分

copy_process分配Pid,设置tid,group_leader


  INIT_LIST_HEAD(&p->children);
  INIT_LIST_HEAD(&p->sibling);
......
    p->pid = pid_nr(pid);
  if (clone_flags & CLONE_THREAD) {
    p->exit_signal = -1;
    p->group_leader = current->group_leader;
    p->tgid = current->tgid;
  } else {
    if (clone_flags & CLONE_PARENT)
      p->exit_signal = current->group_leader->exit_signal;
    else
      p->exit_signal = (clone_flags & CSIGNAL);
    p->group_leader = p;
    p->tgid = p->pid;
  }
......
  if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
    p->real_parent = current->real_parent;
    p->parent_exec_id = current->parent_exec_id;
  } else {
    p->real_parent = current;
    p->parent_exec_id = current->self_exec_id;
  }

fork的第二件大事:唤醒进程

通过wake_up_new_task,将任务状态设置为TASK_RUNNING

void wake_up_new_task(struct task_struct *p)
{
  struct rq_flags rf;
  struct rq *rq;
......
  p->state = TASK_RUNNING;
......
  activate_task(rq, p, ENQUEUE_NOCLOCK);
  p->on_rq = TASK_ON_RQ_QUEUED;
  trace_sched_wakeup_new(p);
  check_preempt_curr(rq, p, WF_FORK);
......
}

activate_task会调用enqueue_task

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
.....
  p->sched_class->enqueue_task(rq, p, flags);
}

如果是CFS调度类,执行对应的enqueue_task_fair

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &p->se;
......
  cfs_rq = cfs_rq_of(se);
  enqueue_entity(cfs_rq, se, flags);
......
  cfs_rq->h_nr_running++;
......
}
  • enqueue_task_fair中取出的为cfs_rq
  • 然后调用enqueue_entity,在其中调用update_curr更新运行的统计数据,调用__enqueue_entity,将sched_entity加入到红黑树,将se->on_rq = 1设置到队列上
  • 返回enqueue_task_fair后将队列上的进程数加1

wake_up_new_task中会调用check_preempt_curr判断是否能抢占,会调用对应的调度类rq->curr->sched_class->check_preempt_curr(rq, p, flags),对于cfs就是check_preempt_wakeup

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
  struct task_struct *curr = rq->curr;
  struct sched_entity *se = &curr->se, *pse = &p->se;
  struct cfs_rq *cfs_rq = task_cfs_rq(curr);
......
  if (test_tsk_need_resched(curr))
    return;
......
  find_matching_se(&se, &pse);
  update_curr(cfs_rq_of(se));
  if (wakeup_preempt_entity(se, pse) == 1) {
    goto preempt;
  }
  return;
preempt:
  resched_curr(rq);
......
}

check_preempt_wakeup会看task_fork_fair根据sysctl_sched_child_runs_first设置了父进程的TIF_NEED_RESCHED

  • 如果设置了就直接返回
  • 如果没有,调用update_curr更新统计量,根据wakeup_preempt_entity判断是否抢占,如果抢占,将父进程设置为TIF_NEED_RESCHED

当父进程从fork返回的时候,如果可以抢占就进行抢占了

19 线程的创建

线程创建通过pthread_create的glibc的函数

用户态创建线程

线程不是一个完全由内核态实现的机制,是内核态和用户态共同完成的,而pthread_create也并非系统调用

在nptl/pthread_create.c中

int __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg)
{
......
}
versioned_symbol (libpthread, __pthread_create_2_1, pthread_create, GLIBC_2_1);

首先处理线程的参数,例如线程栈的大小,没有的话使用默认大小

const struct pthread_attr *iattr = (struct pthread_attr *) attr;
struct pthread_attr default_attr;
if (iattr == NULL)
{
  ......
  iattr = &default_attr;
}

用户态有个用于维护线程的结构pthread

struct pthread *pd = NULL;

创建线程栈用于函数调用

int err = ALLOCATE_STACK (iattr, &pd);

ALLOCATE_STACK宏对应的函数

# define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)


static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
                ALLOCATE_STACK_PARMS)
{
  struct pthread *pd;
  size_t size;
  size_t pagesize_m1 = __getpagesize () - 1;
......
  size = attr->stacksize;
......
  /* Allocate some anonymous memory.  If possible use the cache.  */
  size_t guardsize;
  void *mem;
  const int prot = (PROT_READ | PROT_WRITE
                   | ((GL(dl_stack_flags) & PF_X) ? PROT_EXEC : 0));
  /* Adjust the stack size for alignment.  */
  size &= ~__static_tls_align_m1;
  /* Make sure the size of the stack is enough for the guard and
  eventually the thread descriptor.  */
  guardsize = (attr->guardsize + pagesize_m1) & ~pagesize_m1;
  size += guardsize;
  pd = get_cached_stack (&size, &mem);
  if (pd == NULL)
  {
    /* If a guard page is required, avoid committing memory by first
    allocate with PROT_NONE and then reserve with required permission
    excluding the guard page.  */
  mem = __mmap (NULL, size, (guardsize == 0) ? prot : PROT_NONE,
      MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
    /* Place the thread descriptor at the end of the stack.  */
#if TLS_TCB_AT_TP
    pd = (struct pthread *) ((char *) mem + size) - 1;
#elif TLS_DTV_AT_TP
    pd = (struct pthread *) ((((uintptr_t) mem + size - __static_tls_size) & ~__static_tls_align_m1) - TLS_PRE_TCB_SIZE);
#endif
    /* Now mprotect the required region excluding the guard area. */
    char *guard = guard_position (mem, size, guardsize, pd, pagesize_m1);
    setup_stack_prot (mem, size, guard, guardsize, prot);
    pd->stackblock = mem;
    pd->stackblock_size = size;
    pd->guardsize = guardsize;
    pd->specific[0] = pd->specific_1stblock;
    /* And add to the list of stacks in use.  */
    stack_list_add (&pd->list, &stack_used);
  }

  *pdp = pd;
  void *stacktop;
# if TLS_TCB_AT_TP
  /* The stack begins before the TCB and the static TLS block.  */
  stacktop = ((char *) (pd + 1) - __static_tls_size);
# elif TLS_DTV_AT_TP
  stacktop = (char *) (pd - 1);
# endif
  *stack = stacktop;
...... 
}

主要做了以下事情

  • 获取栈大小
  • 为了防止栈越界,栈未有一块guardsize空间,访问到会报错
  • 线程栈是从进程的堆里创建的,如果不断创建和删除线程,就需要个缓存get_cached_stack
  • 创建的时候先看缓存内时候有栈大小的size,如果没有就需要用__mmap创建一块新的
  • 线程的pthread也是放在栈中
  • 计算guard内存位置,调用setup_stack_prot设置这块内存为受保护
  • 填充pthread结构,包括stackblock,stackblock_size,guardsize,specific(线程全局变量)
  • 线程放到stack_used链表,表示栈正在使用

线程管理有两个链表,分别是stack_usedstack_cachestack_cache会在线程结束的时候进行缓存不释放,当其他线程创建给其他线程用

内核态创建任务

有了用户态的栈,还需要解决程序运行的问题

pd->start_routine = start_routine;
pd->arg = arg;
pd->schedpolicy = self->schedpolicy;
pd->schedparam = self->schedparam;
/* Pass the descriptor to the caller.  */
*newthread = (pthread_t) pd;
atomic_increment (&__nptl_nthreads);
retval = create_thread (pd, iattr, &stopped_start, STACK_VARIABLES_ARGS, &thread_ran);

start_routine线程的函数,start_routine的参数arg,调度策略schedpolicy都需要传递给pthread

然后__nptl_nthreads加一代表多一个线程

真正创建线程是调用create_thread函数

static int
create_thread (struct pthread *pd, const struct pthread_attr *attr,
bool *stopped_start, STACK_VARIABLES_PARMS, bool *thread_ran)
{
  const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM | CLONE_SIGHAND | CLONE_THREAD | CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | 0);
  ARCH_CLONE (&start_thread, STACK_VARIABLES_ARGS, clone_flags, pd, &pd->tid, tp, &pd->tid);
  /* It's started now, so if we fail below, we'll have to cancel it
and let it clean itself up.  */
  *thread_ran = true;
}

ARCH_CLONE调用了__clone

# define ARCH_CLONE __clone


/* The userland implementation is:
   int clone (int (*fn)(void *arg), void *child_stack, int flags, void *arg),
   the kernel entry is:
   int clone (long flags, void *child_stack).


   The parameters are passed in register and on the stack from userland:
   rdi: fn
   rsi: child_stack
   rdx: flags
   rcx: arg
   r8d: TID field in parent
   r9d: thread pointer
%esp+8: TID field in child


   The kernel expects:
   rax: system call number
   rdi: flags
   rsi: child_stack
   rdx: TID field in parent
   r10: TID field in child
   r8:  thread pointer  */

        .text
ENTRY (__clone)
        movq    $-EINVAL,%rax
......
        /* Insert the argument onto the new stack.  */
        subq    $16,%rsi
        movq    %rcx,8(%rsi)


        /* Save the function pointer.  It will be popped off in the
           child in the ebx frobbing below.  */
        movq    %rdi,0(%rsi)


        /* Do the system call.  */
        movq    %rdx, %rdi
        movq    %r8, %rdx
        movq    %r9, %r8
        mov     8(%rsp), %R10_LP
        movl    $SYS_ify(clone),%eax
......
        syscall
......
PSEUDO_END (__clone)

在进程的主进程调用其他系统调用,用户态的栈都是指向整个进程栈,栈顶指针也是指向进程的栈,指令指针也是主线程的代码。所以调用clone的时候,用户态栈,栈顶指针,指令指针也都是主线程的。

但是对于线程来说这些是需要变的,除了线程对应的task_struct,clone之后,用户态的栈应该是线程的栈,栈顶指针也是线程的栈,指令指针是要运行的代码

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
     int __user *, parent_tidptr,
     int __user *, child_tidptr,
     unsigned long, tls)
{
  return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
}

调用的也是_do_fork,但是标志位有一些区别,会造成一些影响

  • copy_files因为CLONE_FILES,将原来files_struct的引用计数加一
  • copy_fs因为CLONE_FS,将原来fs_struct的引用计数加一
  • copy_sighand因为CLONE_SIGHAND,将原来sighand_struct的引用计数加一
  • copy_mm因为CLONE_VM,直接指向了原来的mm_struct

static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
{
  struct files_struct *oldf, *newf;
  oldf = current->files;
  if (clone_flags & CLONE_FILES) {
    atomic_inc(&oldf->count);
    goto out;
  }
  newf = dup_fd(oldf, &error);
  tsk->files = newf;
out:
  return error;
}


static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
{
  struct fs_struct *fs = current->fs;
  if (clone_flags & CLONE_FS) {
    fs->users++;
    return 0;
  }
  tsk->fs = copy_fs_struct(fs);
  return 0;
}

static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
  struct signal_struct *sig;
  if (clone_flags & CLONE_THREAD)
    return 0;
  sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
  tsk->signal = sig;
    init_sigpending(&sig->shared_pending);
......
}

static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
  struct mm_struct *mm, *oldmm;
  oldmm = current->mm;
  if (clone_flags & CLONE_VM) {
    mmget(oldmm);
    mm = oldmm;
    goto good_mm;
  }
  mm = dup_mm(tsk);
good_mm:
  tsk->mm = mm;
  tsk->active_mm = mm;
  return 0;
}

然后是亲缘关系

p->pid = pid_nr(pid);
if (clone_flags & CLONE_THREAD) {
  p->exit_signal = -1;
  p->group_leader = current->group_leader;
  p->tgid = current->tgid;
} else {
  if (clone_flags & CLONE_PARENT)
    p->exit_signal = current->group_leader->exit_signal;
  else
    p->exit_signal = (clone_flags & CSIGNAL);
  p->group_leader = p;
  p->tgid = p->pid;
}
  /* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
  p->real_parent = current->real_parent;
  p->parent_exec_id = current->parent_exec_id;
} else {
  p->real_parent = current;
  p->parent_exec_id = current->self_exec_id;
}

因为CLONE_THREAD的标识

  • 新进程group_leader是自己,tgid是自己的pid,real_parent是当前进程
  • 新线程group_leader是当前进程,tgid是进程的pid,real_parent是当前进程

对于信号的处理,需要发给进程的信号进行细粒度的区分,例如kill所有的线程进程都会处理,pthread_kill是所有线程响应

copy_process中,会初始化struct sigpending pending的成员变量,用于记录信号列表,对于进程和线程的列表是不一样的

init_sigpending(&p->pending);

copy_signal中,会初始化struct sigpending shared_pending的成员变量,是进程和线程共同需要处理的

init_sigpending(&sig->shared_pending);

clone在内核的调用完毕,要返回系统调用,回到用户态

用户态执行线程

根据__clone的第一个参数,回到用户态执行的函数为start_thread,为线程在用户态的统一入口

#define START_THREAD_DEFN \
  static int __attribute__ ((noreturn)) start_thread (void *arg)


START_THREAD_DEFN
{
    struct pthread *pd = START_THREAD_SELF;
    /* Run the code the user provided.  */
    THREAD_SETMEM (pd, result, pd->start_routine (pd->arg));
    /* Call destructors for the thread_local TLS variables.  */
    /* Run the destructor for the thread-local data.  */
    __nptl_deallocate_tsd ();
    if (__glibc_unlikely (atomic_decrement_and_test (&__nptl_nthreads)))
        /* This was the last thread.  */
        exit (0);
    __free_tcb (pd);
    __exit_thread ();
}

start_thread入口函数中,调用提供的运行函数,在函数执行完之后会释放线程的相关数据,例如线程本地数据thread_local variables,线程数减一,__free_tcb释放pthread

void
internal_function
__free_tcb (struct pthread *pd)
{
  ......
  __deallocate_stack (pd);
}


void
internal_function
__deallocate_stack (struct pthread *pd)
{
  /* Remove the thread from the list of threads with user defined
     stacks.  */
  stack_list_del (&pd->list);
  /* Not much to do.  Just free the mmap()ed memory.  Note that we do
     not reset the 'used' flag in the 'tid' field.  This is done by
     the kernel.  If no thread has been created yet this field is
     still zero.  */
  if (__glibc_likely (! pd->user_stack))
    (void) queue_stack (pd);
}

__free_tcb会调用__deallocate_stack来释放整个线程栈,这个线程栈要从当前使用线程栈的列表stack_used中拿下来,放到缓存的线程栈列表stack_cache

内存管理

20 内存管理

每个进程的物理空间都对进程不可见,进程也不能直接访问,需要通过操作系统给分配的虚拟地址进行访问。所有的进程的虚拟地址都是一样的

操作系统提供了虚拟地址和物理地址之间的转换

操作系统对内存的管理主要分为了三个部分

  • 物理内存管理
  • 虚拟地址管理
  • 虚拟地址和物理地址映射

进程在使用内存的时候有几种形式

用户态

  • 代码
  • 全局变量
  • 常量
  • 函数栈
  • 对glibc的调用so文件

内核态

  • 内核代码
  • 内核全局变量
  • 进程的task_struct
  • 进程的内核栈
  • 内核动态分配的内存
  • 内存映射表

对于用户态和内核态都需要通过虚拟地址来访问内存

所以,虚拟内存分两部分

  • 高地址的内核空间
  • 低地址的内核空间

低位置开始分别是

  • Text Segment(代码位置)
  • Data Segment(静态常量)
  • Bss Sgement(未初始化静态常量)
  • Heap
  • Memory Mapping Segment(文件映射到内存,例如依赖的动态链接库)
  • Stack 栈

所有的进程看到的内核空间都是相同的

21 内存管理

内存地址是通过分段进行保存。在分段机制下,虚拟地址由两部分组成

  • 段选择子,保存在段寄存器,内容为段号,是段表的索引,段表保存的是这个段的基地址,段的界限和特权等级
  • 段内偏移量

在Linux内,段表全称为段描述符表,放在全局描述符表GDT中

#define GDT_ENTRY_INIT(flags, base, limit) { { { \
    .a = ((limit) & 0xffff) | (((base) & 0xffff) << 16), \
    .b = (((base) & 0xff0000) >> 16) | (((flags) & 0xf0ff) << 8) | \
      ((limit) & 0xf0000) | ((base) & 0xff000000), \
  } } }

一个段表项由段基地址base和段界限limit,还有一些标识符组成

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#ifdef CONFIG_X86_64
  [GDT_ENTRY_KERNEL32_CS]    = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
  [GDT_ENTRY_KERNEL_CS]    = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
  [GDT_ENTRY_KERNEL_DS]    = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
  [GDT_ENTRY_DEFAULT_USER32_CS]  = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
  [GDT_ENTRY_DEFAULT_USER_DS]  = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
  [GDT_ENTRY_DEFAULT_USER_CS]  = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
#else
  [GDT_ENTRY_KERNEL_CS]    = GDT_ENTRY_INIT(0xc09a, 0, 0xfffff),
  [GDT_ENTRY_KERNEL_DS]    = GDT_ENTRY_INIT(0xc092, 0, 0xfffff),
  [GDT_ENTRY_DEFAULT_USER_CS]  = GDT_ENTRY_INIT(0xc0fa, 0, 0xfffff),
  [GDT_ENTRY_DEFAULT_USER_DS]  = GDT_ENTRY_INIT(0xc0f2, 0, 0xfffff),
......
#endif
} };
EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);

对于64位和32位,都定义了内核代码段、内核数据段、用户代码段和用户数据段,也会定义4个段选择子,指向段描述符表项

#define __KERNEL_CS      (GDT_ENTRY_KERNEL_CS*8)
#define __KERNEL_DS      (GDT_ENTRY_KERNEL_DS*8)
#define __USER_DS      (GDT_ENTRY_DEFAULT_USER_DS*8 + 3)
#define __USER_CS      (GDT_ENTRY_DEFAULT_USER_CS*8 + 3)

这4个值是存储在段寄存器中的

所有的段起始位置都是0,但是用户态的DPL为3,内核态为0,用于权限审核

Linux的虚拟内存和物理内存的转换方式为分页

  • 内存页面不使用,就可以写到硬盘,为换出
  • 需要的时候加载进来,为换入

为了定位和访问页,需要有页表,保存页的起始地址和页内偏移量,组成线性地址,就可以访问了

对于32位系统,虚拟内存为4GB,如果分为4KB一个页,就是1M个,每个页表项需要4个字节进行存储,就是需要4MB内存来存储映射表,但是如果每个进程都有映射表,100个进程就是400MB。

这里有页表的再分页,4M分成1024个4KB,每个4KB可以在一个页中,这样就是1024个页,再通过页目录表来管理

页目录有1024项,用10位就可以找到对应的页表,每个页表也是1024项,用10位来找到对应的页表项,然后用12位可以定位4KB页内的任何一个位置,加起来正好32位

需要的大小为4MB+4KB来映射4GB,但是这是针对使用4GB内存的进程。对于只申请一个数据页的进程,只是用页目录可以满足地址查询,页表只是在需要地址分配的时候才生效

对于64位,2级不够用了,需要4级目录

  • 9位全局页目录项PGD
  • 9位上层页目录项PUD
  • 9位中间页目录项PMD
  • 9位页表项PTE
  • 12位页偏移地址

x86_64处理器地址线只有48条,,故而导致硬件要求传入的地址48位到63位地址必须相同。 4K页面下位宽度为9、9、9、9和12

22 进程空间管理

用户态和内核态的划分

进程的虚拟内存空间在task_struct中

struct mm_struct    *mm;

struct mm_struct中有成员变量

unsigned long task_size;    /* size of task vm space */

内核态和用户态的分界线就是task_size

在32位系统中,内核定义TASK_SIZE

#ifdef CONFIG_X86_32
/*
 * User space process size: 3GB (default).
 */
#define TASK_SIZE    PAGE_OFFSET
#define TASK_SIZE_MAX    TASK_SIZE
/*
config PAGE_OFFSET
        hex
        default 0xC0000000
        depends on X86_32
*/
#else
/*
 * User space process size. 47bits minus one guard page.
*/
#define TASK_SIZE_MAX  ((1UL << 47) - PAGE_SIZE)
#define TASK_SIZE    (test_thread_flag(TIF_ADDR32) ? \
          IA32_PAGE_OFFSET : TASK_SIZE_MAX)
......

当执行一个新的进程的时候会设置task_size

current->mm->task_size = TASK_SIZE;

对于32位系统,最大寻址4G,用户态为3G,内核态为1G,而对于64位系统,虚拟地址只使用了48位,一半是内核态,一半是用户态,都是128T大小

用户态布局

在struct mm_struct

unsigned long mmap_base;  /* base of mmap area */
unsigned long total_vm;    /* Total pages mapped */
unsigned long locked_vm;  /* Pages that have PG_mlocked set */
unsigned long pinned_vm;  /* Refcount permanently increased */
unsigned long data_vm;    /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
unsigned long exec_vm;    /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm;    /* VM_STACK */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
  • total_vm是总共映射的页数
  • locked_vm是锁定不能换出的
  • pinned_vm是不能换出也不能移动的
  • data_vm是存放数据页的数目
  • exec_vm是存放可执行文件的页的数目
  • stack_vm是栈所占的页的数目
  • start_codeend_code是可执行代码开始和结束位置
  • start_dataend_data是已初始化数据的开始和结束位置
  • start_brk为堆的起始位置,brk为堆的当前位置,malloc申请小内存就是通过改变brk位置完成的
  • start_stack是栈的起始位置,栈的结束位置在栈顶指针寄存器ESP中
  • arg_startarg_end是参数列表的位置
  • env_startenv_end是环境变量的位置
  • mmap_base表示虚拟地址空间中用于内存映射的地址,是从高地址到低地址增长的,malloc申请大内存就是通过mmap申请的就是在这个位置

除了位置信息,struct mm_struct还有vm_area_struct用来描述区域属性

struct vm_area_struct *mmap;    /* list of VMAs */
struct rb_root mm_rb;

vm_area_struct内部有一个单链表,用于将区域串起来,还有一个红黑树,可以快速查找内存区域,并且进行快速修改

struct vm_area_struct {
  /* The first cache line has the info for VMA tree walking. */
  unsigned long vm_start;    /* Our start address within vm_mm. */
  unsigned long vm_end;    /* The first byte after our end address within vm_mm. */
  /* linked list of VM areas per task, sorted by address */
  struct vm_area_struct *vm_next, *vm_prev;
  struct rb_node vm_rb;
  struct mm_struct *vm_mm;  /* The address space we belong to. */
  struct list_head anon_vma_chain; /* Serialized by mmap_sem &
            * page_table_lock */
  struct anon_vma *anon_vma;  /* Serialized by page_table_lock */
  /* Function pointers to deal with this struct. */
  const struct vm_operations_struct *vm_ops;
  struct file * vm_file;    /* File we map to (can be NULL). */
  void * vm_private_data;    /* was vm_pte (shared mem) */
} __randomize_layout;
  • vm_startvm_end指定了该区域在用户空间的起始位置和截至位置
  • vm_nextvm_prev将区域串联在链表上
  • vm_rb区域定位在红黑树
  • vm_ops是对这个内存区域可以操作的定义

虚拟内存可以映射到物理内存,也可以映射到文件

  • anon_vma映射到物理内存
  • vm_file映射到文件

vm_area_struct是通过load_elf_binary关联到进程上的

exec运行一个二进制程序,除了解析ELF文件,另外就是建立内存映射

static int load_elf_binary(struct linux_binprm *bprm)
{
......
  setup_new_exec(bprm);
......
  retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
         executable_stack);
......
  error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
        elf_prot, elf_flags, total_size);
......
  retval = set_brk(elf_bss, elf_brk, bss_prot);
......
  elf_entry = load_elf_interp(&loc->interp_elf_ex,
              interpreter,
              &interp_map_addr,
              load_bias, interp_elf_phdata);
......
  current->mm->end_code = end_code;
  current->mm->start_code = start_code;
  current->mm->start_data = start_data;
  current->mm->end_data = end_data;
  current->mm->start_stack = bprm->p;
......

load_elf_binary的工作有

  • 调用setup_new_exec设置内存映射区mmap_base
  • 调用setup_arg_pages设置栈的vm_area_struct,这里设置了mm->arg_start指向栈底,current->mm->start_stack就是栈底
  • 调用elf_map将ELF文件的代码部分映射到内存
  • 调用set_brk设置堆的vm_area_struct,这里设置了current->mm->start_brk = current->mm->brk,堆里还是空的
  • 调用load_elf_interp将依赖的so文件映射到内存中的内存映射位置

映射的之后,需要修改的情况有

  • 函数调用,涉及到函数栈的改变,主要是改变栈顶指针
  • 申请内存,malloc申请堆内存,需要底层进行brk或者 mmap

brk系统调用的入口为sys_brk

SYSCALL_DEFINE1(brk, unsigned long, brk)
{
  unsigned long retval;
  unsigned long newbrk, oldbrk;
  struct mm_struct *mm = current->mm;
  struct vm_area_struct *next;
......
  newbrk = PAGE_ALIGN(brk);
  oldbrk = PAGE_ALIGN(mm->brk);
  if (oldbrk == newbrk)
    goto set_brk;


  /* Always allow shrinking brk. */
  if (brk <= mm->brk) {
    if (!do_munmap(mm, newbrk, oldbrk-newbrk, &uf))
      goto set_brk;
    goto out;
  }


  /* Check against existing mmap mappings. */
  next = find_vma(mm, oldbrk);
  if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
    goto out;


  /* Ok, looks good - let it rip. */
  if (do_brk(oldbrk, newbrk-oldbrk, &uf) < 0)
    goto out;


set_brk:
  mm->brk = brk;
......
  return brk;
out:
  retval = mm->brk;
  return retval

sys_brk的参数brk为新的堆顶位置,当前堆顶位置mm->brk

  1. 首先要将原来的堆顶和现在的堆顶都按照页对齐地址
  2. 然后比较大小,如果两者相同,说明堆增加很少,还在一个页,不需要分配页,直接set_brkmm->brk到新的brk
  3. 如果不在一个页,还有两种情况:当新堆小于旧堆,需要释放内存,调用do_munmap,虚拟删除,分配的时候清空;
  4. 当新堆大于旧堆,调用find_vma,在原来的堆顶所在的vm_area_struct的下一个vm_area_struct之间的空间是否足够,可以分配一个完整页,如果不能直接返回,如果有调用do_brk进一步分配堆内存,计算新旧堆之间的页数
static int do_brk(unsigned long addr, unsigned long len, struct list_head *uf)
{
  return do_brk_flags(addr, len, 0, uf);
}


static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags, struct list_head *uf)
{
  struct mm_struct *mm = current->mm;
  struct vm_area_struct *vma, *prev;
  unsigned long len;
  struct rb_node **rb_link, *rb_parent;
  pgoff_t pgoff = addr >> PAGE_SHIFT;
  int error;


  len = PAGE_ALIGN(request);
......
  find_vma_links(mm, addr, addr + len, &prev, &rb_link,
            &rb_parent);
......
  vma = vma_merge(mm, prev, addr, addr + len, flags,
      NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
  if (vma)
    goto out;
......
  vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
  INIT_LIST_HEAD(&vma->anon_vma_chain);
  vma->vm_mm = mm;
  vma->vm_start = addr;
  vma->vm_end = addr + len;
  vma->vm_pgoff = pgoff;
  vma->vm_flags = flags;
  vma->vm_page_prot = vm_get_page_prot(flags);
  vma_link(mm, vma, prev, rb_link, rb_parent);
out:
  perf_event_mmap(vma);
  mm->total_vm += len >> PAGE_SHIFT;
  mm->data_vm += len >> PAGE_SHIFT;
  if (flags & VM_LOCKED)
    mm->locked_vm += (len >> PAGE_SHIFT);
  vma->vm_flags |= VM_SOFTDIRTY;
  return 0;

do_brk调用了find_vma_links找到vm_area_struct在红黑树的位置,找到父节点和前序节点。然后调用vma_merge看新节点是否能与旧节点合并,如果地址连着就能合并最后修改统计值,不能就创建新的vm_area_struct追加到anon_vma_chain,也加到红黑树,并更新统计值

不能合并的原因是堆不是连续的

内核态的布局

内核态和进程无关,所有内存看到的都是一致的

内核中有两个宏

  • __pa(vaddr)返回与虚拟地址vaddr相关的物理地址
  • __va(paddr)返回与物理内存paddr的虚拟地址
    #define __va(x)      ((void *)((unsigned long)(x)+PAGE_OFFSET))
    #define __pa(x)    __phys_addr((unsigned long)(x))
    #define __phys_addr(x)    __phys_addr_nodebug(x)
    #define __phys_addr_nodebug(x)  ((x) - PAGE_OFFSET)

32位布局

前896M为直接映射区,是一段连续的内存

  1. 在系统启动的时候,前1M物理内存已经被使用
  2. 然后加载内核代码、内核全局变量、BBS等
  3. 对于创建的进程,task_struct也会放入,相应的页表也会被创建
  4. 内核栈进行分配,相应的页表也会被创建

对于64位

  • 0xffff800000000000开始内核部分,有8T空挡
  • __PAGE_OFFSET_BASE(0xffff880000000000)开始的64T内存为直接映射区,虚拟内存和物理内存之间的映射在大部分情况下都是建立页表映射的
  • VMALLOC_START(0xffffc90000000000)VMALLOC_END(0xffffe90000000000)的32T是给vmalloc,内核通过vmalloc申请内存
  • VMEMMAP_START(0xffffea0000000000)开始的1T空间用于存放物理页面的描述结构struct page
  • __START_KERNEL_map(0xffffffff80000000)开始的512M用于存放内核代码段、全局变量和BSS

对于64位的对应关系

23 物理内存分配

物理内存的组织方式

多个CPU访问内存是要经过总线的,而且距离相同,被称为对称多处理器SMP(Symmetric multiprocessing),所以总线会称为瓶颈

为了提高扩展性,有了更高级模式NUMA非一致内存访问

在NUMA模式下,内存不是一整块,每个CPU都有自己的本地内存,CPU访问本地内存不过总线,速度会快很多。当本地内存不足的时候,就会去其他的NUMA节点申请内存,访问的延迟就会增加

页虽然被分为多个节点,每个节点再分成一个一个的页面,页由于需要一个全局唯一定位,所有会有全局唯一的页号,但是因为物理内存不连续,页号也不连续,所以是一个非连续内存模型

节点

NUMA节点的数据结构为typedef struct pglist_data pg_data_t,成员变量有

typedef struct pglist_data {
  struct zone node_zones[MAX_NR_ZONES];
  struct zonelist node_zonelists[MAX_ZONELISTS];
  int nr_zones;
  struct page *node_mem_map;
  unsigned long node_start_pfn;
  unsigned long node_present_pages; /* total number of physical pages */
  unsigned long node_spanned_pages; /* total size of physical page range, including holes */
  int node_id;
......
} pg_data_t;
  • node_id: 节点ID
  • node_mem_map: 节点的struct page数组
  • node_start_pfn: 节点的其实页号
  • node_spanned_pages: 节点中包含不连续内存地址的页面数
  • node_present_pages: 节点真正可用的物理页面数

例如64M内存之间搁着4M的空洞,换算成页数量,16k页搁着1k页,node_spanned_pages为33k,而node_present_pages为32k

每个节点分成一个个区域zone,放在node_zones,数组大小为MAX_NR_ZONES,区域定义

enum zone_type {
#ifdef CONFIG_ZONE_DMA
  ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
  ZONE_DMA32,
#endif
  ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
  ZONE_HIGHMEM,
#endif
  ZONE_MOVABLE,
  __MAX_NR_ZONES
};
  • ZONE_DMA,DMA为内存直接存取,不需要通过CPU控制传递到外设,直接用DMA控制器即可
  • ZONE_NORMAL,直接映射区,一般是常量的映射
  • ZONE_HIGHMEM,高端内存区,对于32位是超过896的位置,64位没有
  • ZONE_MOVABLE,可移动分配区域

其他成员变量

  • nr_zones当前节点区域数量
  • node_zonelists备用节点,就是使用其他节点申请的内存

多个NUMA节点放到了一个数组

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;

区域

区域通过数据结构zone组织

struct zone {
......
  struct pglist_data  *zone_pgdat;
  struct per_cpu_pageset __percpu *pageset;


  unsigned long    zone_start_pfn;


  /*
   * spanned_pages is the total pages spanned by the zone, including
   * holes, which is calculated as:
   *   spanned_pages = zone_end_pfn - zone_start_pfn;
   *
   * present_pages is physical pages existing within the zone, which
   * is calculated as:
   *  present_pages = spanned_pages - absent_pages(pages in holes);
   *
   * managed_pages is present pages managed by the buddy system, which
   * is calculated as (reserved_pages includes pages allocated by the
   * bootmem allocator):
   *  managed_pages = present_pages - reserved_pages;
   *
   */
  unsigned long    managed_pages;
  unsigned long    spanned_pages;
  unsigned long    present_pages;


  const char    *name;
......
  /* free areas of different sizes */
  struct free_area  free_area[MAX_ORDER];


  /* zone flags, see below */
  unsigned long    flags;


  /* Primarily protects free_area */
  spinlock_t    lock;
......
} ____cacheline_internodealigned_in_
  • zone_start_pfn是zone的第一个页
  • spanned_pages是不论是否有空洞,就是最后内存号减去起始内存号
  • present_pages是实际存在的page数目
  • managed_pages是被其他节点管理的page数目
  • per_cpu_pageset用于区分冷热页,被CPU缓存的就是热页

物理页使用有多种的模式

  1. 使用一整页内存,或者直接和虚拟地址空间建立映射关系,被称为匿名页,或者关联一个文件,然后再和虚拟地址空间建立映射关系,被称为内存映射文件
  2. 仅需要分配小块内存,例如一个struct结构,分配小块内存使用了slab allocator技术,原理是申请一个完整的页,然后划分为多个小的存储池,用队列维护状态。还有slob用于嵌入式

第一种模式,union结构中的变量

  • struct address_space *mapping用于内存映射,匿名页为1,映射文件为0
  • pgoff_t index为映射区的偏移量
  • atomic_t _mapcount有多少个页表指向了这个页
  • struct list_head lru表示这个页在一个链表上,例如页面被换出,就在换出页链表
  • compound用于复合页,物理上连续两个或以上的页看为一个独立的大页

第二种模式,union结构中的变量

  • s_mem正在使用slab的第一个对象
  • freelist资源池空闲对象
  • rcu_head需要释放的列表
    struct page {
      unsigned long flags;
      union {
        struct address_space *mapping;  
        void *s_mem;      /* slab first object */
        atomic_t compound_mapcount;  /* first tail page */
      };
      union {
        pgoff_t index;    /* Our offset within mapping. */
        void *freelist;    /* sl[aou]b first free object */
      };
      union {
        unsigned counters;
        struct {
          union {
            atomic_t _mapcount;
            unsigned int active;    /* SLAB */
            struct {      /* SLUB */
              unsigned inuse:16;
              unsigned objects:15;
              unsigned frozen:1;
            };
            int units;      /* SLOB */
          };
          atomic_t _refcount;
        };
      };
      union {
        struct list_head lru;  /* Pageout list   */
        struct dev_pagemap *pgmap; 
        struct {    /* slub per cpu partial pages */
          struct page *next;  /* Next partial slab */
          int pages;  /* Nr of partial slabs left */
          int pobjects;  /* Approximate # of objects */
        };
        struct rcu_head rcu_head;
        struct {
          unsigned long compound_head; /* If bit zero is set */
          unsigned int compound_dtor;
          unsigned int compound_order;
        };
      };
      union {
        unsigned long private;
        struct kmem_cache *slab_cache;  /* SL[AU]B: Pointer to slab */
      };
    ......
    }

页的分配

Linux中内存的页为4KB,把所有的空闲页分组为11个页块链表,每个链表分别包括很多个大小的页块,有1,2,4,8...1024个连续页的页块。最大可以申请1024个连续的页,对应4MB大小的连续内存

第i个页块链表的页块中页数量为2^i,定义为

struct free_area  free_area[MAX_ORDER];

MAX_ORDER为指数,默认最大为11

示例请求一个128KB的页

  • 会先检查128个页的页块链表是否有空闲块,如果没有就查256KB页块的链表,如果有空闲的就把也快拆分为2个页128KB的,一个使用,一个插入128KB页块链表
  • 如果256页块链表不满足,就会查询512KB页块链表,如果有空闲就拆分为两个128KB和一个256KB

这个过程在分配页的过程中

static inline struct page *
alloc_pages(gfp_t gfp_mask, unsigned int order)
{
  return alloc_pages_current(gfp_mask, order);
}


/**
 *   alloc_pages_current - Allocate pages.
 *
 *  @gfp:
 *    %GFP_USER   user allocation,
 *        %GFP_KERNEL kernel allocation,
 *        %GFP_HIGHMEM highmem allocation,
 *        %GFP_FS     don't call back into a file system.
 *        %GFP_ATOMIC don't sleep.
 *  @order: Power of two of allocation size in pages. 0 is a single page.
 *
 *  Allocate a page from the kernel page pool.  When not in
 *  interrupt context and apply the current process NUMA policy.
 *  Returns NULL when no page can be allocated.
 */
struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
  struct mempolicy *pol = &default_policy;
  struct page *page;
......
  page = __alloc_pages_nodemask(gfp, order,
        policy_node(gfp, pol, numa_node_id()),
        policy_nodemask(gfp, pol));
......
  return page;
}

alloc_pages会调用alloc_pages_current,gfp为希望在哪个区域分配内存

  • GFP_USER用于分配页映射到用户进程虚拟地址空间
  • GFP_KERNEL
  • GFP_HIGHMEM用于分配高端区域内存

order代表2^order

__alloc_pages_nodemask,会调用get_page_from_freelist先看自己zone的空闲页,没有的话去备用节点的zone找空闲页

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
            const struct alloc_context *ac)
{
......
  for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
    struct page *page;
......
    page = rmqueue(ac->preferred_zoneref->zone, zone, order,
        gfp_mask, alloc_flags, ac->migratetype);
......
}

整体就是rmqueue->__rmqueue->__rmqueue_smallest的调用链

static inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
            int migratetype)
{
  unsigned int current_order;
  struct free_area *area;
  struct page *page;


  /* Find a page of the appropriate size in the preferred list */
  for (current_order = order; current_order < MAX_ORDER; ++current_order) {
    area = &(zone->free_area[current_order]);
    page = list_first_entry_or_null(&area->free_list[migratetype],
              struct page, lru);
    if (!page)
      continue;
    list_del(&page->lru);
    rmv_page_order(page);
    area->nr_free--;
    expand(zone, page, order, current_order, area, migratetype);
    set_pcppage_migratetype(page, migratetype);
    return page;
  }


  return NULL;

从当前的order开始,找不到就往order更大一位的链表查找,list_add为加到链表,nr_free++计数加一

static inline void expand(struct zone *zone, struct page *page,
  int low, int high, struct free_area *area,
  int migratetype)
{
  unsigned long size = 1 << high;


  while (high > low) {
    area--;
    high--;
    size >>= 1;
......
    list_add(&page[size].lru, &area->free_list[migratetype]);
    area->nr_free++;
    set_page_order(&page[size], high);
  }
}

总结一下,就是空闲页通过free_area管理,每个页是一个page

24 物理内存管理

小内存分配

小内存分配采用slub分配器

在创建进程的时候,调用dup_task_struct复制一个task_struct对象,通过的alloc_task_struct_node方法

static struct kmem_cache *task_struct_cachep;

task_struct_cachep = kmem_cache_create("task_struct",
      arch_task_struct_size, align,
      SLAB_PANIC|SLAB_NOTRACK|SLAB_ACCOUNT, NULL);

static inline struct task_struct *alloc_task_struct_node(int node)
{
  return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);
}

static inline void free_task_struct(struct task_struct *tsk)
{
  kmem_cache_free(task_struct_cachep, tsk);
}

调用了kmem_cache_alloc_nodetask_struct_cachep分配了内存

task_struct_cachep在系统初始化的时候由kmem_cache_create函数创建,专门用于分配task_struct对象的缓存,缓存区每块的大小正好等于task_struct大小,也就是arch_task_struct_size

每次创建task_struct不需要去内存分配,直接在其缓存区分配即可,当进程结束的时候,task_struct也不用销毁,而是直接放回到缓存,新进程直接使用缓存的task_struct

kmem_cache的代码

struct kmem_cache {
  struct kmem_cache_cpu __percpu *cpu_slab;
  /* Used for retriving partial slabs etc */
  unsigned long flags;
  unsigned long min_partial;
  int size;    /* The size of an object including meta data */
  int object_size;  /* The size of an object without meta data */
  int offset;    /* Free pointer offset. */
#ifdef CONFIG_SLUB_CPU_PARTIAL
  int cpu_partial;  /* Number of per cpu partial objects to keep around */
#endif
  struct kmem_cache_order_objects oo;
  /* Allocation and freeing of slabs */
  struct kmem_cache_order_objects max;
  struct kmem_cache_order_objects min;
  gfp_t allocflags;  /* gfp flags to use on each alloc */
  int refcount;    /* Refcount for slab cache destroy */
  void (*ctor)(void *);
......
  const char *name;  /* Name (only for display!) */
  struct list_head list;  /* List of slab caches */
......
  struct kmem_cache_node *node[MAX_NUMNODES];
};

kmem_cache中有struct list_head list。对于操作系统,需要创建和管理很多缓存,这些缓存就放在list_head,就是LIST_HEAD(slab_caches)

大内存拆分为小内存

每一项结构都缓存着下一个空闲对象的指针,进而形成链,可以随时插入和删除

这里有三个变量

  • size指针大小
  • object_size纯对象的大小
  • offset下一个空闲指针的存放在这里的偏移量

在每个numa节点都有kmem_cache_cpukmem_cache_node,在分配缓存块的时候,有两种路径

  • fast path快速通道,用于kmem_cache_cpu
  • slow path普通通道,用于kmem_cache_node

每次分配从kmem_cache_cpu分配,kmem_cache_cpu没有才是kmem_cache_node中分配,然后再是其他节点

kmem_cache_cpu中存放缓存块

struct kmem_cache_cpu {
  void **freelist;  /* Pointer to next available object */
  unsigned long tid;  /* Globally unique transaction id */
  struct page *page;  /* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
  struct page *partial;  /* Partially allocated frozen slabs */
#endif
......
};
  • page指向大块内存的第一个页
  • freelist指向大块内存的第一个空闲项
  • partial也指向另一大块内存的第一个页,但是它的部分被分配了,page满了从这里分配

kmem_cache_node中存放缓存块

struct kmem_cache_node {
  spinlock_t list_lock;
......
#ifdef CONFIG_SLUB
  unsigned long nr_partial;
  struct list_head partial;
......
#endif
};

partial是一个链表,存放空闲的内存块

分配过程

kmem_cache_alloc_node会调用slab_alloc_node

/*
 * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
 * have the fastpath folded into their functions. So no function call
 * overhead for requests that can be satisfied on the fastpath.
 *
 * The fastpath works by first checking if the lockless freelist can be used.
 * If not then __slab_alloc is called for slow processing.
 *
 * Otherwise we can simply pick the next object from the lockless free list.
 */
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
    gfp_t gfpflags, int node, unsigned long addr)
{
  void *object;
  struct kmem_cache_cpu *c;
  struct page *page;
  unsigned long tid;
......
  tid = this_cpu_read(s->cpu_slab->tid);
  c = raw_cpu_ptr(s->cpu_slab);
......
  object = c->freelist;
  page = c->page;
  if (unlikely(!object || !node_match(page, node))) {
    object = __slab_alloc(s, gfpflags, node, addr, c);
    stat(s, ALLOC_SLOWPATH);
  } 
......
  return object;
}

取出cpu_slab,也就是kmem_cache_cpu的freelist,直接返回,如果没有就进入普通通道,调用__slab_alloc

static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
        unsigned long addr, struct kmem_cache_cpu *c)
{
  void *freelist;
  struct page *page;
......
redo:
......
  /* must check again c->freelist in case of cpu migration or IRQ */
  freelist = c->freelist;
  if (freelist)
    goto load_freelist;


  freelist = get_freelist(s, page);


  if (!freelist) {
    c->page = NULL;
    stat(s, DEACTIVATE_BYPASS);
    goto new_slab;
  }


load_freelist:
  c->freelist = get_freepointer(s, freelist);
  c->tid = next_tid(c->tid);
  return freelist;


new_slab:


  if (slub_percpu_partial(c)) {
    page = c->page = slub_percpu_partial(c);
    slub_set_percpu_partial(c, page);
    stat(s, CPU_PARTIAL_ALLOC);
    goto redo;
  }


  freelist = new_slab_objects(s, gfpflags, node, &c);
......
  return freeli

  • 会再尝试一下kmem_cache_cpufreelist,因为中断返回可能有其他进程释放的缓存,有的话load_freelist返回
  • 如果没有就到去kmem_cache_cpu的partial,如果partial不为空,就将page替换为partial进行redo重试
  • 如果还不行就new_slab_objects

static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
      int node, struct kmem_cache_cpu **pc)
{
  void *freelist;
  struct kmem_cache_cpu *c = *pc;
  struct page *page;


  freelist = get_partial(s, flags, node, c);


  if (freelist)
    return freelist;


  page = new_slab(s, flags, node);
  if (page) {
    c = raw_cpu_ptr(s->cpu_slab);
    if (c->page)
      flush_slab(s, c);


    freelist = page->freelist;
    page->freelist = NULL;


    stat(s, ALLOC_SLAB);
    c->page = page;
    *pc = c;
  } else
    freelist = NULL;


  return freelis

get_partial会根据nodeid找到对应的kmem_cache_node,调用get_partial_node进行分配


/*
 * Try to allocate a partial slab from a specific node.
 */
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
        struct kmem_cache_cpu *c, gfp_t flags)
{
  struct page *page, *page2;
  void *object = NULL;
  int available = 0;
  int objects;
......
  list_for_each_entry_safe(page, page2, &n->partial, lru) {
    void *t;


    t = acquire_slab(s, n, page, object == NULL, &objects);
    if (!t)
      break;


    available += objects;
    if (!object) {
      c->page = page;
      stat(s, ALLOC_FROM_PARTIAL);
      object = t;
    } else {
      put_cpu_partial(s, page, 0);
      stat(s, CPU_PARTIAL_NODE);
    }
    if (!kmem_cache_has_cpu_partial(s)
      || available > slub_cpu_partial(s) / 2)
      break;
  }
......
  return object;
  • acquire_slab会从kmem_cache_node的partial拿下大块内存,将freelist赋值给t
  • 当第一轮循环kmem_cache_cpu的page指向取出的内存,返回的就是t,如果kmem_cache_cpu也有partial,就进行第二轮,取下大块内存,调用put_cpu_partial放到partial
  • 如果kmem_cache_node没有空间就回到new_slab_objects函数,通过new_slab调用allocate_slab
static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
  struct page *page;
  struct kmem_cache_order_objects oo = s->oo;
  gfp_t alloc_gfp;
  void *start, *p;
  int idx, order;
  bool shuffle;


  flags &= gfp_allowed_mask;
......
  page = alloc_slab_page(s, alloc_gfp, node, oo);
  if (unlikely(!page)) {
    oo = s->min;
    alloc_gfp = flags;
    /*
     * Allocation may have failed due to fragmentation.
     * Try a lower order alloc if possible
     */
    page = alloc_slab_page(s, alloc_gfp, node, oo);
    if (unlikely(!page))
      goto out;
    stat(s, ORDER_FALLBACK);
  }
......
  return page;
}

alloc_slab_page分配页面,也是按照kmem_cache_order_objects的order来,第一次分配不成功说明内存紧张,就使用min版本的kmem_cache_order_objects

页面换出

每个进程的虚拟内存都非常的大,但是实际物理内存却没有那么多,一般情况只有页面被使用的时候,才会放到内存,长时间不用的会被换到硬盘上

触发页面换出的情况

  1. 分配内存,发现没有内存,申请页通过get_page_from_freelist,调用链get_page_from_freelist->node_reclaim->__node_reclaim->shrink_node
  2. 内存管理线程kswapd,在系统初始化的时候创建,进入循环,在内存紧张的时候看看那些页可以换出
/*
 * The background pageout daemon, started as a kernel thread
 * from the init process.
 *
 * This basically trickles out pages so that we have _some_
 * free memory available even if there is no other activity
 * that frees anything up. This is needed for things like routing
 * etc, where we otherwise might have all activity going on in
 * asynchronous contexts that cannot page things out.
 *
 * If there are applications that are active memory-allocators
 * (most normal use), this basically shouldn't matter.
 */
static int kswapd(void *p)
{
  unsigned int alloc_order, reclaim_order;
  unsigned int classzone_idx = MAX_NR_ZONES - 1;
  pg_data_t *pgdat = (pg_data_t*)p;
  struct task_struct *tsk = current;


    for ( ; ; ) {
......
        kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
          classzone_idx);
......
        reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
......
    }
}

kswapd的调用链balance_pgdat->kswapd_shrink_node->shrink_node,两种方式都是以内存节点为单位

shrink_node会调用shrink_node_memcg,有一个循环处理页面的列表

/*
 * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
 */
static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
            struct scan_control *sc, unsigned long *lru_pages)
{
......
  unsigned long nr[NR_LRU_LISTS];
  enum lru_list lru;
......
  while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
          nr[LRU_INACTIVE_FILE]) {
    unsigned long nr_anon, nr_file, percentage;
    unsigned long nr_scanned;


    for_each_evictable_lru(lru) {
      if (nr[lru]) {
        nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
        nr[lru] -= nr_to_scan;


        nr_reclaimed += shrink_list(lru, nr_to_scan,
                  lruvec, memcg, sc);
      }
    }
......
  }
......

所有的页面都挂在LRU列表,按照页面活跃度进行排序

内存页主要有匿名页和内存映射两类,每类又分active和inactive,会定期更新

enum lru_list {
  LRU_INACTIVE_ANON = LRU_BASE,
  LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
  LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
  LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
  LRU_UNEVICTABLE,
  NR_LRU_LISTS
};


#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)


static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
         struct lruvec *lruvec, struct mem_cgroup *memcg,
         struct scan_control *sc)
{
  if (is_active_lru(lru)) {
    if (inactive_list_is_low(lruvec, is_file_lru(lru),
           memcg, sc, true))
      shrink_active_list(nr_to_scan, lruvec, sc, lru);
    return 0;
  }


  return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);

shrink_list会先缩减活跃页面列表,再压缩不活跃页面列表。压缩采用的shrink_inactive_list对页面进行回收

  • 对于匿名页,需要分配swap,将内存写入文件系统
  • 对于文件映射,需要将对文件的修改写会文件

25 用户态内存映射

mmap的原理

进程的vm_area_struct指向了虚拟地址空间的不同内存块,每个变量的名字为mmap

struct mm_struct {
  struct vm_area_struct *mmap;    /* list of VMAs */
......
}


struct vm_area_struct {
  /*
   * For areas with an address space and backing store,
   * linkage into the address_space->i_mmap interval tree.
   */
  struct {
    struct rb_node rb;
    unsigned long rb_subtree_last;
  } shared;




  /*
   * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
   * list, after a COW of one of the file pages.  A MAP_SHARED vma
   * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
   * or brk vma (with NULL file) can only be in an anon_vma list.
   */
  struct list_head anon_vma_chain; /* Serialized by mmap_sem &
            * page_table_lock */
  struct anon_vma *anon_vma;  /* Serialized by page_table_lock */




  /* Function pointers to deal with this struct. */
  const struct vm_operations_struct *vm_ops;
  /* Information about our backing store: */
  unsigned long vm_pgoff;    /* Offset (within vm_file) in PAGE_SIZE
             units */
  struct file * vm_file;    /* File we map to (can be NULL). */
  void * vm_private_data;    /* was vm_pte (shared mem) */

内存映射有

  • 物理内存和虚拟内存之间的映射
  • 文件中内容映射到虚拟内存

申请小块内存用brk,大块内存用mmap

  • 堆申请内存,mmap是映射内存空间到物理空间
  • 进程加载文件,mmap是映射内存空间到物理空间再到文件
SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
                unsigned long, prot, unsigned long, flags,
                unsigned long, fd, unsigned long, off)
{
......
        error = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
......
}


SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
    unsigned long, prot, unsigned long, flags,
    unsigned long, fd, unsigned long, pgoff)
{
  struct file *file = NULL;
......
  file = fget(fd);
......
  retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
  return retval;
}

如果需要映射文件,fd会传进来一个文件描述符,并且mmap_pgoff中通过fget函数根据文件描述符获得struct file打开的文件

之后调用链为vm_mmap_pgoff->do_mmap_pgoff->do_mmap,工作有

  • get_unmapped_area找到一个没有映射的区域
  • mmap_region映射到这个区域
unsigned long
get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
    unsigned long pgoff, unsigned long flags)
{
  unsigned long (*get_area)(struct file *, unsigned long,
          unsigned long, unsigned long, unsigned long);
......
  get_area = current->mm->get_unmapped_area;
  if (file) {
    if (file->f_op->get_unmapped_area)
      get_area = file->f_op->get_unmapped_area;
  } 
......
}

get_unmapped_area会调用find_vma_prev,在虚拟内存的红黑树vm_area_struct找到对应的位置,因为虚拟内存还没建立,所以是找到前一个vm_area_struct

如果不是匿名映射,而是映射一个文件,根据struct filefile_operations中的操作,调用thp_get_unmapped_area最后还是调用mm_structget_unmapped_area

mmap_region中映射的过程

unsigned long mmap_region(struct file *file, unsigned long addr,
    unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
    struct list_head *uf)
{
  struct mm_struct *mm = current->mm;
  struct vm_area_struct *vma, *prev;
  struct rb_node **rb_link, *rb_parent;


  /*
   * Can we just expand an old mapping?
   */
  vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
      NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
  if (vma)
    goto out;


  /*
   * Determine the object being mapped and call the appropriate
   * specific mapper. the address has already been validated, but
   * not unmapped, but the maps are removed from the list.
   */
  vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
  if (!vma) {
    error = -ENOMEM;
    goto unacct_error;
  }


  vma->vm_mm = mm;
  vma->vm_start = addr;
  vma->vm_end = addr + len;
  vma->vm_flags = vm_flags;
  vma->vm_page_prot = vm_get_page_prot(vm_flags);
  vma->vm_pgoff = pgoff;
  INIT_LIST_HEAD(&vma->anon_vma_chain);


  if (file) {
    vma->vm_file = get_file(file);
    error = call_mmap(file, vma);
    addr = vma->vm_start;
    vm_flags = vma->vm_flags;
  } 
......
  vma_link(mm, vma, prev, rb_link, rb_parent);
  return addr;
.....

因为找到了前一个vm_area_struct流程为

  • 首先看能否基于vm_area_struct扩展,调用vma_merge进行合并
  • 不能就调用kmem_cache_zalloc在Slub新建vm_area_struct对象,设置起始和终止位置,加入队列
  • 如果是映射文件,则设置vm_file,调用call_mmap(也就是调用file_operations的mmap,ext4为ext4_file_mmap),vm_area_struct的内存操作设置为文件系统,读写内存就是读写文件
  • 创建的新的vm_area_struct通过vma_link挂载到mm_struct,还有调用__vma_link_file建立文件到内存的关系,将vm_area_struct挂在struct fileaddress_space结构的i_mmap的红黑树

用户态缺页异常

每个进程都有独立的进程页表,页表的最顶级pgd存放在task_struct中的mm_struct的pgd中

当创建新的进程的时候,会调用fork,对内存部分调用copy_mm,其中会调用dup_mm

/*
 * Allocate a new mm structure and copy contents from the
 * mm structure of the passed in task structure.
 */
static struct mm_struct *dup_mm(struct task_struct *tsk)
{
  struct mm_struct *mm, *oldmm = current->mm;
  mm = allocate_mm();
  memcpy(mm, oldmm, sizeof(*mm));
  if (!mm_init(mm, tsk, mm->user_ns))
    goto fail_nomem;
  err = dup_mmap(mm, oldmm);
  return mm;
}

创建mm_struct,调用mm_init初始化,会调用mm_alloc_pgd分配全局目录项

static inline int mm_alloc_pgd(struct mm_struct *mm)
{
  mm->pgd = pgd_alloc(mm);
  return 0;
}

pgd_alloc除了分配pgd,还调用了pgd_ctor

static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
  /* If the pgd points to a shared pagetable level (either the
     ptes in non-PAE, or shared PMD in PAE), then just copy the
     references from swapper_pg_dir. */
  if (CONFIG_PGTABLE_LEVELS == 2 ||
      (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
      CONFIG_PGTABLE_LEVELS >= 4) {
    clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
        swapper_pg_dir + KERNEL_PGD_BOUNDARY,
        KERNEL_PGD_PTRS);
  }
......
}

这里拷贝了swapper_pg_dir内核页表的最顶级全局页目录

进程fork完,获取到自己的pgd,内核页表,但是还没有映射,需要等进程在CPU运行对内存进行访问

调用context_switch进行上下文切换,调用switch_mm_irqs_off完成内存方面切换,会调用load_new_mm_cr3

cr3是一个CPU寄存器,会指向进程的顶级pgd,cpu要访问进程的虚拟内存,就会从cr3获取pgd的物理地址,然后根据页表解析虚拟内存为物理地址,进而访问内存的数据。但是目前没有物理地址,所以load_new_mm_cr3中会使用__pa,将mm_struct的pgd转为物理地址

虚拟内存在使用的时候,没有对应的物理页,就会触发缺页中断调用do_page_fault

dotraplinkage void notrace
do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
  unsigned long address = read_cr2(); /* Get the faulting address */
......
  __do_page_fault(regs, error_code, address);
......
}


/*
 * This routine handles page faults.  It determines the address,
 * and the problem, and then passes it off to one of the appropriate
 * routines.
 */
static noinline void
__do_page_fault(struct pt_regs *regs, unsigned long error_code,
    unsigned long address)
{
  struct vm_area_struct *vma;
  struct task_struct *tsk;
  struct mm_struct *mm;
  tsk = current;
  mm = tsk->mm;


  if (unlikely(fault_in_kernel_space(address))) {
    if (vmalloc_fault(address) >= 0)
      return;
  }
......
  vma = find_vma(mm, address);
......
  fault = handle_mm_fault(vma, address, flags);
......

__do_page_fault会判断页中断发生的位置

  • 内核调用vmalloc_fault
  • 用户空间调用handle_mm_fault
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
    unsigned int flags)
{
  struct vm_fault vmf = {
    .vma = vma,
    .address = address & PAGE_MASK,
    .flags = flags,
    .pgoff = linear_page_index(vma, address),
    .gfp_mask = __get_fault_gfp_mask(vma),
  };
  struct mm_struct *mm = vma->vm_mm;
  pgd_t *pgd;
  p4d_t *p4d;
  int ret;


  pgd = pgd_offset(mm, address);
  p4d = p4d_alloc(mm, pgd, address);
......
  vmf.pud = pud_alloc(mm, p4d, address);
......
  vmf.pmd = pmd_alloc(mm, vmf.pud, address);
......
  return handle_pte_fault(&vmf);
}

会调用pud_allocpmd_alloc来创建对应的页目录项,最后调用handle_pte_fault创建页表项

获取物理内存使用handle_pte_fault

static int handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;
......
  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  vmf->orig_pte = *vmf->pte;
......
  if (!vmf->pte) {
    if (vma_is_anonymous(vmf->vma))
      return do_anonymous_page(vmf);
    else
      return do_fault(vmf);
  }


  if (!pte_present(vmf->orig_pte))
    return do_swap_page(vmf);
......
}

这里区分了3种情况

PTE页表项不存在有两种

  1. 匿名页,映射物理内存,调用do_anonymous_page
  2. 映射文件,调用do_fault

PTE页表项存在说明存在换出到硬盘,需要换回来,调用do_swap_page

第一种情况do_anonymous_page会先通过pte_alloc分配页表项,通过alloc_zeroed_user_highpage_movable分配一个页,最后调用alloc_pages_vma最终调用的__alloc_pages_nodemask,然后调用mk_pte将页表项指向新的物理地址页,set_pte_at将页表项放到页表内


static int do_anonymous_page(struct vm_fault *vmf)
{
  struct vm_area_struct *vma = vmf->vma;
  struct mem_cgroup *memcg;
  struct page *page;
  int ret = 0;
  pte_t entry;
......
  if (pte_alloc(vma->vm_mm, vmf->pmd, vmf->address))
    return VM_FAULT_OOM;
......
  page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
......
  entry = mk_pte(page, vma->vm_page_prot);
  if (vma->vm_flags & VM_WRITE)
    entry = pte_mkwrite(pte_mkdirty(entry));


  vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
      &vmf->ptl);
......
  set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
......
}

第二种情况,do_fault最后调用__do_fault

static int __do_fault(struct vm_fault *vmf)
{
  struct vm_area_struct *vma = vmf->vma;
  int ret;
......
  ret = vma->vm_ops->fault(vmf);
......
  return ret;
}

这里调用struct vm_operations_struct vm_ops的fault函数,对于ext4文件系统,vm_ops指向了ext4_file_vm_ops,调用了ext4_filemap_fault

static const struct vm_operations_struct ext4_file_vm_ops = {
  .fault    = ext4_filemap_fault,
  .map_pages  = filemap_map_pages,
  .page_mkwrite   = ext4_page_mkwrite,
};


int ext4_filemap_fault(struct vm_fault *vmf)
{
  struct inode *inode = file_inode(vmf->vma->vm_file);
......
  err = filemap_fault(vmf);
......
  return err;
}

调用filemap_fault查找vm_file文件对应的物理页面缓存

  • 如果找到了,调用do_async_mmap_readahead将找到的页find_get_page预读一些数据到内存
  • 如果没有找到,跳到no_cached_page,调用page_cache_read
int filemap_fault(struct vm_fault *vmf)
{
  int error;
  struct file *file = vmf->vma->vm_file;
  struct address_space *mapping = file->f_mapping;
  struct inode *inode = mapping->host;
  pgoff_t offset = vmf->pgoff;
  struct page *page;
  int ret = 0;
......
  page = find_get_page(mapping, offset);
  if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
    do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
  } else if (!page) {
    goto no_cached_page;
  }
......
  vmf->page = page;
  return ret | VM_FAULT_LOCKED;
no_cached_page:
  error = page_cache_read(file, offset, vmf->gfp_mask);
......
}

page_cache_read会分配一个缓存页,将其加入到lru,然后在address_space中调用address_space_operationsreadpage将文件读到内存

static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
{
  struct address_space *mapping = file->f_mapping;
  struct page *page;
......
  page = __page_cache_alloc(gfp_mask|__GFP_COLD);
......
  ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL);
......
  ret = mapping->a_ops->readpage(file, page);
......
}

readpage对于ext4来说address_space_operations中定义的为ext4_readpage

static const struct address_space_operations ext4_aops = {
  .readpage    = ext4_readpage,
  .readpages    = ext4_readpages,
......
};


static int ext4_read_inline_page(struct inode *inode, struct page *page)
{
  void *kaddr;
......
  kaddr = kmap_atomic(page);
  ret = ext4_read_inline_data(inode, kaddr, len, &iloc);
  flush_dcache_page(page);
  kunmap_atomic(kaddr);
......
}

ext4_readpage会调用ext4_read_inline_page进行内存映射

static const struct address_space_operations ext4_aops = {
  .readpage    = ext4_readpage,
  .readpages    = ext4_readpages,
......
};


static int ext4_read_inline_page(struct inode *inode, struct page *page)
{
  void *kaddr;
......
  kaddr = kmap_atomic(page);
  ret = ext4_read_inline_data(inode, kaddr, len, &iloc);
  flush_dcache_page(page);
  kunmap_atomic(kaddr);
......
}
  • ext4_read_inline_page会调用kmap_atomic将物理内存和一个临时虚拟内存映射,得到内核中的地址kaddr
  • 通过kaddr获取到虚拟地址
  • 然后取消临时映射kunmap_atomic

第三种情况,do_swap_page

int do_swap_page(struct vm_fault *vmf)
{
  struct vm_area_struct *vma = vmf->vma;
  struct page *page, *swapcache;
  struct mem_cgroup *memcg;
  swp_entry_t entry;
  pte_t pte;
......
  entry = pte_to_swp_entry(vmf->orig_pte);
......
  page = lookup_swap_cache(entry);
  if (!page) {
    page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma,
          vmf->address);
......
  } 
......
  swapcache = page;
......
  pte = mk_pte(page, vma->vm_page_prot);
......
  set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
  vmf->orig_pte = pte;
......
  swap_free(entry);
......
}

会查找swap文件中有没有缓存的页,如果有就调用swapin_readahead会将文件读到内存,形成内存页,并通过mk_pte生成页表项,set_pte_at插入页表,swap_free将文件清理

swapin_readahead会调用swap_readpage,这里readpage,说明读取文件和swap文件的过程一样,都用kmap_atomic来临时映射

int swap_readpage(struct page *page, bool do_poll)
{
  struct bio *bio;
  int ret = 0;
  struct swap_info_struct *sis = page_swap_info(page);
  blk_qc_t qc;
  struct block_device *bdev;
......
  if (sis->flags & SWP_FILE) {
    struct file *swap_file = sis->swap_file;
    struct address_space *mapping = swap_file->f_mapping;
    ret = mapping->a_ops->readpage(swap_file, page);
    return ret;
  }
......
}

用户态缺页异常处理完毕,页表建立映射,就能在虚拟内存空间访问物理内存了

页表一般都很大,为了加快虚拟内存和物理内存的转换速度,引入了TLB(快表)用于存储地址映射的硬件设备MMU中,比内存还要快。

有了TLB,流程就变为了先查快表,有映射关系就直接转换,在TLB没有就在内存中查询页表

26 内核态内存映射

内核页表

在系统初始化的时候,就会创建内核页表,在arch/x86/include/asm/pgtable_64.h中有定义swapper_pg_dir

extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pmd_t level2_fixmap_pgt[512];
extern pmd_t level2_ident_pgt[512];
extern pte_t level1_fixmap_pgt[512];
extern pgd_t init_top_pgt[];


#define swapper_pg_dir init_top_pgt

swapper_pg_dir指向内存最顶级目录pgd,同时还有几个页表XXX_ident_pgt对应的是直接映射区,XXX_kernel_pgt对应的是内核代码区,XXX_fixmap_pgt对应的是固定映射区

初始化在arch\x86\kernel\head_64.S

__INITDATA


NEXT_PAGE(init_top_pgt)
  .quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
  .org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
  .quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
  .org    init_top_pgt + PGD_START_KERNEL*8, 0
  /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
  .quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE


NEXT_PAGE(level3_ident_pgt)
  .quad  level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
  .fill  511, 8, 0
NEXT_PAGE(level2_ident_pgt)
  /* Since I easily can, map the first 1G.
   * Don't set NX because code runs from these pages.
   */
  PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)


NEXT_PAGE(level3_kernel_pgt)
  .fill  L3_START_KERNEL,8,0
  /* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
  .quad  level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
  .quad  level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE


NEXT_PAGE(level2_kernel_pgt)
  /*
   * 512 MB kernel mapping. We spend a full page on this pagetable
   * anyway.
   *
   * The kernel code+data+bss must not be bigger than that.
   *
   * (NOTE: at +512MB starts the module area, see MODULES_VADDR.
   *  If you want to increase this then increase MODULES_VADDR
   *  too.)
   */
  PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
    KERNEL_IMAGE_SIZE/PMD_SIZE)


NEXT_PAGE(level2_fixmap_pgt)
  .fill  506,8,0
  .quad  level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
  /* 8MB reserved for vsyscalls + a 2MB hole = 4 + 1 entries */
  .fill  5,8,0


NEXT_PAGE(level1_fixmap_pgt)
  .fill  51

页表顶级目录init_top_pgt定义在__INITDATA,为一个全局变量,在内存管理没有初始化就能定位到。

init_top_pgt有三项quad

  • 第一项指向level3_ident_pgt,直接映射区页表的三级目录,减去__START_KERNEL_map为物理地址
  • 第二项指向level3_ident_pgt
  • 第三项指向level3_kernel_pgt

结构为

只映射了一小部分内存,内容包括内核代码和数据结构

如果是用户态进程页表,会有mm_struct指向进程的顶级pgd,对于内核也定义了mm_struct指向了swapper_pg_dir

struct mm_struct init_mm = {
  .mm_rb    = RB_ROOT,
  .pgd    = swapper_pg_dir,
  .mm_users  = ATOMIC_INIT(2),
  .mm_count  = ATOMIC_INIT(1),
  .mmap_sem  = __RWSEM_INITIALIZER(init_mm.mmap_sem),
  .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
  .mmlist    = LIST_HEAD_INIT(init_mm.mmlist),
  .user_ns  = &init_user_ns,
  INIT_MM_CONTEXT(init_mm)
};

定义完内核页表,就是初始化内核页表,系统启动的时候通过start_kernel会调用setup_arch

void __init setup_arch(char **cmdline_p)
{
  /*
   * copy kernel address range established so far and switch
   * to the proper swapper page table
   */
  clone_pgd_range(swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
      initial_page_table + KERNEL_PGD_BOUNDARY,
      KERNEL_PGD_PTRS);


  load_cr3(swapper_pg_dir);
  __flush_tlb_all();
......
  init_mm.start_code = (unsigned long) _text;
  init_mm.end_code = (unsigned long) _etext;
  init_mm.end_data = (unsigned long) _edata;
  init_mm.brk = _brk_end;
......
  init_mem_mapping();
......
}

setup_arch中,load_cr3(swapper_pg_dir)说明内核页表要开始起作用了,并且刷新了TLB,初始化init_mm的成员变量,最重要的就是init_mem_mapping最终它会调用kernel_physical_mapping_init

/*
 * Create page table mapping for the physical memory for specific physical
 * addresses. The virtual and physical addresses have to be aligned on PMD level
 * down. It returns the last physical address mapped.
 */
unsigned long __meminit
kernel_physical_mapping_init(unsigned long paddr_start,
           unsigned long paddr_end,
           unsigned long page_size_mask)
{
  unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;


  paddr_last = paddr_end;
  vaddr = (unsigned long)__va(paddr_start);
  vaddr_end = (unsigned long)__va(paddr_end);
  vaddr_start = vaddr;


  for (; vaddr < vaddr_end; vaddr = vaddr_next) {
    pgd_t *pgd = pgd_offset_k(vaddr);
    p4d_t *p4d;


    vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;


    if (pgd_val(*pgd)) {
      p4d = (p4d_t *)pgd_page_vaddr(*pgd);
      paddr_last = phys_p4d_init(p4d, __pa(vaddr),
               __pa(vaddr_end),
               page_size_mask);
      continue;
    }


    p4d = alloc_low_page();
    paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
             page_size_mask);


    p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
  }
  __flush_tlb_all();


  return paddr_l

kernel_physical_mapping_init中会通过__va将物理地址转换为虚拟地址,然后创建虚拟地址和物理地址的映射页表

理论上对于内存__va__pa直接映射就能实现,这是因为CPU硬件要求,在保护模式下访问虚拟内存需要CR3寄存器

而在这个过程之前,使用的early_top_pgt页表

vmalloc和kmap_atomic原理

内核态通过malloc分配内存,如果是大内存,调用mmap

在虚拟内存有vmalloc区域从VMALLOC_START开始到VMALLOC_END,可以用于映射一段物理内存

/**
 *  vmalloc  -  allocate virtually contiguous memory
 *  @size:    allocation size
 *  Allocate enough pages to cover @size from the page level
 *  allocator and map them into contiguous kernel virtual space.
 *
 *  For tight control over page level allocator and protection flags
 *  use __vmalloc() instead.
 */
void *vmalloc(unsigned long size)
{
  return __vmalloc_node_flags(size, NUMA_NO_NODE,
            GFP_KERNEL);
}


static void *__vmalloc_node(unsigned long size, unsigned long align,
          gfp_t gfp_mask, pgprot_t prot,
          int node, const void *caller)
{
  return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
        gfp_mask, prot, 0, node, caller);
}

内核的临时映射通过kmap_atomic实现

  • 如果32位有高端地址,调用set_pte进行内核页表临时映射
  • 如果64位有高端地址,调用page_address
  • 低端内存调用__va进行临时映射

内核态缺页异常

kmap_atomic发现内有页表直接进行映射了,vmalloc只分配了内核虚拟地址,访问的时候会发生缺页异常,也是调用do_page_fault


/*
 * 32-bit:
 *
 *   Handle a fault on the vmalloc or module mapping area
 */
static noinline int vmalloc_fault(unsigned long address)
{
  unsigned long pgd_paddr;
  pmd_t *pmd_k;
  pte_t *pte_k;


  /* Make sure we are in vmalloc area: */
  if (!(address >= VMALLOC_START && address < VMALLOC_END))
    return -1;


  /*
   * Synchronize this task's top level page-table
   * with the 'reference' page table.
   *
   * Do _not_ use "current" here. We might be inside
   * an interrupt in the middle of a task switch..
   */
  pgd_paddr = read_cr3_pa();
  pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
  if (!pmd_k)
    return -1;


  pte_k = pte_offset_kernel(pmd_k, address);
  if (!pte_present(*pte_k))
    return -1;


  return 0

对于内存的分配需求,可能来自内核态,也可能来自用户态

  • 对于内核态,kmalloc在分配大内存的时候,以及vmalloc分配不连续物理页的时候,直接使用伙伴系统,分配后转换为虚拟地址,访问的时候需要通过内核页表进行映射。对于kmem_cache以及kmalloc分配小内存,则使用slub分配器,将伙伴系统分配出来的大块内存切成一小块一小块进行分配。kmem_cachekmalloc的部分不会被换出,因为用这两个函数分配的内存多用于保持内核关键的数据结构。内核态中vmalloc分配的部分会被换出,因而当访问的时候,发现不在,就会调用do_page_fault
  • 对于用户态的内存分配,或者直接调用mmap系统调用分配,或者调用malloc。调用malloc的时候,如果分配小的内存,就用sys_brk系统调用;如果分配大的内存,还是用sys_mmap系统调用。正常情况下,用户态的内存都是可以换出的,因而一旦发现内存中不存在,就会调用do_page_fault

27 文件系统

文件系统功能规划

  1. 文件有组织形式,按块存储
  2. 有索引,查找文件分成了多少块,都在什么位置
  3. 经常读取的文件有缓存
  4. 文件以文件夹形式组织
  5. 有那些文件被那些进程打开和使用

文件系统相关系统调用

操作文件的例子

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>


int main(int argc, char *argv[])
{


  int fd = -1;
  int ret = 1;
  int buffer = 1024;
  int num = 0;


  if((fd=open("./test", O_RDWR|O_CREAT|O_TRUNC))==-1)
  {
    printf("Open Error\n");
    exit(1);
  }


  ret = write(fd, &buffer, sizeof(int));
  if( ret < 0)
  {
    printf("write Error\n");
    exit(1);
  }
  printf("write %d byte(s)\n",ret);


  lseek(fd, 0L, SEEK_SET);
  ret= read(fd, &num, sizeof(int));
  if(ret==-1)
  {
    printf("read Error\n");
    exit(1);
  }
  printf("read %d byte(s),the number is %d\n", ret, num);


  close(fd);


  return 0;
}
  • open打开文件,操作系统会创建一个数据结构来表示打开的文件,并且会分配一个文件描述符,用于区分进程打开的多个文件。文件的操作也基于文件描述符
  • write写入数据,第二个参数表示要写入的数据存放位置,第三个参数表示希望写入的字节数
  • lseek重新定位读写位置,第二个参数为重新定位的位置,第三个参数为SEEK_SET起始位置
  • read读取数据,第二个参数为读取来的数据存到指向的空间,第三个参数是希望读取的字节数
  • close关闭文件

open的参数

  • O_CREAT表示当文件不存在,创建一个新文件
  • O_RDWR表示以读写方式打开
  • O_TRUNC表示打开文件后,将文件的长度截断为0

查看文件状态的函数会返回文件状态信息

int stat(const char *pathname, struct stat *statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *pathname, struct stat *statbuf);


struct stat {
  dev_t     st_dev;         /* ID of device containing file */
  ino_t     st_ino;         /* Inode number */
  mode_t    st_mode;        /* File type and mode */
  nlink_t   st_nlink;       /* Number of hard links */
  uid_t     st_uid;         /* User ID of owner */
  gid_t     st_gid;         /* Group ID of owner */
  dev_t     st_rdev;        /* Device ID (if special file) */
  off_t     st_size;        /* Total size, in bytes */
  blksize_t st_blksize;     /* Block size for filesystem I/O */
  blkcnt_t  st_blocks;      /* Number of 512B blocks allocated */
  struct timespec st_atim;  /* Time of last access */
  struct timespec st_mtim;  /* Time of last modification */
  struct timespec st_ctim;  /* Time of last status change */
};
  • stat和lstat是通过文件名查到的状态信息
  • fstat是通过文件描述符

列出一个文件夹下的文件和文件属性的代码

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>


int main(int argc, char *argv[])
{
  struct stat sb;
  DIR *dirp;
  struct dirent *direntp;
  char filename[128];
  if ((dirp = opendir("/root")) == NULL) {
    printf("Open Directory Error%s\n");
    exit(1);
  }
  while ((direntp = readdir(dirp)) != NULL){
    sprintf(filename, "/root/%s", direntp->d_name);
    if (lstat(filename, &sb) == -1)
    {
      printf("lstat Error%s\n");
      exit(1);
    }


    printf("name : %s, mode : %d, size : %d, user id : %d\n", direntp->d_name, sb.st_mode, sb.st_size, sb.st_uid);


  }
  closedir(dirp);


  return 0
}
  • opendir打开一个目录名所对应的dir目录流,并返回dir目录流的指针,流的定位为dir目录流的第一个条目
  • readdir从dir目录流中读取一个项目,返回的是一个指针,指向的dirent结构体,并且目录流自动指向下一个目录流条目,如果是最后一个则返回NULL
  • closedir关闭dir目录流

总结

  1. 在文件系统上,需要维护文件的严格的格式,要通过mkfs.ext4命令来格式化为严格的格式
  2. 每一个硬盘上保存的文件都要有一个索引,来维护这个文件上的数据块都保存在哪里
  3. 文件通过文件夹组织起来,可以方便用户使用
  4. 为了能够更快读取文件,内存里会分配一块空间作为缓存,让一些数据块放在缓存里面
  5. 在内核中,要有一整套的数据结构来表示打开的文件
  6. 在用户态,每个打开的文件都有一个文件描述符,可以通过各种文件相关的系统调用,操作这个文件描述符

28 磁盘文件系统

inode和块的存储

硬盘也分成相同大小的单元,为block,是扇区大小的整数倍,默认为4k,格式化的时候可是设定

文件分块存储,就可以是不连续的,文件的存储信息,名称和权限等元数据通过inode来保存

每个文件和文件夹都对应一个inode

struct ext4_inode {
  __le16  i_mode;    /* File mode */
  __le16  i_uid;    /* Low 16 bits of Owner Uid */
  __le32  i_size_lo;  /* Size in bytes */
  __le32  i_atime;  /* Access time */
  __le32  i_ctime;  /* Inode Change time */
  __le32  i_mtime;  /* Modification time */
  __le32  i_dtime;  /* Deletion Time */
  __le16  i_gid;    /* Low 16 bits of Group Id */
  __le16  i_links_count;  /* Links count */
  __le32  i_blocks_lo;  /* Blocks count */
  __le32  i_flags;  /* File flags */
......
  __le32  i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
  __le32  i_generation;  /* File version (for NFS) */
  __le32  i_file_acl_lo;  /* File ACL */
  __le32  i_size_high;
......
};

有文件的

  • i_mode:读写权限
  • i_uid:所属用户
  • i_gid:所属组
  • i_size_io:大小
  • i_blocks_io:占用块数量
  • i_atime:最后一次访问时间
  • i_ctime:最近一次修改inode的时间
  • i_mtime:最后一次更改文件的时间

文件在硬盘的存储位置,例如EXT4_N_BLOCKS中定义

#define  EXT4_NDIR_BLOCKS    12
#define  EXT4_IND_BLOCK      EXT4_NDIR_BLOCKS
#define  EXT4_DIND_BLOCK      (EXT4_IND_BLOCK + 1)
#define  EXT4_TIND_BLOCK      (EXT4_DIND_BLOCK + 1)
#define  EXT4_N_BLOCKS      (EXT4_TIND_BLOCK + 1)

在ext2和ext3中,

  • 其中前12项直接保存了块的位置,可以通过i_block[0-11],直接得到保存文件内容的块
  • 如果装不下可以用i_block[12]指向一个数据块的位置,被称为间接块,通过其找到对应的数据块位置
  • 如果文件太大,i_block[13]也会指向一个块,在这个块中保存间接块,同理i_block[14]等等

但是问题是读取效率很慢,需要读取多次磁盘

ext4引入了Extents,通过树状结构存储

通过ext4_extent_header来描述树的节点

struct ext4_extent_header {
  __le16  eh_magic;  /* probably will support different formats */
  __le16  eh_entries;  /* number of valid entries */
  __le16  eh_max;    /* capacity of store in entries */
  __le16  eh_depth;  /* has tree real underlying blocks? */
  __le32  eh_generation;  /* generation of the tree */
};

eh_entries表示这个节点有多少项,项有两种

  • 叶子节点,会指向硬盘上的连续块的地址,对应着ext4_extent
  • 分支节点(索引节点),会指向下一层的分支节点或者叶子节点,对应着ext4_extent_idx

这两项的大小都是12byte

/*
 * This is the extent on-disk structure.
 * It's used at the bottom of the tree.
 */
struct ext4_extent {
  __le32  ee_block;  /* first logical block extent covers */
  __le16  ee_len;    /* number of blocks covered by extent */
  __le16  ee_start_hi;  /* high 16 bits of physical block */
  __le32  ee_start_lo;  /* low 32 bits of physical block */
};
/*
 * This is index on-disk structure.
 * It's used at all the levels except the bottom.
 */
struct ext4_extent_idx {
  __le32  ei_block;  /* index covers logical blocks from 'block' */
  __le32  ei_leaf_lo;  /* pointer to the physical block of the next *
         * level. leaf or next index could be there */
  __le16  ei_leaf_hi;  /* high 16 bits of physical block */
  __u16  ei_unused;
};
  • 如果文件不大,inode里边的i_block中可以放下一个ext4_extent_header和4项ext4_extenteh_depth为0,也即inode里面的就是叶子节点,树高度为0
  • 如果文件较大,4个extent放不下,就需要分裂为一棵树,eh_depth>0的为索引节点
  • 除了根节点,其他节点都保存在4k的块中,4k扣除ext4_extent_header的12byte,还有340项的位置,每个extent最大能表示128MB数据,340个为42.5GB,如果再不够就增加树的深度

inode位图和block3位图

在文件系统中,有专门的一个4k块来保存inode的位图,每个位为一个inode,如果为1代表被使用

inode中记录了文件的元数据信息,可以根据元数据来找到文件对应使用的block,链路为file->inode->block->data

但是存放文件,则需要首先搞明白的是哪些block是可用的,哪些inode是可用的。所以需要引入inode和block的bitmap机制。 大概每1K或者2K就会设置一个inode,所以可以据此来算出inode bitmap的大小。 对于block的bitmap,会根据磁盘大小来进行计算。

当创建文件的时候会调用open,对应系统调用为sys_open

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
  if (force_o_largefile())
    flags |= O_LARGEFILE;


  return do_sys_open(AT_FDCWD, filename, flags, mode);
}

调用链为do_sys_open-> do_filp_open->path_openat->do_last->lookup_open,逻辑为根据路径和找到文件夹,在文件夹下创建文件,需要一个新的inode

static int lookup_open(struct nameidata *nd, struct path *path,
      struct file *file,
      const struct open_flags *op,
      bool got_write, int *opened)
{
......
  if (!dentry->d_inode && (open_flag & O_CREAT)) {
......
    error = dir_inode->i_op->create(dir_inode, dentry, mode,
            open_flag & O_EXCL);
......
  }
......
}

创建新的inode调用dir_inode,也就是文件夹的inode的create函数


const struct inode_operations ext4_dir_inode_operations = {
  .create    = ext4_create,
  .lookup    = ext4_lookup,
  .link    = ext4_link,
  .unlink    = ext4_unlink,
  .symlink  = ext4_symlink,
  .mkdir    = ext4_mkdir,
  .rmdir    = ext4_rmdir,
  .mknod    = ext4_mknod,
  .tmpfile  = ext4_tmpfile,
  .rename    = ext4_rename2,
  .setattr  = ext4_setattr,
  .getattr  = ext4_getattr,
  .listxattr  = ext4_listxattr,
  .get_acl  = ext4_get_acl,
  .set_acl  = ext4_set_acl,
  .fiemap         = ext4_fiemap,
};

create对应的ext4_create,接下来的调用链为ext4_create->ext4_new_inode_start_handle->__ext4_new_inode

struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
             umode_t mode, const struct qstr *qstr,
             __u32 goal, uid_t *owner, __u32 i_flags,
             int handle_type, unsigned int line_no,
             int nblocks)
{
......
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
......
ino = ext4_find_next_zero_bit((unsigned long *)
                inode_bitmap_bh->b_data,
                EXT4_INODES_PER_GROUP(sb), ino);
......
}

需要做的就是从文件系统获取位图,找到下一个为0的inode,为空闲的inode

inode位图是文件系统mount时从磁盘读取的,和超级块一起读入到内存

文件系统的格式

数据块的位图在一个块中,为4k,每位一个数据块为410248=2^15,按照每个数据块也是4k,可以表示的文件大小为2^1541024=2^27,为128M

所以会有一个块组将块合并,数据结构为ext4_group_desc,包含inode位图bg_inode_bitmap_lo、块位图bg_block_bitmap_lo、inode列表bg_inode_table_lo等,块组描述符也会组成一个列表,为块组描述符表

对整个文件系统描述的为超级块ext4_super_block,存储文件系统有多少inode的s_inodes_count、一共多少块的s_blocks_count_lo,每个块组有多少inode的s_inodes_per_group,每个块组有多少块的s_blocks_per_group

超级块和块组描述符都是全局信息,如果数据丢失,整个文件系统都打不开了,所以这两个部分需要进行备份

  • 默认情况,超级块和块组描述符表都有副本保存在一个块组
  • 开启了sparse_super特性,会保存在指定的块组中

Meta Block Groups特性中,块组描述符表不会保存所有块组的,而是将块组分成多个组,被称为元块组,一个元块组包含64个块组

每个元块组包含64个块组,块组描述符表也是64项,备份三份,在元块组的第一个,第二个和最后一个块组的开始处

这样就能发挥ext4的48位块寻址优势了,在超级块ext4_super_block中,高位置和低位置都为32位,有用的为48位,2^48个块是1EB

目录存储格式

目录本事也是文件,也有inode,也会指向一些块,普通文件的块是文件数据,目录文件的块保存的是目录中的文件信息,为ext4_dir_entry,有两个版本但差别不大


struct ext4_dir_entry {
  __le32  inode;      /* Inode number */
  __le16  rec_len;    /* Directory entry length */
  __le16  name_len;    /* Name length */
  char  name[EXT4_NAME_LEN];  /* File name */
};
struct ext4_dir_entry_2 {
  __le32  inode;      /* Inode number */
  __le16  rec_len;    /* Directory entry length */
  __u8  name_len;    /* Name length */
  __u8  file_type;
  char  name[EXT4_NAME_LEN];  /* File name */
};

每个项会保存目录的下一级文件的文件名和对应inode,第一项为".",第二项为"..",然后是一项项文件。

如果文件很多的话查找就很慢了,所以有了索引模式,如果设置了inode的EXT4_INDEX_FL,文件的块的组织形式就变为


struct dx_root
{
  struct fake_dirent dot;
  char dot_name[4];
  struct fake_dirent dotdot;
  char dotdot_name[4];
  struct dx_root_info
  {
    __le32 reserved_zero;
    u8 hash_version;
    u8 info_length; /* 8 */
    u8 indirect_levels;
    u8 unused_flags;
  }
  info;
  struct dx_entry  entries[0];
};

第一第二项不变,接下来就是dx_root_info结构,indirect_levels为索引层数,dx_entry为文件名的hash值和数据块的映射关系

struct dx_entry
{
  __le32 hash;
  __le32 block;
};

如果要查找一个目录下的文件名,可以通过名称的hash,如果hash可以匹配就打开这个块,如果块不是索引,而是索引树的叶子节点,就是ext4_dir_entry_2,在其中一项一项找即可

软链接和硬链接的存储格式

  • 硬链接和原文件使用相同的inode,但是inode不跨文件系统,所以硬链接不能跨文件系统
  • 软连接使用新的文件,也就是新的inode,在打开的时候指向了文件

总结一下文件系统存储

29 虚拟文件系统

进程要往文件系统写数据需要多个组件合作

  • 进程通过系统调用sys_open、sys_read和sys_write
  • 内核中为每个进程打开需要的文件,并维护一定数据结构
  • 内核中系统打开的文件,也需要维护一定数据结构
  • Linux支持多种文件系统,统一提供了通用的虚拟文件系统接口
  • 对接真正文件系统
  • 块设备的IO层
  • 块设备的读写缓存层
  • 块设备的驱动程序

文件挂载

文件系统需要挂载到操作系统才能操作。

只有在内核注册的操作系统才能被挂载,示例ext4的注册

register_filesystem(&ext4_fs_type);


static struct file_system_type ext4_fs_type = {
  .owner    = THIS_MODULE,
  .name    = "ext4",
  .mount    = ext4_mount,
  .kill_sb  = kill_block_super,
  .fs_flags  = FS_REQUIRES_DEV,
};

mount的系统调用定义

SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name, char __user *, type, unsigned long, flags, void __user *, data){...... ret = do_mount(kernel_dev, dir_name, kernel_type, flags, options);......}

接下来的调用链为do_mount->do_new_mount->vfs_kern_mount

struct vfsmount *
vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
{
......
  mnt = alloc_vfsmnt(name);
......
  root = mount_fs(type, flags, name, data);
......
  mnt->mnt.mnt_root = root;
  mnt->mnt.mnt_sb = root->d_sb;
  mnt->mnt_mountpoint = mnt->mnt.mnt_root;
  mnt->mnt_parent = mnt;
  list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
  return &mnt->mnt;
}

vfs_kern_mount先创建struct mount结构

struct mount {
  struct hlist_node mnt_hash;
  struct mount *mnt_parent;
  struct dentry *mnt_mountpoint;
  struct vfsmount mnt;
  union {
    struct rcu_head mnt_rcu;
    struct llist_node mnt_llist;
  };
  struct list_head mnt_mounts;  /* list of children, anchored here */
  struct list_head mnt_child;  /* and going through their mnt_child */
  struct list_head mnt_instance;  /* mount instance on sb->s_mounts */
  const char *mnt_devname;  /* Name of device e.g. /dev/dsk/hda1 */
  struct list_head mnt_list;
......
} __randomize_layout;


struct vfsmount {
  struct dentry *mnt_root;  /* root of the mounted tree */
  struct super_block *mnt_sb;  /* pointer to superblock */
  int mnt_flags;
} __randomize_layout;
  • mnt_parent为装载点所在的父文件系统
  • mnt_mountpoint为装载点在父文件系统的dentry,即和目录的inode关联
  • mnt_root为当前文件系统的根目录的dentry
  • mnt_sb是指向超级块的指针

调用mount_fs挂载文件系统

struct dentry *
mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
{
  struct dentry *root;
  struct super_block *sb;
......
  root = type->mount(type, flags, name, data);
......
  sb = root->d_sb;
......
}

这里调用ext4_fs_type的mount函数ext4_mount,从文件系统中读取超级块,inode块位图,所有inode块,块位图等都读取加载到内存数据结构中,就可以操作这些数据结构进而操作文件系统

对于每个文件和目录都有

  • dentry,用于和inode关联
  • vfsmount,用于和mount关联

示例挂载和目录文件的以下

部分解释

  • 对于/根目录挂载,mount的mount point和mount root都指向了根目录的dentry
  • 对于/home有两个dentry,一个是根文件系统的目录,一个是根文件系统的挂载点,这个挂载的mount结构,mount point指向挂载点dentry,mount root指向了根目录的dentry

打开文件

打开文件使用的系统调用sys_open

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
......
  return do_sys_open(AT_FDCWD, filename, flags, mode);
}


long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
......
  fd = get_unused_fd_flags(flags);
  if (fd >= 0) {
    struct file *f = do_filp_open(dfd, tmp, &op);
    if (IS_ERR(f)) {
      put_unused_fd(fd);
      fd = PTR_ERR(f);
    } else {
      fsnotify_open(f);
      fd_install(fd, f);
    }
  }
  putname(tmp);
  return fd;
}

打开文件调用get_unused_fd_flags得到没用使用的文件描述符,会在进程的task_structfiles_struct

struct files_struct    *files;

其中有文件描述符列表file __rcu

struct files_struct {
......
  struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

默认进程,0为stdin,1为stdout,2为stderr,再打开就会从列表中分配空闲的位置,每打开一个文件都会有一个struct file

do_sys_open中调用do_filp_open,就会创建struct file,然后通过fd_install(fd, f)将文件描述符和其关联起来

struct file *do_filp_open(int dfd, struct filename *pathname,
    const struct open_flags *op)
{
......
  set_nameidata(&nd, dfd, pathname);
  filp = path_openat(&nd, op, flags | LOOKUP_RCU);
......
  restore_nameidata();
  return filp;
}

do_filp_open里边会首先初始化struct nameidata,会有关键的成员变量path

struct path {
  struct vfsmount *mnt;
  struct dentry *dentry;
} __randomize_layout;

接下来调用path_openat

  • get_empty_filp生成struct file结构
  • path_init初始化struct nameidata结构
  • link_path_walk对路径名逐层查找
  • do_last获取文件的inode,初始化file对象
static struct file *path_openat(struct nameidata *nd,
      const struct open_flags *op, unsigned flags)
{
......
  file = get_empty_filp();
......
  s = path_init(nd, flags);
......
  while (!(error = link_path_walk(s, nd)) &&
    (error = do_last(nd, file, op, &opened)) > 0) {
......
  }
  terminate_walk(nd);
......
  return file;
}

例如文件"/root/hello/world/data",link_path_walk会解析前面的路径部分"/root/hello/world",解析完毕的时候nameidata的dentry为路径名的最后一部分的父目录"/root/hello/world",而"nameidata"的"filename"为路径名的最后一部分"data",最后交由do_last

static int do_last(struct nameidata *nd,
       struct file *file, const struct open_flags *op,
       int *opened)
{
......
  error = lookup_fast(nd, &path, &inode, &seq);
......
    error = lookup_open(nd, &path, file, op, got_write, opened);
......
  error = vfs_open(&nd->path, file, current_cred());
......
}

为了查找最后一部分对应的dentry,linux中实现了目录项的告诉缓存dentry cache,简称dcache主要有两部分

  • dentry_hashtable,哈希表,dcache的所有dentry通过d_hash连接到对应的dentry哈希链表
  • s_dentry_lru,未使用dentry对象链表,dentry通过d_lru指针里链入LRU链表

两个列表之间的关系

  • 引用为0,一个在散列表的dentry变成没有引用,就会加入到LRU表
  • 再次被引用,一个在LRU表的dentry再次被引用,则从LRU表取出
  • 分配,当dentry在散列表中没有找到,则从Slub分配一个
  • 过期归还,当LRU表中最长时间没有使用的dentry会释放到Slub分配器
  • 文件删除,对应的dentry也会释放掉到Slub
  • 结构复用,当需要分配一个dentry

do_last在查找的时候会先从缓存中查找,调用的为lookup_fast,如果缓存没有找到,就需要真的到文件系统里去找,lookup_open会创建一个新的dentry,并调用上一级目录的inode的inode_operations的lookup函数,对于ext4来讲,调用的是ext4_lookup,找到对应的inode最后生成新的dentry赋给path变量

static int lookup_open(struct nameidata *nd, struct path *path,
      struct file *file,
      const struct open_flags *op,
      bool got_write, int *opened)
{
    ......
    dentry = d_alloc_parallel(dir, &nd->last, &wq);
    ......
    struct dentry *res = dir_inode->i_op->lookup(dir_inode, dentry,
                   nd->flags);
    ......
    path->dentry = dentry;
  path->mnt = nd->path.mnt;
}




const struct inode_operations ext4_dir_inode_operations = {
  .create    = ext4_create,
  .lookup    = ext4_lookup,
...

do_last()的最后一步是调用vfs_open打开文件


int vfs_open(const struct path *path, struct file *file,
       const struct cred *cred)
{
  struct dentry *dentry = d_real(path->dentry, NULL, file->f_flags, 0);
......
  file->f_path = *path;
  return do_dentry_open(file, d_backing_inode(dentry), NULL, cred);
}


static int do_dentry_open(struct file *f,
        struct inode *inode,
        int (*open)(struct inode *, struct file *),
        const struct cred *cred)
{
......
  f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
        FMODE_PREAD | FMODE_PWRITE;
  path_get(&f->f_path);
  f->f_inode = inode;
  f->f_mapping = inode->i_mapping;
......
  f->f_op = fops_get(inode->i_fop);
......
  open = f->f_op->open;
......
  error = open(inode, f);
......
  f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
  file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
  return 0;
......
}


const struct file_operations ext4_file_operations = {
......
  .open    = ext4_file_open,
......
};

vfs_open里边核心调用

  • f_op->open,针对ext4就是ext4_file_open
  • 将打开文件的信息读入struct file
int vfs_open(const struct path *path, struct file *file,
       const struct cred *cred)
{
  struct dentry *dentry = d_real(path->dentry, NULL, file->f_flags, 0);
......
  file->f_path = *path;
  return do_dentry_open(file, d_backing_inode(dentry), NULL, cred);
}


static int do_dentry_open(struct file *f,
        struct inode *inode,
        int (*open)(struct inode *, struct file *),
        const struct cred *cred)
{
......
  f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
        FMODE_PREAD | FMODE_PWRITE;
  path_get(&f->f_path);
  f->f_inode = inode;
  f->f_mapping = inode->i_mapping;
......
  f->f_op = fops_get(inode->i_fop);
......
  open = f->f_op->open;
......
  error = open(inode, f);
......
  f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
  file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
  return 0;
......
}


const struct file_operations ext4_file_operations = {
......
  .open    = ext4_file_open,
......
};

整体流程

对于每个进程,打开的文件都有一个文件描述符,在files_struct文件描述符数组,每个文件描述符是数组的下标,内容指向了file结构,结构中有对应的inode,如果操作文件是用的file_operation

30 文件缓存

系统调用层和虚拟文件系统层

文件的读写实用的系统调用read和write,在内核中的定义

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
  struct fd f = fdget_pos(fd);
......
  loff_t pos = file_pos_read(f.file);
  ret = vfs_read(f.file, buf, count, &pos);
......
}


SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
    size_t, count)
{
  struct fd f = fdget_pos(fd);
......
  loff_t pos = file_pos_read(f.file);
    ret = vfs_write(f.file, buf, count, &pos);
......
}
  • read调用vfs_read->__vfs_read
  • write调用vfs_write->__vfs_write

对应代码

ssize_t __vfs_read(struct file *file, char __user *buf, size_t count,
       loff_t *pos)
{
  if (file->f_op->read)
    return file->f_op->read(file, buf, count, pos);
  else if (file->f_op->read_iter)
    return new_sync_read(file, buf, count, pos);
  else
    return -EINVAL;
}


ssize_t __vfs_write(struct file *file, const char __user *p, size_t count,
        loff_t *pos)
{
  if (file->f_op->write)
    return file->f_op->write(file, p, count, pos);
  else if (file->f_op->write_iter)
    return new_sync_write(file, p, count, pos);
  else
    return -EINVAL;
}

每个文件都有struct file结构,file中file_operations f_op定义对这种文件做的操作,__vfs_read会调用对应文件系统的file_operations的read操作,write同理

ext4文件系统层

内核定义了ext4_file_operations

const struct file_operations ext4_file_operations = {
......
  .read_iter  = ext4_file_read_iter,
  .write_iter  = ext4_file_write_iter,
......
}

ext4_file_read_iter会调用generic_file_read_iterext4_file_write_iter会调用__generic_file_write_iter


ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
......
    if (iocb->ki_flags & IOCB_DIRECT) {
......
        struct address_space *mapping = file->f_mapping;
......
        retval = mapping->a_ops->direct_IO(iocb, iter);
    }
......
    retval = generic_file_buffered_read(iocb, iter, retval);
}


ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
......
    if (iocb->ki_flags & IOCB_DIRECT) {
......
        written = generic_file_direct_write(iocb, from);
......
    } else {
......
    written = generic_perform_write(file, from, iocb->ki_pos);
......
    }
}

generic_file_read_iter__generic_file_write_iter逻辑类似,但是区分是否使用缓存

Linux为了改善性能,有时候会选择不使用磁盘,而是读写都在内存中,然后批量读取或者写入磁盘,所以根据这种方式分为了

  • 缓存IO,大部分系统默认IO操作都是缓存IO,对于读会从内存缓冲区检查是否有缓存数据,如果有直接返回,没有就从磁盘读取,然后缓存到缓存中,写也是将数据从用户空间复制到内核空间的缓存,什么时候写到磁盘就由操作系统决定了,或者调用sync
  • 直接IO,直接访问磁盘数据

在读的时候generic_file_read_iter设置了IOCB_DIRECT就会调用address_spacedirect_IO直接从磁盘读取数据,address_space在mmap中用于文件和内存页产生关联

对于缓存也需要文件和内存页关联,address_space操作的定义在ext4_aops,对应的函数为ext4_direct_IO

static const struct address_space_operations ext4_aops = {
......
  .direct_IO    = ext4_direct_IO,
......
};

同理再__generic_file_write_iter中设置了IOCB_DIRECT就会调用generic_file_direct_write,也会调用address_spacedirect_IO将数据直接写入磁盘。ext4_direct_IO会调用__blockdev_direct_IO->do_blockdev_direct_IO跨过缓存层,直接进入通用块设备层,最后到达设备驱动层。文件系统都为块设备,使用的是blockdev相关函数

/*
 * This is a library function for use by filesystem drivers.
 */
static inline ssize_t
do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
          struct block_device *bdev, struct iov_iter *iter,
          get_block_t get_block, dio_iodone_t end_io,
          dio_submit_t submit_io, int flags)
{......}

带缓存如何读写

缓存写入函数

ssize_t generic_perform_write(struct file *file,
        struct iov_iter *i, loff_t pos)
{
  struct address_space *mapping = file->f_mapping;
  const struct address_space_operations *a_ops = mapping->a_ops;
  do {
    struct page *page;
    unsigned long offset;  /* Offset into pagecache page */
    unsigned long bytes;  /* Bytes to write to page */
    status = a_ops->write_begin(file, mapping, pos, bytes, flags,
            &page, &fsdata);
    copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
    flush_dcache_page(page);
    status = a_ops->write_end(file, mapping, pos, bytes, copied,
            page, fsdata);
    pos += copied;
    written += copied;


    balance_dirty_pages_ratelimited(mapping);
  } while (iov_iter_count(i));
}

里边有一个while循环,找出所有写入影响的页,然后依次写入,流程为

  • 对每个页调用address_spacewrite_begin准备工作
  • 调用iov_iter_copy_from_user_atomic将写入内存从用户态拷贝到内核态的页
  • 调用address_spacewrite_end完成写操作
  • 调用balance_dirty_pages_ratelimited,看脏页是否太多,需要写入磁盘
static const struct address_space_operations ext4_aops = {
......
  .write_begin    = ext4_write_begin,
  .write_end    = ext4_write_end,
......
}

write_begin调用的是ext4_write_begin

ext4为日志文件系统,为了防止断电数据丢失,引入了journal,日志文件系统比非日志文件系统多了journal区域,在ext4中分两部分存储,一个是文件元数据,另一个是数据,两者的操作日志journal也是分开的,所以挂载的时候

  • journal模式,这样必须等元数据和数据的日志落盘才能写数据,性能会差
  • order模式,默认模式,不记录数据日志,只记录元数据日志,在写入元数据之前必须数据落盘
  • writeback模式,不记录数据日志,并不保证数据比元数据先落盘,性能最好

ext4_write_begin调用的

  • ext4_journal_start是在做日志相关的准备工作
  • grab_cache_page_write_begin获取应该写入的缓存页
struct page *grab_cache_page_write_begin(struct address_space *mapping,
          pgoff_t index, unsigned flags)
{
  struct page *page;
  int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
  page = pagecache_get_page(mapping, index, fgp_flags,
      mapping_gfp_mask(mapping));
  if (page)
    wait_for_stable_page(page);
  return page;
}

在内核中,缓存也页为单位在内存中,被打开的文件struct file中会有struct address_space用于关联文件和内存,保存着与文件相关的缓存页,在查找的时候根据文件中的偏移量找到对应的页面,使用的radixtree根据长整形找到到对应的对象

struct address_space {
  struct inode    *host;    /* owner: inode, block_device */
  struct radix_tree_root  page_tree;  /* radix tree of all pages */
  spinlock_t    tree_lock;  /* and lock protecting it */
......
}

第一步,pagecache_get_page根据长整形pgoff_t index查找缓存页

第二步,iov_iter_copy_from_user_atomic将分配好的页面调用kmap_atomic映射到内核虚拟地址,将用户态数据拷贝进内核态虚拟地址,调用kunmap_atomic把内核里的映射删除

size_t iov_iter_copy_from_user_atomic(struct page *page,
    struct iov_iter *i, unsigned long offset, size_t bytes)
{
  char *kaddr = kmap_atomic(page), *p = kaddr + offset;
  iterate_all_kinds(i, bytes, v,
    copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
    memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
         v.bv_offset, v.bv_len),
    memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
  )
  kunmap_atomic(kaddr);
  return bytes;
}

第三步,调用ext4_write_end,并调用ext4_journal_stop完成数据写入,调用为block_write_end->__block_commit_write->mark_buffer_dirty,将修改的缓存标记为脏页

第四步,调用balance_dirty_pages_ratelimited会写脏页


/**
 * balance_dirty_pages_ratelimited - balance dirty memory state
 * @mapping: address_space which was dirtied
 *
 * Processes which are dirtying memory should call in here once for each page
 * which was newly dirtied.  The function will periodically check the system's
 * dirty state and will initiate writeback if needed.
  */
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
  struct inode *inode = mapping->host;
  struct backing_dev_info *bdi = inode_to_bdi(inode);
  struct bdi_writeback *wb = NULL;
  int ratelimit;
......
  if (unlikely(current->nr_dirtied >= ratelimit))
    balance_dirty_pages(mapping, wb, current->nr_dirtied);
......
}

balance_dirty_pages_ratelimited会检测脏页的数量是否超过规定,如果超过就调用balance_dirty_pages->wb_start_background_writeback启动线程回写

void wb_start_background_writeback(struct bdi_writeback *wb)
{
  /*
   * We just wake up the flusher thread. It will perform background
   * writeback as soon as there is no other work to do.
   */
  wb_wakeup(wb);
}


static void wb_wakeup(struct bdi_writeback *wb)
{
  spin_lock_bh(&wb->work_lock);
  if (test_bit(WB_registered, &wb->state))
    mod_delayed_work(bdi_wq, &wb->dwork, 0);
  spin_unlock_bh(&wb->work_lock);
}


  (_tflags) | TIMER_IRQSAFE);    \
  } while (0)


/* bdi_wq serves all asynchronous writeback tasks */
struct workqueue_struct *bdi_wq;


/**
 * mod_delayed_work - modify delay of or queue a delayed work
 * @wq: workqueue to use
 * @dwork: work to queue
 * @delay: number of jiffies to wait before queueing
 *
 * mod_delayed_work_on() on local CPU.
 */
static inline bool mod_delayed_work(struct workqueue_struct *wq,
            struct delayed_work *dwork,
            unsigned long delay)
{....

bdi_wq为全局变量,所有回写都挂在这个队列上,mod_delayed_work函数负责将回写任务bdi_writeback挂在队列上。bdi_writeback有成员变量struct delayed_work dwork,如果设置的delay为0会立刻进行执行

执行由bdi完成,是块设备初始化的时候创建bdi_init结构,这个过程会调用wb_init初始化bdi_writeback

static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
       int blkcg_id, gfp_t gfp)
{
  wb->bdi = bdi;
  wb->last_old_flush = jiffies;
  INIT_LIST_HEAD(&wb->b_dirty);
  INIT_LIST_HEAD(&wb->b_io);
  INIT_LIST_HEAD(&wb->b_more_io);
  INIT_LIST_HEAD(&wb->b_dirty_time);
  wb->bw_time_stamp = jiffies;
  wb->balanced_dirty_ratelimit = INIT_BW;
  wb->dirty_ratelimit = INIT_BW;
  wb->write_bandwidth = INIT_BW;
  wb->avg_write_bandwidth = INIT_BW;
  spin_lock_init(&wb->work_lock);
  INIT_LIST_HEAD(&wb->work_list);
  INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
  wb->dirty_sleep = jiffies;
......
}


#define __INIT_DELAYED_WORK(_work, _func, _tflags)      \
  do {                \
    INIT_WORK(&(_work)->work, (_func));      \
    __setup_timer(&(_work)->timer, delayed_work_timer_fn,  \
            (unsigned long)(_work),      \

这里INIT_DELAYED_WORK会初始化一个timer定时器,定时执行wb_workfn,调用链为wb_workfn->wb_do_writeback->wb_writeback->writeback_sb_inodes->__writeback_single_inode->do_writepages

触发回写的场景就有

  • write的时候脏页超过规定
  • 用户主动调用sync,会调用wakeup_flusher_threads
  • 内存紧张无法分配内存页,会调用free_more_memory最后调用到wakeup_flusher_threads
  • timer定时回写

带缓存的读操作

带缓存的读调用的generic_file_buffered_read

static ssize_t generic_file_buffered_read(struct kiocb *iocb,
    struct iov_iter *iter, ssize_t written)
{
  struct file *filp = iocb->ki_filp;
  struct address_space *mapping = filp->f_mapping;
  struct inode *inode = mapping->host;
  for (;;) {
    struct page *page;
    pgoff_t end_index;
    loff_t isize;
    page = find_get_page(mapping, index);
    if (!page) {
      if (iocb->ki_flags & IOCB_NOWAIT)
        goto would_block;
      page_cache_sync_readahead(mapping,
          ra, filp,
          index, last_index - index);
      page = find_get_page(mapping, index);
      if (unlikely(page == NULL))
        goto no_cached_page;
    }
    if (PageReadahead(page)) {
      page_cache_async_readahead(mapping,
          ra, filp, page,
          index, last_index - index);
    }
    /*
     * Ok, we have the page, and it's up-to-date, so
     * now we can copy it to user space...
     */
    ret = copy_page_to_iter(page, offset, nr, iter);
    }
}

读取比较简单,主要涉及预读的问题,需要先找到page cache是否有缓存,没有就需要进行预读,在page_cache_sync_readahead中完成。

预读完成就可以将内存从用户空间拷贝到内存空间了

总结

整体流程

  • 系统调用层read和write
  • vfs层vfs_readvfs_write
  • ext4层ext4_file_read_iteext4_file_write_ite
  • 直接IO和间接IO
  • 块设备层

第六部分 输入数据系统

31 输入与输出

输入输出系统有很多,鼠标、键盘、显示器、网卡、硬盘、打印机、CD等

用设备控制器屏蔽设备差异

操作系统中,cpu不操作设备,操作的就是设备控制器,有磁盘控制器、视频控制器等等

输入输出设备被分为两类:

  • 块设备,将数据存储在固定大小的块中,每个块都有自己的地址,例如硬盘
  • 字符设备,发送和接收字节流

块设备传输的数据量一般比较大,控制器中可能有缓冲区

CPU与控制器寄存器和数据缓冲区进行通信的方式

  • 每个控制器寄存器都分配一个IO端口,可以通过特殊的汇编指令操作寄存器
  • 数据缓冲区可以内存直接映射IO,可以分配一段内存空间给他,就像读写内存一样读写数据缓冲区,内存中ioremap就是做这个的

控制器的寄存器会有状态标志位,通过检测状态标志位来判断输入和输出是否完成,有两种检测方式

  • 轮询等待,一直查到查到为止
  • 中断,通知系统的输入输出操作完成

为了响应中断,就有一个硬件中断控制器,设备将中断发给中断控制器,由中断控制器通知CPU,CPU接收到中断停下当前操作,通过do_IRQ()来处理中断

但是对于磁盘,有DMA功能,允许设备在CPU不参与的情况完成内存的读写,CPU只需要对DMA下指令,读取多少数据放到内存的那些位置,DMA控制器直接发给磁盘控制器,读取磁盘数据到内存指定位置,传输完成DMA控制器发送中断通知CPU完成,CPU就能直接操作数据了

用驱动程序屏蔽设备控制器差异

驱动程序是操作系统的一部分,操作起来和本地代码一样操作驱动

用驱动程序屏蔽设备控制器差异,所有的驱动程序都会按照相同的规则,实现同样的方法

处理中断的一般流程是,一个设备驱动程序初始化的时候,要先注册一个该设备的中断处理函数,中断的时候,触发的函数是do_IRQ中断处理的统一入口。函数里面,可以找到设备驱动程序注册的中断处理函数 Handler,然后执行它进行中断处理

当中断返回的那一刻是进程切换的时机

对于块设备,在驱动程序上,文件系统之下,有通用设备层

用文件系统接口屏蔽驱动程序的差异

设备会同一在/dev/下创建特殊文件,也有对应的inode,但是不会关联到硬盘或者其他存储,而是连接了设备驱动

$ ls -l
crw------- 1 root root      5,   1 Dec 14 19:53 console
crw-r----- 1 root kmem      1,   1 Dec 14 19:53 mem
crw-rw-rw- 1 root root      1,   3 Dec 14 19:53 null
crw-r----- 1 root kmem      1,   4 Dec 14 19:53 port
crw-rw-rw- 1 root root      1,   8 Dec 14 19:53 random
crw--w---- 1 root tty       4,   0 Dec 14 19:53 tty0
crw--w---- 1 root tty       4,   1 Dec 14 19:53 tty1
crw-rw-rw- 1 root root      1,   9 Dec 14 19:53 urandom
brw-rw---- 1 root disk    253,   0 Dec 31 19:18 vda
brw-rw---- 1 root disk    253,   1 Dec 31 19:19 vda1
brw-rw---- 1 root disk    253,  16 Dec 14 19:53 vdb
brw-rw---- 1 root disk    253,  32 Jan  2 11:24 vdc
crw-rw-rw- 1 root root      1,   5 Dec 14 19:53 zero

c开头为字符设备,b开头为块设备,两个号,前边是主设备号,后边是次设备号,主设备号定位设备驱动程序,次设备号作为参数传给启动程序,选择相应的单元

可以直接通过文件操作命令来操作设备

$ cat /dev/urandom | od -x

Linux加载一个驱动程序,这个驱动程序被写成和操作系统有标准接口的代码,可以看到内核模块

可以通过lsmod查看

# lsmod
Module                  Size  Used by
iptable_filter         12810  1
bridge                146976  1 br_netfilter
vfat                   17461  0
fat                    65950  1 vfat
ext4                  571716  1
cirrus                 24383  1
crct10dif_pclmul       14307  0
crct10dif_common       12595  1 crct10dif_pclmul

如果没有安装可以使用insmod命令,一般后缀为ko

insmod openvswitch.ko

有了驱动就可以通过命令mknod在dev目录下创建设备文件

mknod filename type major minor

type为c和b,major为主设备号,minor为次设备号

设备管理在/sys下

  • /sys/devices是内核对系统中所有设备的分层次的表示
  • /sys/dev目录下一个char文件夹,一个block文件夹,分别维护一个按字符设备和块设备的主次号码 (major:minor) 链接到真实的设备 (/sys/devices下) 的符号链接文件
  • /sys/block 是系统中当前所有的块设备
  • /sys/module 有系统中所有模块的信息

有了sysfs还有一个用于维护的守护进程udev,当新设备插入系统,内核就会检测到并创建内核对象kobject,这个对象通过sysfs文件系统展示到用户层,还会向用户空间发送热插拔消息,udevd会监听这些消息,在dev创建对应文件

操作设备有几种方式

  • 命令行
  • read和write调用
  • ioctl

总结

32 字符设备(上)

简单的字符驱动解析——鼠标,代码在drivers/input/mouse/logibm.c

/*
 * Logitech Bus Mouse Driver for Linux
 */
module_init(logibm_init);
module_exit(logibm_exit);

简单的字符驱动解析——打印机,代码在drivers/char/lp.c

/*
 * Generic parallel printer driver
 */
module_init(lp_init_module);
module_exit(lp_cleanup_module);

内核模块

设备驱动是是一个内核模块

第一部分,头文件部分一般内核模块都需要include两个头文件

#include <linux/module.h>
#include <linux/init.h>

第二部分,定义一些函数,要不高于处理内核模块的主要逻辑,例如打开、关闭、读取和写入设备的函数或者响应中断的函数

例如logibm.c定义了logibm_openlogibm_close来处理打开和关闭,logibm_interrupt来响应中断

第三部分,定义一个file_operations,因为设备可以通过文件来访问

lp.c中定义

static const struct file_operations lp_fops = {
  .owner    = THIS_MODULE,
  .write    = lp_write,
  .unlocked_ioctl  = lp_ioctl,
#ifdef CONFIG_COMPAT
  .compat_ioctl  = lp_compat_ioctl,
#endif
  .open    = lp_open,
  .release  = lp_release,
#ifdef CONFIG_PARPORT_1284
  .read    = lp_read,
#endif
  .llseek    = noop_llseek,
};

而在logibm.c中没有这样的结构,因为属于输入设备的一种,被统一定义在drivers/input/input.c,logibm.c定义了自己独有的数据结构

static const struct file_operations input_devices_fileops = {
  .owner    = THIS_MODULE,
  .open    = input_proc_devices_open,
  .poll    = input_proc_devices_poll,
  .read    = seq_read,
  .llseek    = seq_lseek,
  .release  = seq_release,
};

第四部分,定义整个模块的初始化和退出函数,用于加载和卸载ko使用

例如lp.c就定义了lp_init_modulelp_cleanup_modulelogibm.c就定义了logibm_initlogibm_exit

第五部分,调用module_initmodule_exit,分别指向上面两个初始化函数和退出函数

第六部分,声明一下lisense,调用MODULE_LICENSE

打开字符设备

字符设备做的特殊的事情

insmod加载内核模块,会先module_init调用调用初始化函数,lp.c的初始化函数lp_init对应代码

static int __init lp_init (void)
{
......
  if (register_chrdev (LP_MAJOR, "lp", &lp_fops)) {
    printk (KERN_ERR "lp: unable to get major %d\n", LP_MAJOR);
    return -EIO;
  }
......
}


int __register_chrdev(unsigned int major, unsigned int baseminor,
          unsigned int count, const char *name,
          const struct file_operations *fops)
{
  struct char_device_struct *cd;
  struct cdev *cdev;
  int err = -ENOMEM;
......
  cd = __register_chrdev_region(major, baseminor, count, name);
  cdev = cdev_alloc();
  cdev->owner = fops->owner;
  cdev->ops = fops;
  kobject_set_name(&cdev->kobj, "%s", name);
  err = cdev_add(cdev, MKDEV(cd->major, baseminor), count);
  cd->cdev = cdev;
  return major ? 0 : cd->major;
}

在字符设备驱动内核模块加载的时候,最重要的就是注册这个字符设备,注册方式是

  • 调用__register_chrdev_region,注册字符设备的主次设备号和名称
  • 调用cdev_alloc分配struct cdev结构,并将字符设备cdev的ops成员变量指向具体设备的ops
  • 调用cdev_add将这个字符设备cdev添加到内核的struct kobj_map *cdev_map中,由这个结构统一管理内核中所有的字符设备;并使用MKDEV(cd->major, baseminor)将设备号dev_tcdev关联起来,这样通过设备号dev_t就可以在cdev_map中找到设备cdev了
/**
 * cdev_add() - add a char device to the system
 * @p: the cdev structure for the device
 * @dev: the first device number for which this device is responsible
 * @count: the number of consecutive minor numbers corresponding to this
 *         device
 *
 * cdev_add() adds the device represented by @p to the system, making it
 * live immediately.  A negative error code is returned on failure.
 */
int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
  int error;


  p->dev = dev;
  p->count = count;


  error = kobj_map(cdev_map, dev, count, NULL,
       exact_match, exact_lock, p);
  kobject_get(p->kobj.parent);


  return 0;

在logibm.c中没有注册字符设备,也是使用的input.c内的初始化函数input_init,通过register_chrdev_region调用logibm_init中的input_register_device进行注册输入字符设备,由input.c统一管理

内核模块加载完成后,会通过mknod在/dev下创建一个设备文件用来通过文件系统接口操作

mknod系统调用如下

SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, dev)
{
  return sys_mknodat(AT_FDCWD, filename, mode, dev);
}


SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
    unsigned, dev)
{
  struct dentry *dentry;
  struct path path;
......
  dentry = user_path_create(dfd, filename, &path, lookup_flags);
......
  switch (mode & S_IFMT) {
......
    case S_IFCHR: case S_IFBLK:
      error = vfs_mknod(path.dentry->d_inode,dentry,mode,
          new_decode_dev(dev));
      break;
......
  }
}

在文件系统为新设备创建一个dentry,维护文件和inode之间关联

如果是字符文件S_IFCHR或者设备文件S_IFBLK,就调用vfs_mknod

int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
{
......
  error = dir->i_op->mknod(dir, dentry, mode, dev);
......
}

会调用文件系统的inode_operations,而/dev目录为devtmpfs

static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
          const char *dev_name, void *data)
{
#ifdef CONFIG_TMPFS
  return mount_single(fs_type, flags, data, shmem_fill_super);
#else
  return mount_single(fs_type, flags, data, ramfs_fill_super);
#endif
}


static struct file_system_type dev_fs_type = {
  .name = "devtmpfs",
  .mount = dev_mount,
  .kill_sb = kill_litter_super,
};

在devtmpfs挂载的时候有两种方式ramfs和shmem,都是基于内存的文件系统

static const struct inode_operations ramfs_dir_inode_operations = {
......
  .mknod    = ramfs_mknod,
};


static const struct inode_operations shmem_dir_inode_operations = {
#ifdef CONFIG_TMPFS
......
  .mknod    = shmem_mknod,
};

两个mknod的实现最后都会指向init_special_inode

void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
  inode->i_mode = mode;
  if (S_ISCHR(mode)) {
    inode->i_fop = &def_chr_fops;
    inode->i_rdev = rdev;
  } else if (S_ISBLK(mode)) {
    inode->i_fop = &def_blk_fops;
    inode->i_rdev = rdev;
  } else if (S_ISFIFO(mode))
    inode->i_fop = &pipefifo_fops;
  else if (S_ISSOCK(mode))
    ;  /* leave it no_open_fops */
}

特殊文件会有特殊的inode,可以关联字符设备、块设备、FIFO文件、Socket等

字符设备的inode的file_operations指向了def_chr_fops,这里只有一个open

const struct file_operations def_chr_fops = {
  .open = chrdev_open,
};


static int chrdev_open(struct inode *inode, struct file *filp)
{
  const struct file_operations *fops;
  struct cdev *p;
  struct cdev *new = NULL;
  int ret = 0;


  p = inode->i_cdev;
  if (!p) {
    struct kobject *kobj;
    int idx;
    kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
    new = container_of(kobj, struct cdev, kobj);
    p = inode->i_cdev;
    if (!p) {
      inode->i_cdev = p = new;
      list_add(&inode->i_devices, &p->list);
      new = NULL;
    } 
  } 
......
  fops = fops_get(p->ops);
......
  replace_fops(filp, fops);
  if (filp->f_op->open) {
    ret = filp->f_op->open(inode, filp);
......
  }
......
}

会先判断inode的i_cdev是否关联到cdev,如果第一时间打开是没有关联的,通过i_cdev也就是dev_t,在cdev_map找到cdev,然后调用cdev中的file_operations,这些是驱动设备自己定义的,并将其赋给struct filefile_operations

最后调用驱动设备中file_operationsopen函数即可打开设备,对于打印机调用的是lp_open,对于鼠标调用的input_proc_devices_open最后调用到logibm_open

写入字符设备

写入字符设备,就是用的文件系统的标准接口write,参数为fd文件描述符,在内核中调用sys_write,根据fd得到struct file在调用vfs_write

ssize_t __vfs_write(struct file *file, const char __user *p, size_t count, loff_t *pos)
{
  if (file->f_op->write)
    return file->f_op->write(file, p, count, pos);
  else if (file->f_op->write_iter)
    return new_sync_write(file, p, count, pos);
  else
    return -EINVAL;
}

__vfs_write中调用struct file中的file_operations的write函数,这个函数在打开设备的时候已经指向了驱动程序的file_operations的write,最终调用的lp_write

static ssize_t lp_write(struct file * file, const char __user * buf,
            size_t count, loff_t *ppos)
{
  unsigned int minor = iminor(file_inode(file));
  struct parport *port = lp_table[minor].dev->port;
  char *kbuf = lp_table[minor].lp_buffer;
  ssize_t retv = 0;
  ssize_t written;
  size_t copy_size = count;
......
  /* Need to copy the data from user-space. */
  if (copy_size > LP_BUFFER_SIZE)
    copy_size = LP_BUFFER_SIZE;
......
  if (copy_from_user (kbuf, buf, copy_size)) {
    retv = -EFAULT;
    goto out_unlock;
  }
......
  do {
    /* Write the data. */
    written = parport_write (port, kbuf, copy_size);
    if (written > 0) {
      copy_size -= written;
      count -= written;
      buf  += written;
      retv += written;
    }
......
        if (need_resched())
      schedule ();


    if (count) {
      copy_size = count;
      if (copy_size > LP_BUFFER_SIZE)
        copy_size = LP_BUFFER_SIZE;


      if (copy_from_user(kbuf, buf, copy_size)) {
        if (retv == 0)
          retv = -EFAULT;
        break;
      }
    }  
  } while (count > 0);
......

这是个经典的驱动程序写入函数

  • 调用copy_from_user将用户态数据拷贝到内核态缓存
  • 调用parport_write写入外部设备
  • 调用schedule在写入过程中给其他线程抢占CPU的机会

使用IOCTL控制设备

ioctl是一个系统调用

SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
  int error;
  struct fd f = fdget(fd);
......
  error = do_vfs_ioctl(f.file, fd, cmd, arg);
  fdput(f);
  return error;
}
  • fd是这个设备文件描述符
  • cmd是传给设备的命令
  • arg是命令的参数

cmd和arg只要驱动开发人员和用户约定好即可

cmd为int,组成有

  • 最低八位为NR,是命令号
  • 然后八位是TYPE,是类型
  • 然后十四位是参数的大小
  • 最高两位是DIR,是方向,表示写入、读出,还是读写

还有单独的宏用于组成cmd的值

/*
 * Used to create numbers.
 */
#define _IO(type,nr)    _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size)  _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size)  _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))


/* used to decode ioctl numbers.. */
#define _IOC_DIR(nr)    (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK)
#define _IOC_TYPE(nr)    (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK)
#define _IOC_NR(nr)    (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK)
#define _IOC_SIZE(nr)    (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK)
  • 用户程序中,可以通过上面的"Used to create numbers"这些宏,根据参数生成cmd
  • 驱动程序中,可以通过下面的"used to decode ioctl numbers"这些宏,解析cmd后,执行指令

ioctl中会调用do_vfs_ioctl,执行定义好的cmd,如果不是定义好的,就执行默认操作

对于默认文件调用file_ioctl,其他文件调用vfs_ioctl

int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
       unsigned long arg)
{
  int error = 0;
  int __user *argp = (int __user *)arg;
  struct inode *inode = file_inode(filp);


  switch (cmd) {
......
  case FIONBIO:
    error = ioctl_fionbio(filp, argp);
    break;


  case FIOASYNC:
    error = ioctl_fioasync(fd, filp, argp);
    break;
......
  case FICLONE:
    return ioctl_file_clone(filp, arg, 0, 0, 0);


  default:
    if (S_ISREG(inode->i_mode))
      error = file_ioctl(filp, cmd, arg);
    else
      error = vfs_ioctl(filp, cmd, arg);
    break;
  }
  return error;

unlocked_ioctl对于打印机来说调用的lp_ioctl

这里通过switch的方式根据cmd执行对应操作


static long lp_ioctl(struct file *file, unsigned int cmd,
      unsigned long arg)
{
  unsigned int minor;
  struct timeval par_timeout;
  int ret;


  minor = iminor(file_inode(file));
  mutex_lock(&lp_mutex);
  switch (cmd) {
......
  default:
    ret = lp_do_ioctl(minor, cmd, arg, (void __user *)arg);
    break;
  }
  mutex_unlock(&lp_mutex);
  return ret;
}


static int lp_do_ioctl(unsigned int minor, unsigned int cmd,
  unsigned long arg, void __user *argp)
{
  int status;
  int retval = 0;


  switch ( cmd ) {
    case LPTIME:
      if (arg > UINT_MAX / HZ)
        return -EINVAL;
      LP_TIME(minor) = arg * HZ/100;
      break;
    case LPCHAR:
      LP_CHAR(minor) = arg;
      break;
    case LPABORT:
      if (arg)
        LP_F(minor) |= LP_ABORT;
      else
        LP_F(minor) &= ~LP_ABORT;
      break;
    case LPABORTOPEN:
      if (arg)
        LP_F(minor) |= LP_ABORTOPEN;
      else
        LP_F(minor) &= ~LP_ABORTOPEN;
      break;
    case LPCAREFUL:
      if (arg)
        LP_F(minor) |= LP_CAREFUL;
      else
        LP_F(minor) &= ~LP_CAREFUL;
      break;
    case LPWAIT:
      LP_WAIT(minor) = arg;
      break;
    case LPSETIRQ: 
      return -EINVAL;
      break;
    case LPGETIRQ:
      if (copy_to_user(argp, &LP_IRQ(minor),
          sizeof(int)))
        return -EFAULT;
      break;
    case LPGETSTATUS:
      if (mutex_lock_interruptible(&lp_table[minor].port_mutex))
        return -EINTR;
      lp_claim_parport_or_block (&lp_table[minor]);
      status = r_str(minor);
      lp_release_parport (&lp_table[minor]);
      mutex_unlock(&lp_table[minor].port_mutex);


      if (copy_to_user(argp, &status, sizeof(int)))
        return -EFAULT;
      break;
    case LPRESET:
      lp_reset(minor);
      break;
     case LPGETFLAGS:
       status = LP_F(minor);
      if (copy_to_user(argp, &status, sizeof(int)))
        return -EFAULT;
      break;
    default:
      retval = -EINVAL;
  }
  return retval

33 字符设备(下)

中断处理函数

鼠标就是通过中断的方式将位置和按键信息传递给驱动程序,下边为打开设备和响应中断

static int logibm_open(struct input_dev *dev)
{
  if (request_irq(logibm_irq, logibm_interrupt, 0, "logibm", NULL)) {
    printk(KERN_ERR "logibm.c: Can't allocate irq %d\n", logibm_irq);
    return -EBUSY;
  }
  outb(LOGIBM_ENABLE_IRQ, LOGIBM_CONTROL_PORT);
  return 0;
}


static irqreturn_t logibm_interrupt(int irq, void *dev_id)
{
  char dx, dy;
  unsigned char buttons;


  outb(LOGIBM_READ_X_LOW, LOGIBM_CONTROL_PORT);
  dx = (inb(LOGIBM_DATA_PORT) & 0xf);
  outb(LOGIBM_READ_X_HIGH, LOGIBM_CONTROL_PORT);
  dx |= (inb(LOGIBM_DATA_PORT) & 0xf) << 4;
  outb(LOGIBM_READ_Y_LOW, LOGIBM_CONTROL_PORT);
  dy = (inb(LOGIBM_DATA_PORT) & 0xf);
  outb(LOGIBM_READ_Y_HIGH, LOGIBM_CONTROL_PORT);
  buttons = inb(LOGIBM_DATA_PORT);
  dy |= (buttons & 0xf) << 4;
  buttons = ~buttons >> 5;


  input_report_rel(logibm_dev, REL_X, dx);
  input_report_rel(logibm_dev, REL_Y, dy);
  input_report_key(logibm_dev, BTN_RIGHT,  buttons & 1);
  input_report_key(logibm_dev, BTN_MIDDLE, buttons & 2);
  input_report_key(logibm_dev, BTN_LEFT,   buttons & 4);
  input_sync(logibm_dev);


  outb(LOGIBM_ENABLE_IRQ, LOGIBM_CONTROL_PORT);
  return IRQ_HANDLED

处理中断也需要一个处理函数

irqreturn_t (*irq_handler_t)(int irq, void * dev_id);


/**
 * enum irqreturn
 * @IRQ_NONE    interrupt was not from this device or was not handled
 * @IRQ_HANDLED    interrupt was handled by this device
 * @IRQ_WAKE_THREAD  handler requests to wake the handler thread
 */
enum irqreturn {
  IRQ_NONE    = (0 << 0),
  IRQ_HANDLED    = (1 << 0),
  IRQ_WAKE_THREAD    = (1 << 1),
};
  • irq是整数,中断信号
  • dev_id是一个void *的通用指针,主要用于区分同一个中断处理函数对于不同设备的处理

返回值有三种

  • IRQ_NONE,不是我的中断不归我管
  • IRQ_HANDLED,中断处理完了
  • IRQ_WAKE_THREAD,中断处理完了,还有个进程等待中断,需要进行唤醒

logibm_interrupt这个中断处理函数,先获取了x和y的移动坐标,以及左中右的按键,上报上去,然后返回IRQ_HANDLED表示处理完成

中断信号A处理的时候,这个中断信号A应该是暂时关闭的,防止再来一个中断信号A,影响当前中断信号A的处理。这个暂时关闭时间不能太长也不能太短,就引入了中断分为两个部分,上半部和下半部

  • 上半部完成关键部分的处理,完成将中断信号打开,可以进来新的中断
  • 下半部处理时间较长的部分,一般通过工作队列

中断注册

有了中断处理函数,需要调用request_irq来进行注册,参数有

  • unsigned int irq是中断信号
  • irq_handler_t handler是中断处理函数
  • unsigned long flags一些标识位
  • const char *name是设备名称
  • void *dev这个通用指针应该和中断处理函数的void *dev相对应
static inline int __must_check
request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char *name, void *dev)
{
  return request_threaded_irq(irq, handler, NULL, flags, name, dev);
}

request_irq调用的request_threaded_irq

int request_threaded_irq(unsigned int irq, irq_handler_t handler,
       irq_handler_t thread_fn, unsigned long irqflags,
       const char *devname, void *dev_id)
{
  struct irqaction *action;
  struct irq_desc *desc;
  int retval;
......
  desc = irq_to_desc(irq);
......
  action = kzalloc(sizeof(struct irqaction), GFP_KERNEL);
  action->handler = handler;
  action->thread_fn = thread_fn;
  action->flags = irqflags;
  action->name = devname;
  action->dev_id = dev_id;
......
  retval = __setup_irq(irq, desc, action);
......
}

对于每个中断都有struct irq_desc,主要有个struct irqaction表示中断处理的动作,这个结构有一个next指针说明是在链表上的

struct irq_desc {
......
  struct irqaction  *action;  /* IRQ action list */
......
  struct module    *owner;
  const char    *name;
};


/**
 * struct irqaction - per interrupt action descriptor
 * @handler:  interrupt handler function
 * @name:  name of the device
 * @dev_id:  cookie to identify the device
 * @percpu_dev_id:  cookie to identify the device
 * @next:  pointer to the next irqaction for shared interrupts
 * @irq:  interrupt number
 * @flags:  flags (see IRQF_* above)
 * @thread_fn:  interrupt handler function for threaded interrupts
 * @thread:  thread pointer for threaded interrupts
 * @secondary:  pointer to secondary irqaction (force threading)
 * @thread_flags:  flags related to @thread
 * @thread_mask:  bitmask for keeping track of @thread activity
 * @dir:  pointer to the proc/irq/NN/name entry
 */
struct irqaction {
  irq_handler_t    handler;
  void      *dev_id;
  void __percpu    *percpu_dev_id;
  struct irqaction  *next;
  irq_handler_t    thread_fn;
  struct task_struct  *thread;
  struct irqaction  *secondary;
  unsigned int    irq;
  unsigned int    flags;
  unsigned long    thread_flags;
  unsigned long    thread_mask;
  const char    *name;
  struct proc_dir_entry  *dir;
};

每一个中断处理动作的结构 struct irqaction,都有以下成员

  • handler中断处理函数
  • void *dev_id为设备id
  • irq为中断信号
  • 如果中断处理函数在单独的线程运行,则有thread_fn是线程的执行函数,thread是线程的task_struct

request_threaded_irq函数中,irq_to_desc根据中断信号查找中断描述结构,一般情况下struct irq_desc都在一个数组,直接通过下标查找即可,但是如果配置了CONFIG_SPARSE_IRQ就不是连续的了,就需要用树保存了

#ifdef CONFIG_SPARSE_IRQ
static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
struct irq_desc *irq_to_desc(unsigned int irq)
{
  return radix_tree_lookup(&irq_desc_tree, irq);
}
#else /* !CONFIG_SPARSE_IRQ */
struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
  [0 ... NR_IRQS-1] = {
  }
};
struct irq_desc *irq_to_desc(unsigned int irq)
{
  return (irq < NR_IRQS) ? irq_desc + irq : NULL;
}
#endif /* !CONFIG_SPARSE_IRQ */

request_threaded_irq分配了一个struct irqaction并初始化,然后调用__setup_irq

static int
__setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
{
  struct irqaction *old, **old_ptr;
  unsigned long flags, thread_mask = 0;
  int ret, nested, shared = 0;
......
  new->irq = irq;
......
  /*
   * Create a handler thread when a thread function is supplied
   * and the interrupt does not nest into another interrupt
   * thread.
   */
  if (new->thread_fn && !nested) {
    ret = setup_irq_thread(new, irq, false);
  }
......
  old_ptr = &desc->action;
  old = *old_ptr;
  if (old) {
    /* add new interrupt at end of irq queue */
    do {
      thread_mask |= old->thread_mask;
      old_ptr = &old->next;
      old = *old_ptr;
    } while (old);
  }
......
  *old_ptr = new;
......
  if (new->thread)
    wake_up_process(new->thread);
......
}


static int
setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
{
  struct task_struct *t;
  struct sched_param param = {
    .sched_priority = MAX_USER_RT_PRIO/2,
  };


  t = kthread_create(irq_thread, new, "irq/%d-%s", irq, new->name);
  sched_setscheduler_nocheck(t, SCHED_FIFO, &param);
  get_task_struct(t);
  new->thread = t;
......
  return 0;

struct irqaction挂在链表的末端,如果设定了单独的线程运行中断处理函数,setup_irq_thread就会创建对应的内核线程,wake_up_process会唤醒它

中断处理

硬件中断发生了什么

  1. 外部设备给中断控制器发送物理中断信号
  2. 中断控制器将物理中断信号转换成为中断向量interrupt vector,发给各个CPU
  3. 每个CPU中根据interrupt vector在中断向量表查找IRQ处理函数(不是irq_handler_t
  4. IRQ处理函数中,将interrupt vector转化为抽象中断层的中断信号irq,调用中断信号irq对应的中断描述结构里面的irq_handler_t

CPU收到的中断在arch/x86/include/asm/irq_vectors.h中定义

/*
 * Linux IRQ vector layout.
 *
 * There are 256 IDT entries (per CPU - each entry is 8 bytes) which can
 * be defined by Linux. They are used as a jump table by the CPU when a
 * given vector is triggered - by a CPU-external, CPU-internal or
 * software-triggered event.
 *
 * Linux sets the kernel code address each entry jumps to early during
 * bootup, and never changes them. This is the general layout of the
 * IDT entries:
 *
 *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
 *  Vectors  32 ... 127 : device interrupts
 *  Vector  128         : legacy int80 syscall interface
 *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 except 204 : device interrupts
 *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
 *
 * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
 *
 * This file enumerates the exact layout of them:
 */
#define FIRST_EXTERNAL_VECTOR    0x20
#define IA32_SYSCALL_VECTOR    0x80
#define NR_VECTORS       256
#define FIRST_SYSTEM_VECTOR    NR_VECTORS

CPU一共可以处理256个,用宏NR_VECTOR或者FIRST_SYSTEM_VECTOR表示

CPU硬件要求每个CPU有一个中断向量表,通过load_idt加载,表中记录每个中断对应的处理方法,中断向量表的定义在arch/x86/kernel/traps.c

gate_desc idt_table[NR_VECTORS] __page_aligned_bss;

对于CPU可以处理的中断分为几个部分

第一部分,0~31是系统陷入或者系统异常,这些错误无法屏蔽,一定要处理,这些中断处理信号在系统初始化的时候,start_kernel调用trap_init,其中开头调用了大量set_intr_gate最终通过_set_gate完成注册


void __init trap_init(void)
{
  int i;
...
  set_intr_gate(X86_TRAP_DE, divide_error);
//各种各样的set_intr_gate,不都贴在这里了,只贴一头一尾
...
  set_intr_gate(X86_TRAP_XF, simd_coprocessor_error);


  /* Reserve all the builtin and the syscall vector: */
  for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
    set_bit(i, used_vectors);


#ifdef CONFIG_X86_32
  set_system_intr_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
  set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif


  /*
   * Set the IDT descriptor to a fixed read-only location, so that the
   * "sidt" instruction will not leak the location of the kernel, and
   * to defend the IDT against arbitrary memory write vulnerabilities.
   * It will be reloaded in cpu_init() */
  __set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
  idt_descr.address = fix_to_virt(FIX_RO_IDT);
......


static inline void _set_gate(int gate, unsigned type, void *addr,
           unsigned dpl, unsigned ist, unsigned seg)
{
  gate_desc s;
  pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
  write_idt_entry(idt_table, gate, &s);
}

set_intr_gate为每个中断设置了处理函数,放入了中断向量表idt_table

这32个中断在arch/x86/include/asm/traps.h中定义

/* Interrupts/Exceptions */
enum {
  X86_TRAP_DE = 0,  /*  0, Divide-by-zero */
  X86_TRAP_DB,    /*  1, Debug */
  X86_TRAP_NMI,    /*  2, Non-maskable Interrupt */
  X86_TRAP_BP,    /*  3, Breakpoint */
  X86_TRAP_OF,    /*  4, Overflow */
  X86_TRAP_BR,    /*  5, Bound Range Exceeded */
  X86_TRAP_UD,    /*  6, Invalid Opcode */
  X86_TRAP_NM,    /*  7, Device Not Available */
  X86_TRAP_DF,    /*  8, Double Fault */
  X86_TRAP_OLD_MF,  /*  9, Coprocessor Segment Overrun */
  X86_TRAP_TS,    /* 10, Invalid TSS */
  X86_TRAP_NP,    /* 11, Segment Not Present */
  X86_TRAP_SS,    /* 12, Stack Segment Fault */
  X86_TRAP_GP,    /* 13, General Protection Fault */
  X86_TRAP_PF,    /* 14, Page Fault */
  X86_TRAP_SPURIOUS,  /* 15, Spurious Interrupt */
  X86_TRAP_MF,    /* 16, x87 Floating-Point Exception */
  X86_TRAP_AC,    /* 17, Alignment Check */
  X86_TRAP_MC,    /* 18, Machine Check */
  X86_TRAP_XF,    /* 19, SIMD Floating-Point Exception */
  X86_TRAP_IRET = 32,  /* 32, IRET Exception */
};

这32个中断设置完,会有for循环for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++),将前32个中断都在used_vectors中标记为1,表示这些都设置过中断处理函数了

然后trap_init单独调用set_intr_gate来设置32位系统调用的中断,IA32_SYSCALL_VECTOR,也即128,单独将used_vectors中的第128位标记为1

trap_init最后会将idt_table放在一个固定的虚拟地址上

trap_init结束后,中断向量表中已经填好了前32位,外加一位32位系统调用,其他的都是用于设备中断。在start_kernel调用完毕trap_init之后,还会调用init_IRQ()来初始化其他的设备中断,最终会调用到 native_init_IRQ

void __init native_init_IRQ(void)
{
  int i;
  i = FIRST_EXTERNAL_VECTOR;
#ifndef CONFIG_X86_LOCAL_APIC
#define first_system_vector NR_VECTORS
#endif
  for_each_clear_bit_from(i, used_vectors, first_system_vector) {
    /* IA32_SYSCALL_VECTOR could be used in trap_init already. */
    set_intr_gate(i, irq_entries_start +
        8 * (i - FIRST_EXTERNAL_VECTOR));
  }
......
}

从32中断开始,到最后NR_VECTORS为止,除了used_vectors没有标记为1的位置,都会调用set_intr_gate设置中断向量表

其他设备中断的处理函数的设置会从irq_entries_start开始,偏移量为i - FIRST_EXTERNAL_VECTOR

中断处理函数是定义在irq_entries_start这个表里面的,我们在arch\x86\entry\entry_32.Sarch\x86\entry\entry_64.S都能找到这个函数表的定义

ENTRY(irq_entries_start)
    vector=FIRST_EXTERNAL_VECTOR
    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
  pushl  $(~vector+0x80)      /* Note: always in signed byte range */
    vector=vector+1
  jmp  common_interrupt /* 会调用到do_IRQ */
  .align  8
    .endr
END(irq_entries_start)


common_interrupt:
  ASM_CLAC
  addq  $-0x80, (%rsp)      /* Adjust vector to [-256, -1] range */
  interrupt do_IRQ
  /* 0(%rsp): old RSP */
ret_from_intr:
......
  /* Interrupt came from user space */
GLOBAL(retint_user)
......
/* Returning to kernel space */
retint_kernel:
......

里边定义了FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR项,每一项都是中断处理函数,会跳到common_interrupt去执行,最终调用do_IRQ,调用完成后从中断返回

/*
 * do_IRQ handles all normal device IRQ's (the special
 * SMP cross-CPU interrupts have their own specific
 * handlers).
 */
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
{
  struct pt_regs *old_regs = set_irq_regs(regs);
  struct irq_desc * desc;
  /* high bit used in ret_from_ code  */
  unsigned vector = ~regs->orig_ax;
......
  desc = __this_cpu_read(vector_irq[vector]);
  if (!handle_irq(desc, regs)) {
......
  }
......
  set_irq_regs(old_regs);
  return 1;
}

CPU从AX寄存器里面拿到了中断向量vector,但是中断控制器发送给每个CPU的中断向量都是每个CPU局部的,而抽象中断处理层的虚拟中断信号irq以及它对应的中断描述结构irq_desc是全局的,也即这个CPU的200 号的中断向量和另一个CPU的200号中断向量对应的虚拟中断信号irq和中断描述结构irq_desc可能不一样,这就需要一个映射关系。这个映射关系放在Per CPU变量vector_irq里面

DECLARE_PER_CPU(vector_irq_t, vector_irq);

在系统初始化的时候会调用__assign_irq_vector将虚拟信号绑定到某个CPU的中断向量

static int __assign_irq_vector(int irq, struct apic_chip_data *d,
             const struct cpumask *mask,
             struct irq_data *irqdata)
{
  static int current_vector = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
  static int current_offset = VECTOR_OFFSET_START % 16;
  int cpu, vector;
......
  while (cpu < nr_cpu_ids) {
    int new_cpu, offset;
......
    vector = current_vector;
    offset = current_offset;
next:
    vector += 16;
    if (vector >= first_system_vector) {
      offset = (offset + 1) % 16;
      vector = FIRST_EXTERNAL_VECTOR + offset;
    }


    /* If the search wrapped around, try the next cpu */
    if (unlikely(current_vector == vector))
      goto next_cpu;




    if (test_bit(vector, used_vectors))
      goto next;


......
    /* Found one! */
    current_vector = vector;
    current_offset = offset;
    /* Schedule the old vector for cleanup on all cpus */
    if (d->cfg.vector)
      cpumask_copy(d->old_domain, d->domain);
    for_each_cpu(new_cpu, vector_searchmask)
      per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
    goto update;


next_cpu:
    cpumask_or(searched_cpumask, searched_cpumask, vector_cpumask);
    cpumask_andnot(vector_cpumask, mask, searched_cpumask);
    cpu = cpumask_first_and(vector_cpumask, cpu_online_mask);
    continue;
  }
....

一旦找到某个向量,就将CPU的此向量对应的向量描述结构irq_desc设置为虚拟中断信号irq对应的向量描述结构irq_to_desc(irq)

do_IRQ会根据中断向量vector得到irq_desc,然后调用handle_irq->generic_handle_irq_desc->irq_desc->handle_irq


static inline void generic_handle_irq_desc(struct irq_desc *desc)
{
  desc->handle_irq(desc);
}

handle_irq调用__handle_irq_event_percpu

irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags)
{
  irqreturn_t retval = IRQ_NONE;
  unsigned int irq = desc->irq_data.irq;
  struct irqaction *action;


  record_irq_time(desc);


  for_each_action_of_desc(desc, action) {
    irqreturn_t res;
    res = action->handler(irq, action->dev_id);
    switch (res) {
    case IRQ_WAKE_THREAD:
      __irq_wake_thread(desc, action);
    case IRQ_HANDLED:
      *flags |= action->flags;
      break;
    default:
      break;
    }
    retval |= res;
  }
  return retval;

__handle_irq_event_percpu中调用irq_desc的每个hander,也就是action注册的,根据返回值继续执行即可

整个中断的处理流程就是

34 块设备(上)

块设备和字符设备一样,也会在/dev下创建块设备文件,分配特殊的inode,不同的是

  • 字符设备采用S_ISCHR分支,对应的inode的file_operationsdef_chr_fops
  • 块设备采用S_ISBLK分支,对应的inode的file_operationsdef_blk_fops

i_rdev被设置为了块设备的设备号dev_t,块设备也可以像字符设备一样打开并读写,但是一般情况下都会进行挂载,mount到一个文件夹下

const struct file_operations def_blk_fops = {
        .open           = blkdev_open,
        .release        = blkdev_close,
        .llseek         = block_llseek,
        .read_iter      = blkdev_read_iter,
        .write_iter     = blkdev_write_iter,
        .mmap           = generic_file_mmap,
        .fsync          = blkdev_fsync,
        .unlocked_ioctl = block_ioctl,
        .splice_read    = generic_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = blkdev_fallocate,
};

打开块设备blkdev_open,最后调用的blkdev_get

mount调用对应文件系统的mount操作,对于ext4调用的就是ext4_mount->mount_bdev

static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data)
{
  return mount_bdev(fs_type, flags, dev_name, data, ext4_fill_super);
}


struct dentry *mount_bdev(struct file_system_type *fs_type,
  int flags, const char *dev_name, void *data,
  int (*fill_super)(struct super_block *, void *, int))
{
  struct block_device *bdev;
  struct super_block *s;
  fmode_t mode = FMODE_READ | FMODE_EXCL;
  int error = 0;


  if (!(flags & MS_RDONLY))
    mode |= FMODE_WRITE;


  bdev = blkdev_get_by_path(dev_name, mode, fs_type);
......
  s = sget(fs_type, test_bdev_super, set_bdev_super, flags | MS_NOSEC, bdev);
......
  return dget(s->s_root);
......
}

mount_bdev做了两件事

  • blkdev_get_by_path根据/dev/xxx文件路径找到对应设备,并打开
  • sget根据打开的设备文件,填充ext4文件系统的super_block建立文件系统体系

blkdev_get_by_path得到的是struct block_device *bdev

/**
 * blkdev_get_by_path - open a block device by name
 * @path: path to the block device to open
 * @mode: FMODE_* mask
 * @holder: exclusive holder identifier
 *
 * Open the blockdevice described by the device file at @path.  @mode
 * and @holder are identical to blkdev_get().
 *
 * On success, the returned block_device has reference count of one.
 */
struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
          void *holder)
{
  struct block_device *bdev;
  int err;


  bdev = lookup_bdev(path);
......
  err = blkdev_get(bdev, mode, holder);
......
  return bdev;
}

并调用blkdev_get打开设备

/**
 * lookup_bdev  - lookup a struct block_device by name
 * @pathname:  special file representing the block device
 *
 * Get a reference to the blockdevice at @pathname in the current
 * namespace if possible and return it.  Return ERR_PTR(error)
 * otherwise.
 */
struct block_device *lookup_bdev(const char *pathname)
{
  struct block_device *bdev;
  struct inode *inode;
  struct path path;
  int error;


  if (!pathname || !*pathname)
    return ERR_PTR(-EINVAL);


  error = kern_path(pathname, LOOKUP_FOLLOW, &path);
  if (error)
    return ERR_PTR(error);


  inode = d_backing_inode(path.dentry);
......
  bdev = bd_acquire(inode);
......
  goto out;
}

pathname为设备名称,例如/dev/xxx,这个文件在devtmpfs文件系统中,所以kern_path可以在这个文件系统获取对应的dentry,再通过d_backing_inode获取inode,这个inode就是init_special_inode生成的特殊inode

bd_acquire通过特殊的inode,找到struct block_device

static struct block_device *bd_acquire(struct inode *inode)
{
  struct block_device *bdev;
......
  bdev = bdget(inode->i_rdev);
  if (bdev) {
    spin_lock(&bdev_lock);
    if (!inode->i_bdev) {
      /*
       * We take an additional reference to bd_inode,
       * and it's released in clear_inode() of inode.
       * So, we can access it via ->i_mapping always
       * without igrab().
       */
      bdgrab(bdev);
      inode->i_bdev = bdev;
      inode->i_mapping = bdev->bd_inode->i_mapping;
    }
  }
  return bdev;
}

调用的bdget,参数为特殊inode的i_rdev,在mknod的时候设置为设备号dev_t

struct block_device *bdget(dev_t dev)
{
        struct block_device *bdev;
        struct inode *inode;


        inode = iget5_locked(blockdev_superblock, hash(dev),
                        bdev_test, bdev_set, &dev);

        bdev = &BDEV_I(inode)->bdev;


        if (inode->i_state & I_NEW) {
                bdev->bd_contains = NULL;
                bdev->bd_super = NULL;
                bdev->bd_inode = inode;
                bdev->bd_block_size = i_blocksize(inode);
                bdev->bd_part_count = 0;
                bdev->bd_invalidated = 0;
                inode->i_mode = S_IFBLK;
                inode->i_rdev = dev;
                inode->i_bdev = bdev;
                inode->i_data.a_ops = &def_blk_aops;
                mapping_set_gfp_mask(&inode->i_data, GFP_USER);
                spin_lock(&bdev_lock);
                list_add(&bdev->bd_list, &all_bdevs);
                spin_unlock(&bdev_lock);
                unlock_new_inode(inode);
        }
        return bdev;
}

在bdget遇到了bdev伪文件系统,bdget函数根据传进来的dev_tblockdev_superblock文件系统里边找到inode,这个blockdev_superblock初始化在系统初始化的时候完成,调用bdev_cache_init

struct super_block *blockdev_superblock __read_mostly;


static struct file_system_type bd_type = {
        .name           = "bdev",
        .mount          = bd_mount,
        .kill_sb        = kill_anon_super,
};


void __init bdev_cache_init(void)
{
        int err;
        static struct vfsmount *bd_mnt;


        bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode), 0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD|SLAB_ACCOUNT|SLAB_PANIC), init_once);
        err = register_filesystem(&bd_type);
        if (err)
                panic("Cannot register bdev pseudo-fs");
        bd_mnt = kern_mount(&bd_type);
        if (IS_ERR(bd_mnt))
                panic("Cannot create bdev pseudo-fs");
        blockdev_superblock = bd_mnt->mnt_sb;   /* For writeback */
}

所有块设备保存在伪文件系统bdev中,通过块设备的block_device和bdev文件系统的块设备的inode,通过struct bdev_inode关联,bdget通过的bdev块设备的inode,进而获得块设备的block_device

获取到block_deviceblkdev_get_by_path会通过blkdev_get打开设备,最后调用的__blkdev_get

block_device的结构为

struct block_device {
  dev_t      bd_dev;  /* not a kdev_t - it's a search key */
  int      bd_openers;
  struct super_block *  bd_super;
......
  struct block_device *  bd_contains;
  unsigned    bd_block_size;
  struct hd_struct *  bd_part;
  unsigned    bd_part_count;
  int      bd_invalidated;
  struct gendisk *  bd_disk;
  struct request_queue *  bd_queue;
  struct backing_dev_info *bd_bdi;
  struct list_head  bd_list;
......
} ;

一个磁盘可能有很多分区,每个分区又可以格式化为不同文件系统,所以是一个复杂的结构

struct gendisk用来描述整个设备,上图指向了/dev/sda

struct gendisk {
  int major;      /* major number of driver */
  int first_minor;
  int minors;                     /* maximum number of minors, =1 for disks that can't be partitioned. */
  char disk_name[DISK_NAME_LEN];  /* name of major driver */
  char *(*devnode)(struct gendisk *gd, umode_t *mode);
......
  struct disk_part_tbl __rcu *part_tbl;
  struct hd_struct part0;


  const struct block_device_operations *fops;
  struct request_queue *queue;
  void *private_data;


  int flags;
  struct kobject *slave_dir;
......
};
  • major是主设备号
  • first_minor表示第一个分区的从设备号
  • minors表示分区的数目
  • disk_name为磁盘块设备的名称
  • struct disk_part_tbl结构里是一个struct hd_struct的数组,用于表示各个分区
  • struct block_device_operations fops指向对于这个块设备的各种操作
  • struct request_queue queue是表示在这个块设备上的请求队列

struct hd_struct是用来表示某个分区的,在上面的例子中,有两个hd_struct的实例,分别指向 /dev/sda1、 /dev/sda2,定义如下

struct hd_struct {
  sector_t start_sect;
  sector_t nr_sects;
......
  struct device __dev;
  struct kobject *holder_dir;
  int policy, partno;
  struct partition_meta_info *info;
......
  struct disk_stats dkstats;
  struct percpu_ref ref;
  struct rcu_head rcu_head;
};

hd_struct中比较重要的信息有从磁盘的哪个扇区开始,到哪个扇区结束

block_device既可以表达块设备,也可以表达分区,但是

  • bd_disk指向了gendisk
  • bd_part指向了没有分区的hd_struct
  • bd_contains指向了整个块设备block_device
static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
  struct gendisk *disk;
  struct module *owner;
  int ret;
  int partno;
  int perm = 0;


  if (mode & FMODE_READ)
    perm |= MAY_READ;
  if (mode & FMODE_WRITE)
    perm |= MAY_WRITE;
......
  disk = get_gendisk(bdev->bd_dev, &partno);
......
  owner = disk->fops->owner;
......
  if (!bdev->bd_openers) {
    bdev->bd_disk = disk;
    bdev->bd_queue = disk->queue;
    bdev->bd_contains = bdev;


    if (!partno) {
      ret = -ENXIO;
      bdev->bd_part = disk_get_part(disk, partno);
......
      if (disk->fops->open) {
        ret = disk->fops->open(bdev, mode);
......
      }


      if (!ret)
        bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);


      if (bdev->bd_invalidated) {
        if (!ret)
          rescan_partitions(disk, bdev);
......
      }
......
    } else {
      struct block_device *whole;
      whole = bdget_disk(disk, 0);
......
      ret = __blkdev_get(whole, mode, 1);
......
      bdev->bd_contains = whole;
      bdev->bd_part = disk_get_part(disk, partno);
......
      bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
    }
  } 
......
  bdev->bd_openers++;
  if (for_part)
    bdev->bd_part_count++;
.....
}

__blkdev_get中调用get_gendisk获取gendisk

/**
 * get_gendisk - get partitioning information for a given device
 * @devt: device to get partitioning information for
 * @partno: returned partition index
 *
 * This function gets the structure containing partitioning
 * information for the given device @devt.
 */
struct gendisk *get_gendisk(dev_t devt, int *partno)
{
  struct gendisk *disk = NULL;


  if (MAJOR(devt) != BLOCK_EXT_MAJOR) {
    struct kobject *kobj;


    kobj = kobj_lookup(bdev_map, devt, partno);
    if (kobj)
      disk = dev_to_disk(kobj_to_dev(kobj));
  } else {
    struct hd_struct *part;
    part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
    if (part && get_disk(part_to_disk(part))) {
      *partno = part->partno;
      disk = part_to_disk(part);
    }
  }
  return disk;
}
  • 第一种情况,如果block_device指向了整个磁盘设备,只需要根据dev_tbdev_map中的gendisk拿出来即可
  • 第二种情况,如果block_device指向了某个分区,就需要先拿到hd_struct,去找设备的gendisk,并将partno设置为分区号

在字符设备初始化的时候,会调用__register_chrdev_region注册字符设备,对块设备也是一样的,每个块设备初始化的时候都会调用add_disk注册一个gendisk,通过调用链add_disk->device_add_disk->blk_register_region,将dev_t和一个gendisk关联起来,保存在bdev_map

得到gendisk之后

  • 如果partno为0,那就证明打开的是设备而不是分区,就需要调用disk_get_part,获取gendisk中的分区数组,调用block_device_operations里的open函数打开设备
  • 如果partno不为0,那就声明打开的是分区,那么就获取整个设备block_device,赋值给变量struct block_device *whole,然后调用__blkdev_get打开whole代表的整个设备,将bd_contains设置为whole

block_device_operations在驱动层完成,例如drivers/scsi/sd.c里面,也就是MODULE_DESCRIPTION("SCSI disk (sd) driver")中,就有这样的定义

static const struct block_device_operations sd_fops = {
  .owner      = THIS_MODULE,
  .open      = sd_open,
  .release    = sd_release,
  .ioctl      = sd_ioctl,
  .getgeo      = sd_getgeo,
#ifdef CONFIG_COMPAT
  .compat_ioctl    = sd_compat_ioctl,
#endif
  .check_events    = sd_check_events,
  .revalidate_disk  = sd_revalidate_disk,
  .unlock_native_capacity  = sd_unlock_native_capacity,
  .pr_ops      = &sd_pr_ops,
};


/**
 *  sd_open - open a scsi disk device
 *  @bdev: Block device of the scsi disk to open
 *  @mode: FMODE_* mask
 *
 *  Returns 0 if successful. Returns a negated errno value in case 
 *  of error.
 **/
static int sd_open(struct block_device *bdev, fmode_t mode)
{
......
}

在驱动层打开磁盘设备,block_device字段都填充完成,mount_bdevblkdev_get_by_path完成block_device的获取

然后是sgetblock_device添加到superblock,参数会有一个set_bdev_super函数,首先会分配一个super_block,然后调用set_bdev_super的callback函数,这里super_block为ext4文件系统的super_block

static int set_bdev_super(struct super_block *s, void *data)
{
  s->s_bdev = data;
  s->s_dev = s->s_bdev->bd_dev;
  s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
  return 0;
}


/**
 *  sget  -  find or create a superblock
 *  @type:    filesystem type superblock should belong to
 *  @test:    comparison callback
 *  @set:    setup callback
 *  @flags:    mount flags
 *  @data:    argument to each of them
 */
struct super_block *sget(struct file_system_type *type,
      int (*test)(struct super_block *,void *),
      int (*set)(struct super_block *,void *),
      int flags,
      void *data)
{
......
  return sget_userns(type, test, set, flags, user_ns, data);
}


/**
 *  sget_userns -  find or create a superblock
 *  @type:  filesystem type superblock should belong to
 *  @test:  comparison callback
 *  @set:  setup callback
 *  @flags:  mount flags
 *  @user_ns: User namespace for the super_block
 *  @data:  argument to each of them
 */
struct super_block *sget_userns(struct file_system_type *type,
      int (*test)(struct super_block *,void *),
      int (*set)(struct super_block *,void *),
      int flags, struct user_namespace *user_ns,
      void *data)
{
  struct super_block *s = NULL;
  struct super_block *old;
  int err;
......
  if (!s) {
    s = alloc_super(type, (flags & ~MS_SUBMOUNT), user_ns);
......
  }
  err = set(s, data);
......
  s->s_type = type;
  strlcpy(s->s_id, type->name, sizeof(s->s_id));
  list_add_tail(&s->s_list, &super_blocks);
  hlist_add_head(&s->s_instances, &type->fs_supers);
  spin_unlock(&sb_lock);
  get_filesystem(type);
  register_shrinker(&s->s_shrink);
  return s;
}

mount一个块设备就完成了,设备打开形成block_device结构,并且添加到super_block

对于ext4文件系统有了super_block,就可以进行读写了

总结

块设备的整体流程

35 块设备(下)

对于写入ext4文件系统,最后调用的是ext4_file_write_iter

  • 直接IO,调用generic_file_direct_write,链路为mapping->a_ops->direct_IO,实际调用的ext4_direct_IO往设备层写入数据
  • 缓存IO,将数据从用户空间拷贝到内存缓存,当触发脏页刷写的时候,调用wb_workfn写入,链路为wb_workfn->wb_do_writeback->wb_writeback->writeback_sb_inodes->__writeback_single_inode->do_writepages,在do_writepages中,要调用mapping->a_ops->writepages,但实际调用的是ext4_writepages往设备层写入数据

直接IO访问块设备

直接IO调用的ext4_direct_IO

static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
  struct file *file = iocb->ki_filp;
  struct inode *inode = file->f_mapping->host;
  size_t count = iov_iter_count(iter);
  loff_t offset = iocb->ki_pos;
  ssize_t ret;
......
  ret = ext4_direct_IO_write(iocb, iter);
......
}


static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
{
  struct file *file = iocb->ki_filp;
  struct inode *inode = file->f_mapping->host;
  struct ext4_inode_info *ei = EXT4_I(inode);
  ssize_t ret;
  loff_t offset = iocb->ki_pos;
  size_t count = iov_iter_count(iter);
......
  ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
           get_block_func, ext4_end_io_dio, NULL,
           dio_flags);
……
}

ext4_direct_IO会调用__blockdev_direct_IO,有个参数inode->i_sb->s_bdev,通过当前inode得到super_block中的s_bdev,就是block_device

__blockdev_direct_IO会调用do_blockdev_direct_IO,需要一个struct dio结构和struct dio_submit来看描述写入请求

static inline ssize_t
do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
          struct block_device *bdev, struct iov_iter *iter,
          get_block_t get_block, dio_iodone_t end_io,
          dio_submit_t submit_io, int flags)
{
  unsigned i_blkbits = ACCESS_ONCE(inode->i_blkbits);
  unsigned blkbits = i_blkbits;
  unsigned blocksize_mask = (1 << blkbits) - 1;
  ssize_t retval = -EINVAL;
  size_t count = iov_iter_count(iter);
  loff_t offset = iocb->ki_pos;
  loff_t end = offset + count;
  struct dio *dio;
  struct dio_submit sdio = { 0, };
  struct buffer_head map_bh = { 0, };
......
  dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
  dio->flags = flags;
  dio->i_size = i_size_read(inode);
  dio->inode = inode;
  if (iov_iter_rw(iter) == WRITE) {
    dio->op = REQ_OP_WRITE;
    dio->op_flags = REQ_SYNC | REQ_IDLE;
    if (iocb->ki_flags & IOCB_NOWAIT)
      dio->op_flags |= REQ_NOWAIT;
  } else {
    dio->op = REQ_OP_READ;
  }
  sdio.blkbits = blkbits;
  sdio.blkfactor = i_blkbits - blkbits;
  sdio.block_in_file = offset >> blkbits;


  sdio.get_block = get_block;
  dio->end_io = end_io;
  sdio.submit_io = submit_io;
  sdio.final_block_in_bio = -1;
  sdio.next_block_for_io = -1;


  dio->iocb = iocb;
  dio->refcount = 1;


  sdio.iter = iter;
  sdio.final_block_in_request =
    (offset + iov_iter_count(iter)) >> blkbits;
......
  sdio.pages_in_io += iov_iter_npages(iter, INT_MAX);


  retval = do_direct_IO(dio, &sdio, &map_bh);
.....
}

do_direct_IO有两层循环

  • 第一层循环是依次处理这些写入的所有块,对于每个块,取出对应内存中的页page,有写入的起始地址from和终止地址to
  • 第二层循环是处理from到to的数据,调用submit_page_section提交到设备层进行写入
static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
      struct buffer_head *map_bh)
{
  const unsigned blkbits = sdio->blkbits;
  const unsigned i_blkbits = blkbits + sdio->blkfactor;
  int ret = 0;


  while (sdio->block_in_file < sdio->final_block_in_request) {
    struct page *page;
    size_t from, to;


    page = dio_get_page(dio, sdio);
        from = sdio->head ? 0 : sdio->from;
    to = (sdio->head == sdio->tail - 1) ? sdio->to : PAGE_SIZE;
    sdio->head++;


    while (from < to) {
      unsigned this_chunk_bytes;  /* # of bytes mapped */
      unsigned this_chunk_blocks;  /* # of blocks */
......
            ret = submit_page_section(dio, sdio, page,
              from,
              this_chunk_bytes,
              sdio->next_block_for_io,
              map_bh);
......
      sdio->next_block_for_io += this_chunk_blocks;
      sdio->block_in_file += this_chunk_blocks;
      from += this_chunk_bytes;
      dio->result += this_chunk_bytes;
      sdio->blocks_available -= this_chunk_blocks;
      if (sdio->block_in_file == sdio->final_block_in_request)
        break;
......
        }
    }
}

submit_page_section会调用dio_bio_submit进而调用submit_bio向块设备层提交数据,struct bio是数据传给块设备的通用传输对象

/**
 * submit_bio - submit a bio to the block device layer for I/O
 * @bio: The &struct bio which describes the I/O
 */
blk_qc_t submit_bio(struct bio *bio)
{
......
  return generic_make_request(bio);
}

缓存IO如何访问块设备

缓存IO调用的是ext4_writepages

static int ext4_writepages(struct address_space *mapping,
         struct writeback_control *wbc)
{
......
  struct mpage_da_data mpd;
  struct inode *inode = mapping->host;
  struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
......
  mpd.do_map = 0;
  mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
  ret = mpage_prepare_extent_to_map(&mpd);
  /* Submit prepared bio */
  ext4_io_submit(&mpd.io_submit);
......
}

struct mpage_da_data,里边有文件的inode,要写入的页的偏移量,还有struct ext4_io_submit,有通用传输对象bio

struct mpage_da_data {
  struct inode *inode;
......
  pgoff_t first_page;  /* The first page to write */
  pgoff_t next_page;  /* Current page to examine */
  pgoff_t last_page;  /* Last page to examine */
  struct ext4_map_blocks map;
  struct ext4_io_submit io_submit;  /* IO submission data */
  unsigned int do_map:1;
};


struct ext4_io_submit {
......
  struct bio    *io_bio;
  ext4_io_end_t    *io_end;
  sector_t    io_next_block;
};

ext4_writepages中,mpage_prepare_extent_to_map用于初始化struct mpage_da_data

之后调用链为mpage_prepare_extent_to_map->mpage_process_page_bufs->mpage_submit_page->ext4_bio_write_page->io_submit_add_bh

io_submit_add_bh中,此时bio中还是空的,需要调用io_submit_init_bio初始化bio

static int io_submit_init_bio(struct ext4_io_submit *io,
            struct buffer_head *bh)
{
  struct bio *bio;


  bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
  if (!bio)
    return -ENOMEM;
  wbc_init_bio(io->io_wbc, bio);
  bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
  bio->bi_bdev = bh->b_bdev;
  bio->bi_end_io = ext4_end_bio;
  bio->bi_private = ext4_get_io_end(io->io_end);
  io->io_bio = bio;
  io->io_next_block = bh->b_blocknr;
  return 0;
}

ext4_writepages在bio初始化之后,需要调用ext4_io_submit,通过submit_bio提交IO

void ext4_io_submit(struct ext4_io_submit *io)
{
  struct bio *bio = io->io_bio;


  if (bio) {
    int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
          REQ_SYNC : 0;
    io->io_bio->bi_write_hint = io->io_end->inode->i_write_hint;
    bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
    submit_bio(io->io_bio);
  }
  io->io_bio = NULL;
}

如何向块设备层提交请求

不管是直接IO,还是缓存IO最后都到了submit_bio

submit_bio会调用generic_make_request

blk_qc_t generic_make_request(struct bio *bio)
{
  /*
   * bio_list_on_stack[0] contains bios submitted by the current
   * make_request_fn.
   * bio_list_on_stack[1] contains bios that were submitted before
   * the current make_request_fn, but that haven't been processed
   * yet.
   */
  struct bio_list bio_list_on_stack[2];
  blk_qc_t ret = BLK_QC_T_NONE;
......
  if (current->bio_list) {
    bio_list_add(&current->bio_list[0], bio);
    goto out;
  }


  bio_list_init(&bio_list_on_stack[0]);
  current->bio_list = bio_list_on_stack;
  do {
    struct request_queue *q = bdev_get_queue(bio->bi_bdev);


    if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
      struct bio_list lower, same;


      /* Create a fresh bio_list for all subordinate requests */
      bio_list_on_stack[1] = bio_list_on_stack[0];
      bio_list_init(&bio_list_on_stack[0]);
      ret = q->make_request_fn(q, bio);


      blk_queue_exit(q);


      /* sort new bios into those for a lower level
       * and those for the same level
       */
      bio_list_init(&lower);
      bio_list_init(&same);
      while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
        if (q == bdev_get_queue(bio->bi_bdev))
          bio_list_add(&same, bio);
        else
          bio_list_add(&lower, bio);
      /* now assemble so we handle the lowest level first */
      bio_list_merge(&bio_list_on_stack[0], &lower);
      bio_list_merge(&bio_list_on_stack[0], &same);
      bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
    } 
......
    bio = bio_list_pop(&bio_list_on_stack[0]);
  } while (bio);
  current->bio_list = NULL; /* deactivate */
out:
  return ret;
}

在do while中,先是获取一个请求队列request_queue,调用队列的make_request_fn函数

块设备队列结构

struct block_devicestruct gendisk中都有请求队列struct request_queue用于处理上层发来的请求

在块设备初始化的时候会生成一个request_queue

struct request_queue {
  /*
   * Together with queue_head for cacheline sharing
   */
  struct list_head  queue_head;
  struct request    *last_merge;
  struct elevator_queue  *elevator;
......
  request_fn_proc    *request_fn;
  make_request_fn    *make_request_fn;
......
}

在请求队列request_queue上是,首先是一个链表list_head,来保存请求request

struct request {
  struct list_head queuelist;
......
  struct request_queue *q;
......
  struct bio *bio;
  struct bio *biotail;
......
}

每个request暴多一个链表struct bio,有指针指向头部和尾部

struct bio {
  struct bio    *bi_next;  /* request queue link */
  struct block_device  *bi_bdev;
  blk_status_t    bi_status;
......
    struct bvec_iter  bi_iter;
  unsigned short    bi_vcnt;  /* how many bio_vec's */
  unsigned short    bi_max_vecs;  /* max bvl_vecs we can hold */
  atomic_t    __bi_cnt;  /* pin count */
  struct bio_vec    *bi_io_vec;  /* the actual vec list */
......
};


struct bio_vec {
  struct page  *bv_page;
  unsigned int  bv_len;
  unsigned int  bv_offset;
}

在bio中,bi_next是链表的下一项,struct bio_vec指向一组页面

在请求队列request_queue上有两个重要函数

  • make_request_fn用于生成request
  • request_fn用于处理request

块设备初始化

以scsi驱动为例,在初始化驱动设备的时候,会调用scsi_alloc_queue,将request_fn设置为scsi_request_fn,然后blk_init_allocated_queue->blk_queue_make_request,把make_request_fn设置为blk_queue_bio

/**
 * scsi_alloc_sdev - allocate and setup a scsi_Device
 * @starget: which target to allocate a &scsi_device for
 * @lun: which lun
 * @hostdata: usually NULL and set by ->slave_alloc instead
 *
 * Description:
 *     Allocate, initialize for io, and return a pointer to a scsi_Device.
 *     Stores the @shost, @channel, @id, and @lun in the scsi_Device, and
 *     adds scsi_Device to the appropriate list.
 *
 * Return value:
 *     scsi_Device pointer, or NULL on failure.
 **/
static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
             u64 lun, void *hostdata)
{
  struct scsi_device *sdev;
  sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
           GFP_ATOMIC);
......
  sdev->request_queue = scsi_alloc_queue(sdev);
......
}


struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
{
  struct Scsi_Host *shost = sdev->host;
  struct request_queue *q;


  q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
  if (!q)
    return NULL;
  q->cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
  q->rq_alloc_data = shost;
  q->request_fn = scsi_request_fn;
  q->init_rq_fn = scsi_init_rq;
  q->exit_rq_fn = scsi_exit_rq;
  q->initialize_rq_fn = scsi_initialize_rq;


    //调用blk_queue_make_request(q, blk_queue_bio);
  if (blk_init_allocated_queue(q) < 0) {
    blk_cleanup_queue(q);
    return NULL;
  }


  __scsi_init_queue(shost, q);
......
  return q
}

blk_init_allocated_queue

  1. 初始化了make_request_fn
  2. 初始化了IO电梯算法
int blk_init_allocated_queue(struct request_queue *q)
{
  q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size);
......
  blk_queue_make_request(q, blk_queue_bio);
......
  /* init elevator */
  if (elevator_init(q, NULL)) {
......
  }
......
}

电梯算法有很多类型,定义为elevator_type

  • struct elevator_type elevator_noopNoop调度算法是最简单的IO调度算法,它将IO请求放入到一个FIFO队列中,然后逐个执行这些IO请求
  • struct elevator_type iosched_deadlineDeadline算法要保证每个IO请求在一定的时间内一定要被服务到,以此来避免某个请求饥饿。为了完成这个目标,算法中引入了两类队列,一类队列用来对请求按起始扇区序号进行排序,通过红黑树来组织,我们称为sort_list,按照此队列传输性能会比较高;另一类队列对请求按它们的生成时间进行排序,由链表来组织,称为fifo_list,并且每一个请求都有一个期限值
  • struct elevator_type iosched_cfq又看到了熟悉的CFQ完全公平调度算法。所有的请求会在多个队列中排序。同一个进程的请求,总是在同一队列中处理。时间片会分配到每个队列,通过轮询算法,我们保证了I/O 带宽,以公平的方式,在不同队列之间进行共享。

elevator_init中会根据名称来指定电梯算法,如果没有选择,那就默认使用iosched_cfq

请求提交与调度

generic_make_request中,调用了队列的make_request_fn函数,实际调用的blk_queue_bio

static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
{
  struct request *req, *free;
  unsigned int request_count = 0;
......
  switch (elv_merge(q, &req, bio)) {
  case ELEVATOR_BACK_MERGE:
    if (!bio_attempt_back_merge(q, req, bio))
      break;
    elv_bio_merged(q, req, bio);
    free = attempt_back_merge(q, req);
    if (free)
      __blk_put_request(q, free);
    else
      elv_merged_request(q, req, ELEVATOR_BACK_MERGE);
    goto out_unlock;
  case ELEVATOR_FRONT_MERGE:
    if (!bio_attempt_front_merge(q, req, bio))
      break;
    elv_bio_merged(q, req, bio);
    free = attempt_front_merge(q, req);
    if (free)
      __blk_put_request(q, free);
    else
      elv_merged_request(q, req, ELEVATOR_FRONT_MERGE);
    goto out_unlock;
  default:
    break;
  }


get_rq:
  req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
......
  blk_init_request_from_bio(req, bio);
......
  add_acct_request(q, req, where);
  __blk_run_queue(q);
out_unlock:
......
  return BLK_QC_T_NONE;
}

blk_queue_bio调用elv_merge来判断,当前bio请求是否能够和当前已有的request合并,成为一批IO,提高读写性能

判断的标准为struct bio的成员struct bvec_iter,有两个变量

  • 起始磁盘簇bi_sector
  • 大小bi_size
enum elv_merge elv_merge(struct request_queue *q, struct request **req,
    struct bio *bio)
{
  struct elevator_queue *e = q->elevator;
  struct request *__rq;
......
  if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) {
    enum elv_merge ret = blk_try_merge(q->last_merge, bio);


    if (ret != ELEVATOR_NO_MERGE) {
      *req = q->last_merge;
      return ret;
    }
  }
......
  __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
  if (__rq && elv_bio_merge_ok(__rq, bio)) {
    *req = __rq;
    return ELEVATOR_BACK_MERGE;
  }


  if (e->uses_mq && e->type->ops.mq.request_merge)
    return e->type->ops.mq.request_merge(q, req, bio);
  else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
    return e->type->ops.sq.elevator_merge_fn(q, req, bio);


  return ELEVATOR_NO_MERGE;
}

elv_merge尝试了三次合并

  1. 判断和上一次合并的request能不能进行合并,在blk_try_merge中,如果blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector,判断request的起始地址加上大小,和bio起始地址是否连续,如果连续就把bio放到最后,为ELEVATOR_BACK_MERGE,如果blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector同理,如果连续放到bio的前边,被称为ELEVATOR_NO_MERGE
  2. 调用elv_rqhash_find,按照bio的起始地址查找request,看看有没有能合并的,接到bio的后边,被称为ELEVATOR_BACK_MERGE
  3. 调用elevator_merge_fn,对于iosched_cfq->cfq_merge->cfq_find_rq_fmerge->elv_rb_find,参数为bio的结束地址,看看有没有能合并的,被称为ELEVATOR_FRONT_MERGE

blk_try_merge逻辑

enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
{
......
    if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
    return ELEVATOR_BACK_MERGE;
  else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
    return ELEVATOR_FRONT_MERGE;
  return ELEVATOR_NO_MERGE;
}

static enum elv_merge cfq_merge(struct request_queue *q, struct request **req,
         struct bio *bio)
{
  struct cfq_data *cfqd = q->elevator->elevator_data;
  struct request *__rq;


  __rq = cfq_find_rq_fmerge(cfqd, bio);
  if (__rq && elv_bio_merge_ok(__rq, bio)) {
    *req = __rq;
    return ELEVATOR_FRONT_MERGE;
  }


  return ELEVATOR_NO_MERGE;
}


static struct request *
cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
{
  struct task_struct *tsk = current;
  struct cfq_io_cq *cic;
  struct cfq_queue *cfqq;


  cic = cfq_cic_lookup(cfqd, tsk->io_context);
  if (!cic)
    return NULL;


  cfqq = cic_to_cfqq(cic, op_is_sync(bio->bi_opf));
  if (cfqq)
    return elv_rb_find(&cfqq->sort_list, bio_end_sector(bio));


  return NUL
}

elv_merge返回blk_queue_bio的时候,如何可以合并就按照类型进行合并,如果没有合并,继续调用get_request创建一个新的request,调用blk_init_request_from_bio,将bio放到新的request,然后调用add_acct_request,把新的request加入到request_queue

generic_make_request还操作了很多次struct bio_list,因为块设备是分层的,例如

  • 两块磁盘组成的RAID,两个RAID组成的LVM,在LVM上创建块设备,接近用户的就是高层次块设备,接近底层的就是低层次块设备

generic_make_request在发往高层次块设备的时候,会调用高层次块设备的make_request_fn,然后高层次块设备调用generic_make_request将请求发给低层次块设备,这是一个递归的操作

generic_make_request被当前任务调用的时候,将current->bio_list设置为bio_list_on_stack,并在generic_make_request的一开始就判断current->bio_list是否为空

  • 如果不为空,就不需要调用make_request_fn进行递归,就直接把请求加入到bio_list里,实现递归的及时退出
  • 如果为空,current->bio_list设置为bio_list_on_stack,进入dowhile的循环,当前队列调用make_request_fn,并在其中生成新的bio,是一个边处理边创建的过程

bio_list_on_stack[1] = bio_list_on_stack[0]make_request_fn之前,将之前队列里面遗留没有处理的保存下来,接着bio_list_initbio_list_on_stack[0]设置为空,然后调用 make_request_fn,在make_request_fn里面如果有新的bio生成,都会加到bio_list_on_stack[0]这个队列里面来

make_request_fn执行完毕后,可以想象bio_list_on_stack[0]可能又多了一些bio了,接下来的循环中调用bio_list_popbio_list_on_stack[0]积攒的bio拿出来,分别放在两个队列lower和same中,lower就是更低层次的块设备的bio,same 是同层次的块设备的bio。

lower、samebio_list_on_stack[1]都取出来,放在bio_list_on_stack[0]统一进行处理,lower优先了,因为只有底层的块设备的I/O做完了,上层的块设备的I/O才能做完

放入底层队列由底层队列处理请求

请求的处理

请求处理调用request_queue中的request_fn,对于scsi调用的scsi_request_fn


static void scsi_request_fn(struct request_queue *q)
  __releases(q->queue_lock)
  __acquires(q->queue_lock)
{
  struct scsi_device *sdev = q->queuedata;
  struct Scsi_Host *shost;
  struct scsi_cmnd *cmd;
  struct request *req;


  /*
   * To start with, we keep looping until the queue is empty, or until
   * the host is no longer able to accept any more requests.
   */
  shost = sdev->host;
  for (;;) {
    int rtn;
    /*
     * get next queueable request.  We do this early to make sure
     * that the request is fully prepared even if we cannot
     * accept it.
     */
    req = blk_peek_request(q);
......
    /*
     * Remove the request from the request list.
     */
    if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
      blk_start_request(req);
.....
    cmd = req->special;
......
    /*
     * Dispatch the command to the low-level driver.
     */
    cmd->scsi_done = scsi_done;
    rtn = scsi_dispatch_cmd(cmd);
......
  }
  return;
......
}

是一个无线的for循环,从request_queue中读取request,封装更底层的执行,设备控制器进行真正的IO

总结

# 查看磁盘的调度算法
$ cat /sys/block/xvda/queue/scheduler
# 临时修改
$ echo noop > /sys/block/xvda/queue/scheduler
# 永久修改就需要修改内核参数,然后重启。
# I/O队列长度
$ iostat -d -x 
# 中的avgqu-sz是平均I/O队列长度