先从int 0x80开始讲起

在trap_init函数中初始化异常和中断处理,其中0x80中断门被这样设置:

void __init trap_init(void)
{
    ......
#ifdef CONFIG_IA32_EMULATION
    set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
    set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

#ifdef CONFIG_X86_32
    set_system_trap_gate(SYSCALL_VECTOR, &system_call);
    set_bit(SYSCALL_VECTOR, used_vectors);
#endif
    ......
}

在i386中处理函数被定义成system_call,这个东西被定义在entry_32.S中,用汇编写成。

    /*
     * syscall stub including irq exit should be protected against kprobes
     */

    .pushsection .kprobes.text, "ax"
    # system call handler stub
ENTRY(system_call)
    RING0_INT_FRAME         # can't unwind into user space anyway
    ASM_CLAC
    pushl_cfi %eax          # save orig_eax
    SAVE_ALL
    GET_THREAD_INFO(%ebp)
                    # system call tracing in operation / emulation
    testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
    jnz syscall_trace_entry
    cmpl $(NR_syscalls), %eax
    jae syscall_badsys
syscall_call:
    call *sys_call_table(,%eax,4)
    movl %eax,PT_EAX(%esp)      # store the return value
syscall_exit:
    LOCKDEP_SYS_EXIT
    DISABLE_INTERRUPTS(CLBR_ANY)    # make sure we don't miss an interrupt
                    # setting need_resched or sigpending
                    # between sampling and the iret
    TRACE_IRQS_OFF
    movl TI_flags(%ebp), %ecx
    testl $_TIF_ALLWORK_MASK, %ecx  # current->work
    jne syscall_exit_work

这段汇编不算晦涩,首先是保存Ring 3的寄存器,然后检查该进程当前是否正在被trace,如果是的话跳去syscall_trace_entry,然后将syscall number与NR_syscalls进行比较,检查是否合法,不合法就跳去syscall_badsys处理错误,然后indirect call跳去sys_call_table中记录的某个地址,执行某个syscall handler,等到syscall处理完后再回到用户态。

sys_call_table的表项定义在syscalls_32.h中。

__SYSCALL_I386(0, sys_restart_syscall, sys_restart_syscall)
__SYSCALL_I386(1, sys_exit, sys_exit)
__SYSCALL_I386(2, sys_fork, stub32_fork)
__SYSCALL_I386(3, sys_read, sys_read)
__SYSCALL_I386(4, sys_write, sys_write)
__SYSCALL_I386(5, sys_open, compat_sys_open)
__SYSCALL_I386(6, sys_close, sys_close)
__SYSCALL_I386(7, sys_waitpid, sys32_waitpid)
__SYSCALL_I386(8, sys_creat, sys_creat)
__SYSCALL_I386(9, sys_link, sys_link)
__SYSCALL_I386(10, sys_unlink, sys_unlink)

还算是比较简洁直观的吧。

sysenter和sysexit

可能是因为中断门最初被设计时没有考虑要将来要成为syscall入口,其实现效率可能不算高(要做Ring级切换和权限检查),intel就搞了一套fast system call,亦即sysenter和sysexit。

在这套机制中增加了三个MSR寄存器,分别是:

IA32_SYSENTER_CS (MSR address 174H) - The lower 16 bits of this MSR are the segment selector for the privilege level 0 code segment. This value is also used to determine the segment selector of the privilege level 0 stack segment (see the Operation section). This value cannot indicate a null selector.

IA32_SYSENTER_EIP (MSR address 175H) - The value of this MSR is loaded into RIP (thus, this value references the first instruction of the selected operating procedure or routine). In protected mode, only bits 31:0 are loaded.

IA32_SYSENTER_ESP (MSR address 176H) - The value of this MSR is loaded into RSP (thus, this value contains the stack pointer for the privilege level 0 stack). This value cannot represent a non-canonical address. In protected mode, only bits 31:0 are loaded.

这三个MSR寄存器的内容在syscall32_cpu_init函数中被设置。

/* May not be __init: called during resume */
void syscall32_cpu_init(void)
{
    /* Load these always in case some future AMD CPU supports
    SYSENTER from compat mode too. */
    wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
    wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
    wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target);

    wrmsrl(MSR_CSTAR, ia32_cstar_target);
}

这个ia32_sysenter_target同样在entry_32.S中被定义,内容和system_call差不多,这里就不多说了。

syscall和sysret

我的理解是在intel搞sysenter这一套东西的同时,amd也搞了一套syscall的东西,这两样东西虽然功能都是差不多的,但是多多少少还是有一些兼容性上的差别,而且amd也保持了syscall指令在x86_64上面的兼容性。再加上amd先提出了x86_64扩展,所以到了后来,intel也不得不去兼容amd搞出的这一套东西。

syscall和sysenter差不多,只不过是MSR的名字换了一套,这些MSR寄存器在syscall_init函数中被初始化,入口继续被指向了system_call函数。

/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
    /*
     * LSTAR and STAR live in a bit strange symbiosis.
     * They both write to the same internal register. STAR allows to
     * set CS/DS but only a 32bit target. LSTAR sets the 64bit rip.
     */
    wrmsrl(MSR_STAR,  ((u64)__USER32_CS)<<48  | ((u64)__KERNEL_CS)<<32);
    wrmsrl(MSR_LSTAR, system_call);
    wrmsrl(MSR_CSTAR, ignore_sysret);

#ifdef CONFIG_IA32_EMULATION
    syscall32_cpu_init();
#endif

    /* Flags to clear on syscall */
    wrmsrl(MSR_SYSCALL_MASK,
            X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
            X86_EFLAGS_IOPL|X86_EFLAGS_AC);
}

关于strace

strace是一个用来追踪syscall的非常有用的工具,其实现是用到了ptrace(gdb也是用了ptrace),但追根究底靠的还是在前面汇编里面提到的那段桩代码。调用关系如下:

system_call -> syscall_trace_entry -> syscall_trace_enter -> tracehook_report_syscall_entry -> ptrace_report_syscall -> ptrace_notify -> ptrace_do_notify -> ptrace_stop

在ptrace_stop中给父进程即strace发送SIGTRAP信号,通知子进程syscall的发生。




blog comments powered by Disqus

Published

06 December 2013

Tags