Syscall in Linux
先从int 0x80开始讲起
在trap_init函数中初始化异常和中断处理,其中0x80中断门被这样设置:
void __init trap_init(void)
{
......
#ifdef CONFIG_IA32_EMULATION
set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif
#ifdef CONFIG_X86_32
set_system_trap_gate(SYSCALL_VECTOR, &system_call);
set_bit(SYSCALL_VECTOR, used_vectors);
#endif
......
}
在i386中处理函数被定义成system_call,这个东西被定义在entry_32.S中,用汇编写成。
/*
* syscall stub including irq exit should be protected against kprobes
*/
.pushsection .kprobes.text, "ax"
# system call handler stub
ENTRY(system_call)
RING0_INT_FRAME # can't unwind into user space anyway
ASM_CLAC
pushl_cfi %eax # save orig_eax
SAVE_ALL
GET_THREAD_INFO(%ebp)
# system call tracing in operation / emulation
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(NR_syscalls), %eax
jae syscall_badsys
syscall_call:
call *sys_call_table(,%eax,4)
movl %eax,PT_EAX(%esp) # store the return value
syscall_exit:
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testl $_TIF_ALLWORK_MASK, %ecx # current->work
jne syscall_exit_work
这段汇编不算晦涩,首先是保存Ring 3的寄存器,然后检查该进程当前是否正在被trace,如果是的话跳去syscall_trace_entry,然后将syscall number与NR_syscalls进行比较,检查是否合法,不合法就跳去syscall_badsys处理错误,然后indirect call跳去sys_call_table中记录的某个地址,执行某个syscall handler,等到syscall处理完后再回到用户态。
sys_call_table的表项定义在syscalls_32.h中。
__SYSCALL_I386(0, sys_restart_syscall, sys_restart_syscall)
__SYSCALL_I386(1, sys_exit, sys_exit)
__SYSCALL_I386(2, sys_fork, stub32_fork)
__SYSCALL_I386(3, sys_read, sys_read)
__SYSCALL_I386(4, sys_write, sys_write)
__SYSCALL_I386(5, sys_open, compat_sys_open)
__SYSCALL_I386(6, sys_close, sys_close)
__SYSCALL_I386(7, sys_waitpid, sys32_waitpid)
__SYSCALL_I386(8, sys_creat, sys_creat)
__SYSCALL_I386(9, sys_link, sys_link)
__SYSCALL_I386(10, sys_unlink, sys_unlink)
还算是比较简洁直观的吧。
sysenter和sysexit
可能是因为中断门最初被设计时没有考虑要将来要成为syscall入口,其实现效率可能不算高(要做Ring级切换和权限检查),intel就搞了一套fast system call,亦即sysenter和sysexit。
在这套机制中增加了三个MSR寄存器,分别是:
IA32_SYSENTER_CS (MSR address 174H) - The lower 16 bits of this MSR are the segment selector for the privilege level 0 code segment. This value is also used to determine the segment selector of the privilege level 0 stack segment (see the Operation section). This value cannot indicate a null selector.
IA32_SYSENTER_EIP (MSR address 175H) - The value of this MSR is loaded into RIP (thus, this value references the first instruction of the selected operating procedure or routine). In protected mode, only bits 31:0 are loaded.
IA32_SYSENTER_ESP (MSR address 176H) - The value of this MSR is loaded into RSP (thus, this value contains the stack pointer for the privilege level 0 stack). This value cannot represent a non-canonical address. In protected mode, only bits 31:0 are loaded.
这三个MSR寄存器的内容在syscall32_cpu_init函数中被设置。
/* May not be __init: called during resume */
void syscall32_cpu_init(void)
{
/* Load these always in case some future AMD CPU supports
SYSENTER from compat mode too. */
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target);
wrmsrl(MSR_CSTAR, ia32_cstar_target);
}
这个ia32_sysenter_target同样在entry_32.S中被定义,内容和system_call差不多,这里就不多说了。
syscall和sysret
我的理解是在intel搞sysenter这一套东西的同时,amd也搞了一套syscall的东西,这两样东西虽然功能都是差不多的,但是多多少少还是有一些兼容性上的差别,而且amd也保持了syscall指令在x86_64上面的兼容性。再加上amd先提出了x86_64扩展,所以到了后来,intel也不得不去兼容amd搞出的这一套东西。
syscall和sysenter差不多,只不过是MSR的名字换了一套,这些MSR寄存器在syscall_init函数中被初始化,入口继续被指向了system_call函数。
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
/*
* LSTAR and STAR live in a bit strange symbiosis.
* They both write to the same internal register. STAR allows to
* set CS/DS but only a 32bit target. LSTAR sets the 64bit rip.
*/
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
wrmsrl(MSR_LSTAR, system_call);
wrmsrl(MSR_CSTAR, ignore_sysret);
#ifdef CONFIG_IA32_EMULATION
syscall32_cpu_init();
#endif
/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
X86_EFLAGS_IOPL|X86_EFLAGS_AC);
}
关于strace
strace是一个用来追踪syscall的非常有用的工具,其实现是用到了ptrace(gdb也是用了ptrace),但追根究底靠的还是在前面汇编里面提到的那段桩代码。调用关系如下:
system_call -> syscall_trace_entry -> syscall_trace_enter -> tracehook_report_syscall_entry -> ptrace_report_syscall -> ptrace_notify -> ptrace_do_notify -> ptrace_stop
在ptrace_stop中给父进程即strace发送SIGTRAP信号,通知子进程syscall的发生。
blog comments powered by Disqus