icon-cookie
The website uses cookies to optimize your user experience. Using this website grants us the permission to collect certain information essential to the provision of our services to you, but you may change the cookie settings within your browser any time you wish. Learn more
I agree
richard yuwen
620 articles
My Web Markups - richard yuwen
  • 单发一个请求给服务器端 部署时动态插入 Sidecar:当我们将开发的云原生应用部署到云上,具体说是部署在 k8s 的 pod 中时,我们会自动在 pod 中再部署一个 Sidecar,用于实现为应用赋能 在运行时,我们会改变云原生应用的行为:如前面所说客户端简单发一个请求给服务器端,在这里会被改变为将请求劫持到 Sidecar,注意这个改变对应用而言是透明无感知的 在 Sidecar 中实现各种功能:Sidecar 里面就可以实现原有 SDK 客户端实现的各种功能,如服务发现,负载均衡,路由等等 Sidecar 在实现这些功能时,可以对接更多的基础设施,也可以对接其他的中间件产品,这种情况下,Service Mesh 产品会成为应用和基础设施 / 中间件之间的桥梁 可以通过控制平面来控制 Sidecar 的行为,而这些控制可以独立于应用之外 我们再以应用的视角,将云和下沉到云中的 Service Mesh 产品视为黑盒,来看 Service Mesh 模式: 以原生模式开发应用 以标准模式部署应用:底下发生了什么不关心 客户端简单发一个请求给服务器端:底下是如何实现的同样不关心,应用只知道请求最终顺利发送完成 Service Mesh 产品的存在和具体工作模式,对于运行于其上的云原生应用来说是透明无感知的,但是在运行时这些能力都动态赋能给了应用,从而帮助应用在轻量化的同时依然可以继续提供原有的功能。Mesh 模式不仅仅可以用于服务间通讯,也可以应用于更多的场景: Database mesh:用于数据库访问 Message mesh:用于消息系统 中间件下沉到基础设施的方式,不只是有 Mesh 模式一种,而且也不是每个中间件都需要改造为 Mesh 模式,比如前面我们提到有些中间件是可以通过与 Mesh 集成的方式来间接为应用提供能力,典型如监控,日志,追踪等。我们也在探索 mesh 模式之外的更多模式,比如 DNS 模式,目前还在探索中。简单归纳一下我们目前总结的云原生赋能(Cloud Empower)的基本工作原理: 首先要将功能实现从应用中剥离出来:这是应用轻量化的前提和基础 然后在运行时为应用 动态赋能:给应用的赋能方式也要云原生化,要求在运行时动态提供能力,而应用无感知 本次畅谈云原生分享的上半场内容就到这里结果了,欢迎继续观看下半场内容。如开头所说,这次分享是希望起到一个抛砖引玉的作用,期待后面会有更多同学出来就云原生这个话题进行更多的分享和讨论。也希望能有同学介绍更多云原生的实现模式,更多云原生的发展思路,拭目以待。 2019 年 2 月 27 日 13:57 15903 文章版权归极客邦科技InfoQ所有,未经许可不得转载。 敖小剑 发布了 33 篇内容,共 146100 次阅读,收获喜欢 124 次。  关注  架构云原生Service Mesh微服务  轻点一下,留下你的鼓励 评论 9 条评论 发布 龙行无疆 感谢作者分享,有两个问题需要思考下:1.应用要实现的能力和云提供的能力如何分界?从IAAS->PAAS->SAAS->FAAS,这本身就是能力的丰富和细化,之前需要自己做的可能以后云上就有了如何演进?2.同一种能力不同云有不同实现,甚至同一个云上也有可能有多种实现,如消息中间件(RabbitMQ、Kafka),这种能力谁来定义统一的标准让应用做到透明无感知,甚至跨云平滑迁移? 2020 年 03 月 05 日 18:06  1  回复 jcy 云原生后最重要的是获得两个能力弹性和分布式! 2020 年 02 月 19 日 11:06  0  回复 大胡子 太简单 2020 年 08 月 24 日 16:50  0  回复 jcy 云原生后最重要的是弹性和分布式 2020 年 02 月 19 日 11:05  0  回复 🐷🐷🐷 学习了,谢谢分享 2019 年 12 月 27 日 00:35  0  回复 郝姬友 mark 2019 年 10 月 10 日 10:17  0  回复 Michael Yang 学习了 2019 年 03 月 30 日 23:26  0  回复 Shawn W. 受益匪浅,对cloud native 和 service mesh 的主要区别还是不太清楚,有时间帮忙解释下 2019 年 02 月 28 日 10:47  0  回复 东风微鸣 cloud native是道, service mesh是术. 前面ppt里有, cloudnative, 具体的实现可能就包括: 微服务, 容器, service mesh等. service mesh是为了实现云原生, 所提供的一种服务发现和治理的方式. 2019 年 02 月 28 日 19:23  7  回复 没有更多评论了 更多内容推荐 你那么追捧的 SpringBoot,到底替你做了什么? 一个小伙伴最近参加某一线互联网公司的面试,被问到了一些Spring Boot源码的问题,看看大家能否答出来: 2020 年 7 月 6 日 畅谈云原生(下):云原生的飞轮理论 在云原生大热之际,聊一聊对云原生的理解和实现思路,下半场主要关注两个话题:云和应用该如何衔接?如何让产品更符合云原生?以及一个小花絮:云原生下有哪些有趣的角色转变?  架构云原生Service Mesh微服务 Spring Cloud:面向应用层的云架构解决方案 我们应该从效率、成本、稳定性这几个方面来检验架构是否合理,并为架构朝着更加健康的方向发展保驾护航。 2018 年 2 月 18 日 从 Vessel 到二代裸金属容器,云原生的新一波技术浪潮涌向何处? 摘要:云原生大势,深度解读华为云四大容器解决方案如何加速技术产业融合。 2020 年 8 月 24 日 实战(四):“画图”程序后端实战 基于 OpenID Connect 协议来提供帐号系统,基于 OAuth 2.0 协议来实现 Open API 体系。 2019 年 9 月 24 日 可扩展架构案例(一):电商平台架构是如何演变的? 这一讲,我会针对最近十几年电商平台的架构变化过程,具体说明为了支持业务的快速发展,架构是如何一步步演进的。 2020 年 2 月 28 日 深入浅出 Kubernetes 实践篇 (五):二分之一活的微服务 future!基本上,我相信对云原生技术趋势有些微判断的同学,都会有这个觉悟。其背后的逻辑其实是比较简单的:当容器集群,特别是Kubernetes成为事实上的标准之后,应用必然会不断的复杂化,服务治理肯定会成为强需求。  文化 & 方法Kubernetes阿里巴巴 绝不仅仅是安全:Kata Containers 与 gVisor 今天,我为你介绍了拥有独立内核的安全容器项目,对比了 KataContainers 和 gVisor 的设计与实现细节。 2018 年 12 月 10 日 .NET Core 1.0、ASP.NET Core 1.0 和 EF Core 1.0 简介 为了显著减少混淆,ASP.NET 5.0和Entity Framework 7.0被重新命名为ASP.NET Core 1.0和Entity Framework Core 1.0。  .NET语言 & 开发 发现更多内容 敖小剑  关注 暂无签名 最新发布 Mecha:将 Mesh 进行到底 2020 年 5 月 12 日 Service Mesh 和 API Gateway 关系深度探讨 2020 年 4 月 28 日 云原生生态周报 Vol. 45:Argo 项目加入 CNCF 孵化器 2020 年 4 月 15 日 推荐阅读 从蚂蚁金服实践入手,带你深入了解 Service Mesh  架构云原生微服务Service Mesh WebAssembly 能够为 Web 前端框架赋能吗? 2020 年 9 月 23 日 面向未来的思考:泛服务化与 Service Mesh 2018 年 3 月 22 日 Service Mesh 和 API Gateway 关系深度探讨  移动开源容器云原生阿里巴巴阿里云金融Service Mesh Service Mesh 发展趋势:云原生中流砥柱(下)  文化 & 方法云原生Service Mesh Kubernetes 项目与“基础设施民主化”的探索  ArchSummit方法论最佳实践 可扩展架构案例(三):你真的需要一个中台吗? 2020 年 3 月 4 日 电子书 Java 避坑指南:Java 高手笔记代码篇 本迷你书包括 86 个业务开发中常见踩坑点。每一个知识点都相当的实用,是程序员业务开发中的必备避坑指南... 立即下载 大厂实战PPT下载 换一换  在 3D 图形场景下的前端开发 朱毅 | 贝壳 技术专家 立即下载 深度学习在推荐系统的进展及在微博的应用 张俊林 | 新浪微博 AI Lab资深算法专家 立即下载 基于 TypeScript 的 Node.js 多场景框架设计方案 陈仲寅(张挺) | 淘宝 前端技术专家
  • 加强和改善应用运行环境(即云)来帮助应用
  • 业务动态配置的客户端
  • 胖客户端
  • 下沉到基础设施的中间件
  • 包裹厚厚的一层非业务需求
  • 非业务需求的实现往往会以类库和开发框架的方式提供
  • 底层平台负责向上提供基本运行资源。而应用需要满足业务需求和非业务需求
8 annotations
  • 抓手、生态、闭环、拉齐、梳理、迭代、owner意识
  • 能说、会写、善做是对职场人的三大要求
  • 撕逼甩锅邀功抢活这些闹心的事儿基本也不会缺席
  • PPT、沟通、表达、时间管理、设计、文档等方面的能力
  • 报警配置和监控梳理
  • 良好的规划能力和清晰的演进蓝图
  • 做系统建设要有全局视野
  • 有的人能把一个小盘子越做越大
  • 想到了leader没想到的地方
  • 直接去找对应的人聊,让别人讲一遍自己基本就全懂了,这效率比看文档看代码快多了
  • 向上沟通反馈
  • owner意识
  • 主动承担任务,主动沟通交流,主动推动项目进展,主动协调资源,主动向上反馈,主动创造影响力
  • 主动承担,及时交流反馈
  • 主动跳出自己的舒适区,感到挣扎与压力的时候,往往是黎明前的黑暗,那才是成长最快的时候
  • 强迫自己跳出自己的安逸区
  • 积极学习,保持技术能力、知识储备与工作年限成正比,这到了35岁哪还有什么焦虑呢
  • 架构先行于业务
  • 技术同学该如何培养产品思维,引导产品走向
  • 系统建设?系统核心能力,系统边界,系统瓶颈,服务分层拆分,服务治理
  • 代码层,可以做的事情更多了,资源池化、对象复用、无锁化设计、大key拆分、延迟处理、编码压缩、gc调优还有各种语言相关的高性能实践
  • 在架构层,可以做缓存、预处理、读写分离、异步、并行等等
  • 术到道的过程
  • 知识还是零星的几点,不成体系,不仅容易遗忘,而且造成自己视野比较窄,看问题比较局限
24 annotations
  • Allocations can't be over-committed
  • Non-root cgroups can distribute domain resources to their children only when they don't have any processes of their own
  • Only one process can be migrated on a single write(2) call
  • use cases where multiple cgroups write to a single inode simultaneously are not supported well
  • cgroup writeback is implemented on ext2, ext4, btrfs, f2fs, and xfs
  • per-cgroup dirty memory states
  • dirty memory ratio
  • how much the workload is being impacted due to lack of memory
  • memory.pressure
  • memory.stat
  • memory.events
  • Memory usage hard limit
  • Memory usage throttle limit
  • Best-effort memory protection
  • Protections can be hard guarantees or best effort soft boundaries
  • Memory is stateful and implements both limit and protection models
  • cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner
  • "min" and "max"
  • weight"
  • Limits can be over-committed
  • [0, max] and defaults to "max"
  • [1, 10000] with the default at 100
  • absolute resource guarantee
  • weight based resource distribution
  • The root cgroup should be exempt from resource control and thus shouldn't have resource control interface files
  • Consider cgroup namespaces as delegation boundaries
  • namespace root
  • all non-root "cgroup.subtree_control" files can only contain controllers which are enabled in the parent's "cgroup.subtree_control" file.
  • not subject to the no internal process constraint
  • threaded domain or thread root
  • The io controller, in conjunction with the memory controller, implements control of page cache writeback IOs
  • CPU
  • Memory
  • IO
  • PID
  • Cpuset
  • Device
  • RDMA
  • HugeTLB
  • Misc
  • A read-only flat-keyed file
  • allows to limit the HugeTLB usage per control group
  • controller limit during page fault
  • anon_thp
44 annotations
  • a memory-intensive process
  • out-of-the-box improvement over the kernel OOM killer
  • The kernel OOM handler’s main job is to protect the kernel
  • oomd
  • rejects a few and continues to run
  • Load shedding
  • oad shedding is a technique to avoid overloading and crashing a system by temporarily rejecting new requests. The idea is that all loads will be better served if the system rejects a few and continues to run, instead of accepting all requests and crashing due to lack of resources. In a recent test, a team at Facebook that runs asynchronous jobs, called Async, used memory pressure as part of a load shedding strategy to reduce the frequency of OOMs. The Async tier runs many short-lived jobs in parallel. Because there was previously no way of knowing how close the system was to invoking the OOM handler, Async hosts experienced excessive OOM kills. Using memory pressure as a proactive indicator of general memory health, Async servers can now estimate, before executing each job, whether the system is likely to have enough memory to run the job to completion. When memory pressure exceeds the specified threshold, the system ignores further requests until conditions stabilize. The chart shows how async responds to changes in memory pressure: when memory.full (in orange) spikes, async sheds jobs back to the async dispatcher, shown by the blue async_execution_decision line. The results were signifcant: Load shedding based on memory pressure decreased memory overflows in the Async tier and increased throughput by 25%. This enabled the Async team to replace larger servers with servers using less memory, while keeping OOMs under control. oomd - memory pressure-based OOM oomd is a new userspace tool similar to the kernel OOM handler, but that uses memory pressure to provide greater control over when processes start getting killed, and which processes are selected. The kernel OOM handler’s main job is to protect the kernel; it’s not concerned with ensuring workload progress or health. Consequently, it’s less than ideal in terms of when and how it operates: It starts killing processes only after failing at multiple attempts to allocate memory, i.e., after a problem is already underway. It selects processes to kill using primitive heuristics, typically killing whichever one frees the most memory. It can fail to start at all when the system is thrashing: memory utilization remains within normal limits, but workloads don't make progress, and the OOM killer never gets invoked to clean up the mess. Lacking knowledge of a process's context or purpose, the OOM killer can even kill vital system processes: When this happens, the system is lost, and the only solution is to reboot, losing whatever was running, and taking tens of minutes to restore the host. Using memory pressure to monitor for memory shortages, oomd can deal more proactively and gracefully with increasing pressure by pausing some tasks to ride out the bump, or by performing a graceful app shutdown with a scheduled restart. In recent tests, oomd was an out-of-the-box improvement over the kernel OOM killer and is now deployed in production on a number of Facebook tiers. Case study: oomd at Facebook See how oomd was deployed in production at Facebook in this case study looking at Facebook's build system, one of the largest services running at Facebook. oomd in the fbtax2 project As discussed previously, the fbtax2 project team prioritized protection of the main workload by using memory.low to soft-guarantee memory to workload.slice, the main workload's cgroup. In this work-conserving model, processes in system.slice could use the memory when the main workload didn't need it. There was a problem though: when a memory-intensive process in system.slice can no longer take memory due to the memory.low protection on workload.slice, the memory contention turns into IO pressure from page faults, which can compromise overall system performance. Because of limits set in system.slice's IO controller (which we'll look at in the next section of this case study) the increased IO pressure causes system.slice to be throttled. The kernel recognizes the slowdown is caused by lack of memory, and memory.pressure rises accordingly. oomd monitors the pressure, and once it exceeds the configured threshold, kills one of the processes—most likely the memory hog in system.slice—and resolves the situation before the excess memory pressure crashes the system. This behavior ← Memory ControllerIO Controller →Memory overcommitPressure-based load sheddingoomd - memory pressure-based OOMCase study: oomd at Facebook
  • outweigh the overhead of occasional OOM events
  • demand exceeds the total memory available
  • Overcommitting on memory—promising more memory for processes than the total system memory—is a key technique for increasing memory utilization
10 annotations
  • io.latency
  • You protect workloads with io.latency by specifying a latency target (e.g., 20ms). If the protected workload experiences average completion latency longer than its latency target value, the controller throttles any peers that have a more relaxed latency target than the protected workload. The delta between the prioritized cgroup's target and the targets of other cgroups is used to determine how hard the other cgroups are throttled: If a cgroup with io.latency set to 20ms is prioritized, cgroups with latency targets <= 20ms will never be throttled, while a cgroup with 50ms will get throttled harder than a cgroup with a 30ms target. Interface The interface for io.latency is in a format similar to the other controllers: MAJOR:MINOR target=<target time in microseconds> When io.latency is enabled, you'll see additional stats in io.stat: depth=<integer>—The current queue depth for the group. avg_lat=<time in microseconds>—The running average IO latency for this group. This provides a general idea of the overall latency you can expect for this workload on the specified disk. Note: All cgroup knobs can be configured through systemd. See the systemd.resource-control documentation for details. Using io.latency The limits are applied only at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody. Thus, a common way to configure this is to set io.latency in groups A, B, and C. Configuration strategies Generally you don't want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload: Start at higher than the expected latency for your device, and watch the avg_lat value in io.stat for your workload group to get an idea of the latency during normal operation. Use this value as a basis for your real setting: Try setting it, for example, around 20% higher than the value in io.stat. Experimentation is key here since avg_lat is a running average and subject to statistical anomalies. Setting too tight of a control (i.e., too low of a latency target) provides greater protection to a workload, but it can come at the expense of overall system IO overhead if other workloads get throttled prematurely. Another important factor is that hard disk IO latency can fluctuate greatly: If the latency target is too low, other workloads can get throttled due to normal latency fluctuations, again leading to sub-optimal IO control. Thus, in most cases then, you'll want to set the latency target higher than expected latency to avoid unnecessary throttling—the only question is by how much. Two general approaches have proven most effective: Setting io.latency higher (20-25%) than the usual expected latency. TThis provides a tighter protection guarantee for the workload. However, the tighter control can sometimes mean the system pays more in terms of IO overhead, which leads to lower system-wide IO utilization. A setting like this can be effective for systems with SSDs. Setting io.latency to several times higher than the usual expected latency, especially for hard disks. A hard disk's usual uncontended completion latencies are between 7 and 20ms, but when contention occurs, the completion latency balloons quickly, easily reaching 10 times normal. Because the latency is so volatile, workloads running on hard disks are usually not sensitive to small swings in completion latency; things break down only in extreme conditions when latency jumps several times higher (which isn't difficult to trigger). Effective protection can be achieved in cases like this by setting a relaxed target on the protected group (e.g., 50 or 75ms), and a higher setting for lower priority groups (e.g., an additional 25ms over the higher priority group). This way, the workload can have reasonable protection without significantly compromising hard disk utilization by triggering throttling when it's not necessary. How throttling works io.latency is work conserving: as long as everybody's meeting their latency target, the controller doesn't do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes two forms: Queue depth throttling—This is the number of outstanding IO's a group is allowed to have. The controller will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time. Artificial delay induction—There are certain types of IO that can't be throttled without possibly affecting higher priority groups adversely. This includes swapping and metadata IO. These types of IO are allowed to occur normally, but they are "charged" to the originating group. Once the victimized group starts meeting its latency target again, it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately. fbtax2 IO controller configuration As discussed previously, the goal of the fbtax2 cgroup hierarchy was to protect workload.slice. In addition to the memory controller settings, the team found that IO protections were also necessary to make it all work. When memory pressure increases, it often translates into IO pressure. Memory pressure leads to page evictions: the higher the memory pressure, the more page evictions and re-faults, and therefore more IOs. It isn’t hard to generate memory pressure high enough to saturate a disk with IOs, especially the rotating hard disks that were used on the machines in the fbtax2 project. To correct for this, the team used a strategy similar to strategy 2 described above: they prioritized workload.slice by setting its io.latency to higher than expected, to 50ms. This provides more protection for workload.slice than for system.slice, whose io.latency is set to 75ms. When workload.slice has been delayed by lack of IO past its 50ms threshold, it gets IO priority: the kernel limits IO from system.slice and reallocates it to workload.slice so the main workload can keep running. hostcritical.slice was given a similar level of protection as workload.slice since any problems there can also impact the main workload. In this case it used memory.min to guarantee it will have enough to keep running. Though they knew system.slice needed lower IO priority, the team determined the 75ms number through trial and error, modifying it repeatedly until they achieved the right balance between protecting the main workload and ensuring the stability of system.slice. In the final installment of this case study, we'll summarize the strategies used in the fbtax2 project, and look at some of the utilization gains that resulted in Facebook's server farms. ← Memory Controller Strategies and ToolsCPU Controller →cgroup2 IO controller enhancementsInterface filesProtecting workloads with io.latencyInterfaceUsing io.latencyConfiguration strategiesHow throttling works
  • This is where you specify IO limits
  • O
  • accounting of all IOs per-cgroup
  • IOPS
  • system has the flexibility to limit IO to low priority workloads
7 annotations
23 annotations
  • 保持一定的资源 buffer
  • 统一填充式调度
  • 资源使用提出指导意见,以便其在保证资源使用的同时,防止资源超配导致的浪费
  • 评估预警,及早进行容器的自动扩容或者迁移
  • 时空互补
  • 应用画像可以为调度提供依据
  • 应用
  • 画像
  • 应用对于资源的使用存在一定的规律。一般常规的做法是将其对于资源使用的特点,分为计算密集型、内存密集型、存储密集型等。这种简单的做法无法从 CPU/ 内存 / 存储 / 网络多个资源维度和时间维度上对于资源的使用进行描述。我们采用了强化机器学习算法,根据应用的历史数据,提取其资源使用的特征,进而将不同的应用进行归类,形成应用画像。 应用画像可以为调度提供依据。不同的应用根据应用画像的结果,进行亲和 / 反亲和的调度,将不同类的应用容器调度在一起,使其资源需求得到时空互补,而不会相互影响。同时对于一个应用中多个容器,可以依据应用画像对容器的健康状态进行评估预警,及早进行容器的自动扩容或者迁移,以免影响业务。另外,应用画像还为之后应用的资源使用提出指导意见,以便其在保证资源使用的同时,防止资源超配导致的浪费。 Serverless 与延迟容忍 在以往的技术架构中,大量的业务应用属于长期服务 (long-running services),其特点是需要长时间提供服务。而实际上,许多应用并不需要长期提供服务。以图片转换应用为例,商户上传的商品图片需要转换成多种尺寸,并在图片中打上水印 /logo 等,并上传到存储中。图片转换应用只需要在用户上传图片时提供服务,其他时间并不需要占用资源;而且应用对于延迟不敏感,允许最长几十秒级别的延迟执行。对于这种事件驱动、延迟容忍的应用,我们推动其由长期服务,转向使用 JDOS 提供的 serverless 架构。serverless 的架构在将长期服务转为离线计算任务方面发挥了巨大的作用。 serverless 的应用任务和大数据的离线计算任务,抽象为统一的批处理任务 (batch jobs)。批处理任务提交到阿基米德时,需要提供任务描述,描述内容包括任务函数、任务类型、资源描述以及任务的延迟容忍时间等,由阿基米德进行调度执行。延迟容忍时间是指该任务可以容忍的最长延迟执行时间。也就是说任务提交后不必立即执行,可以容忍一定时间后才获得资源执行,这就为阿基米德的调度规划提供了重要依据,以便其提前进行流水线编排规划。 资源碎片与时空复用 不同批次采购的服务器的资源配比不同,而不同的应用申请的资源配比也不同。基于资源适配的调度算法容易导致一台服务器上的 CPU 的配额已经分配完毕,但是内存还空余几十 GB 或者内存分配完毕,CPU 还空余几核。这种情况我们称之为资源碎片。 资源碎片在几乎每台物理机上都有发生。长期服务,特别是面向用户的任务,在每天的执行中会出现明显的高峰低谷。而且不同的长期服务的资源消耗也不同。因此集群中的时空资源利用率不均是常态。资源碎片和时空分布不均问题造成了巨额的资源浪费。 我们倾向于长期服务稳定存在,尽量低频度迁移。因此对于资源碎片和时空不均的情况,阿基米德采用批处理任务进行统一填充式调度,以达到资源碎片的充分利用和资源的时空复用的效果。阿基米德不仅仅可以对当前的资源和任务进行调度,还可以综合应用画像和批处理任务的描述,对未来一段时间的任务调度进行提前规划,使得业务能够正常运行的同时,资源得到充分的利用,有效防止了批处理任务与长期服务的资源竞争。阿基米德会时刻保持一定的资源 buffer 应对突发流量的资源需求。 SLA 无论是长期服务还是批处理任务,均会与阿基米德签订 SLA 协议。阿基米德将会保证服务或者任务的资源使用量、服务可用性等。特别是长期服务,阿基米德将会优先保证其资源使用和服务可用。在批处理任务与长期服务、长期服务与长期服务即将出现资源竞争时,阿基米德会根据 SLA 协议的可用性和优先级进行筛选排序,依序对于任务或者服务进行驱逐迁移,保证高优先级的长期服务能够优先使用资源,非必要情况不进行迁移,不受其他任务 / 服务的资源竞争影响。 集群自治 JDOS 提供了集群的自动化管理,阿基米德则将集群从自动化转为了自治的管理系统。社区的 kube-controller 提供了一个控制器的范例,但是存在着诸多弊端,比如 controller 获取的信息量过少,只能从 apiserver 获取 node 状态,无法准确判断 node 节点是否离线,从而导致误判,致使容器发生频繁迁移。因此阿基米德中扩展了 controller,成为一个单独的系统 MAGI。MAGI 共有五个节点,分布在数据中心的不同物理 POD 内。MAGI 系统负责集群的自治决策,采用投票会商制,用以对节点是否离线、容器是否需要迁移等决策进行复核。经 MAGI 系统归票决策后,才会实际触发节点的离线摘除和容器的迁移。 不止于调度 阿基米德不仅仅是 JDOS 的调度系统,更是应用资源使用情况的数据分析平台。阿基米德为项目管理、业务、审计、采购等部门的相关工作,提供了直接的数据支持。 机房的主要电力消耗用于制冷,而制冷的主要目的是为 CPU 降温。阿基米德会根据应用画像与调度规划,对于服务器的 CPU 的主频进行相应调整,以达到节能降耗的作用。此功能已在 2 个核心机房进行了大规模的实践,取得了降低 17% 电力的成果。 在 2018 年,我们将进一步推动优化调度算法,精确应用画像,提升调度的准确性,在整合计算、提升效率、节能降耗方面进行更多的实践。届时也将把更多的生产一线的调度数据和模型与业界分享。 作者介绍 鲍永成,京东商城研发体系 - 基础平台部技术总监。2013 年加入京东商城研发体系,负责京东容器平台 JDOS 研发,带领团队完成京东容器大规模战略项目落地,有效支撑京东日常业务系统运行和大促高峰流量。目前聚焦在京东 JDOS 阿基米德战略项目和敏捷智能数据中心等方向。 感谢蔡芳芳对本文的策划。 阅读数:2094 发布于:2017 年 11 月 15 日 16:59 文章版权归极客邦科技 InfoQ 所有,未经许可不得转载。  语言 & 开发架构双十一 评论 发布 暂无评论 推荐阅读 27 | 微服务容器化运维:容器调度和服务编排 2018 年 10 月 23 日 想了解阿里巴巴的云化架构 看这篇就够了 2017 年 12 月 26 日 第 33 讲 | 区块链与供应链(二) 2018 年 6 月 8 日 面向容器技术资源调度关键技术对比 2016 年 6 月 2 日 新浪微博混合云架构实践挑战之概述篇 2016 年 4 月 3 日 11 | 负载均衡:节点负载差距这么大,为什么收到的流量还一样? 2020 年 3 月 13 日 不断超越的调度系统:如何撑住 9 年双 11 交易峰值 800 倍增长 2020 年 5 月 29 日 订阅 每周精要 你将获得 了解详情  资深编辑编译的全球 IT 要闻 一线技术专家撰写的实操技术案例 InfoQ 出品的课程和线下活动报名通道 立即订阅 
  • 京东数据中心的资源调度与驱逐
  • 业务系统资源申请量和使用量之间差距巨大
  • 靠新增机器来应对高峰瞬时流量
  • 集群的平均资源利用率提升 3 倍
  • 3000 万核·小时
14 annotations
27 annotations