Loading…
中国上海
2019 年 6 月 24–26 日
单击此处了解更多信息和注册

点击此处查看英文版日程表。
To view the English version of this schedule please go here.

我们将为所有主题演讲和分组会议提供同声传译服务。
Simultaneous translation will be provided for all keynote and breakout sessions.

场馆 + 赞助商展示区地图
Venue + Sponsor Showcase Map
KC+CNC - 机器学习 + 数据 [clear filter]
Tuesday, June 25
 

11:00 CST

使用 Kubeflow 进行超参数调优 - Richard Liu,Google;Johnu George, Cisco
在机器学习中,超参数调优指的是为训练模型寻找最优配置的过程。选择最优超参数可以大幅提高算法的性能,但随着新超参数的添加, 搜索空间呈指数级增长。

自动化机器学习的一个与超参数调优密切相关的子领域是神经网络结构搜索 (NAS)。在最近的研究中, 由 NAS 算法生成的神经网络甚至在性能上可以超越手工生成的神经网络。但是如同超参数调优, 此过程可能既耗时又昂贵。

有鉴于此,我们推出了 Katib - 一个基于Kubernetes云原生的自动化机器学习平台。作为Kubeflow平台的一部分,Katib 以自定义资源的形式提供了一套丰富的管理 API。我们将演示如何配置超参数调优研究,以及如何在用户界面中比较实验结果。


Speakers
avatar for Johnu George

Johnu George

Staff Engineer, Nutanix
Johnu George is a staff engineer at Nutanix.  His research interests are in the areas of distributed systems and scalable infrastructure for big data applications. He is an active open source contributor and currently a PMC member of Apache Mnemonic.  He is actively involved in... Read More →
avatar for Richard Liu

Richard Liu

Senior Software Engineer, Google
Richard Liu is a Senior Software Engineer at Google Cloud. He is currently an owner and maintainer of the TensorFlow operator and Katib projects in Kubeflow. Previously he had worked as a software developer at Microsoft Azure.



Tuesday June 25, 2019 11:00 - 11:35 CST
403/405

16:00 CST

Kubernetes 集群的大规模分布式深度学习 - Yuan Tang,蚂蚁金服;Yong Tang,MobileIron
本次演讲的重点是在 Kubernetes 上部署大规模分布式深度学习。此外,还将介绍如何通过使用运算符来管理和并实现机器学习训练过程自动化。我们将分享我们的经验,并比较两个开源 Kubernetes 运算符:tf-operator 和 mpi-operator。这两个运算符都为 TensorFlow 管理训练任务,但有着不同的分配策略,这就造成了 CPU、GPU 和网络利用率方面的不同性能结果。

深度学习任务既是网络密集型又是 GPU 密集型,因此对编排进行适当优化非常重要。易发的不平衡会导致闲置计算容量,这对于 GPU 节点来说成本太高昂了(与 CPU 相比)。我们将分享我们的经验,希望可提供有用的洞察,帮助从机器学习任务中获得更好的经济效益。

Speakers
avatar for Yuan Tang

Yuan Tang

Principal Software Engineer, Red Hat
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led teams to build AI infrastructure and platforms at various companies, including Alibaba and Akuity. He's a project lead of Argo and Kubeflow, a maintainer of TensorFlow and XGBoost, and... Read More →
avatar for Yong Tang

Yong Tang

Senior Director of Engineering, Ivanti
Yong Tang is Senior Director of Engineering at Ivanti. He is a core maintainer of CoreDNS and contributes to many container, cloud-native, and machine learning projects for the open source community. In addition to CoreDNS, he is a maintainer of Docker/Moby. He is also a maintainer... Read More →


Tuesday June 25, 2019 16:00 - 16:35 CST
620

16:45 CST

通过学习痕迹优化微服务 - Zhang Wentao 和杨洋,IBM
跟踪在微服务领域扮演着越来越重要的角色,有助于故障排除和瓶颈分析。但是在每天产生的大量痕迹中,我们可以学到什么,以及如何充分利用这些宝贵的数据?在本次会议中,我们将讨论如何使用 Kubeflow 训练 Istio 生成的跟踪数据,以揭示您可能无法用肉眼发现的模式,例如季节性性能降级、系统故障的最大影响因素、微服务之间的关联等。它将有助于根本原因分析,并最终为您带来优化的微服务和性能。 

Speakers
avatar for Emma Yang

Emma Yang

Software Engineer, IBM
Yang Yang is advisory software engineer in IBM. She's been working on monitoring for cloud platform over 4 years, and has a lot experience on large scale and dynamic environments. Besides cloud related, she is also very interested in front-end technologies. She had delivered the... Read More →



Tuesday June 25, 2019 16:45 - 17:20 CST
620

17:30 CST

最大限度地降低在 Kubernetes 上运行深度学习的 GPU 成本 - Kai Zhang 和 Yang Che,阿里巴巴
越来越多的数据科学家在 Kubernetes 上运行基于 Nvidia GPU 的深度学习任务。与此同时,他们发现集群中的空闲 GPU 浪费了超过 40% 的成本。因此,Kubernetes 如何能帮助提高 GPU 使用效率成为一个重要挑战。
在本次演讲中,我们将介绍一款基于原生 Kubernetes 的 GPU 共享解决方案。我们将介绍所有设计和实施细节。关键主题包括,
- 如何定义 GPU 共享API
- 如何在不更改调度程序裸机代码的情况下在 Kubernetes 集群中调度 GPU 共享。
- 如何将 GPU 隔离解决方案与 Kubernetes 相集成
我们还将通过演示介绍 Tensorflow 用户如何在 Kubernetes 集群中的同一 GPU 设备上运行不同的作业。
在这款解决方案的应用期间,整体 GPU 使用得到显着改善,特别是就 AI 模型开发、调试和推理服务而言。

Speakers
avatar for Kai Zhang

Kai Zhang

Staff Engineer, Alibaba
Kai Zhang, is now a staff engineer of Alibaba Cloud. He's worked on container service product and enterprise solution development for 3 years. Before that, he worked in deep learning platform, cloud computing, distributed system and SOA area over 10 years. Recently, he is exploring... Read More →
avatar for Yang Che

Yang Che

senior engineer, Alibaba Cloud
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →


Tuesday June 25, 2019 17:30 - 18:05 CST
620

18:15 CST

Kubernetes 的多云机器学习数据和工作流 - Lei Xue,Momenta;Fei Xue,Google
自动驾驶汽车需要硬件加速机器学习来解决跟踪和分类等关键问题。Momenta 在本地和公共云中训练 ML 模型,每个模型有着不同的 GPU 和网络接口(Infiniband,RoCE)。

在本次演讲中,我们将讨论如何使用 Kubernetes 构建多云 ML 平台,特别是我们如何在不同环境中管理训练数据;我们如何处理多用户和群组调度;以及我们如何支持异构硬件。

Speakers
avatar for Lei Xue

Lei Xue

Infrastructure Tech Lead, Momenta
Lei Xue currently works as an AI Infrastructure tech lead at Momenta. He leads a development team that focuses on GPU cluster management for Kubernetes&Docker. Previously, Lei was a member of KataContainers/Hyper team and the software engineer of Oracle/Sun Microsystems. He is also... Read More →
FX

Fei Xue

Product Manager, Ant Financial
Fei Xue is currently a product manager at Ant Financial working on ML and data platform. Fei was an early member of the Kubeflow team at Google, an open source effort to help developers and enterprise develop and deploy cloud-native machine learning everywhere. Fei comes from a distributed... Read More →


Tuesday June 25, 2019 18:15 - 18:50 CST
620
 
Wednesday, June 26
 

11:20 CST

云原生存储的异常检测 - Seiya Takei,雅虎日本公司;Xing Yang,OpenSDS
在云原生环境中,与异构存储相集成始终是一项挑战。如何检测和及时修复问题对于关键任务工作负载至关重要。

在本次会议中,Takei-san 和 Xing 将介绍一种专为从云原生环境的异构存储中检索数据而设计的通用卷模型。他们还将展示一个可以分析数据并检测异常行为的 ML 模块,并介绍该模块如何帮助雅虎日本发现早期问题以保持云原生存储系统正常运行。

IOP、带宽、延迟和容量等卷指标都会从存储后端(为在 Kubernetes 之上运行的工作负载提供服务)收集起来,并发送到 Prometheus 服务器。ML 模块从 Prometheus 检索数据并应用算法进行异常检测。最后评估结果并在需要时发出警报。

Speakers
ST

Seiya Takei

Storage Engineer, Yahoo Japan Corporation
Seiya Takei is in charge of private cloud compute and storage at Yahoo Japan. Yahoo Japan has been participated in the End-User Advisory Committee of OpenSDS, an open source project under Linux Foundation, since July 2017. Seiya Takei has speaking experiences at local conferences... Read More →
avatar for Xing Yang

Xing Yang

Tech Lead, VMware by Broadcom
Xing Yang is a Tech Lead in the Cloud Native Storage team at VMware by Broadcom. She is a co-chair of CNCF Storage TAG, a co-chair of the Kubernetes Storage SIG, a co-chair of the Data Protection WG, and a maintainer in Kubernetes CSI. Before joining VMware, Xing was the Lead Architect... Read More →


Wednesday June 26, 2019 11:20 - 11:55 CST
620
 

Filter sessions
Apply filters to sessions.