nvidia/deeops 라이브러리를 이용한 모니터링(prometheus, grafana) 기본 구성

Develiberta on Jan 20, 20222022-01-20T11:00:00+09:00

Updated Jun 9, 20222022-06-09T15:25:04+09:00 1 min read

학습 목표

deepops가 무엇인지 이해하고 설명할 수 있다.
deepops를 이용해서 기본적인 환경을 구성할 수 있다.

deepops 개요

https://github.com/NVIDIA/deepops
GPU 인프라 및 자동화 도구

deepops를 이용한 기본적인 환경 구성

Install a supported operating system on all nodes.

Set up your provisioning machine.

 # Install software prerequisites and copy default configuration
 ./scripts/setup.sh

Create and edit the Ansible inventory.
1 2 3 4 # Edit inventory # Add Slurm controller/login host to `slurm-master` group # Add Slurm worker/compute hosts to the `slurm-node` groups vi config/inventory
1. [all] 하위에 ansible을 이용할 모든 노드 기재
2. [slurm-master] 하위에 master 노드 기재
3. [slurm-node] 하위에 worker 노드 기재
4. [all:vars] 하위 #SSH User에 ansible_user 변경

(optional) Modify config/group_vars/*.yml to set configuration parameters

 vi config/group_vars/slurm_cluster.yml
 # 전체 파일을 살펴보며 설정 정보 수정

Verify the configuration.
1 ansible all -m raw -a "hostname" -k

(optional) Modify playbooks/*.yml to set configuration tasks

 vi playbooks/slurm-cluster.yml
 # 전체 파일을 살펴보며 설정 정보 수정

Install Slurm.

 # NOTE: If SSH requires a password, add: `-k`
 # NOTE: If sudo on remote machine requires a password, add: `-K`
 # NOTE: If SSH user is different than current user, add: `-u ubuntu`
 ansible-playbook --ask-become-pass -k -l slurm-cluster playbooks/slurm-cluster.yml

참고

Slurm Deployment Guide https://github.com/NVIDIA/deepops/tree/master/docs/slurm-cluster

DevOps, Ansible

DevOps Ansible

This post is licensed under CC BY 4.0 by the author.

nvidia/deeops 라이브러리를 이용한 모니터링(prometheus, grafana) 기본 구성

학습 목표

deepops 개요

deepops를 이용한 기본적인 환경 구성

참고

Further Reading

nvidia/deeops 라이브러리를 이용한 모니터링(prometheus, grafana) 테스트 구성

Protection

Security

Trending Tags