[AWS] Boto3로 EC2장비의 Vcpu 로깅 (Python)

DevOps/Cloud

[AWS] Boto3로 EC2장비의 Vcpu 로깅 (Python)

장그래 2023. 7. 12. 10:16

개요

kubeflow 기반의 MLOps 환경을 구축하고, 운영하던 도중 더 이상 Node가 생기지 않는 이슈가 발생했다.

원인

https://docs.aws.amazon.com/ko_kr/servicequotas/latest/userguide/intro.html

Service Quotas 할당량이란 무엇입니까? - Service Quotas

이 페이지에 작업이 필요하다는 점을 알려 주셔서 감사합니다. 실망시켜 드려 죄송합니다. 잠깐 시간을 내어 설명서를 향상시킬 수 있는 방법에 대해 말씀해 주십시오.

docs.aws.amazon.com

원인은 AWS Service Quatas에서 Vcpu의 limit이 2000으로 할당되어 있기 때문이었다.
(AWS에서 Service Quatas는 각 서비스의 할당량을 정해놓은 것이다.)
MLOps 사용자가 늘어나면서, Node 개수가 늘어났고 Node가 증가함에 따라 Vcpu도 동시에 증가했기 때문이었다.

해결 방안

Service Quatas의 limit을 늘려 문제를 해결했지만, 무한정 Service Quatas를 늘릴 수 없었기에 지속적인 모니터링이 필요했다.
지속적인 모니터링을 위해 Service Quatas limit을 중앙 집중형 로그 시스템에 떨구기로 했다.
(구조 : Python 로깅 -> 중앙 집중형 로그 시스템 (Splunk, ELK, etc) -> Teams or Slack 알람 발생

이를 통해 지속적으로 Service Quatas를 모니터링 할 수 있었으며, 더욱 안정적으로 kubeflow를 운영할 수 있었다.
(코드도 공유한다)

def ec2_vcpu():
    keywords=["g3", "g4dn"]
    ec2 = boto3.client('ec2', region_name = 'ap-northeast-2')
    reservations = ec2.describe_instances()['Reservations']
    vcpus = 0
    for reservation in reservations:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            instance_type_name = instance['InstanceType']
            if (any(keyword in instance_type_name for keyword in keywords)):
                instance_type = ec2.describe_instance_types(InstanceTypes=[instance_type_name])
                current_vcpu_count = instance_type['InstanceTypes'][0]['VCpuInfo']['DefaultVCpus']
                #print(f'{instance_id} - {instance_type_name} - {current_vcpu_count}')
                vcpus += current_vcpu_count
    msg= "vcpu Utilization: " + str(vcpus)
    logger.info(msg)

저작자표시 비영리 변경금지 (새창열림)

'DevOps > Cloud' 카테고리의 다른 글

[AWS] CloudWatch Agent 설치 (Ubuntu) (0)	2023.01.16
[AWS] AWS Security Hub 란? (0)	2022.12.14
[AWS] VPC Flow Logs 란 ? (1)	2022.11.21
[AWS] EC2 SSH key 접속 Permission Error 발생 (0)	2022.11.16
[AWS] ALB에 고정 IP (Static IP) 주소 설정하는 방법 (0)	2022.11.15

현재글[AWS] Boto3로 EC2장비의 Vcpu 로깅 (Python)

컴맹에서 개발자 되기