Question:
Problem
I am simply trying to install Cloudwatch Agent on Amazon Linux 2 instances at startup, using AWS userdata. For some reason, after Cloud Init has finished running, all services get restarted and the configuration file I put in the cloudwatch folder is not there anymore.
I am using a custom AMI which is pre-built with Packer, my configuration file being put in /opt/aws/amazon-cloudwatch-agent/etc/custom/amazon-cloudwatch-agent.json
from an Ansible template. This is the configuration file I want to use, holding all metrics and logs I want to send. I am then copying it to /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
at startup after the agent installation.
Here is my userdata script:
1 2 3 4 5 |
#!/bin/bash yum install amazon-cloudwatch-agent -y cp /opt/aws/amazon-cloudwatch-agent/etc/custom/amazon-cloudwatch-agent.json /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json |
What is happening
After startup has finished, I can see the script ran correctly. If I run cat /opt/aws/amazon-cloudwatch-agent/log/amazon-cloudwatch-agent.log
I can see that the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
2021/07/16 13:33:46 I! I! Detected the instance is EC2 2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ... /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it. 2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_amazon-cloudwatch-agent.json ... Valid Json input schema. I! Detecting run_as_user... No csm configuration found. Configuration validation first phase succeeded 2021/07/16 13:33:46 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ... 2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_amazon-cloudwatch-agent.json ... 2021/07/16 13:33:46 I! Detected runAsUser: root 2021/07/16 13:33:46 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to root:root 2021-07-16T13:33:46Z I! Starting AmazonCloudWatchAgent 1.247347.4 2021-07-16T13:33:46Z I! Loaded inputs: netstat diskio logfile mem net processes swap cpu disk 2021-07-16T13:33:46Z I! Loaded aggregators: 2021-07-16T13:33:46Z I! Loaded processors: delta ec2tagger 2021-07-16T13:33:46Z I! Loaded outputs: cloudwatch cloudwatchlogs 2021-07-16T13:33:46Z I! Tags enabled: host=ip-XX-XX-X-XXX.eu-west-1.compute.internal 2021-07-16T13:33:46Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-XX-XX-X-XXX.eu-west-1.compute.internal", Flush Interval:1s 2021-07-16T13:33:46Z I! [logagent] starting 2021-07-16T13:33:46Z I! [logagent] found plugin cloudwatchlogs is a log backend 2021-07-16T13:33:46Z I! [logagent] found plugin logfile is a log collection 2021-07-16T13:33:46Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization. =======> 2021-07-16T13:33:46Z I! cloudwatch: get unique roll up list [[AutoScalingGroupName] [InstanceId InstanceType] []] 2021-07-16T13:33:46Z I! cloudwatch: publish with ForceFlushInterval: 30s, Publish Jitter: 11s 2021-07-16T13:33:46Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded 2021-07-16T13:33:46Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes =======> 2021-07-16T13:33:47Z I! [logagent] piping log from APP-DEV-php-errors-logs/XX.XX.X.XXX(/var/log/php-fpm/error.log) to cloudwatchlogs 2021-07-16T13:33:54Z I! Profiler is stopped during shutdown 2021-07-16T13:33:54Z I! [agent] Hang on, flushing any cached metrics before shutdown 2021/07/16 13:33:55 I! I! Detected the instance is EC2 2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ... /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it. 2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default ... Valid Json input schema. I! Detecting run_as_user... No csm configuration found. No log configuration found. Configuration validation first phase succeeded 2021/07/16 13:33:55 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ... 2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default ... 2021/07/16 13:33:55 I! Detected runAsUser: cwagent 2021/07/16 13:33:55 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to 994:992 2021/07/16 13:33:55 I! Set HOME: /home/cwagent 2021-07-16T13:33:55Z I! Starting AmazonCloudWatchAgent 1.247348.0 2021-07-16T13:33:55Z I! Loaded inputs: disk mem 2021-07-16T13:33:55Z I! Loaded aggregators: 2021-07-16T13:33:55Z I! Loaded processors: ec2tagger 2021-07-16T13:33:55Z I! Loaded outputs: cloudwatch 2021-07-16T13:33:55Z I! Tags enabled: host=ip-XX-XX-X-XXX.eu-west-1.compute.internal 2021-07-16T13:33:55Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-XX-XX-X-XXX.eu-west-1.compute.internal", Flush Interval:1s 2021-07-16T13:33:55Z I! [logagent] starting 2021-07-16T13:33:55Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization. =======> 2021-07-16T13:33:55Z I! cloudwatch: get unique roll up list [] 2021-07-16T13:33:55Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 26s 2021-07-16T13:33:55Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded 2021-07-16T13:33:55Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes 2021-07-16T13:39:07Z I! [processors.ec2tagger] ec2tagger: Refresh is no longer needed, stop refreshTicker. |
So as you can see, the initial command from userdata runs fine and custom metrics and logs are collected (see the ====> mark before the relevant lines).
However a few seconds later, after Cloud Init is over, the cloudwatch agent is restarted by systemd somehow and again, somehow, the file amazon-cloudwatch-agent.json
is absent from the filesystem, so the agent runs with default parameters.
However if I rerun the command manually after startup everything works fine but of course I need it automated for when autoscaling fires up.
What I have tried
Launching amazon cloudwatch agent directly with systemd, trying to chown the config file to read-only, fetching config only and let the system start the agent itself, but the problem still persists.
Thank you for your help
Answer:
Workaround
The preinstalled ssm-agent conflicts with the Cloudwtach Agent. Uninstall ssm-agent during Packer build:
1 2 |
sudo yum erase amazon-ssm-agent --assumeyes |
Explanation
I finally found out that the newly install cloudwatch agent conflicts with the SSM agent installed by default in the Amazon Linux 2 image.
Indeed, I first tried an ugly workaround which would be to replace the StartExec line of the amazon-cloudwatch-agent service using sed in the user data :
1 2 |
sed -i '/ExecStart/c\ExecStart=/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/custom/amazon-cloudwatch-agent.json' /etc/systemd/system/amazon-cloudwatch-agent.service |
That way when the service gets restarted after instance startup it would use my custom configuration.
However I then found out that the service file got also replaced after Cloud Init ended.
Reviewing the system messages I noticed that ssm-agent was performing some configuration reloading after Cloud Init ended, and thus I assumed that it could possibly be the culprit.
I ended up uninstalling it in the packer build which is building my AMI so it would not be present at instance startup, and finally my configuration did not get overwritten anymore.
Note that I do not have a deep understanding of how ssm-agent works, and there is probably a proper way to instantiate Cloudwatch Agent using some SSM configuration.
Since we do not currently use SSM and I do not have enough time to study this option, I choosed this compromise.
If someone can come up with a cleaner solution, using ssm-agent through an automated method, this would be greatly appreciated.