- First of all: do "pair or mob problem solving" :-)
- How did you get notify about the problem? (monitoring system, email, customer call, etc.)
- Does it happen to every customer? How does it affect them?
- If it impacts your customers, follow company protocols to notify your customers.
- Dependineg on the impact on the customers, have a clear deadline for solving it: it might be 0 seconds, 5 minutos, 1 hour, etc. After that time, do a quick fix if possible (e.g. kill the container and get up a new one?).
- Try to verify the problem yourself.
- Does it happen always? Only "sometimes"?
- When does it happen? Accessing where and doing exactly what?
- Look at the logs
- Look at the monitoring systems (Grafana, Kibana, etc.)
- Knowing the network topology: pings with different sizes (I've seen MTU problems).
- Have the architecture and technologies involved very clear.
- External dependencies: database, web services... Is everything up and running? Access it "by hand".
- Are there several instances? What is it shared among the instances? (e.g. same DB).
- Does it happen only in one instance or all of them?
- It would be nice to have some "system functional tests" prepared, in order to check each element in an independent way: web server, app server, database, message brokers, etc.
- Go for the tools :-)
- After everything is finished, write a public postmortem!!
- sysdig
- = strace + tcpdump + htop + iftop + lsof + ...
- connections, files, containers running, etc.
- https://sysdig.com/opensource/sysdig/install/
docker run -i -t --name sysdig --privileged -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro sysdig/sysdig
- Run
csysdig
- https://www.youtube.com/watch?v=UJ4wVrbP-Q8
- htop
- Network sniffers
- ngrep
- http://ngrep.sourceforge.net/usage.html
- I used it for Asterisk debugging: http://jonathanmanning.com/2009/11/17/how-to-sip-capture-using-ngrep-debug-sip-packets/
- https://seguridadyredes.wordpress.com/2010/02/24/esas-pequenas-utilidades-ngrep/
- E.g.
ngrep -W byline -t ‘192.168.5.151’ port 8080
- Wireshark
- tshark
- tcpdump
tcpdump -i any -n host 192.168.5.3 and icmp
>>> prints out a description of the contents of packets on a network interface that match the expression
- ngrep
- Linux:
- General: https://gist.github.com/islomar/7f0c0ccb7172c62526ab
- Kernel version and distribution information:
uname -a
lsb_release -a
cat /etc/lsb-release
- Files:
lsof
(list open files) - ps:
-e
to display all processes,-m
to sort by memory,-r
to sort by CPU
- sed (stream editor) and awk (text processing)
- Search
- grep
-R
for recursion,-i
for ignoring case,-v
to invert searches...
zgrep <text> <zipFileName>
- ag
- Silver searcher, really fast (it uses pthreads).
- https://github.com/ggreer/the_silver_searcher
- E.g.
ag -i foo /bar/
>> search 'foo' ignoring case on the path /path/
- grep
- Disk space:
- https://www.cyberciti.biz/faq/linux-check-disk-space-command/
df -h
, use-a
to include all filesystems (including the 0 size)du
- List top 10 directories under /etc/ eatind disk space:
du -ha /etc/ | sort -n -r | head -n 10
- Memory: https://www.binarytides.com/linux-command-check-memory-usage/
- Ports:
sudo lsof -i
sudo netstat -tulpn | grep :<portNumber>
>> you get processId (pid)sudo ls -la /proc/<pid>/exe
>> you get the exact processgrep <port_number> /etc/services
>>> it tells you what service is running on that port
- Socket investigation:
- Check service status:
/etc/init.d/<service-name> [start | stop | status]
service <service-name> [start | stop | status]
- Network
- netcat
- Test UDP and TCP connections (send data), port scanner
- https://www.digitalocean.com/community/tutorials/how-to-use-netcat-to-establish-and-test-tcp-and-udp-connections-on-a-vps
Echo ‘hola’ | nc -u <ip> 162
>> send an async SNMP trapnc -u -l -p 1666
>> listen to the UDP port 1666
- iptables
- command line utility for configuring Linux kernel firewall implemented within the Netfilter project.
- https://www.howtogeek.com/177621/the-beginners-guide-to-iptables-the-linux-firewall/
iptables -nvL
- e.g. protect from ssh connections to your machine
- conntrack
- Conntrack is a table that stores information about all connections to/from a VPS. Here is a good explanation on how it works: http://www.rigacci.org/wiki/lib/exe/fetch.php/doc/appunti/linux/sa/iptables/conntrack.html
- Connection tracking refers to the ability to maintain state information about a connection in memory tables, such as source and destination ip address and port number pairs (known as socket pairs), protocol types, connection state and timeouts. Firewalls that do this are known as stateful.
- When the kernel sends a packet, it stores where it was addressed, so that it does not have to calculate it again.
conntrack -L -p udp | grep =<port_number>
>> it shows the packets that arrive
- netcat
- https://gist.github.com/islomar/4ddc12320de8a9bff31d9774fef5ec9f
- https://blogs.oracle.com/javamagazine/java-flight-recorder-and-jfr-event-streaming-in-java-14
- Java memory: VisualVM, EclipseMAT, Oracle Mission Control >> GC
- Get the MaxPermSize used in a project:
jinfo -flag MaxPermSize <process_id>