Hello Everyone,

I hope this message finds you well. I'm reaching out to the community because I'm currently facing an issue with my high-performance computing setup, and I'm seeking your expertise and advice.

Here's a brief overview of my situation:

I have been using a Lenovo high performance computing (HPC) system for the past 18 months, and everything has been running smoothly until recently. I primarily use it for running complex simulations using software like Lenovo Intelligent Computing Orchestration (LiCO). However, I've noticed a significant drop in performance, and I'm struggling to pinpoint the root cause.

Here are some details about my setup:

  • Hardware Configuration:

    • CPU: Lenovo ThinkSystem SR650 with Intel Xeon Gold 6248R
    • GPU: NVIDIA A100 Tensor Core GPU
    • RAM: 128GB Lenovo TruDDR4
    • Storage: 2TB Lenovo PCIe NVMe SSD
  • Software Environment:

    • Operating System: Lenovo Cloud Enabled Intelligent Computing OS
    • Lenovo Intelligent Computing Orchestration (LiCO)

I've noticed a significant drop in performance during my simulations, with runtimes taking much longer than usual. I've also observed occasional system freezes during these tasks. I've already tried basic troubleshooting steps, such as updating drivers and ensuring the system is free from malware or unwanted processes, but unfortunately, the problem persists.

Specifically, I'm interested in hearing about:

  1. Any common pitfalls or issues associated with high-performance computing setups.
  2. Recommended tools or methods for performance monitoring and debugging.
  3. Tips for optimizing performance in [mention specific applications or tasks].

If anyone has any insights or suggestions, I would greatly appreciate your input. Your expertise could be immensely helpful in getting my high-performance computing setup back on track.

Thank you in advance for your time and assistance. I look forward to learning from the community's collective knowledge and experience.

Best regards,

Charlie