Categories: Technology

Maximizing AI Performance with Intel Arc A770 GPU on Windows

Summary

This article introduces the Intel Arc A770 GPU as a competitive option for AI-intensive workloads, especially for those working in the Windows ecosystem. Traditionally, this segment has been dominated by NVIDIA and CUDA GPUs, but Intel’s latest offering offers a solid alternative. This article adds new information to help users more easily work with the Arc A770 GPU directly in Windows without the need for the Windows Subsystem for Linux (WSL).

With practical steps and detailed information, we will cover how to configure and optimize the Arc A770 GPU for various AI models, including Llama2, Llama3, and Phi3. The article also includes performance metrics and memory usage statistics to give you a full understanding of the GPU’s capabilities. Whether you are a developer or a researcher, this article will give you the knowledge you need to use the Intel GPU effectively and efficiently in your AI projects.

Introduction

Intel recently gave me the opportunity to test its Arc A770 GPU for AI workloads. While detailed specs can be found here, one feature that immediately stands out is the 16GB of RAM. That’s 4GB more than its natural competitor, the NVIDIA RTX 3060, making it an attractive option for AI computing at a similar price.

Intel Arc A770 GPU was used for testing

Since we work primarily with Microsoft technologies at Plain Concepts, I decided to explore the GPU capabilities of Windows. Given my regular work with PyTorch, I started by using the Intel extension for PyTorch to see if I could run models like Llama2, Llama3, and Phi3 and evaluate their performance.

I initially considered using the Windows Subsystem for Linux (WSL), based on suggestions in several blog posts and videos that native Windows support might not be quite ready. However, I decided to experiment with my own Windows setup first, and after a few tweaks and adjustments, I was pleased to find that everything worked just fine.

Intel Arc A770 GPU was used for testing

In this article, I share my experience and the steps I took to run Llama2, Llama3, and Phi3 models on the Intel Arc A770 GPU directly in Windows. I also present performance metrics, including execution time and memory usage for each model. The goal is to provide a comprehensive overview of how to effectively use the Intel Arc A770 GPU for AI-intensive tasks in Windows.

Settings in Windows

Intel provides a complete guide on how to install the Python extension for Arc GPU.

Intel Extensions for Pytorch Installation Guide

However, setting up the Arc A770 GPU in Windows required some initial settings. Here is a quick rundown of those settings. For detailed instructions, see the corresponding repository.

  • Since oneAPI requires setting several environment variables from CMD, I recommend installing the Pscx extension for PowerShell, which makes it easy to invoke CMD scripts.
  • When running on Windows with Mamba, the PATH environment variable can become too long, causing problems when setting up oneAPI environment variables. To avoid this problem, I have included a setup_vars.ps1 script that sets up the necessary environment variables for oneAPI.
  • The Phi3 example requires installation of a pre-release version of the ipex-llm library, which implements optimizations for all Phi3 kernel operations. After installing this library, you must reinstall the transformer library.

Using the Intel Extension for Pytorch

As stated in their GitHub repository: “The Intel® Extension for PyTorch extends PyTorch capabilities with modern function optimizations to further improve performance on Intel hardware.”. Specific, «Element Provides simple GPU acceleration for discrete Intel GPUs using the PyTorch xpu device.” This means that with this extension you can take advantage of the Intel Arc A770 GPU for AI tasks without relying on CUDA/NVIDIA, and get even more performance gains by using one of the optimized models.

Luckily, the extension uses the same API as PyTorch, so you typically only need to make a few code changes to get it working on an Intel GPU. Here’s a quick rundown of the changes required:

  1. Checking the GPU

Add the Intel extension to pytorch and check if the GPU is detected correctly.

This change is not strictly necessary, but it is recommended to check if the GPU is detected correctly before running the model.

  1. Move the model to the GPU

Once the model is loaded, move it to the GPU.

  1. Move inputs to GPU

Finally, when using the model, make sure the input data is also on the GPU.

Other changes in measurement results

To accurately measure performance, I also added some extra code to get the total inference time and maximum memory allocation. This basically consists of warming up each model before inference, and some extra code to wait for the model to run and print the results in a readable form. Visit the examples repository to learn more and reproduce the results on your machine.

Challenge2

Llama2 is the second version of Meta’s popular open-source Llama LLM model. After setting up the environment and making the changes described in the previous section to the official Llama2 samples, I was able to run the Llama2 model on an Intel Arc A770 GPU for both simple inference and educational tasks.

Running Llama2 7B on Intel Arc A770 GPU

The Llama2 7B model takes up about 14GB of memory with float16 precision. Since there are 16GB available on the GPU, we can run it without any problems. Below you can see the results of an example output using a maximum of 128 tokens in the output.

Running Llama2 7B Chat on Intel Arc A770 GPU

Likewise, the Llama2 7B chat results were impressive, as the model generated human-like responses in a conversational tone. The chat sample ran smoothly on the Intel Arc A770 GPU, demonstrating its capabilities for chat applications. In this case, the sample is run with 512 tokens in the output to further stress the hardware.

Call3

Llama3 is the latest version of Meta’s Llama LLM model, released a couple of months ago. Fortunately, the Intel team was quick to include optimizations to the model in the extension so that the full power of the Intel Arc A770 GPU could be taken advantage of. The process was very similar to that used for Llama2, using the same environment and official samples.

Running Llama3 8B on Intel Arc A770 GPU

The Llama3 8B model takes up roughly just over 15GB of memory with float16 precision. Since there are 16GB available on the GPU, we can run it without any issues. Below you can see the results of an example output using a maximum of 64 tokens in the output.

Running Llama3 8B instruction on Intel Arc A770 GPU

Continuing with the Llama2 examples, I also tested the chat capabilities of the Llama3 8B model by increasing the number of output tokens to 256.

Fi3

Phi3 is Microsoft’s latest model, released on April 24, and is designed for educational tasks. It’s a smaller model than Llama2 and Llama3 (3.8B for the smallest version), but it’s still quite powerful. It’s trained to perform educational tasks, providing detailed and informative responses.

Although Phi3 optimizations for Intel hardware are not yet included in the Intel extension for Pytorch, we can use the third-party library ipex-llm to optimize the model. In this case, since Phi3 is fairly new, I had to install a pre-release version for optimizations, which implements optimizations for all of Phi3’s core operations. Note that ipex-llm is not an official Intel library, but rather a community-driven library, so it is not officially supported by Intel.

After optimizing the model, the rest of the code changes remained the same as for Llama2 and Llama3, so I was able to run the Phi3 model on the Intel Arc A770 GPU without any problems.

Running Phi3 4K instruction on Intel Arc A770 GPU

The 4K model takes up about 2.5GB of memory with 4-bit precision. Since it has far fewer parameters than the Llama models, it runs much faster. Below you can see the results of an example inference using a maximum of 512 tokens in the output.

Performance comparison

To give a comprehensive assessment of the Intel Arc A770 GPU’s performance, I compared the runtime and memory usage of each model on both the Intel Arc A770 GPU and the NVIDIA RTX3080 TI. The metrics were obtained using identical code samples and environment configurations for both GPUs, ensuring a fair and accurate comparison.

Intel ARC A770

model Output tokens Time of completion Max memory used
meta-lama/Llama-2-7b-hf 128 ~7.7 sec ~12.8 GB
meta-llama/Llama-2-7b-chat-hf 512 ~22.1 sec ~13.3 GB
Meta-Lama/Meta-Lama-3-8B 64 ~11.5 sec ~15.1 GB
Meta-Lama/Meta-Lama-3-8B-Instruct 256 ~30.7 sec ~15.2 GB
Microsoft/Phi-3-mini-4k-instruct 512 ~5.9 sec ~2.6 GB

NVIDIA RTX3080 TI

model Output tokens Time of completion Max memory used
meta-lama/Llama-2-7b-hf 128 ~15.5 sec ~12.8 GB
meta-llama/Llama-2-7b-chat-hf 512 ~51.5 sec ~13.3 GB
Meta-Lama/Meta-Lama-3-8B 64 ~16.9 sec ~15.1 GB
Meta-Lama/Meta-Lama-3-8B-Instruct 256 ~66.7 sec ~15.2 GB
Microsoft/Phi-3-mini-4k-instruct 512 ~16.7 sec ~2.6 GB

Comparison table of advantages

The following graph shows the normalized execution time of each token for each model on Intel Arc A770 and NVIDIA RTX3080 TI GPUs.

*MAG ERROR: LESS THAN 0.1 SECOND

As you can see, the Intel Arc A770 GPU performed exceptionally well across all models, delivering competitive execution times. In particular, the Intel Arc A770 GPU outperformed the NVIDIA RTX3080 TI by two or more times in most cases.

Conclusion

The Intel Arc A770 GPU has proven to be a great option for running AI on a local Windows machine, offering an alternative to the CUDA/NVIDIA ecosystem. The GPU’s ability to efficiently run models such as Llama2, Llama3, and Phi3 demonstrates its potential and high performance. Despite initial setup issues, the process was relatively straightforward and the results were impressive.

At its core, the Intel Arc A770 GPU is a powerful tool for AI applications on Windows. With some initial tweaks and code changes, it handled inference, chat, and training tasks efficiently. This opens up new possibilities for developers and researchers who prefer or need to work in a Windows environment without relying on NVIDIA GPUs and CUDA. As Intel continues to improve its GPU offerings and software support, the Arc A770 and future models could become major players in the AI ​​community.

Links of interest

Code examples used in this article can be found in the IntelArcA770 GitHub repository.

Additionally, below are some resources that I find essential for learning more about Intel’s hardware and library ecosystem for AI workloads.

Recommendations

Source link

Admin

Recent Posts

Emma Watson’s current character is supposed to be.

Casting for the Harry Potter reboot has officially begun! A search has effectively begun to…

4 days ago

Jennifer Lopez tries on ‘revenge dress’ for her Ben Affleck divorce premiere – Paris Match

Jennifer Lopez tries on 'revenge dress' for her Ben Affleck divorce premiereParis matchJennifer Lopez: Son…

7 days ago

Jennifer Lopez and Ben Affleck’s recent divorce: Their retro neglect… and trends – Yahoo

Jennifer Lopez and Ben Affleck recently divorced: their retro neglect... and trendsYahooJennifer Lopez's Divorce Court…

1 week ago

Hailee Steinfeld Has Retro Love

Hailee Steinfeld is happy to have found her perfect partner.The 26-year-old star revealed that she…

1 week ago

Demi Rose Performs ‘Hot’ in Ibiza

JAKARTA - Model and Instagram influencer Demi Rose Mawby is not a cesse de chauffer…

1 week ago

Jennifer Lopez’s Divorce Court Prize, Ben Affleck to Benefit from Son’s Absence for Ghost Son’s Home and Wedding – Grazia France

Jennifer Lopez's Divorce Court Prize, Ben Affleck to Benefit from Son Absence for Home, Marriage…

1 week ago