Parallel computing on NVIDIA GPU or supercomputer in every home. GPU Mining - A Complete Guide GPU Computing

Today, news about the use of GPUs for general computing can be heard on every corner. Words such as CUDA, Stream and OpenCL have become almost the most quoted words on the IT Internet in just two years. However, what these words mean, and what the technologies behind them are, are far from known to everyone. And for Linuxoids who are accustomed to "being in flight", in general, all this is seen as a dark forest.

Birth of GPGPU

We are all used to thinking that the only component of a computer capable of executing any code that is ordered to it is the central processing unit. For a long time almost all mainstream PCs were equipped with a single processor that handled every conceivable calculation, including operating system code, all of our software, and viruses.

Later, multi-core processors and multi-processor systems appeared, in which there were several such components. This allowed the machines to perform multiple tasks at the same time, and the overall (theoretical) performance of the system rose exactly as many times as there were cores installed in the machine. However, it turned out that it was too difficult and expensive to manufacture and design multi-core processors.

Each core had to host a full-fledged processor of a complex and intricate x86 architecture, with its own (rather large) cache, instruction pipeline, SSE blocks, many blocks that perform optimizations, etc. etc. Therefore, the process of increasing the number of cores slowed down significantly, and white university coats, for which two or four cores were clearly not enough, found a way to use other computing power for their scientific calculations, which was in abundance on the video card (as a result, even the BrookGPU tool appeared, emulating an additional processor using DirectX and OpenGL function calls).

GPUs, devoid of many of the shortcomings of the central processor, turned out to be an excellent and very fast calculating machine, and very soon GPU manufacturers themselves began to look closely at the developments of scientific minds (and nVidia hired most of the researchers in general). The result is nVidia's CUDA technology, which defines an interface that makes it possible to transfer the computation of complex algorithms to the shoulders of the GPU without any crutches. It was later followed by ATi (AMD) with its own variant of the technology called Close to Metal (now Stream), and shortly thereafter came Apple's standard version called OpenCL.

GPU is our everything?

Despite all the advantages, the GPGPU technique has several problems. The first of these lies in a very narrow scope. GPUs have stepped far ahead of the central processor in terms of increasing computing power and the total number of cores (video cards carry a computing unit consisting of more than a hundred cores), but such a high density is achieved due to the maximum simplification of the design of the chip itself.

In essence, the main task of the GPU is reduced to mathematical calculations using simple algorithms that receive not very large amounts of predictable data as input. For this reason, GPU cores have a very simple design, meager cache volumes, and a modest set of instructions, which ultimately results in their low cost of production and the possibility of very dense placement on a chip. GPUs are like a Chinese factory with thousands of workers. They do some simple things quite well (and most importantly - quickly and cheaply), but if you entrust them with the assembly of the aircraft, then the result will be a maximum hang glider.

Therefore, the first limitation of the GPU is its focus on fast mathematical calculations, which limits the scope of graphics processors to help in the operation of multimedia applications, as well as any programs involved in complex data processing (for example, archivers or encryption systems, as well as software involved in fluorescence microscopy, molecular dynamics, electrostatics and other things of little interest to Linux users) .

The second problem with GPGPU is that not every algorithm can be adapted to run on the GPU. Individual GPU cores are quite slow, and their power only comes into play when they work together. And this means that the algorithm will be as efficient as the programmer can effectively parallelize it. In most cases, only a good mathematician can cope with such work, and there are very few of them among software developers.

And thirdly, GPUs work with the memory installed on the video card itself, so each time the GPU is activated, two additional operations copy: input from random access memory the application itself and the output from GRAM back to application memory. It is not hard to guess that this can negate any gain in application time (as it happens with the FlacCL tool, which we will look at later).

But that's not all. Despite the existence of a generally accepted standard in the face of OpenCL, many programmers still prefer to use vendor-specific implementations of the GPGPU technique. CUDA turned out to be especially popular, which, although it provides a more flexible programming interface (by the way, OpenCL in nVidia drivers implemented on top of CUDA), but tightly ties the application to video cards from the same manufacturer.

KGPU or Linux kernel accelerated by GPU

Researchers at the University of Utah have developed a KGPU system that allows some of the functions of the Linux kernel to run on a GPU using the CUDA framework. To accomplish this task, a modified Linux kernel and a special daemon that runs in user space, listens for kernel requests and passes them to the video card driver using the CUDA library, are used. Interestingly, despite the significant overhead that such an architecture creates, the authors of KGPU managed to create an implementation of the AES algorithm, which raises the encryption speed. file system eCryptfs 6 times.

What is now?

Due to its youth, and also due to the problems described above, GPGPU has not become a truly widespread technology, however, useful software that uses its capabilities exists (albeit in a meager amount). Crackers of various hashes appeared among the first, the algorithms of which are very easy to parallelize.

Multimedia applications were also born, such as the FlacCL encoder, which allows you to transcode an audio track into the FLAC format. Some pre-existing applications have also acquired GPGPU support, the most notable of which was ImageMagick, which now knows how to shift some of its work to the graphics processor using OpenCL. There are also projects for transferring data archivers and other information compression systems to CUDA / OpenCL (they don’t like ATi Unixoids). We will consider the most interesting of these projects in the following sections of the article, but for now we will try to figure out what we need in order for all this to start and work stably.

GPUs have long outperformed x86 processors in performance

· Secondly, the system must have the latest proprietary drivers for the video card installed, they will provide support for both the card's native GPGPU technologies and the open OpenCL.

· And thirdly, since distribution builders have not yet started distributing application packages with GPGPU support, we will have to build applications ourselves, and for this we need official SDKs from manufacturers: CUDA Toolkit or ATI Stream SDK. They contain the header files and libraries necessary for building applications.

Install CUDA Toolkit

We follow the link above and download the CUDA Toolkit for Linux (you can choose from several versions, for Fedora, RHEL, Ubuntu and SUSE distributions, there are versions for both x86 and x86_64 architectures). In addition, there you need to download driver kits for developers (Developer Drivers for Linux, they are first on the list).

Run the SDK installer:

$ sudo sh cudatoolkit_4.0.17_linux_64_ubuntu10.10.run

When the installation is completed, proceed to install the drivers. To do this, shut down the X server:

# sudo /etc/init.d/gdm stop

Opening the console and run the driver installer:

$ sudo sh devdriver_4.0_linux_64_270.41.19.run

After the installation is completed, we start X:

In order for applications to work with CUDA/OpenCL, we write the path to the directory with CUDA libraries in the LD_LIBRARY_PATH variable:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Or, if you installed the 32-bit version:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib32

You also need to specify the path to the CUDA header files so that the compiler can find them at the application build stage:

$ export C_INCLUDE_PATH=/usr/local/cuda/include

That's it, now you can start building CUDA/OpenCL software.

Install ATI Stream SDK

Stream SDK does not require installation, so the archive downloaded from the AMD website can simply be unpacked into any directory ( the best choice will be /opt) and write the path to it in the same LD_LIBRARY_PATH variable:

$ wget http://goo.gl/CNCNo

$ sudo tar -xzf ~/AMD-APP-SDK-v2.4-lnx64.tgz -C /opt

$ export LD_LIBRARY_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/lib/x86_64/

$ export C_INCLUDE_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/include/

As with the CUDA Toolkit, x86_64 needs to be replaced with x86 on 32-bit systems. Now we go to the root directory and unpack the icd-registration.tgz archive (it's kind of free license key):

$ sudo tar -xzf /opt/AMD-APP-SDK-v2.4-lnx64/icd-registration.tgz - FROM /

We check the correct installation / operation of the package using the clinfo tool:

$ /opt/AMD-APP-SDK-v2.4-lnx64/bin/x86_64/clinfo

ImageMagick and OpenCL

Support for OpenCL appeared in ImageMagick a long time ago, but it is not enabled by default in any distribution. Therefore, we will have to build IM ourselves from source. There is nothing complicated about this, everything you need is already in the SDK, so the assembly will not require the installation of any additional libraries from nVidia or AMD. So, download / unpack the archive with the sources:

$ wget http://goo.gl/F6VYV

$ tar -xjf ImageMagick-6.7.0-0.tar.bz2

$ cd ImageMagick-6.7.0-0

$ sudo apt-get install build-essential

Run the configurator and grab its output for OpenCL support:

$ LDFLAGS=-L$LD_LIBRARY_PATH ./configure | grep -e cl.h -e OpenCL

The correct output of the command should look something like this:

checking CL/cl.h usability... yes

checking CL/cl.h presence... yes

checking for CL/cl.h... yes

checking OpenCL/cl.h usability... no

checking OpenCL/cl.h presence... no

checking for OpenCL/cl.h... no

checking for OpenCL library... -lOpenCL

The word "yes" should mark either the first three lines or the second (or both). If this is not the case, then most likely the C_INCLUDE_PATH variable was not initialized correctly. If the word "no" marks the last line, then the matter is in the LD_LIBRARY_PATH variable. If everything is ok, start the build/install process:

$ sudo make install clean

Verify that ImageMagick was indeed compiled with OpenCL support:

$ /usr/local/bin/convert-version | grep Features

Features: OpenMP OpenCL

Now let's measure the resulting gain in speed. The ImageMagick developers recommend using the convolve filter for this:

$ time /usr/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

$ time /usr/local/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

Some other operations, such as resizing, should now also work much faster, but you should not hope that ImageMagick will start processing graphics at breakneck speed. So far, very little of the package has been optimized with OpenCL.

FlacCL (Flacuda)

FlacCL is a FLAC audio encoder that takes advantage of OpenCL features. It is part of the CUETools package for Windows, but thanks to mono it can also be used on Linux. To get the archive with the encoder, run the following command:

$ mkdir flaccl && cd flaccl

$ wget www.cuetools.net/install/flaccl03.rar

$ sudo apt-get install unrar mono

$ unrar x fl accl03.rar

So that the program can find the OpenCL library, we make a symbolic link:

$ ln -s $LD_LIBRARY_PATH/libOpenCL.so libopencl.so

Now let's start the encoder:

$ mono CUETools.FLACCL.cmd.exe music.wav

If the error message "Error: Requested compile size is bigger than the required workgroup size of 32" is displayed on the screen, then we have a weak video card in the system, and the number of cores involved should be reduced to the specified number using the flag '-- group-size XX', where XX is the desired number of cores.

I must say right away that due to the long initialization time of OpenCL, a noticeable gain can only be obtained on sufficiently long tracks. Short sound files FlacCL processes almost at the same speed as its traditional version.

oclHashcat or quick brute force

As I already said, developers of various crackers and password brute force systems were among the first to add GPGPU support to their products. For them new technology has become a real holy grail, which made it easy to transfer the naturally easily parallelizable code to the shoulders of fast GPU processors. Therefore, it is not surprising that there are now dozens of very different implementations of such programs. But in this article I will talk about only one of them - oclHashcat.

oclHashcat is a cracker that can crack passwords by their hash at extremely high speed, while using GPU power using OpenCL. According to the measurements published on the project website, the speed of MD5 password selection on the nVidia GTX580 is up to 15800 million combinations per second, thanks to which oclHashcat is able to find an eight-character password of average complexity in just 9 minutes.

The program supports OpenCL and CUDA, MD5 algorithms, md5($pass.$salt), md5(md5($pass)), vBulletin< v3.8.5, SHA1, sha1($pass.$salt), хэши MySQL, MD4, NTLM, Domain Cached Credentials, SHA256, поддерживает распределенный подбор паролей с задействованием мощности нескольких машин.

$7z x oclHashcat-0.25.7z

$ cd oclHashcat-0.25

And run the program (we will use a trial list of hashes and a trial dictionary):

$ ./oclHashcat64.bin example.hash ?l?l?l?l example.dict

oclHashcat will open the text of the user agreement, which you must agree to by typing "YES". After that, the enumeration process will begin, the progress of which can be found by pressing . To pause the process, press

To resume - . You can also use brute force (for example, from aaaaaaaa to zzzzzzz):

$ ./oclHashcat64.bin hash.txt ?l?l?l?l ?l?l?l?l

And various modifications of the dictionary and direct enumeration method, as well as their combinations (you can read about this in the docs/examples.txt file). In my case, the speed of enumeration of the entire dictionary was 11 minutes, while direct enumeration (from aaaaaaaa to zzzzzzzz) lasted about 40 minutes. On average, the speed of the GPU (RV710 chip) was 88.3 million / s.

conclusions

Despite many different limitations and the complexity of software development, GPGPU is the future of high-performance desktop computers. But the most important thing is that you can use the capabilities of this technology right now, and this applies not only to Windows machines, but also to Linux.


There are never too many cores...

Modern GPUs are monstrous fast beasts capable of chewing gigabytes of data. However, a person is cunning and, no matter how computing power grows, he comes up with tasks more and more difficult, so there comes a moment when you have to state with sadness that optimization is needed 🙁

This article describes the basic concepts, in order to make it easier to navigate in the theory of gpu-optimization and the basic rules, so that these concepts have to be accessed less often.

The reasons why GPUs are effective for dealing with large amounts of data that require processing:

  • they have great opportunities for parallel execution of tasks (many, many processors)
  • high memory bandwidth

Memory bandwidth- this is how much information - bits or gigabytes - can be transferred per unit of time, a second or a processor cycle.

One of the tasks of optimization is to use the maximum throughput - to increase performance throughput(ideally, it should be equal to memory bandwidth).

To improve bandwidth usage:

  • increase the amount of information - use the bandwidth to the full (for example, each stream works with float4)
  • reduce latency - the delay between operations

Latency- the time interval between the moments when the controller requested a specific memory cell and the moment when the data became available to the processor for executing instructions. We cannot influence the delay itself in any way - these restrictions are present at the hardware level. It is due to this delay that the processor can simultaneously serve several threads - while thread A has requested to allocate memory to it, thread B can calculate something, and thread C can wait until the requested data arrives.

How to reduce latency if synchronization is used:

  • reduce the number of threads in a block
  • increase the number of block groups

Using GPU resources to the full - GPU Occupancy

In highbrow conversations about optimization, the term often flashes - gpu occupancy or kernel occupancy- it reflects the efficiency of the use of resources-capacities of the video card. Separately, I note that even if you use all the resources, this does not mean that you are using them correctly.

The computing power of the GPU is hundreds of processors greedy for calculations, when creating a program - the kernel (kernel) - the burden of distributing the load on them falls on the shoulders of the programmer. A mistake can result in most of these precious resources being idle for no reason. Now I will explain why. You have to start from afar.

Let me remind you that the warp ( warp in NVidia terminology, wavefront - in AMD terminology) - a set of threads that simultaneously perform the same kernel function on the processor. Threads, united by the programmer into blocks, are divided into warps by the thread scheduler (separately for each multiprocessor) - while one warp is running, the second is waiting for memory requests to be processed, etc. If some of the warp threads are still performing calculations, while others have already done their best, then there is an inefficient use of the computing resource - popularly referred to as idle power.

Every synchronization point, every branch of logic can create such an idle situation. The maximum divergence (branching of the execution logic) depends on the size of the warp. For NVidia GPUs, this is 32, for AMD, 64.

To reduce multiprocessor downtime during warp execution:

  • minimize waiting time barriers
  • minimize the divergence of execution logic in the kernel function

To effectively solve this problem, it makes sense to understand how warps are formed (for the case with several dimensions). In fact, the order is simple - first in X, then in Y, and last in Z.

the core is launched with 64×16 blocks, the threads are divided into warps in the order X, Y, Z - i.e. the first 64 elements are split into two warps, then the second, and so on.

The kernel starts with 16x64 blocks. The first and second 16 elements are added to the first warp, the third and fourth elements are added to the second warp, and so on.

How to reduce divergence (remember - branching is not always the cause of a critical performance loss)

  • when adjacent threads have different execution paths - many conditions and transitions on them - look for ways to re-structure
  • look for an unbalanced load of threads and decisively remove it (this is when we not only have conditions, but because of these conditions, the first thread always calculates something, and the fifth one does not fall into this condition and is idle)

How to get the most out of GPU resources

GPU resources, unfortunately, also have their limitations. And, strictly speaking, before launching the kernel function, it makes sense to define limits and take these limits into account when distributing the load. Why is it important?

Video cards have restrictions on the total number of threads that one multiprocessor can execute, the maximum number of threads in one block, the maximum number of warps on one processor, restrictions on different types of memory, etc. All this information can be requested both programmatically, through the corresponding API, and previously using utilities from the SDK. (deviceQuery modules for NVidia devices, CLInfo modules for AMD video cards).

General practice:

  • the number of thread blocks/workgroups must be a multiple of the number of stream processors
  • block/workgroup size must be a multiple of the warp size

At the same time, it should be borne in mind that the absolute minimum - 3-4 warps / wayfronts are spinning simultaneously on each processor, wise guides advise to proceed from the consideration - at least seven wayfronts. At the same time - do not forget the restrictions on the iron!

Keeping all these details in your head quickly gets boring, therefore, for calculating gpu-occupancy, NVidia offered an unexpected tool - an excel (!) Calculator full of macros. There you can enter information on the maximum number of threads for SM, the number of registers and the size of the shared (shared) memory available on the stream processor, and the used parameters for launching functions - and it gives a percentage of resource use efficiency (and you tear your hair out realizing that to use all the cores you are missing registers).

usage information:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#calculating-occupancy

GPU and memory operations

Video cards are optimized for 128-bit memory operations. Those. ideally, each memory manipulation, ideally, should change 4 four-byte values ​​​​at a time. The main annoyance for the programmer is that modern compilers for the GPU are not able to optimize such things. This has to be done right in the function code and, on average, brings fractions of a percentage of performance gains. The frequency of memory requests has a much greater impact on performance.

The problem is as follows - each request returns in response a piece of data that is a multiple of 128 bits. And each thread uses only a quarter of it (in the case of a normal four-byte variable). When adjacent threads simultaneously work with data located sequentially in memory cells, this reduces the total number of memory accesses. This phenomenon is called combined read and write operations ( coalesced access - good! both read and write) - and with the correct organization of the code ( strided access to contiguous chunk of memory - bad!) can significantly improve performance. When organizing your kernel - remember - contiguous access - within the elements of one row of memory, working with the elements of a column is no longer so efficient. Want more details? I liked this pdf - or google for " memory coalescing techniques “.

The leading position in the “bottleneck” nomination is occupied by another memory operation - copy data from host memory to GPU . Copying does not happen anyhow, but from a memory area specially allocated by the driver and the system: when a request is made to copy data, the system first copies this data there, and only then uploads it to the GPU. Data transport speed is limited by bandwidth PCI bus Express xN (where N is the number of data lines) through which modern video cards communicate with the host.

However, extra copying of slow memory on the host is sometimes an unjustified overhead. The way out is to use the so-called pinned memory - a specially marked area of ​​memory, so that operating system does not have the ability to perform any operations with it (for example, unload to swap / move at its discretion, etc.). Data transfer from the host to the video card is carried out without the participation of the operating system - asynchronously, through DMA (direct memory access).

And finally, a little more about memory. Shared memory on a multiprocessor is usually organized in the form of memory banks containing 32-bit words - data. The number of banks traditionally varies from one GPU generation to another - 16/32 If each thread requests data from a separate bank, everything is fine. Otherwise, several read / write requests to one bank are obtained and we get - a conflict ( shared memory bank conflict). Such conflicting calls are serialized and therefore executed sequentially, not in parallel. If all threads access the same bank, a “broadcast” response is used ( broadcast) and there is no conflict. There are several ways to effectively deal with access conflicts, I liked it description of the main techniques for getting rid of conflicts of access to memory banks – .

How to make mathematical operations even faster? Remember that:

  • double precision calculations are heavy operation load with fp64 >> fp32
  • constants of the form 3.13 in the code, by default, are interpreted as fp64 if you do not explicitly specify 3.14f
  • to optimize mathematics, it will not be superfluous to consult in the guides - and are there any flags for the compiler
  • vendors include features in their SDKs that take advantage of device features to achieve performance (often at the expense of portability)

It makes sense for CUDA developers to pay close attention to the concept cuda stream, that allow you to run several core functions at once on one device or combine asynchronous copying of data from the host to the device during the execution of functions. OpenCL does not yet provide such functionality 🙁

Profiling junk:

NVifia Visual Profiler is an interesting utility that analyzes both CUDA and OpenCL kernels.

P.S. As a longer optimization guide, I can recommend googling all sorts of best practice guide for OpenCL and CUDA.

  • ,

What software is needed to mine cryptocurrency? What to consider when choosing equipment for mining? How to mine bitcoins and ethereum using a video card on a computer?

It turns out that not only fans of spectacular computer games need powerful video cards. Thousands of users around the world use graphics cards to earn cryptocurrency! From several cards with powerful processors miners create farms- computing centers that extract digital money almost out of thin air!

Denis Kuderin is with you - an expert of the HeatherBober magazine on finance and their competent multiplication. I will tell you what it is mining on video card in 17-18 years, how to choose the right device for earning cryptocurrency, and why it is no longer profitable to mine bitcoins on video cards.

You will also learn where to buy the most productive and powerful video card for professional mining, and get expert tips to improve the efficiency of your mining farm.

1. Mining on a video card - easy money or unjustified expenses

A good video card is not just an adapter digital signals, but also a powerful processor capable of solving the most complex computing problems. And including - calculate the hash code for the block chain (blockchain). This makes graphics cards ideal for mining- Cryptocurrency mining.

Question: Why the graphics processor? After all, in any computer there is a central processing unit? Isn't it logical to do calculations with it?

Answer: The CPU processor can also calculate blockchains, but it does it hundreds of times slower than the video card processor (GPU). And not because one is better, the other is worse. They just work differently. And if you combine several video cards, the power of such computer center rise several times more.

For those who have no idea about how digital money is mined, small educational program. Mining - main and sometimes the only way cryptocurrency production.

Since no one mints or prints this money, and they are not a material substance, but a digital code, someone must calculate this code. This is what miners do, or rather, their computers.

In addition to code calculations, mining performs several more important tasks:

  • system decentralization support: lack of attachment to servers - the basis of the blockchain;
  • transaction confirmation– without mining, operations will not be able to enter a new block;
  • formation of new blocks of the system- and entering them into a single registry for all computers.

I want to immediately cool the ardor of novice miners: the mining process is becoming more and more difficult every year. For example, using a video card has long been unprofitable.

Bitcoins with the help of GPUs are now mined only by stubborn amateurs, since specialized processors have replaced video cards ASIC. These chips consume less electricity and are more efficient in terms of computing. All good, but worth the order 130-150 thousand rubles .

powerful model Antminer S9

Fortunately for miners, bitcoin is not the only cryptocurrency on the planet, but one of hundreds. Other digital money - Ethereum, Zcash, Expanse, dogecoins etc. it is still profitable to mine with the help of video cards. The remuneration is stable, and the equipment pays off in about 6-12 months.

But there is another problem - the lack of powerful video cards. The excitement around the cryptocurrency has led to a rise in the price of these devices. It is not so easy to buy a new video card suitable for mining in Russia.

Novice miners have to order video adapters in online stores (including foreign ones) or purchase second-hand goods. By the way, I do not recommend doing the latter: Mining equipment becomes obsolete and wears out at a fantastic rate.

Avito even sells entire farms for cryptocurrency mining.

There are many reasons: some miners have already “played enough” in the extraction of digital money and decided to engage in more profitable operations with cryptocurrency (in particular, stock trading), others realized that they could not compete with powerful Chinese clusters operating on the basis of power plants. Still others switched from video cards to ASICs.

However, the niche still brings some profit, and if you start with the help of a video card right now, you will still have time to jump on the bandwagon of the train leaving for the future.

Another thing is that there are more and more players on this field. Moreover, the total number of digital coins does not increase from this. On the contrary, the reward becomes smaller.

So, six years ago, the reward for one blockchain of the Bitcoin network was equal to 50 coins, now it's only 12.5 BTK. The complexity of calculations thus increased by 10 thousand times. True, the cost of bitcoin itself has increased many times during this time.

2. How to mine cryptocurrency using a video card - step by step instructions

There are two mining options - solo and as part of a pool. It is difficult to engage in single production - you need to have a huge amount of hashrate(power units) so that the started calculations have a probability of successful closure.

99% of all miners work in pools(English pool - pool) - communities engaged in the distribution of computing tasks. Joint mining eliminates the random factor and guarantees a stable profit.

One of my acquaintances, a miner, said this about it: I have been mining for 3 years, during this time I have not communicated with anyone who would mine alone.

Such prospectors are similar to the gold prospectors of the 19th century. You can search for years for your nugget (in our case, bitcoin) and never find it. That is, the blockchain will never be closed, which means you will not receive any reward.

Slightly more chances for “lone hunters” for ethers and some other crypto-coins.

Due to the peculiar encryption algorithm, ETH is not mined using special processors (they have not yet been invented). Only video cards are used for this. Due to ethereums and other altcoins, numerous farmers of our time still hold on.

One video card to create a full-fledged farm will not be enough: 4 pieces - "living wage" for the miner, counting on a stable profit. No less important powerful system cooling video adapters. And do not lose sight of such a cost item as electricity bills.

Step-by-step instructions will protect against errors and speed up the process setup.

Step 1. Choose a pool

The world's largest cryptocurrency pools are located in China, as well as in Iceland and the United States. Formally, these communities do not have a state affiliation, but Russian-language pool sites are a rarity on the Internet.

Since you will most likely have to mine ethereum on a video card, then you will need to choose the community involved in the calculation of this currency. Although Etherium is a relatively young altcoin, there are many pools for its mining. The size of your income and its stability largely depend on the choice of the community.

We select a pool according to the following criteria:

  • performance;
  • working hours;
  • fame among cryptocurrency miners;
  • Availability positive feedback in independent forums;
  • convenience of withdrawing money;
  • the size of the commission;
  • the principle of accrual of profit.

The cryptocurrency market changes daily. This also applies to rate fluctuations, and the emergence of new digital money - forks bitcoin. There are global changes as well.

So, recently it became known that the air in the near future is moving to a fundamentally different system of profit distribution. In a nutshell, miners who have “a lot of ketse”, that is, coins, will have income in the Etherium network, and novice miners will either close their shop or switch to other money.

But such "little things" never stopped enthusiasts. Moreover, there is a program called Profitable Pool. It automatically tracks the most profitable altcoins for mining at the current moment. There is also a search service for the pools themselves, as well as their real-time ratings.

Step 2. Install and configure the program

After registering on the pool website, you need to download a special miner program - do not calculate the code manually using a calculator. Such programs are also enough. For bitcoin, this is 50 miner or CGMiner, for ether - Ethminer.

Setting up requires care and certain skills. For example, you need to know what scripts are and be able to enter them into the command line of your computer. I advise you to check the technical points with practicing miners, since each program has its own installation and configuration nuances.

Step 3. Registering a wallet

If you don’t have a bitcoin wallet or ethereum storage yet, you need to register them. We download wallets from official sites.

Sometimes the pools themselves provide assistance in this matter, but not free of charge.

Step 4. Start mining and monitor statistics

It remains only to start the process and wait for the first receipts. Be sure to download an auxiliary program that will monitor the status of the main components of your computer - workload, overheating, etc.

Step 5. Withdraw cryptocurrency

Computers work around the clock and automatically, calculating the code. You just have to make sure that the cards or other systems do not fail. Cryptocurrency will flow into your wallet at a rate directly proportional to the amount of hashrate.

How to convert digital currency to fiat? A question worthy of a separate article. In short, the most fast way- exchange offices. They take a percentage for their services, and your task is to find the most profitable rate with the minimum commission. A professional service for comparing exchangers will help you do this.

- the best resource of this kind in Runet. This monitoring compares the performance of more than 300 exchange offices and finds the best quotes for the currency pairs you are interested in. Moreover, the service indicates the cryptocurrency reserves at the cash desk. The monitoring lists contain only proven and reliable exchange services.

3. What to look for when choosing a video card for mining

Choose your video card wisely. The first one that comes across or the one that is already on your computer will also mine, but this power even for ethers will be negligible.

The main indicators are as follows: performance (power), power consumption, cooling, overclocking prospects.

1) Power

Everything is simple here - the higher the processor performance, the better for calculating the hash code. Excellent performance is provided by cards with a memory capacity of more than 2 GB. And choose devices with a 256-bit bus. 128-bit for this case is not suitable.

2) Energy consumption

Power, of course, is great - high hashrate and all that. But don't forget the power consumption figures. Some productive farms “eat up” so much electricity that the costs barely pay off or do not pay off at all.

3) Cooling

Standard consists of 4-16 cards. It produces an excess amount of heat that is detrimental to the iron and undesirable to the farmer himself. Living and working in a one-room apartment without air conditioning will be, to put it mildly, uncomfortable.

High-quality processor cooling is an indispensable condition for successful mining

Therefore, when choosing two cards with the same performance, give preference to the one with less thermal power indicator (TDP) . The best cooling parameters are demonstrated by Radeon cards. The same devices last longer than all other cards in active mode without wear.

Additional coolers will not only remove excess heat from the processors, but also extend their life.

4) Ability to overclock

Overclocking is a forced increase in the performance of a video card. The ability to "overclock the card" depends on two parameters − GPU frequencies and video memory frequencies. These are the ones you will overclock if you want to increase computing power.

What video cards to take? You will need the latest generation devices, or at least graphics accelerators, released no earlier than 2-3 years ago. Miners use cards AMD Radeon, Nvidia, Geforce GTX.

Take a look at the payback table for video cards (the data is current at the end of 2017):

4. Where to buy a video card for mining - an overview of the TOP-3 stores

As I said, video cards with the growing popularity of mining have become a scarce commodity. To buy the right device, you have to spend a lot of time and effort.

Our review of the best online sales points will help you.

1) TopComputer

Moscow hypermarket specializing in computer and home appliances. It has been operating on the market for more than 14 years, delivering goods from all over the world almost at producer prices. There is a prompt delivery service, free for Muscovites.

At the time of this writing, there are cards for sale AMD, Nvidia(8 Gb) and other varieties suitable for mining.

2) Mybitcoinshop

Special shop, trading exclusively in goods for mining. Here you will find everything for building a home farm - video cards of the required configuration, power supplies, adapters, and even ASIC miners (for new generation miners). There is a paid delivery and pickup from a warehouse in Moscow.

The company has repeatedly received the unofficial title of the best shop for miners in the Russian Federation. Prompt service, friendly attitude to customers, advanced equipment are the main components of success.

3) Ship Shop America

Purchase and delivery of goods from the USA. An intermediary company for those who need truly exclusive and most advanced mining products.

Direct partner of the leading manufacturer of video cards for gaming and mining - Nvidia. The maximum waiting time for goods is 14 days.

5. How to increase the income from mining on a video card - 3 useful tips

Impatient readers who want to start mining right now and receive income from tomorrow morning will certainly ask - how much do miners earn?

Earnings depend on equipment, cryptocurrency rate, pool efficiency, farm capacity, hash rate and a bunch of other factors. Some manage to receive monthly up to 70 000 in rubles , others are satisfied 10 dollars in Week. This is an unstable and unpredictable business.

Useful tips will help you increase your income and optimize your expenses.

You will mine a currency that is rapidly growing in price, you will earn more. For example, ether is now worth about 300 dollars, bitcoin - more 6000 . But you need to take into account not only the current value, but also the growth rate for the week.

Tip 2. Use the mining calculator to select the optimal equipment

The mining calculator on the pool website or on another specialized service will help you choose the best program and even a video card for mining.

Speaking about parallel computing on the GPU, we must remember what time we live in, today is the time when everything in the world is accelerated so much that we lose track of time, not noticing how it rushes by. Everything we do is connected with high accuracy and speed of information processing, in such conditions we certainly need tools in order to process all the information that we have and convert it into data, besides, speaking of such tasks, we must remember that these tasks are necessary not only for large organizations or mega-corporations, ordinary users now also need to solve such problems, who solve their vital tasks related to high technologies at home on personal computers! The emergence of NVIDIA CUDA was not surprising, but rather justified, because, as soon as it will be necessary to process much more time-consuming tasks on a PC than before. Work that previously took a very long time will now take a matter of minutes, respectively, this will affect the overall picture of the whole world!

What is GPU Computing

GPU computing is the use of the GPU to compute technical, scientific, and everyday tasks. Computing on the GPU involves the use of the CPU and GPU with a heterogeneous selection between them, namely: the sequential part of the programs is taken over by the CPU, while time-consuming computational tasks remain by the GPU. Due to this, tasks are parallelized, which leads to faster processing of information and reduces the time it takes to complete the work, the system becomes more productive and can simultaneously process more tasks than before. However, in order to achieve such success, hardware support alone is not enough, in this case software support is also needed so that the application can transfer the most time-consuming calculations to the GPU.

What is CUDA

CUDA is a programming technology in a simplified C language for algorithms that are executed in GPUs GeForce accelerators of the eighth generation and above, as well as the corresponding Quadro and Tesla cards from NVIDIA. CUDA allows you to include special functions in the text of a C program. These functions are written in the simplified C programming language and run on the GPU. The initial version of the CUDA SDK was released on February 15, 2007. For successful translation of code in this language, the CUDA SDK includes its own C compiler command line nvcc from NVIDIA. The nvcc compiler is based on the Open64 open compiler and is designed to translate the host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for building the final program or library in any programming environment such as Microsoft Visual Studio.

Technology Capabilities

  1. The C Standard Language for Parallel Development of GPU Applications.
  2. Ready-made libraries of numerical analysis for the fast Fourier transform and the basic package of linear algebra programs.
  3. Dedicated CUDA driver for computing with fast data transfer between GPU and CPU.
  4. Possibility of interaction CUDA drivers with OpenGL and DirectX graphics drivers.
  5. Operating room support Linux systems 32/64-bit, Windows XP 32/64-bit and MacOS.

Technology Benefits

  1. The CUDA Application Programming Interface (CUDA API) is based on the standard C programming language with some limitations. This simplifies and smoothes the process of learning the CUDA architecture.
  2. The 16 KB shared memory between threads can be used for a user-organized cache with a wider bandwidth than when fetching from regular textures.
  3. More efficient transactions between CPU memory and video memory.
  4. Full hardware support for integer and bitwise operations.

An example of technology application

cRark

The hardest part of this program is the tincture. The program has a console interface, but thanks to the instructions that come with the program itself, it can be used. The following is a brief guide to setting up the program. We will test the program for performance and compare it with another similar program that does not use NVIDIA CUDA, in this case the well-known program "Advanced Archive Password Recovery".

From the downloaded cRark archive, we only need three files: crark.exe , crark-hp.exe and password.def . Сrark.exe is a RAR 3.0 password cracker without encrypted files inside the archive (i.e. opening the archive we see the names, but we cannot unpack the archive without a password).

Сrark-hp.exe is a command-line RAR 3.0 password cracking utility that encrypts the entire archive (i.e. when opening an archive, we do not see either the name or the archives themselves and cannot unpack the archive without a password).

Password.def is any renamed text file with very little content (for example: 1st line: ## 2nd line: ?* , in which case the password will be cracked using all characters). Password.def is the head of the cRark program. The file contains the rules for opening the password (or the character area that crark.exe will use in its work). More details about the options for choosing these characters are written in the text file obtained by opening the downloaded from the site from the author of the cRark program: russian.def .

Training

I must say right away that the program only works if your video card is based on a GPU with support for the CUDA 1.1 acceleration level. So a series of video cards based on the G80 chip, such as the GeForce 8800 GTX , is out of the question, since they have hardware support for CUDA 1.0 acceleration. Using CUDA, the program selects only passwords for RAR archives of versions 3.0+. Everything needs to be installed software, related to CUDA , namely:

We create any folder anywhere (for example, on the C: drive) and call it any name, for example, "3.2". We put the files there: crark.exe , crark-hp.exe and password.def and a password-protected / encrypted RAR archive.

Next, start the command console Windows strings and navigate to the created folder. In Windows Vista and 7, you should call the "Start" menu and enter "cmd.exe" in the search field, in Windows XP, from the "Start" menu, first call the "Run" dialog and enter "cmd.exe" in it. After opening the console, enter a command like: cd C:\folder\ , cd C:\3.2 in this case.

Recruiting in text editor two lines (you can also save the text as a .bat file in the folder with cRark) to guess the password of a password-protected RAR archive with unencrypted files:

echo off;
cmd /K crark (archive name).rar

to guess the password of a password-protected and encrypted RAR archive:

echo off;
cmd /K crark-hp (archive name).rar

Copy 2 lines of the text file to the console and press Enter (or run the .bat file).

results

The decryption process is shown in the figure:

The speed of selection on cRark using CUDA was 1625 passwords / second. In one minute, thirty-six seconds, a password with 3 characters was guessed: "q)$". For comparison: the speed of enumeration in Advanced Archive Password Recovery on my dual-core Athlon processor 3000+ equals a maximum of 50 passwords/second and the search would have to last 5 hours. That is, selecting a RAR archive by bruteforce in cRark using a GeForce 9800 GTX+ video card is 30 times faster than on a CPU.

For those with an Intel processor, a good motherboard with a high frequency system bus(FSB 1600 MHz), CPU rate and enumeration speed will be higher. And if you have a quad-core processor and a pair of video cards of the GeForce 280 GTX level, then the performance of password brute force is accelerated at times. Summarizing the example, it should be said that given task was solved using CUDA technology in just 2 minutes instead of 5 hours, which indicates a high potential for this technology!

conclusions

Having considered today the technology for parallel computing CUDA, we clearly saw all the power and huge potential for the development of this technology using the example of a password recovery program for RAR archives. I must say about the prospects of this technology, this technology will certainly find a place in the life of every person who decides to use it, whether it be scientific tasks, or tasks related to video processing, or even economic tasks that require fast accurate calculation, all this will lead to the inevitable an increase in labor productivity that cannot be ignored. To date, the phrase "home supercomputer" is already beginning to enter the lexicon; it is absolutely obvious that in order to translate such an object into reality, every home already has a tool called CUDA. Since the release of cards based on the G80 chip in 2006, a huge number of NVIDIA-based accelerators have been released that support CUDA technology, which can make the dream of supercomputing in every home a reality. By promoting CUDA technology, NVIDIA raises its credibility in the eyes of customers in the form of providing additional features their equipment, which many have already bought. It remains only to believe that soon CUDA will develop very quickly and will allow users to take full advantage of all the possibilities of parallel computing on the GPU.

Once I had a chance to talk in the computer market with the technical director of one of the many companies selling laptops. This "specialist" tried to foam at the mouth to explain exactly what kind of laptop configuration I need. The main message of his monologue was that the time of the central processing units (CPU) is over, and now all applications actively use calculations on the graphics processing unit (GPU), and therefore the performance of the laptop is entirely dependent on the graphics processor, and you can not pay any attention to the CPU. attention. Realizing that arguing and trying to reason with this technical director is absolutely pointless, I did not waste time in vain and bought the laptop I needed in another pavilion. However, the very fact of such a blatant incompetence of the seller struck me. It would be understandable if he was trying to deceive me as a buyer. Not at all. He sincerely believed in what he said. Yes, apparently, marketers at NVIDIA and AMD are not eating their bread in vain, and they still managed to inspire some users with the idea of ​​the dominant role of the graphics processor in a modern computer.

The fact that today graphics processing unit (GPU) computing is becoming more and more popular is beyond doubt. However, this does not diminish the role of the central processor. Moreover, if we talk about the vast majority of user applications, then today their performance depends entirely on the performance of the CPU. That is, the vast majority of user applications do not use GPU computing.

In general, GPU computing is mostly performed on specialized HPC systems for scientific computing. But user applications that use GPU computing can be counted on the fingers. At the same time, it should immediately be noted that the term "computing on the GPU" in this case is not entirely correct and can be misleading. The fact is that if an application uses GPU computing, this does not mean at all that the central processor is idle. Computing on the GPU does not involve shifting the load from the CPU to the GPU. As a rule, the central processor remains busy, and the use of the graphics processor, along with the central processor, allows you to increase performance, that is, reduce the time it takes to complete the task. Moreover, the GPU itself here acts as a kind of coprocessor for the CPU, but by no means completely replaces it.

To understand why GPU computing is not such a panacea and why it is incorrect to say that their computing capabilities are superior to those of the CPU, it is necessary to understand the difference between the central processor and the graphics processor.

Differences in GPU and CPU architectures

CPU cores are designed to execute a single stream of sequential instructions at maximum throughput, while GPUs are designed to quickly execute a very large number of parallel instruction streams. This is the fundamental difference between graphic processors and central ones. CPU is a general purpose processor or processor general purpose, optimized to achieve high performance a single instruction stream that handles both integers and floating point numbers. In this case, access to memory with data and instructions occurs mainly randomly.

To improve CPU performance, they are designed to execute as many instructions as possible in parallel. For example, for this, the processor cores use an out-of-order instruction execution block, which allows you to reorder instructions out of order of their arrival, which allows you to raise the level of parallelism in the implementation of instructions at the level of a single thread. Nevertheless, this still does not allow for the parallel execution of a large number of instructions, and the overhead for parallelizing instructions inside the processor core turns out to be very significant. That is why general purpose processors are not very a large number of executive blocks.

The GPU is designed fundamentally differently. It was originally designed to execute a huge number of parallel streams of commands. Moreover, these command streams are parallelized initially, and there is simply no overhead for parallelizing instructions in the GPU. The GPU is designed to render the image. To put it simply, at the input it takes a group of polygons, performs all the necessary operations, and outputs pixels at the output. Processing of polygons and pixels is independent, they can be processed in parallel, separately from each other. Therefore, due to the inherently parallel organization of work, the GPU uses a large number of execution units that are easy to load, in contrast to the sequential flow of instructions for the CPU.

GPUs and CPUs also differ in how they access memory. In the GPU, memory access is easily predictable: if a texture texel is read from memory, then after a while the time will come for neighboring texels as well. When writing, the same thing happens: if a pixel is written to the framebuffer, then after a few cycles, the pixel located next to it will be written. Therefore, the GPU, unlike the CPU, simply does not need a large cache, and textures require only a few kilobytes. The principle of working with memory in the GPU and CPU is also different. So, all modern GPUs have several memory controllers, and the graphics memory itself is faster, so GPUs have much more about greater memory bandwidth compared to general-purpose processors, which is also very important for parallel calculations that operate with huge data streams.

In universal processors b about Most of the chip area is occupied by various command and data buffers, decoding blocks, hardware branch prediction blocks, command reordering blocks, and cache memory of the first, second, and third levels. All these hardware blocks are needed to speed up the execution of a few instruction streams by parallelizing them at the processor core level.

The execution units themselves take up relatively little space in the universal processor.

In the GPU, on the contrary, the main area is occupied by numerous execution units, which allows it to simultaneously process several thousand command streams.

We can say that, unlike modern CPUs, GPUs are designed for parallel computing with large quantity arithmetic operations.

It is possible to use the computing power of GPUs for non-graphical tasks, but only if the problem being solved allows for the possibility of parallelizing algorithms into hundreds of execution units available in the GPU. In particular, the performance of calculations on the GPU shows excellent results when the same sequence of mathematical operations is applied to a large amount of data. In this case, the best results are achieved if the ratio of the number of arithmetic instructions to the number of memory accesses is large enough. This operation places less demands on execution control and does not require a large cache.

There are many examples of scientific calculations where the advantage of the GPU over the CPU in terms of computational efficiency is undeniable. So, many scientific applications on molecular modeling, gas dynamics, fluid dynamics and other things are perfectly adapted for GPU calculations.

So, if the algorithm for solving a problem can be parallelized into thousands of separate threads, then the efficiency of solving such a problem using a GPU can be higher than solving it using only a general-purpose processor. However, it is not so easy to take and transfer the solution of some task from the CPU to the GPU, if only because the CPU and GPU use different commands. That is, when a program is written for a solution on the CPU, the x86 instruction set is used (or a set of instructions compatible with a specific processor architecture), but for the GPU, completely different instruction sets are used, which again take into account its architecture and capabilities. Modern 3D game development uses the DirectX and OrenGL APIs to allow programmers to work with shaders and textures. However, using the DirectX and OrenGL APIs for non-graphics computing on the GPU is not the best option.

NVIDIA CUDA and AMD APPs

That is why, when the first attempts to implement non-graphical computing on the GPU (General Purpose GPU, GPGPU) began to be made, the BrookGPU compiler arose. Before its creation, developers had to access video card resources through the OpenGL or Direct3D graphics APIs, which greatly complicated the programming process, as it required specific knowledge - they had to learn the principles of working with 3D objects (shaders, textures, etc.). This was the reason for the very limited use of GPGPU in software products. BrookGPU has become a kind of "translator". These streaming extensions to the C language hid the 3D API from programmers, and when using it, the need for knowledge of 3D programming practically disappeared. The computing power of video cards became available to programmers in the form of an additional coprocessor for parallel calculations. The BrookGPU compiler processed a file with C code and extensions, building code linked to a library with DirectX or OpenGL support.

Largely thanks to the BrookGPU, NVIDIA and ATI (now AMD) turned their attention to the emerging general-purpose computing technology on GPUs and began developing their own implementations that provide direct and more transparent access to 3D accelerator compute units.

As a result, NVIDIA developed the CUDA (Compute Unified Device Architecture) parallel computing architecture. The CUDA architecture allows non-graphical computing to be implemented on NVIDIA GPUs.

The public beta version of the CUDA SDK was released in February 2007. The CUDA API is based on a simplified dialect of the C language. The architecture of the CUDA SDK allows programmers to implement algorithms that run on NVIDIA GPUs and include special functions in C program code. To successfully translate code in this language, the CUDA SDK includes NVIDIA's own nvcc command-line Compiler.

CUDA is cross-platform software for operating systems such as Linux, Mac OS X and Windows.

AMD (ATI) has also developed its own version of the GPGPU technology, formerly called ATI Stream and now AMD Accelerated Parallel Processing (APP). AMD APP is based on the open industry standard OpenCL (Open Computing Language). The OpenCL standard provides parallelism at the instruction level and at the data level and is an implementation of the GPGPU technique. It is a completely open standard and is royalty-free for use. Note that AMD APP and NVIDIA CUDA are not compatible with each other, however, latest version NVIDIA CUDA also supports OpenCL.

Testing GPGPU in video converters

So, we found out that CUDA technology is intended for the implementation of GPGPU on NVIDIA GPUs, and on GPUs AMD processors- APP API. As already noted, the use of non-graphical computations on the GPU is advisable only if the task being solved can be parallelized into many threads. However, most user applications do not meet this criterion. However, there are some exceptions. For example, most modern video converters support the ability to use calculations on NVIDIA and AMD GPUs.

In order to find out how efficiently GPU computing is used in custom video converters, we selected three popular solutions: Xilisoft Video Converter Ultimate 7.7.2, Wondershare Video Converter Ultimate 6.0.3.2 and Movavi Video Converter 10.2.1. These converters support the ability to use NVIDIA and AMD graphics processors, and you can disable this feature in the video converter settings, which allows you to evaluate the efficiency of using the GPU.

For video conversion, we used three different videos.

The first video was 3 minutes 35 seconds long and 1.05 GB in size. It was recorded in the mkv data storage (container) format and had the following characteristics:

  • video:
    • format - MPEG4 Video (H264),
    • resolution - 1920*um*1080,
    • bitrate mode - Variable,
    • average video bitrate - 42.1 Mbps,
    • maximum video bitrate - 59.1 Mbps,
    • frame rate - 25 fps;
  • audio:
    • format - MPEG-1 Audio,
    • audio bitrate - 128 Kbps,
    • number of channels - 2,

The second video was 4 minutes 25 seconds long and 1.98 GB in size. It was recorded in the MPG data storage (container) format and had the following characteristics:

  • video:
    • format - MPEG-PS (MPEG2 Video),
    • resolution - 1920*um*1080,
    • bitrate mode - Variable.
    • average video bitrate - 62.5 Mbps,
    • maximum video bitrate - 100 Mbps,
    • frame rate - 25 fps;
  • audio:
    • format - MPEG-1 Audio,
    • audio bitrate - 384 Kbps,
    • number of channels - 2,

The third video was 3 minutes 47 seconds long and 197 MB in size. It was recorded in the MOV data storage (container) format and had the following characteristics:

  • video:
    • format - MPEG4 Video (H264),
    • resolution - 1920*um*1080,
    • bitrate mode - Variable,
    • video bitrate - 7024 Kbps,
    • frame rate - 25 fps;
  • audio:
    • format - AAC,
    • audio bitrate - 256 Kbps,
    • number of channels - 2,
    • sampling frequency - 48 kHz.

All three test videos were converted using video converters to the MP4 data storage format (H.264 codec) for viewing on iPad tablet 2. The resolution of the output video file was 1280*um*720.

Note that we did not use exactly the same conversion settings in all three converters. That is why it is incorrect to compare the efficiency of video converters by the conversion time. For example, in the Xilisoft Video Converter Ultimate 7.7.2 video converter, the iPad 2 preset - H.264 HD Video was used for conversion. This preset uses following settings coding:

  • codec - MPEG4 (H.264);
  • resolution - 1280*um*720;
  • frame rate - 29.97 fps;
  • video bitrate - 5210 Kbps;
  • audio codec - AAC;
  • audio bitrate - 128 Kbps;
  • number of channels - 2;
  • sampling frequency - 48 kHz.

Wondershare Video Converter Ultimate 6.0.3.2 used the iPad 2 preset with the following additional settings:

  • codec - MPEG4 (H.264);
  • resolution - 1280*um*720;
  • frame rate - 30 fps;
  • video bitrate - 5000 Kbps;
  • audio codec - AAC;
  • audio bitrate - 128 Kbps;
  • number of channels - 2;
  • sampling frequency - 48 kHz.

Movavi Video Converter 10.2.1 used the iPad preset (1280*um*720, H.264) (*.mp4) with the following additional settings:

  • video format - H.264;
  • resolution - 1280*um*720;
  • frame rate - 30 fps;
  • video bitrate - 2500 Kbps;
  • audio codec - AAC;
  • audio bitrate - 128 Kbps;
  • number of channels - 2;
  • sampling frequency - 44.1 kHz.

The conversion of each source video was carried out five times on each of the video converters, using both the GPU and only the CPU. After each conversion, the computer rebooted.

As a result, each video was converted ten times in each video converter. To automate this routine work, a special utility was written with GUI, which allows you to fully automate the testing process.

Test Bench Configuration

The test stand had the following configuration:

  • processor - Intel Core i7-3770K;
  • motherboard- Gigabyte GA-Z77X-UD5H;
  • motherboard chipset - Intel Z77 Express;
  • memory - DDR3-1600;
  • memory size - 8 GB (two 4 GB GEIL modules);
  • memory operation mode - dual-channel;
  • video card - NVIDIA GeForce GTX 660Ti (video driver 314.07);
  • drive - Intel SSD 520 (240 GB).

An operating room was installed on the stand Windows system 7 Ultimate (64-bit).

Initially, we tested the processor and all other components of the system in normal operation. At the same time, the Intel Core i7-3770K processor worked at a nominal frequency of 3.5 GHz c activated mode turbo boost(The maximum frequency of the processor in Turbo Boost mode is 3.9 GHz).

Then we repeated the testing process, but while overclocking the processor to a fixed frequency of 4.5 GHz (without using the Turbo Boost mode). This made it possible to reveal the dependence of the conversion speed on the frequency of the processor (CPU).

At the next stage of testing, we returned to the standard processor settings and repeated testing with other video cards:

  • NVIDIA GeForce GTX 280 (driver 314.07);
  • NVIDIA GeForce GTX 460 (driver 314.07);
  • AMD Radeon HD6850 (driver 13.1).

Thus, video conversion was carried out on four video cards of different architectures.

The older video card NVIDIA GeForce 660Ti is based on the graphic processor of the same name with the code designation GK104 (Kepler architecture), manufactured using a 28-nm process technology. This GPU contains 3.54 billion transistors and a die area of ​​294 mm2.

Recall that the GK104 GPU includes four graphics processing clusters (Graphics Processing Clusters, GPC). GPC clusters are independent devices within the processor and are able to work as separate devices, since they have all the necessary resources: rasterizers, geometry engines and texture modules.

Each such cluster has two streaming multiprocessor SMX (Streaming Multiprocessor), but in the GK104 processor in one of the clusters, one multiprocessor is blocked, so there are seven SMX multiprocessors in total.

Each SMX streaming multiprocessor contains 192 streaming computing cores(CUDA cores), so the total GK104 processor has 1344 CUDA cores. In addition, each SMX multiprocessor contains 16 TMUs, 32 Special Function Units (SFUs), 32 Load-Store Units (LSUs), a PolyMorph engine, and more.

The GeForce GTX 460 graphics card is based on a GPU codenamed GF104 based on the Fermi architecture. This processor is manufactured using a 40-nm process technology and contains about 1.95 billion transistors.

The GF104 GPU includes two GPC graphics processing clusters. Each has four streaming multiprocessor SMs, but in the GF104 processor in one of the clusters, one multiprocessor is locked, so there are only seven SMs.

Each SM streaming multiprocessor contains 48 stream compute cores (CUDA cores), so the GK104 processor has a total of 336 CUDA cores. In addition, each SM multiprocessor contains eight texture units (TMUs), eight Special Function Units (SFUs), 16 Load-Store Units (LSUs), a PolyMorph engine, and more.

The GeForce GTX 280 GPU belongs to the second generation of NVIDIA's unified GPU architecture and is very different in architecture from Fermi and Kepler.

The GeForce GTX 280 GPU is made up of Texture Processing Clusters (TPCs), which, while similar, are very different from the Fermi and Kepler GPC graphics processing clusters. In total, there are ten such clusters in the GeForce GTX 280 processor. Each TPC cluster includes three SMs and eight TMUs. Each multiprocessor consists of eight stream processors (SPs). Multiprocessors also contain units for sampling and filtering texture data, which are used both in graphics and in some computational tasks.

Thus, in one TPC cluster there are 24 stream processors, and in the GeForce GTX 280 GPU there are already 240 of them.

The summary characteristics of video cards based on NVIDIA GPUs used in testing are presented in the table.

There is no video in the table below AMD cards Radeon HD6850, which is quite natural, since according to technical specifications it is difficult to compare with NVIDIA graphics cards. Therefore, we will consider it separately.

The AMD Radeon HD6850 GPU, codenamed Barts, is manufactured using a 40nm process and contains 1.7 billion transistors.

The AMD Radeon HD6850 processor architecture is a unified architecture with an array of common processors for streaming multiple kinds of data.

The AMD Radeon HD6850 processor consists of 12 SIMD cores, each containing 16 superscalar stream processor units and four texture units. Each superscalar stream processor contains five universal stream processors. Thus, in total, there are 12*um*16*um*5=960 universal stream processors in the AMD Radeon HD6850 GPU.

The GPU frequency of the AMD Radeon HD6850 graphics card is 775 MHz, and the effective frequency of the GDDR5 memory is 4000 MHz. The amount of memory is 1024 MB.

Test results

So, let's turn to the test results. Let's start with the first test, when the NVIDIA GeForce GTX 660Ti video card is used and the standard mode of operation Intel processor Core i7-3770K.

On fig. Figures 1-3 show the results of converting three test videos with three converters in modes with and without a GPU.

As can be seen from the test results, the effect of using the GPU is obvious. For Xilisoft Video Converter Ultimate 7.7.2, when using a GPU, the conversion time is reduced by 14%, 9%, and 19% for the first, second, and third videos, respectively.

For Wondershare Video Converter Ultimate 6.0.32, GPU usage can reduce conversion time by 10%, 13% and 23% for the first, second and third video respectively.

But Movavi Video Converter 10.2.1 benefits the most from the use of a GPU. For the first, second and third video, the reduction in conversion time is 64%, 81% and 41% respectively.

It is clear that the gain from using the GPU depends on both the original video and the video conversion settings, which, in fact, is demonstrated by our results.

Now let's see what the gain in conversion time will be when overclocking the Intel Core i7-3770K processor to a frequency of 4.5 GHz. If we assume that in normal mode all processor cores are loaded during conversion and operate at a frequency of 3.7 GHz in Turbo Boost mode, then an increase in frequency to 4.5 GHz corresponds to overclocking by 22%.

On fig. Figures 4-6 show the results of converting three test videos when overclocking the processor in modes with and without a GPU. In this case, the use of a graphics processor allows you to get a gain in conversion time.

For Xilisoft Video Converter Ultimate 7.7.2, when using a GPU, the conversion time is reduced by 15%, 9%, and 20% for the first, second, and third videos, respectively.

For Wondershare Video Converter Ultimate 6.0.32, using a GPU can reduce the conversion time by 10%, 10% and 20% for the first, second and third video respectively.

For Movavi Video Converter 10.2.1, the use of a GPU can reduce the conversion time by 59%, 81% and 40% respectively.

Naturally, it is interesting to see how overclocking the processor can reduce the conversion time with and without a GPU.

On fig. Figures 7-9 show the results of comparing the video conversion time without using the GPU in the normal mode of the processor and in the overclocked mode. Since in this case the conversion is carried out only by means of the CPU without GPU calculations, it is obvious that an increase in the processor clock speed leads to a reduction in the conversion time (an increase in the conversion speed). It is equally obvious that the reduction in conversion speed should be approximately the same for all test videos. So, for video converter Xilisoft Video Converter Ultimate 7.7.2, when overclocking the processor, the conversion time is reduced by 9, 11 and 9% for the first, second and third videos, respectively. For Wondershare Video Converter Ultimate 6.0.32, the conversion time is reduced by 9%, 9% and 10% for the first, second and third video respectively. Well, for the Movavi Video Converter 10.2.1 video converter, the conversion time is reduced by 13, 12 and 12%, respectively.

Thus, when the processor is overclocked by 20%, the conversion time is reduced by about 10%.

Let's compare the video conversion time using the GPU in the normal mode of the processor and in the overclocking mode (Fig. 10-12).

For video converter Xilisoft Video Converter Ultimate 7.7.2, when overclocking the processor, the conversion time is reduced by 10, 10 and 9% for the first, second and third video, respectively. For Wondershare Video Converter Ultimate 6.0.32, the conversion time is reduced by 9%, 6%, and 5% for the first, second, and third videos, respectively. Well, for the Movavi Video Converter 10.2.1 video converter, the conversion time is reduced by 0.2, 10 and 10%, respectively.

As you can see, for Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32 converters, the reduction in conversion time when overclocking the processor is approximately the same both with and without a GPU, which is logical, since these converters do not use very efficiently GPU computing. But for Movavi Video Converter 10.2.1, which efficiently uses GPU computing, overclocking the processor in the GPU computing mode has little effect on reducing the conversion time, which is also understandable, since in this case the main load falls on the GPU.

Now let's see the test results with different video cards.

It would seem that the more powerful the video card and the more CUDA cores (or universal stream processors for AMD video cards) in the graphics processor, the more efficient video conversion should be if the graphics processor is used. But in practice it doesn't work that way.

As for video cards based on NVIDIA GPUs, the situation is as follows. When using Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32, the conversion time practically does not depend on the type of video card used. That is, for NVIDIA graphics cards GeForce GTX 660Ti, NVIDIA GeForce GTX 460 and NVIDIA GeForce GTX 280 in the GPU computing mode, the conversion time is the same (Fig. 13-15).

Rice. 1. Results of converting the first
test video in normal mode
processor work

processor graphics cards in GPU usage mode

Rice. 14. Results of comparing the conversion time of the second video

Rice. 15. Results of comparing the conversion time of the third video
on different graphics cards in GPU usage mode

This can only be explained by the fact that the GPU calculation algorithm implemented in Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32 is simply inefficient and does not allow all graphics cores to be actively used. By the way, this explains the fact that for these converters the difference in conversion time in GPU and non-GPU modes is small.

In Movavi Video Converter 10.2.1, the situation is somewhat different. As we remember, this converter is able to use GPU calculations very efficiently, and therefore, in the GPU mode, the conversion time depends on the type of video card used.

But with the AMD Radeon HD 6850 video card, everything is as usual. Either the video card driver of the "curve", or the algorithms implemented in the converters need serious improvement, but in the case of GPU calculations, the results either do not improve or worsen.

More specifically, the situation is as follows. For Xilisoft Video Converter Ultimate 7.7.2, when using a GPU to convert the first test video, the conversion time increases by 43%, while converting the second video - by 66%.

Moreover, Xilisoft Video Converter Ultimate 7.7.2 is also characterized by unstable results. The spread in conversion time can reach 40%! That is why we repeated all the tests ten times and calculated the average result.

But for Wondershare Video Converter Ultimate 6.0.32 and Movavi Video Converter 10.2.1, when using the GPU to convert all three videos, the conversion time does not change at all! It is likely that Wondershare Video Converter Ultimate 6.0.32 and Movavi Video Converter 10.2.1 either do not use AMD APP technology when converting, or the AMD video driver is simply "crooked", resulting in AMD APP technology not working.

conclusions

Based on the testing carried out, the following important conclusions can be drawn. Modern video converters can indeed use GPU computing technology, which can increase the conversion speed. However, this does not mean at all that all calculations are completely transferred to the GPU and the CPU remains idle. As testing shows, when using GPGPU technology, the central processor remains loaded, which means that the use of powerful, multi-core central processors in systems used for video conversion remains relevant. The exception to this rule is AMD APP technology on AMD GPUs. For example, when using Xilisoft Video Converter Ultimate 7.7.2 with AMD APP technology enabled, the CPU load is indeed reduced, but this leads to the fact that the conversion time does not decrease, but, on the contrary, increases.

In general, if we talk about converting video with the additional use of a graphics processor, then to solve this problem it is advisable to use video cards with graphics NVIDIA processors. As practice shows, only in this case it is possible to achieve an increase in the conversion speed. And you need to remember that the real increase in the conversion speed depends on many factors. These are the input and output video formats, and, of course, the video converter itself. Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32 are not suitable for this task, but the converter and Movavi Video Converter 10.2.1 are able to use NVIDIA GPUs very efficiently.

As for video cards based on AMD GPUs, they should not be used at all for video conversion tasks. In the best case, this will not give any increase in the conversion speed, and in the worst case, you can get a decrease in it.

Internet