GPGPU using Opencl

cayoenrique

Member
Messages
475
Update.
I know you are waiting too long. Be sure I am still working. But please give me a little longer. Now the funny part here is that after your wait there is no warranties that my code will be faster. That,s how OpenCL works. To get faster you consume more resources. Using more resources means you get slower. AT the end the is output balance result. Betwen what you gain in speed less what you waisted in resources. We will see.

Now I will post OpenCL Books & resources in pdf for the Brave ones. Never forget that the AUTHORITY is
Code:
https://www.khronos.org/opencl/
https://www.khronos.org/opencl/community-resources/
https://github.com/KhronosGroup/OpenCL-Guide

So here are some books for you to study while you give me some time
OpenCL_Books.zip (36.47 MB)
Code:
https://workupload.com/file/qLPUZMqcWS6
 

cayoenrique

Member
Messages
475
Many contradiction for KPS where provided on thread Encryption Projects as SU group
I know it all depends on computer characteristic, over clocking, manufacturer and single vs launching multiple Cudabis at same time.

But I need a reference! I decided to use dvlajkovic GTX 4090 16,384 cores @ 2.235 Ghz. He posted this LINK as a sample results. I know his KPS result seems funny as it is negative. But Coder seems to believe that the negative in fact means
Code:
-913098240 = 0xC9933A00 = 3381869056 = 3381M

For those that do not understand Coders answer, here it is explained. Max integer values is 2^32 = 4294967296. If we subtract the negative value we get the positive one.
Code:
4294967296 -913098240 = 3381869056

I can also corroborated that fact bu using the provided numbers in the link:
Code:
41696649805824 / 12329469 / 1000 = 3381.86906555538

So mystery resolve, -913098240 is in fact 3381MKPS or 3.381 BKPS or 3.381 x 10^ 9.

Why it is important? Because this thread Main Objective is to teach you how to use OpenCL. And as best, to match Cudabiss Speed. Never to provide a tool that can be fast enough to be a treat to CSA encryption. To do this I need to correlate what dvlajkovic GTX 4090 16,384 cores @ 2.235 Ghz do on others GPUs. I came up with this correlated Table.

Code:
                    cores    clock    Device Power    KPS                  KPH=PPS*3600       Time for full range
                                                                                                  Hours     Days
RTX 4090             16384    2230    36536320    3,381,869,056.00    12,174,728,601,600.00       23.12    0.96
RTX 3090Ti           10752    1395    14999040    1,388,338,761.15     4,998,019,540,132.74       56.32    2.35
RTX 2080Ti            4352    1350     5875200      543,819,330.40     1,957,749,589,452.91      143.77    5.99
RTX 1080Ti            3584    1480     5304320      490,977,626.40     1,767,519,455,052.91      159.25    6.64
RTX 1070Ti            2432    1607     3908224      361,752,409.92     1,302,308,675,702.96      216.14    9.01
RTX 1060              1280    1506     1927680      178,429,610.37       642,346,597,323.77      438.20    18.26
RTX 1050Ti             768    1290      990720       91,702,867.48       330,130,322,927.36      852.62    35.53
AMD Radeon HD 6770M    480    725       348000       32,211,520.80       115,961,474,865.47    2,427.31    101.14

If you do not have one of those. This is what you have to do. This are the formulas:

Code:
Device Power = cores x  clock
KPS = 92.56 x Device Power
KPH =  333222 x Device Power
Hours =  844705468 / Device Power
Days = 35196061 / Device Power

Lets use my HD 6770M as example. 1rst compute your Device Power
Code:
Device Power =  480 x 725 =  348000

Then my HD 6770M will correlate to:
Code:
KPS = 92.56 x 348000 = 32210880 keys per second
KPH =  333222 x 348000 = 115961256000 keys per hour
Hours =  844705468 / 348000 = 2427.31 hours
Days = 35196061 / 348000 = 101.14 days

WHAT ALL THIS MEANS? What should be NEXT...

Well if my AMD Radeon HD 6770M can go as fast as KPS column for my device, then I am close to matching CUDABISS speed. For my AMD Radeon HD 6770M the number is 32210880 keys per second, or 32 MKPS. And as I posted before I was over that already. This means I do not want to personally post code faster that this!!!. I know You are upset.

Now what I need to do is to finish this OCLbiss, so that instead of providing fake keys, it can find the REAL Key. And I need to find a way to SATURATE the GPU so that dvlajkovic GTX 4090 16,384 cores @ 2.235 Ghz do in fact see full potential.

Now as a Personal request, if you find the code to go faster than your KPS column for your device PLEASE do not post publicly results here. The Chinese manufacturers can quickly pick on this results and kill CSA as it is… No more CSA means, no more SU forum, no more hack SAT TV. Again what you do in private is up to you.

Update on what I had been doing.
Just to give you an Idea, I was in version 21 and I am going on version 61… Well do not think that all those versions are good ones. In fact ALL have been some what a failure. I still do not have a multythread OCLBiss working. That is why I have not post it.
 
Last edited:

cayoenrique

Member
Messages
475
Update.
There is a lot to do. But I guess you are expecting some code. I hope then to publish tomorrow some code. I do not do it now because is to late ans I have to sleep. But tomorrow I will try to clean the code and test it on Windows before release. See you tomorrow.
 

cayoenrique

Member
Messages
475
After another night of hard-work, this time under Windows here it is

OCLBiss_077.zip (56.26 KB)
Code:
https://workupload.com/file/jxW8CgwGSYd

It is late and a lot to say.

As it is you can build it with just using make command . For codeblock project, rename "OCLBiss.cbp_win" to "OCLBiss.cbp". On linux use the "OCLBiss.cbp_linux".

In config OCLBiss.cfg

You can find the usual few files
Code:
#PROGRAM_FILE:"csa_decrypt_1block_000a.cl"
#PROGRAM_FILE:"csa_decrypt_1block_001a.cl"
PROGRAM_FILE:"csa_decrypt_1block_002a.cl"
#PROGRAM_FILE:"csa_decrypt_1block_003a.cl"

New
Code:
#VECTORTEST
VECTORTEST:0 0 43   // If VECTORTEST > VECTORMAX( 43 ) then vector test is ignore.  Of smaller test will stop at compared N key

if you give VECTORTEST:0 it will work as usual, find the REAL Key. but if you give a value from 1-43 it will run a vector test that makes sure the opencl fake keys are the ones we are looking. In other word it makes sure the program function as it should.

Code:
#Main Speed Adjustments
MULTITHREADSIZE:4   1 2 4                   # Nunmber of CPU search threads to launch simultaneously
PES1ROUNDSB4PES2:16                         # Number of PES1 rounds before testing for PES2 < 16
LOCALWORKDROUPSIZE:0 64 256 64 256  0 64 128 256           # 1) Set recomended, multiple of 32 or 64, new fast GPU can do 256
GLOBALWORKDROUPSIZE:0 1536 6144 3072 1536 1536 6144 1536 3072 1536 128 256 6144 0 1536 3072 6144 # 2) Set Recomended, take note of how many CU.  Then 1rst sugest value CU x 256. Then multiples of 2.  In my case 6 * 256 = 1536, then multiples 1536 3072 6144
LOOPSPERKERNEL:256 256 516 1024 2048 4096             # 3) Adjust LOOPSPERKERNEL to gest a cadence of about 1 second

Now the program have the possibility to run from 1 - 4 threads.

PES1ROUNDSB4PES2:16 This means as it reads, PES1 rounds before we do final test for PES2 & PES3. Where PES is the Encrypted Packetized Elementary Streams or the 3 TS 16bytes strings used to decrypt.
Code:
PES1:3CEBDC173C2BD64F651688F258D59705
PES2:AD43A480B11CDDBE60AC847768D7A771
PES3:113D7195079EBDB25A66B08092519DE7

As you are going to see, many if not most variables had change name. This is to prevent confusion. Now the program have THREADS, where they mean an individual instance of the program running in CPU. Where in the past I use threads to name running programs in cores inside GPU.

Do not forget the SCROLL LOCK LED will turn on if you have it in your key board.


I hope you like it. Now I am going to take a rest. maybe finalize the T2MI program that I started for @dvlajkovic
 
Last edited:

cayoenrique

Member
Messages
475
AHHH.
Program is unfinished. We need to decide what to do when we find Key, Stopping the program or continue. Or even sending an Email or tweet.
We need to add an auto detection for the 3 PES, so that we do not need to write them manually.
We need to verify that the program continue at 0000000000000000 when we decide to start at middle.

And many other stuff.

In fact you will find I do a lot of extra thing. I was force to add not needed thing in an attempt to make the program work in mutithread mode. Now that it is running we could see what we can eliminate.
 

cayoenrique

Member
Messages
475
@Me2019H

Let me start saying that building in an OLD GPU may add some extra situations and problems. Problems that may not exit in newer GPU. I am not sure of all this. This is my very 1rst time I do a multi-thread solution for Opencl. And if you search net you may fine no sample!

So working on this last solution took a lot of work from me. I got no time to do extra thing like placing comments.

Now last CLBiss_077 code, is a complex C product. It uses most of the topic that many will consider advance programing. Pointers, dynamic allocation of variables, we got keyboard capture, separating common sections in to C files, local variables vs global variables, managing threads and ensuring they close all on exit . Passing variables to threads and making sure writing to output do not collide. And it needs to compiled both in Linux and Windows. In general is a complex

And most important this is 1rst release so it is APHA, needs a lot of improvements.

For a newbie trying to learn from CLBiss_077 is suicidal. You need to learn the OpenCL basic from previous versions. Now I will take some time to add comments soon. But know I am in resting mode.
 

C0der

Registered
Messages
270
If the total number of GPU-threads is the same, does higher MULTITHREADSIZE increase speed?
 

cayoenrique

Member
Messages
475
IN GPU there is nothing written in stone. What seems to be the most logical conclusion, at the end when tested fail. In the other hand simple changes that you think has no meaningful change, under test can probe us wrong. Then what works in one GPU may not work for another model. What I had found is that we need to test all possible variants until we find the most value for us.

Now "threads".... For now on I do not call a single core work a thread. Why? To prevent confusion with CPU thread instances. MULTITHREADSIZE are instances of same program running in CPU.

So your mention of "GPU-threads" I will called it now GLOBALWORKDROUPSIZE, a single core working i will called it a Task

There is nothing wrong with your question. But I will re-write the question to prevent confusion.
If GLOBALWORKDROUPSIZE is the same, does higher MULTITHREADSIZE increase speed?

The short answer is yes. And I explained here LINK
To get work on GPU 1rst you need to submit Input DATA, then you compute, finally you need to request Output DATA. Well the submission of input and output DATA goes thru your PC PCI or PCIe. And is a process that takes time. Time that you loose.

Now in reality there is nothing you can do to make it faster or prevent to loose this time. What you can do is use Magick!! Yes just as magick makes us believe something can be done, we can Hide this lost time!!
Lets take the swimming pool. If I bring 10 guys to help me. My waist time do not improve. In fact it get worst, now I need to wait in line for 10 guys to fill their bucket, before I can refill mine!!!! See what I am saying. Waisted time is not removed.

But the important is not if I am loosing time. But the faucet where the water is been supply, it time Open get maximize. More water, faster the filling of the swimming pool.

Now do not go and attempt to do a 100 in MULTITHREADSIZE!! Just as moonbase did on his last attempt. You can kill the benefit by over working direct memory access. You PC need also resources to keep tract of all the work needed. You just need to have a few submissions for input and output waiting in line.

Regards my last program. At the moment you can not go higher that 4 in MULTITHREADSIZE. It is hard-coded. You can only do it if you modify its code so that more threads can be initiated. So good values are 1 - 4.
 

moonbase

VIP
Donating Member
Messages
552
...Now do not go and attempt to do a 100 in MULTITHREADSIZE!! Just as moonbase did on his last attempt...

I did not attempt to do 100 multi threads. Do not make false comments about my tests.

The last multi instance test I did with CudaBISS was for 8 instances at the same time using an MSI 4090 GPU. It worked perfectly and it was fast.
This session of 8 instances at the same time yielded a speed of 7853 million keys per second, yes it was almost 8 billion keys per second.
It processes the full biss range from 000000... to FFFFFF.... in less than 10 hours.

What speed of key search are you getting with your Open CL tool please?
 
Last edited:

cayoenrique

Member
Messages
475
@moonbase My apologies to you if I try to be funny and fail.

Let me say something. I got a few interested, C0der, Me2019H, dvlajkovic and a few other good ones. But be aware that about all I admire you. You are the only one that keeps testing & looking for different alternatives. I witch all follow your interest. After saying that I hope that you understand that in no way I will try to minimize you work.

I will repeat this for @all. I have ONLY admiration on moonbase interest in finding a better solution for all of us.

Regards KPS spped. Here is where we think different. Only a few CAS uses DES and even lesser use AES to encrypt Video. So we could say almost ALL video we hack are base in CSA. If anyone publicly say that are cracking CSA, this people are in fact placing at risk the security of CSA. What ever you place here publicly will be pickup by the Chinese manufacturers or the ones that make money from this cheap receivers. At the end the providers will move a way from CSA and we will loos our way of hacking sat TV.

Now the objective of this thread is for people to learn to do OpenCL so that they can build their own tools. But to make happy a few I was hoping to at least match CUDABIS speed. I know you mention hours. But I use as a reference 1 day for full key search in latest GPU. They have not posted speed! I guess either are afraid of saying something out of order or the program may be a little faster. I do not know the right answer.

Now PLEASE download the source code, and compile it with just make. Clearly you need to so the settings in the tutorial. But if foe any reason you refuse to install mingw then just ask for the binaries. We are not hiding them. Just there is no need if you can compile it your self. Then you can see what speed you can get your self. I will see I can build it in windows and posted for you. I am a Linux user.

If you read my POST #124 you see that I am hoping it will approach about 3.381 x 10^9. But at the moment my best operational GPU is a 480 core AMD Radeon HD 6770M.
 

cayoenrique

Member
Messages
475
I do not want to be accused of providing bad exe or virus related comments. If you are that kind of user PLEASE do not download it.
I am forced to post this binary to have moonbase happy. If you download PLEASE check it with anti virus software before running on windows. Best if for you to build it from sources. You been advice.
OCLBiss_077_win_static_binary.zip (65.93 KB)
Code:
https://workupload.com/file/cttjNqqQYXW

And do not forget. READ post #124. If you get speeds faster that 3.81 x 10 ^9 please do not comment in public, we do not want to know. Use PM for any critical advice. THANK you
 
Last edited:

moonbase

VIP
Donating Member
Messages
552
@cayoenrique

I am happy to try to test your OpenCL tool. The only issue I have is that I have no knowledge of Linux and compiling so I have no idea what to do with any files that I download or how to create and test the tool.
All my PC's are Windows 10 O/S. For me to be able to test anything it needs to be able to run in that O/S.
 
Last edited:

cayoenrique

Member
Messages
475
As requested by dvlajkovic all info in this tread was designed to be build and execute on Windows (W7, W10 & W11).
As per "no knowledge of Linux and compiling". It took me great amount of time to create Step by Step Tutorial most of witch have Pictures.
I am trying to make you happy. But you need to do your assignment too.

I guess your only option is to ask the people here by PM, to get your numbers as that seems is all you want. But if you want to join in learning Please join in and start with the tutorials
 

dvlajkovic

Member
Messages
498
@moonbase

Start reading from the beginning.
Everything is there in details: what we do, how to test, what option to change, etc.
Enrique did a fine job explaining the basics: now spend some quality time reading and trying it out.
Learn from fails and improve. That's easiest way to understand and make progress.
No one said it works out of a box, so dive in if you're certain that you want to learn.
As for the compiling, there are many sites that answer questions.
None of them will spare you from reading/studying/trying/failing/and ultimately winning.
This is something that money can't buy and it's totally up to you.
 
Top