#1
|
|||
|
|||
How to get most of lolz parallelism
Originally I was planing to add this as a comment to my "Cost vs return debate" thread, but I did not wanted to necro my old topic plus this is not completely relevant to that. I am not going to compare here lolz with other compressors, I am going to share my experience on how its MT works and how to get most of it.
You may have noticed that even when using -mt1, lolz will benefit from other cores by offloading certain(most certainly detection) calculations to other threads. Still, it is very common procedure of the user to select number of threads equivalent to number of CPU cores. I have made small test and compared its MT scaling on my intel 4690k CPU. I used 21a7 version but later I also tried latest 22c4b just to see ldmf in action, whether it is efficient speed wise. This test is not about final file size, it was done only on small ~600mb file and any variations were negligible regardless of any setting in all test so I will only focus on speed and time here. Basic setting was -d128 -mc8 -tt1, and from there I started changing -mt and for last ldmf tests I changed only -ldl. Lets start with -mt1: mt1.png ^Compression on "single" thread is done in 3:48min and average speed was ~2762k/s. I quoted "single" deliberately because CPU usage often spiked as high as 74%(!) and was at least 36-42% most of the time. Lets try -mt2: mt2.png ^We see almost linear scaling, speed jumped to ~4448k/s and time it took was 2:21min. This is -mt3: mt3.png ^Speed is even higher, at ~5100k/s and time took only 2:03min. Scaling is already worse, we gained only about ~600k/s vs previous ~1700k/s. Finally, -mt4(and you already know what to expect do you ): mt4.png ^Yep, speed is actually slightly slower than -mt3 as well as time by 1sec, but this is not even full picture. I made this test twice and what I am showing you is second one where I did not even touched computer during processing. In the first test, even having open web browser took the speed down to ~4400k/s which was less than -mt3! Most certainly culprit is a L2/L3 cache insufficiency/misses. In another words, doing 2 cores on 4C CPU is the most efficient way, and even -mt1 is not that bad as I originally thought. However, since I strongly suspect high dependency on L2/L3 cache, you would be better advised to do your own tests and setup lolz's -mt param. to benefit most of YOUR OWN CPU. It wouldn't surprise me at all for example if AMD CPU's behaved differently and by same token is to consider(and test) Hyper Threading type of CPU's. Maybe on more than 4C CPU's, it will be necessary to set to -2 cores less, not just 1, especially if CPU cache is a concern. Anyway, I thought you may want to see this... oh and lets throw some -ldmf of most recent lolz version to the mix, but please consider this test may or may not be good enough for such thing as file size was too small. But anyway, -ldmf with default -ldl8: ldmf8.png and -ldl5: ldmf5.png ^These tests were done on -mt1 so to compare with previous -mt1, its a few seconds difference. So yeah, thats it. EDIT: on bigger files, ldmf time will extend to couple of minutes. On ~16gb it added around ~12-24min. That makes it slower than srep but, in the picture of whole compression which took more than 1h:30min(1:34min was quickess with -mt3, mt1 took 3h+) its doable. Also ldmf doesnt do full pre-pass of data like srep - which, if you compressed big things like 60gb+ on regular HDD would add another 6-12min. If added time of ldmf concerns you, you really better just use lzma and FA's 4x4 is simply unbeatable when it comes to ratio vs speed/memory(srep+xlzma:lc8 compressed those 16g in ~16min with no worse than ~8% loss == ~700% speedup vs ~8%ratio). But between external lzma's(which are slower than xlzma) and lolz, I would rather then use lolz. I mean either I want max compresson, or max efficiency. EDIT2: unfortunately I have bad news, ldmf really doesn't work well with multiple cores, even if mtt=0. From that 16gb dataset(inflated "We Happy Few" game .pak's): Quote:
Quote:
- if you want to use more than 1T in lolz, srep is the only good option and then ldmf is pointless and would only add to compression time - make sure you disable it if you use srep with -mt > 1 - srep+lolz[MT] can still achieve very similar results to lolz[1T]+ldmf, with only slightly worse compression but significant speedup, in my case 3x - even with srep and -mt3 though, lolz is still 5x slower than srep+xlzma, now with only about ~6.5% better cmp ratio. - between lolz[1T]+ldmf and lolz[3T]+srep is only ~1.75% cmp ratio difference, but 300% speed difference - of course, there will be variation with different data/game but my past experience was that 90+% of the time there was no more than ~2% variation Then again, this is just for extra info. Main topic is about lolz and its best threading settings + ldmf and on that, conclusion is that it may be better to set MAX_CPU-1 or -2 depending on type, and ldmf only if you use -mt1 AND -mtt0, otherwise go srep(for now at least). If you want to keep best ratio with lolz[1T]+ldmf, there is one way around it: process more games separately at once, overall it will be more efficient use of CPU cores to cut time. Last edited by elit; 18-02-2019 at 13:30. |
The Following 11 Users Say Thank You to elit For This Useful Post: | ||
78372 (17-02-2019), BLACKFIRE69 (12-09-2019), COPyCAT (20-10-2020), DedSec (16-02-2019), Gehrman (02-06-2022), Harsh ojha (01-10-2019), JRD! (22-02-2019), mubbii (16-02-2019), Pantsi (16-02-2019), Razor12911 (17-02-2019), shazzla (06-10-2019) |
Sponsored Links |
#2
|
||||
|
||||
Had the same with mt options in my tests.
mt6 gave around 2800k/s, while mt12 gave just ~3200k/s But I think it's dependend on the input you have.
__________________
Haters gonna hate
|
The Following User Says Thank You to KaktoR For This Useful Post: | ||
elit (16-02-2019) |
#3
|
|||
|
|||
KAktoR do you happen to have Ryzen 2600 variant(6C12T)? If so then the fact that you can still actually see any benefits with max threads - and even more so when half is just HT, is good information to know.
Btw I dont see why input data should matter as long as its same for all tests. We should only care about different threads on same processor(measuring effectiveness), its not about my CPU vs yours just ot be clear(but yes, I too believe different input data do affect speed of lolz - from my observation). If you have 2600 or similar those one have a huge L3 cache, although I would also try 2-3T. 2600 also have different design, it act as a NUMA but require external AMD application to switch the way cores work. You cound get a very different results with tweaking -mt and CPU mode. Last edited by elit; 16-02-2019 at 13:11. |
#4
|
||||
|
||||
Yes Ryzen 5 2600 6C 12T here.
I think this because on other inputs I had nearly 4000k/s with mt6 (on my last test today I had only 2800k/s to max 3000k/s. Ok indeed the last input was about 16GB, the other one just 3GB). And of course I didn't do anything other in this time then compress with lolz.
__________________
Haters gonna hate
|
#5
|
|||
|
|||
You need to chose one data and do all tests on it with different -mt[N] on that same data and same other settings of course. For your CPU most important would be 1-2-3-4-5-6-8-10-12 and twice with 2 different modes set through AMD app. Then you could see how your CPU's scaling goes. Idea is to see which -mt{N] gives you most effective pareto frontier. Also dictionary/block size should be small enough so that even with 12T input file can be divided to so many parts.
|
#6
|
||||
|
||||
RZ 1600
Code:
Creating archive: VC4.Bin.001 using rep+srep:m3f:l512+lolz:dtb1:d32:mtt1:mt10:mc1023+diskspan:4410mb:4430mb Compressed 25 files, 64,396,399,216 => 4,624,220,180 bytes. Ratio 7.18% Compression time: cpu 3093.48 sec/real 17804.47 sec = 17%. Speed 3.62 mB/s All OK Creating archive: VC4.Bin.001 using rep+srep:m3f:l512:m512+lolz:dtb1:d32:mtt1:mt6:mc1023+diskspan:4360mb:4430mb Compressed 26 files, 64,248,717,854 => 4,571,791,380 bytes. Ratio 7.12% Compression time: cpu 2337.45 sec/real 17213.25 sec = 14%. Speed 3.73 mB/s All OK |
#7
|
|||
|
|||
Yep thats what I was talking about. Although you still was able to utilize all physical cores here. I wonder how would be on -mt5, because perhaps HT enabled CPU's don't need to be set for less than n physical cores.
BTW Simorq, you are among the most fantastic and most helpful people on this(and other) forum(s) I know. Thank you for all your dedication and personally for your last help with cls setup. |
The Following User Says Thank You to elit For This Useful Post: | ||
Simorq (14-04-2019) |
#8
|
|||
|
|||
I try some trials for lolz and get some results
I want to ask you ... When i make my trials I use (tt12) whats your opinion in that . Specially i get good ratio with this method lolz:dtb1:d128m:mtt1:mt4:mc1023:tt12:fba0 I wait your opinion @elit |
#9
|
|||
|
|||
What's the optimal and standard settings for multithreaded lolz?
|
#10
|
|||
|
|||
Back then when I have tested tt1 vs tt4, difference was within 4% or less, but speed decreased 2-3x which for already slow lolz is a kill. This setting is similar to -mc in lzma whose many here also love to abuse to unbelievable levels(1000+) and also cause significant slowdown(and have similar negligible effect on compression ration from my past tests).
Honestly, I think using such levels in both cases is retarded, I have yet to see their benefit vs cost and I dare anyone to prove me wrong with specific - well documented example. Until then, I stick with tt1 and mc32. there is no standard and the most optimal one is my ;p mtt1 mc8-32 tt1 dto0 dm00 |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best Compression Methods for 'Specific' Games INDEX | JustFun | Conversion Tutorials | 46 | 02-12-2024 21:20 |
NEW LOLZ V22c4 | doofoo24 | Conversion Tutorials | 61 | 08-06-2024 09:49 |
Bench Test (LOLZ vs RAZOR vs MCM vs LZMA2) | felice2011 | Conversion Tutorials | 5 | 19-10-2020 08:40 |
LZMA vs LOLZ & Scan Compress Method | yasitha | Conversion Tutorials | 58 | 11-01-2019 10:01 |
problem with lolz | Kitsune1982 | Conversion Tutorials | 6 | 11-06-2018 14:04 |