XXHighEnd

Ultimate Audio Playback => Sound Quality OS Related => Topic started by: PeterSt on May 23, 2012, 11:58:39 am



Title: 01 | A Guide to Glitchless Playback
Post by: PeterSt on May 23, 2012, 11:58:39 am
Introduction

This is a topic about how to play without glitches, ticks or anything via XXHighEnd. This goes along with the explicit notice that XXHighEnd puts bars infinitely higher on the technical side of things than any other software player, and which bar is inherently there and THUS puts a lot more requirements to your system - and you yourself for understandings.

If you don't want to understand a thing of it all, but want some glitchless sound first, see near the bottom under "Summarized - The most lean settings".

Technical background

Because we are not joking here, just as well we are not joking about our hobby (is it still that only ?), it is a good idea to have at least some knowlegde of what's actually happening when we are playing a track - which in the end is a file on disk.

To always keep in mind : XXHighEnd is a so called "memory player", which means that all is played directly from memory, which per se is not the same at all as increasing "buffers" or whatever others may call it, and next think it is the same thing.

Read the file

So, a track (or file in computer terms) has to be read from disk. For a red book (= CD) track, this will be about 10MB (Mega Byte = million Bytes) per minute of play time. Thus, a 6 minute track will comprise of around 60MB. And, a 70 minute complete CD will be around 700MB.
This assumes the file in uncompressed. Thus, the bytes have been put to disk (from ripping or download) 1:1. However, if the file is compressed like the FLAC format, the size is close to half of the original. Thus, now our 6 minute track has the size of around 30MB. This means that a compressed file is read faster from disk. This is logic, because it is smaller. There is less to read.

The reading of a file can be done virtually at the same time as other stuff the PC is doing. For example, it can do this while at the same time playback is going on. But, this is under "conditions" you will learn later in this topic. So, if these conditions aren't met, glitches during playback may be in order. For now though, it is important to understand that the shorter it takes to read the file, the less chance there is it will interrupt playback. All in this topic is about this principle !

Block Size

The music files are stored on disk in physical elements we call "blocks" or "clusters". This block/cluster size can be set when the disk is formatted. Now :
When we write to the disk or read from it, this goes per the block. Thus, when a file of 100MB has to be read in full, and the block size would be 1KB, a 100,000 blocks will have to be read. Similarly, when the block size would be 512 bytes, 200,000 blocks need to be read, and when the block size is 64KB (the maximum for the NTFS file system), 64 times less blocks need to be read compared to the 1KB block size.

It is important to recognize that our music files will always be larger than then largest block size, so from that angle we always have to choose the largest block size.
Also notice that when e.g. 1000 blocks need to be read of a size of 512 bytes, the time needed to read 1000 blocks of 64Kbytes will be as long (and otherwise the difference is neglectable).

From the above follows that in all cases the block size should be set to the largest value possible : 64KB for the NTFS file system. When this is not explicitly set at Formatting, the default of 512bytes will apply, and reading the music files will be 128 times more slow.

Warning : Do not make the mistake to format all your disks with the largest block size possible. So, something like the Operating System's disk contains many small files of a few bytes up to a few hundred bytes only. If *that* disk now would be formatted with a 64KB block size, all these small files will occupy 64KB on disk. So, now all takes enormously much more space on disk (10s of 1000s of files).

Preconvert the file

The file format internally "played" is WAV (or "PCM" data, never mind). This means that any file format which is not WAV will be converted to WAV first. Remember, XXHighEnd is a pure memory player, and this is one of the phenomena that denotes it; Nothing is done in real time during playback, but all is done in advance of playback.

So, our FLAC file needs to be turned into a WAV file. This is done during the reading of the FLAC file. Thus, the FLAC file is read by bits and pieces, and each piece is converted to WAV. At the same time the piece is output to disk again, so it will be there in the WAV format.
The conversion to WAV itself is a cpu bound process. And to keep in mind : the playing of a file is a 100% cpu bound process. This means that playing doesn't use anything else than the cpu (central processing unit). This also means that if something else needs the cpu, playback must stop. The cpu can do one thing at a time only ... But, things are not *that* strict, as we will learn later.
For now it is important to understand that while the FLAC to WAV conversion is a cpu bound process, it doesn't hog the cpu 100% of the time. How ? well, because the bits and pieces which must be read from disk interrupt that process. Thus, we read a piece (which doesn't use the cpu), convert that piece (uses the cpu), read a piece (doesn't use the cpu), and so on and on and on. And of course, while the piece is read from disk, something else can use the cpu.
Notice that this "switching" is about pieces of time which measure in micro seconds.

Using the preconverted file

After our file has been converted and put back to disk in the wanted WAV format, it needs to go into memory. So, like we talked about in the beginning, the WAV file has to be read from disk, and what was written about it again applies. Or ...
Actually it does not; Our just written WAV file (from the conversion process) is written from memory. This is because all goes via memory. Now, the OS (Operating System, e.g. Windows 7) still has the file in memory, assumed the memory wasn't needed for something else in the mean time. This is called a "cache" (pron. cash). And, if we need the file again (which we do), it can be read from memory. This is infinitelky more fast than reading the file from disk, so it's a good thing.

But is it ?
Like the description of the conversion of the FLAC file, and how this goes by bits and pieces, the same accounts for a file which is read from disk without conversion; Bits and pieces are read, and again during the reading (the I/O (Inut/Output)) no cpu is needed. But what if the file is read from the cache ?
Cache is just internal memory, and what will happen is that the file is copied from the one piece of memory to the other. And, the only thing which can do this is the cpu. So, now the cpu is 100% needed for the 100% of the time the file takes copying (which was our "reading"). Indeed, the time this takes is enormously more fast compared to when the file is read from disk, but now all the time the cpu will be needed.

It is again important to understand that it works like this, with in the background your knowledge about playback also needing 100% of the cpu.

But we have more cpu's - ehh, cores of them

Yes, luckily we have. So, assuming a cpu with two individual cores (which can be seen as two (almost) independent cpu's), two processes can use "the cpu" for 100% of the time, and both processes will keep on running (and at the same speed compared to if they were alone). This means for example, that one process may be performing a FLAC conversion and all (see above), while another process may be playing the file. Both won't hinder eachother, and the conversion not hindering the playback is the important thing.

Is it ?
Yes, for glitchless playback it is. But what if the playback would hinder the conversion ? aha ... then we'd have a stall in sound afterall, just because the file couldn't be read/converted in time ...

And again, keep in mind these principles, because all is about this.

After we have read the file ...

When the file has been converted and read into memory for playback, the most nasty things are only to begin !
Our 6 minute track, comprising 60MB, is fully in memory now, and quite some processing (= 100% cpu useage !) is needed to make the file playable according to several settings, like the DAC it has to play on, the digital attenuation, the sample rate conversions implied (by the DAC and settings), and other technical stuff to let you hear music instead of rubbish.
This means that the file -in the mean time undergoing these again "conversions"- is copied a few times from the one memory place to the other. And, because no interruptions are inherently there (like there's no I/O which may interrupt the cpu useage) this means 100% cpu useage for a (relatively) long time !
Now think of it ... Because this is not our first track we play (but a second) it means that another track is already playing, also using (near, see later) 100% cpu. This is bound to conflict with glitchless playback, unless we can cause one core to be dedicated to music playback so the in-memory processes of preparing the second track won't interrupt playback.

Playback itself

Again, since XXHighEnd plays all from memory, it means that playback itself is a pure in-memory process, and again no inherent interruptions can be there. Thus, playback means 100% useage of the cpu. Well, theoretically and luckily not in practice. In practice we have to deal with the speed the DAC "eats" the samples from the file. And, as long as this goes slower than the time needed to provide those same samples, it implies we regularly have to wait before the DAC ate our before provided samples (which are just the bytes from the file, although now in memory and in the format and all we need to play it).

Providing the DAC with samples goes via a "buffer". A buffer is a small piece of memory which can contain a number of our samples, and while XXHighEnd fills that buffer, the driver (which is a small program dedicated to the hardware = DAC) is eats the buffer and provides the read samples to the DAC. Notice that for normal redbook CD data this goes at 44100 samples per second. So, suppose our buffer would be able to contain samples for 1 second play time, it means that once per second the buffer needs a refill, or otherwise we'd have a small stall (glitch) in sound.

Buffers

Generally, to our eyes two buffers exist in this process; One is the one we just talked about and which is filled by XXHighEnd, and the other is within the driver itself and which it uses to provide the DAC with data. This latter is what we call the "Device Buffer", and often its size can be set from within a driver settings panel of some sort. In XXHighend there's also a setting for it, but it doesn't "set" that buffer size as such. It is only needed information for XXHighEnd so it can utilize the communication with the DAC at best, and it should contain the value the actual Device Buffer is set to from within the driver settings panel.
The other side - the buffer which is filled by XXHighEnd can -very generally sproken- be controlled by the Q1 quality parameter. Please be careful though, because the "very generally spoken" is indeed not more than that. Only with certain settings it really applies, and this will be Kernel Streaming Special Mode. Only then Q1 directly influences the output buffer of XXHighEnd (and the size in samples shows up in the X3PB log file). In the other situations you won't know, but, it is always important to know the buffer is (always) there, and how it influences glitchless playback. On that matter -for each of the buffers- think like this :

The longer the buffer (again, this counts for the output buffer of XXHighEnd as well as the Device Buffer), the more subsequent (in time) the cpu is needed, and the more chance there is that playback stalls. But now it gets a kind of complicated, because "playback" as such is nothing more than filling that output buffer. Thus, only when we are too late, playback stalls (glitches) just because it ran empty (the driver ate it empty). Yes, this is true, but we must not forget that there's also the Device Buffer, and that has to be provided to the DAC by similar means. The only difference is that now the driver program is doing it. Now think ...
If we -in XXHighEnd- consume such a long time of the cpu in order to fill our buffer - that the Driver Buffer can't be emptied towards the DAC because both need the cpu (*and* the processes are 100% cpu intensive), no samples are provided to the DAC and sound stalls again. Oh boy.
The other way around can happen just the same : the Device Buffer is being emptied towards the DAC, needs all the cpu for that, and in the mean time XXHighEnd doesn't get the necessary cpu "cycles" to fill the buffer (fully), and when next the driver reads from the buffer there are not enough samples there to keep the flow going towards the DAC.

Of course, we have two cpu cores, so some processes can happen at the same time, but by now we can count well over two processes needing 100% of the cpu, and things start to get "too much".

Without solving it all, again it is important to see the general principle : the larger a buffer, the longer the cpu will need to process it. But, the shorther the buffer, the more overhead will be there to start and stop the (very small) task. And it is easy to see : when we set the buffer(s) to a very small size, the cpu useage increases drastically. And, this does not mean (at all) that we will be more late with things. The contrary ...

Interrupts and task switching

While everybody tends to think "when my cpu useage is low, I must have glitchless playback", and where we have seen that cpu useage is totally unrelated because it is merely about who (which process) is allowed to use the cpu and which really can be done per one at the time (per cpu core), there also another mechanism, even more important : interrupting. And notice, this does not consume cpu, so you can't see it.

When, for example, the disk is ready with reading its asked for portion of data, it will attempt to interrupt the cpu to tell it's ready. The data will be waiting in (again) a buffer, and when the cpu is ready with whatever it was doing, it will activate the process to read the disk buffer and further deal with it.

So, a finished process (which can be an I/O bound process, but which also can be another cpu bound process) interrupts the cpu. And, since many processes are there (not only from our XXHighEnd program !) many interrupts will be there. So, this can also be two I/O bound processes which read from the same or different disks, and we must assume that the cpu is always too slow at dealing with the following cpu bound process, not to let emerge a queue of processes. Now, which of the waiting processes is dealt with first is determined by the priority the process has. The higher the priority, the more upfront in the queue the waiting process will be. But also : the more "higher priority" processes there are, the less the "lower priority" process will get their turn (ever). So, here we have the perfect "opportunity" to stall sound forever, if we only "divide" the priorities wrongly.

Once a cpu core switches from one task to the other, sadly is has to copy the (piece of) memory the new process wants - over the memory the before processed used. This is the "second level cache" of the cpu itself, and the cpu can only work with *that* memory. So, this 2nd level cache memory is copied from normal memory, and it is *that* process which consumes many of the cpu cycles we see, while only after that the process itself needs the normal cpu cycles for that process. Now, it is said that the larger the 2nd level cpu cache the faster it works, but this is only true if it doesn't need task switching that often; The task switching itself is pure overhead ...

If we now think back on the choice between larger and smaller buffer sizes, and when we understand that a smaller buffer size incurs for more "reactivation" of the cpu to deal with it, we have more overhead because of the more often task switching. Now, assumed that the cpu could deal with a 1 second buffer (1 second for playback, which may mean 1ms to deal with it) in one go, there is one time the overhead needed to switch to that task per second. But if we denote the buffer 1ms (1000 times smaller), within the 1ms it could deal with the buffer before, now 1000 times the overhead occurs. Also, the larger the 2nd level cache, the larger the overhead. Again, it is the overhead of copying to the 2nd level cache which consumes "all" the cpu cycles, and it is this what creates the high cpu useage when small buffer sizes are used. It is not the actual processing causing that.
Roughly think like this : When 2MB of memory has to be copied into the 2nd level cache, this is dealing with 2 million bytes of data. But when the buffer size is set to 1ms, a buffer has to be filled with less than 200 bytes of data ... The overhead is a factor 10000 ...

A last thing to know (but we hardly can manipulate it to our benefit) is that the cpu always allows one process a certain amount of time only. For example, supposed that our in-memory preparation of the file takes 1 second (which will be a normal real life figure), it will have switched to other processes several times in the mean time. This is because otherwise other important tasks may wait too long, so it is better to deal with all waiting tasks (processes) at "near the same time", also knowing that the cpu will never know how long one process takes (could be hours !). Thus, the overhead of taskswitching is always there, but the longer the (subsequently cpu bound) process needs the cpu, the less the overhead is. Only when we deliberately squeeze the process into smaller pieces (like we can enforce that with the buffer story), we encourage for more overhead than actually needed.

The speed of your PC ...

It will be clear ... The more speedy the PC is at the several subjects, the less all will be a problem. But of course it is about the other way around : when the PC is not so speedy at all, what to do to let all work anyway. But keep in mind this always :

When the connection to the disk subsystem is too slow (like 10Mb Ethernet would be), the data can't be transferred fast enough to the PC's memory, and sound will stall because of that.

When the cpu is so slow that it can't process the in memory files to what is needed for playback within the time frame anticipated upon in the program, sound will stall.

Whatever *you* do to encourage overhead, will make things worse.


All clear ?
Ok ok, more or less. But let's now hop over to the practical outlay of things anway.


The influencing XXHighEnd parameters

Notice : All of the below is only explained for the elementary working, and does not describe the combination of influences. Hopefully you can reason out the combinations which really would be useful, because too many combinations exist to ever explain.


Split File size

Possibly the most important one, because it implies so much;
The larger the number (which denotes MB of some virtual sort of used gross memory, and which used size will be much larger in net form) the larger the file portions read from disk into memory. Notice that the net used memory for the track can never be more than 2GB (2000MB), which comes down to stuffing the PC with more than 3GB for XXHighEnd purposes only, is useless.

  • The larger the number, the longer it takes to read from disk. However, this only counts for the file in (the eventual) .WAV form. This means : a FLAC file to be converted will always be converted in full, no matter the Split File size.
  • Similarly, the larger the number, the more a slow(er) connection will be profound.
  • The larger the number, the more must be processed in-memory per one subsequent go.
  • The smaller the number, the less will be the influence on SQ for the better.
  • The larger the number, the higher the latency will be on stuff like Volume Change (which again has to be applied on the whole piece of track in memory).


Processor Core Appointment Schemes

5 Different Schemes exist, amongst which "nothing". Each has its special dedication of spreading tasks over cpu cores, under the notice that it's quite useless to explain exactly how. Originally the Schemes were made because they all sound different, but lately they are merely used to allow for ultra low latency playback (which is about Kernel Streaming (Engine#4) and the Adaptive Mode and Special Mode settings), which by itself have a greater impact on SQ than a specific scheme by itself, but now a specific Scheme just *allowing* for the ultra low latency playback.
Please kep in mind that setting the Processor Core Schemes is only possible with an active license which never has been different, while the "necessity" for the ultra low latency playback only emerged the beginning of 2010.

Of course, the faster the PC is for its cpu, the less the Schemes are necessary for the reason of ultra low latency playback.

When you have problems (glitches) it is best to look at the cpu graph from within TaskManager (the separate cores made visible), so that you can see what is happening at which Scheme, the outlay in this topic having in mind of course.

Priority settings

The Thread Priority is the priority given to the playback process itself.
The Player Priority is the priority given to everything else in the PC, except for those processes which won't allow changing the priority.

It is important to know that with Attended Playback each "trackload into memory", starting at possible needed conversions, is dealt with by XXHighEnd on a track per track basis. This means that these processes are subject to the Player Priority, and while playback is going on -the Thread Piority always being higher when all is set right- the playback process may stall the load processes. It is exactly this why Attended Playback always gives more troubles for continues playback than Unattended Playback.
So, with Unattended Playback all preconversions are done in advance (or during the playing of the first track, depening on the "Start Playback during conversion" parameter), and when the next track is about to play, only the ready .WAV file needs to be read into memory.

The only process which always has adopted the Thread Priority is the playback process, which is not equal to "everything Engine3.exe is doing". For example, the in-memory processing and copying of the file to prepare it for playback, has dedicated a "hard coded" priority independed of your settings. It *is* subject to your set Processor Core Appointment Scheme though !

Start Playback during Conversion

Notice that this only applies to Unattended Playback.

By itself this indeed causes Playback to commence after the first track is fully ready (converted etc.). However, a few things are important to know here :

  • Once playback has started, it will eat so many cpu cycles, that conversions will slow down significantly.
  • The conversions are NOT subject to the Processor Core Appointment Schemes, so a conversion may interrupt playback with a slower cpu PC. However, the conversions *are* subject to the Priority settings.
  • Because playback will eat most of the available cpu cycles, converting (etc.) a complete album may take "minutes". This, while it could take a few seconds only. This means that all the time SQ will be influenced by the conversion activities. IOW, it may be better to wait those few seconds longer, knowing that from then on SQ will be optimal.
  • When this is set to Yes (IOW, wait) it may happen the other way around opposed to what intuivity tells you : once Playback starts it can be subject to glitches;
    This is because the heavy I/O and processes took a longer time hence required more resources from the PC, and now when Playback starts those resources are freed explicitly. This is an OS activity though, which can only be overcome with "just waiting" another 10 seconds or so, which btw will be "minutes" on a laptop. In other words, this just can't be dealt with programmatically, and if it happens better start Playback immediately afterall, or try to find other settings which don't require so much from the PC (low latency playback being the first to investigate).

Unattended Playback

Although this is merely meant for the best SQ possible, it is also a means for much easier glitchless playback. One of the reaons has already been pointed out above (preprocessing of all tracks is done in advance), but generally spoken all is more "lean". This is similar to the previous subject and the resources needed and needed to be freed : they just are not needed anymore. It's only the plain .WAV file which needs to go into memory, and when this is compared to e.g. a FLAC conversion, this is maybe 1000 times more lean.

Also the fact the XXHighEnd is just not running (which it is with Attended Playback) does matter. And, this matters a LOT because this is all related to the (Vista and up) dealing with the screen and how it's refreshed. This by itself is related to the shutting down of services (see below) which NOT happens with Attended Playback. And, when those services are running, changing one small area of the screen (like the cursor of the running time) causes 10000 times more cpu power compared to shut down services (this has been explicitly measured).

Stop Services

Always do this ! But notice it only works together with Unattended Playback. It ceases many many interrupts from the screen, and besides that much processing (of drawing nice shadows and stuff).

Running Time OSD

As long as this is there in the presentation of a digital counter (instead of a progress bar, which may be there in the future), better don't use this;
It has been developed in the Engine#3 haydays, and wasn't audible. Well, that's what we thought. With todays ultra low latency playback means though, the maintaining of the Running Time expresses as small ticks. Even if the ticks won't be audible, it will influence for the worse.

The OSD Text is harmless.

Wallpaper

Although not logical perhaps, showing the Coverart as Wallpaper is one of THE most prone reasons to ticks or glitches;
Normally you won't be bothered by it, but with a too slow PC it really can be a problem.
First there's the internal (OS) processing which is very awkward in this area, and it wants a very high priority or something, which is out of our control. Also notice that this involves (writing to) the Registry.
Next there's the relatively large processing involved in forming the "pixels" for the screen and writing them to a file; Although this is a low priority process, it just takes many cpu cycles to do it, and it recides in an (also time) area where other processes need the cpu (so things may start to run behind because of it).
The plajor culprit though is the "bang" with which the OS bombards the Wallpaper onto the screen, which can go with a tick itself (like the OSD Running Time), but which also can cause a little glitch because filling the buffer stalls briefly when the Wallpaper is drawn at the wrong moment. Also, that moment is not under our control; we tell the OS to do it, and it will do it when it thinks the time is right.

To keep in mind : When the Volume is changed, the Wallpaper is redrawn (to reflect the newly set volume). The small tick you almost guaranteed will perceive here will, however, be from the internal processing which can't be done otherwise for Engine#4 (Kernel Streaming). If you want to check whether it's the Wallpaper tick you hear, or this unavoidable internal processing, try it with Engine#3 (WASAPI). There the volume change should bring NO ticks.

Please notice : This little subject is not here to improve SQ by itself, but is only there to give you a theoretical reason of "ticks" you can't solve otherwise. So, it is easy to try whether it is the Wallpaper causing your ticks : just shut it off and retry.

Show Mirror Back (Wallpaper)

Additionally to the description of culprits related to the Wallaper just decribed, it is good to notice that already showing the back of the Coverart makes the area to process and write to the screen twice as large. But, when the mirrored back is in order things are extra heavy because this "picture" emerges from a pixel by pixel memory copy of the original. Besides that, it recovers the tracknames from the album playing, *and* it recovers where playback momentarily is.
All very good reasons for your system to give a glitch when it is not 100% "on par".

Summarized - The most lean settings

Ok, just to have a beginning somewhere, and your system does *not* look much on par. What could you do to increase your chances on glitchless sound ?
Notice this is just a beginning, and from there on you could make things more "difficult" for your system.
This assumes Attended Playback, under the notice that it takes the proper install of AutoHotkey to let Unattended Playback work. Notice that Attended Playback is prone to "all stops" when a next track is about to play.

  • Use Engine#3, hence any sound device *not* prefixed with "KS:".
  • Set the Device Buffer size in the driver panel for the device concerned as high as possible (if you can change it).
  • Set Q1 to 14, and leave Q2/3/4/5 at 0.
  • Use Core Appointment Scheme 3 (if you use an Activated version; otherwise this is grayed out).
  • Set Player Prio to Below Normal and Thread Prio to High.
  • Switch Off all Wallpaper settings.
  • Set Copy to XX Drive by standard to Off.
  • Do not imply any Upsampling (Input and Output Sample Rate should be the same)
  • Set Normalize Volume to Off.
  • Set Decode HDCD to Off.
  • Set Split File size to 50.
  • Don't start pushing buttons when the previous has not been processed yet (watch the yellow lights in the buttons as well as those on top of the screen.
  • Preferrably use .WAV files. FLAC/AIFF is okay too, but *is* less lean.
  • Don't move your mouse over the screen during playback, especially not over areas which may show a ToolTip.

If all (now) works, try to find out what makes things "stop" working. But, if you were heading for Engine#4 in the first place, better first select a "KS:" prefixed device, and try again before changing anything else. But, use KS Normal Mode, or otherwise you'd have to dive into appropriate Device Buffer settings in relation to Q1 settings. On that matter -before hopping over to Engine#4- maybe better vary Q1 from the 14 at first, to maybe -1 and +4. You should perceive a difference here.

It may be a good idea to start TaskManager, and watch the cpu behaviour in the cpu graph. Change the Split File size and watch the differences (throughout a track). Apply 1x Arc Prediction Upsampling, and look again. Change the Core Appointment Scheme and look. Change the Device Buffer Size and look.
Each spike you see is prone for having problems. Only that.

If you hear problems but don't see it on the cpu graph, watch the disk light and see whether it's related.

In the end, try to understand. Not that you *have* to in order to play something, but the better you understand the better your SQ *will* be.
Remember these words ...