NWN2 and Dual Cores

Diamond · November 6, 2006

Ok then context switching jumping from one core to the next shouldn't be whats causing dual-cores to run slower then single-cores (as all games would then behave that way).
You seem to be in the know of how lowlevel stuff works under windows. You don't happen to have educated guess as to what could cause such behavior?

(that is if the lower performance even is because of single/dual core thingy...)

<{POST_SNAPBACK}>

Context switching is actually switching from one thread to another on one core, and not necessarily from the same process. So context switching occurs with all CPUs, not just dual core. E.g. your running thread might be pre-empted (stopped temporarily) and, say, some other process' thread is scheduled to run for a certain time until OS decides it is time for someone else to use the CPU (that is how multi-threading and multi-tasking works).

I have no idea why context-switching might be a problem though. Taks was referring to the situation where a thread is pre-empted and scheduled on the other core after that, but I am not exactly sure why that would be a problem, unless he meant something else. All I know now is that the performance problem manifests itself on some specific configurations, and is not connected to whether the CPU is dual core or not.

Edited November 6, 2006 by Diamond

Loof · November 6, 2006

Yes I know what context switching is. Taks was talking about a scenario where you had dual cores without shared caches I think. So a switch from one core to another would either result in alot of cache misses or having to copy the cache from one to the other. At least thats what I think he was talking about.

What I asked you if you had a guess as to what could cause a single threaded process to perform noticeably slower on a dual-core then a single-core CPU.

Note: with single threaded I'm refering to a process that has only one thread where any heavy computations take place... Just so you don't missunderstand me again

Edited November 6, 2006 by Loof

Diamond · November 6, 2006

Yes I know what context switching is.

<{POST_SNAPBACK}>

Ah, OK. Sorry, I misread your reply. :thumbsup:

Taks was talking about a scenario where you had dual cores without shared caches I think. So a switch from one core to another would either result in alot of cache misses or having to copy the cache from one to the other. At least thats what I think he was talking about.

<{POST_SNAPBACK}>

This would make sense, though Taks was referring to Linux; Windows thread scheduler tries to schedule the thread on the same processor/core.

What I asked you if you had a guess as to what could cause a single threaded process to perform noticeably slower on a dual-core then a single-core CPU.

<{POST_SNAPBACK}>

No idea :lol:

Diamond · November 6, 2006

A dev post indicating that NWN2 isn't multi-threaded (it uses only one core). Some dual-core CPUs are not the performance issue.

Yes, NWN2 only uses one core. If we wanted to take complete advantage of multi core systems we would need to architect for it from the ground up or do a refactoring of some of the core engine components, which we didn't do on this project because we licensed the technology from Bioware. Typically, the second core is used for physics simulations, which we don't have in NWN2.

The good news is that the game isn't CPU bound for most systems. The largest bottleneck is on the GPU or your video card, and in most situations we are pixel bound. So no matter how fast the CPU is chugging along, it's always going to be waiting for the GPU to render the frame.

We've made it pretty clear during the project that the toolset is the only part of NWN2 that takes advantage of multi core systems, since we wrote the toolset from the ground up. However, we are looking for simple ways of incorporating support for dual core systems in the game, so users with super machines can feel happy that their computer is being used to it's fullest potential. Don't get your hopes too high though, but it is something that is being considered for the future. :D

Thanks,

-Brennecke

Ummi · November 6, 2006

Hmmm...that's funny, NWN2 doesn't run "like a dream" at high settings for me. I get pretty good FPS indoors(50-60) and as soon as I step outside my FPS drops to 20 and lower. I stated my system specs over in the First Impressions thread but I'll list them again -- Intel C2D 6400, 2gb RAM, 7900GT 512mb, Soundblaster X-Fi, WD Raptor 74gb. I can run Oblivion just fine at high settings but not NWN2.

<{POST_SNAPBACK}>

That's why i said high settings not maxed. Many modern games seem capable of graphics beyond even the best hardware so some tweaking is usually necessary. Fiddle with the real heavy stuff like lighting, reflections, shadows and draw distance and you should be fine probably >_<

Edited November 6, 2006 by Ummi

$@\NightandtheShape/@$ · November 6, 2006

I'm running on a dual core and have yet to encounter any problems 1280x1024, everything is ON.

Diamond · November 6, 2006

What's the configuration?

$@\NightandtheShape/@$ · November 6, 2006

AMD Athlon

Diamond · November 6, 2006

GeForce 7950 GX2 (1GB VRAM)

<{POST_SNAPBACK}>

Oh, no wonder it is running smooth on everything maxed out. I wonder what will my 7600GT pull off.

$@\NightandtheShape/@$ · November 6, 2006

GeForce 7950 GX2 (1GB VRAM)

<{POST_SNAPBACK}>

Oh, no wonder it is running smooth on everything maxed out. I wonder what will my 7600GT pull off.

<{POST_SNAPBACK}>

Same but a little less on the shadows, friend of mine picked up that exact card yesturday so he could play NWN2... So it'll look nice, but not AS nice.

metadigital · November 8, 2006

@Kraftan:
Yeah almost all modern games are GPU limited and not CPU limited, but there really isn't any law that says it has to be that way. If someone made a game tomorrow with very advanced AI, physics and lots of procedural content it's possible/probable that such a game would be CPU bound. In other words its a design choice...

<{POST_SNAPBACK}>

Actually, a poor CPU will hamper your gameplay in Rome: Total War. :alien:

taks · November 8, 2006

Taks was referring to the situation where a thread is pre-empted and scheduled on the other core after that, but I am not exactly sure why that would be a problem, unless he meant something else. All I know now is that the performance problem manifests itself on some specific configurations, and is not connected to whether the CPU is dual core or not.

<{POST_SNAPBACK}>

in linux land, we actually refer to core switching as a context switch as well. it is a major problem because each core has its own cache/stack. you have to flush everything from one cache/stack, and reload it on the new core's. depending upon how the cache is set up, it can take a while.

taks

taks · November 8, 2006

This would make sense, though Taks was referring to Linux; Windows thread scheduler tries to schedule the thread on the same processor/core.
<{POST_SNAPBACK}>

yes. i have no idea how winders handles it. i'm also coming from a pseudo-real-time version of linux viewpoint designed for embedded processing.

in this particular case, you can force affinity to a certain processor, which can be important as all standard housekeeping processes are targeted to CPU0 by default. i'd imagine it works similar, though not exactly the same with winders. linux also allows users to have relatively low-level control if necessary, something probably absent with mr. proprietary winders.

taks

angshuman · November 9, 2006

... you have to flush everything from one cache/stack, and reload it on the new core's...

Small irrelevant technicality - you don't really have to physically flush anything from the old cache, whatever new task starts running on that cache will slowly evict your blocks out. :aiee:

taks · November 9, 2006

i think that behavior is dependent upon the chip and memory/cache management structure.

taks

angshuman · November 9, 2006

i think that behavior is dependent upon the chip and memory/cache management structure.

True, true. I was referring to typical x86 systems.

taks · November 14, 2006

they have shared L1? or only shared L2? i've got a unique L1 per core, and one fairly large, but shared, L2. if linux pulls a core switch on me, i have to move my entire stack into a region i can use.

taks

angshuman · November 14, 2006

they have shared L1? or only shared L2? i've got a unique L1 per core, and one fairly large, but shared, L2. if linux pulls a core switch on me, i have to move my entire stack into a region i can use.

taks

Athlons and Pentiums have split L1, split L2. Core 2's have split L1, shared L2. Regardless of the organization, if the OS switches the core you're executing on, it'll make sure your stack, PC etc. is saved. But neither the OS nor the app have to worry about explicitly "moving" anything over, it's all part of the virtual memory that is addressable from any core. The coherence protocol takes care of the moving around of physical blocks between the caches as and when required.

taks · November 15, 2006

But neither the OS nor the app have to worry about explicitly "moving" anything over, it's all part of the virtual memory that is addressable from any core.

i wasn't referring to physically moving anything. i was referring to the down time while things get moved, stalls, etc. i.e. these functions are handled automatically, but they still take considerable amount of time. i'm running into that right now with a benchmark i'm running that requires threads for larger sizes (an FFT). my times are all over the map because i cannot lock down affinity.

The coherence protocol takes care of the moving around of physical blocks between the caches as and when required.

<{POST_SNAPBACK}>

oh yes, i realize that.

taks

taks · November 15, 2006

Athlons and Pentiums have split L1, split L2. Core 2's have split L1, shared L2.
<{POST_SNAPBACK}>

i prefer the latter organization for parallel. split L1 is a given necessity for speed reasons. a shared L2, however, allows cores to play together in the sandbox a little better. of course, associativity semi-reserves areas of L2 anyway...

taks

angshuman · November 15, 2006

i wasn't referring to physically moving anything. i was referring to the down time while things get moved, stalls, etc. i.e. these functions are handled automatically, but they still take considerable amount of time. i'm running into that right now with a benchmark i'm running that requires threads for larger sizes (an FFT). my times are all over the map because i cannot lock down affinity.

Oh yes, absolutely, affinity is a huge issue.

i prefer the latter organization for parallel. split L1 is a given necessity for speed reasons. a shared L2, however, allows cores to play together in the sandbox a little better. of course, associativity semi-reserves areas of L2 anyway...

Ah, but a shared L2 has hidden costs. First: The primary benefit of a shared L2 is that gives you the "illusion" of an overall larger cache (compared to split private L2's) due to more efficient utilization of space. However, you need a high-bandwidth crossbar sitting between the cores and a shared L2 cache, which takes up a significant amount of area. Therefore, the total area you can allocate for the L2 is smaller. This is a very interesting tradeoff that was once very eloquently described by a colleague of a colleague as "what's better, the illusion of a larger cache, or a larger cache?"

Second, there's the issue of cycle time. Two small L2's can be clocked faster than one large L2. Third: Applications could sometimes destructively interfere with each other, so sometimes there are fairness issues. Associativity can only go so far to prevent this, although you could fix this with simple solutions like way reservation.

Despite these issues, I guess what really tips the scales in a shared L2's favor is the fact that if you have a lot of sharing, you can prevent unnecessary data replication and coherence traffic. So, for multi-threaded applications (as opposed to a multi-programmed workload), a shared L2 probably makes more sense, which is why we are seeing real implementations moving towards this direction.

metadigital · November 15, 2006

Sure, the issue (apples for apples) would be timing an application that compares, say, moving a codestream from one core to the other (from one distinct L2 cache to the other) versus using a shared L2 cache between the two cores.

Lots of fun and oppotunities for engineers to draw pretty graphs.

taks · November 15, 2006

Ah, but a shared L2 has hidden costs. First: The primary benefit of a shared L2 is that gives you the "illusion" of an overall larger cache (compared to split private L2's) due to more efficient utilization of space. However, you need a high-bandwidth crossbar sitting between the cores and a shared L2 cache, which takes up a significant amount of area. Therefore, the total area you can allocate for the L2 is smaller. This is a very interesting tradeoff that was once very eloquently described by a colleague of a colleague as "what's better, the illusion of a larger cache, or a larger cache?"

the BCM1480 has a 2 MB shared cache across 4 cores. it is connected to the L1 via a CPU_SPEED/2, 256-bit bus. high-bandwidth is an understatement. the total core, btw, only runs at 23 W @ 1 GHz.

Second, there's the issue of cycle time. Two small L2's can be clocked faster than one large L2.

hehe, not this one!

Third: Applications could sometimes destructively interfere with each other, so sometimes there are fairness issues. Associativity can only go so far to prevent this, although you could fix this with simple solutions like way reservation.

yup.

So, for multi-threaded applications (as opposed to a multi-programmed workload), a shared L2 probably makes more sense, which is why we are seeing real implementations moving towards this direction.

<{POST_SNAPBACK}>

yup again. performing an image compression routine, for example, involves operating on rows and columns of the image independently. a quad-core system can operate on 4 rows/columns at a time with ease, and very little conflict. the speedup is nearly 4:1. using all four cores to operate on a single row/column at a time, however, probably only provides 2:1 speedup and less if the sizes are small (say, less than 4k elements).

taks

NWN2 and Dual Cores

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in