Tuesday, November 24, 2015

Mugen, Mugen, Mugen

Today I want to talk about a project I've been working on for the past weeks. The first half of this post will be about Mugen in general. The second half of this post will be about my testing of Mugen.



Mugen is a 2d fighting game. The character and stages included with the game are very bad, but that's not the point. Mugen allows people to make their own characters, stages, motifs, and set up their own music for stages and menus. A motif is a set of files that dictates the look and feel of the stages, health bars, and so forth. Many authors have ported over many characters from other fighting games, like Street Fighter or King of Fighters. I'm more of a fan of the latter myself (as you can see later from my roster of many Iori edits). This freedom allows people to showcase their coding skills, artistry, and creativity. Some characters are totally made up. Some are crazy edits of characters from other franchises. Those can vary from characters who have abilities from multiple games in a franchise, to a fusion of two characters, or any other whacky combo - including brand new attacks, effects, sounds, etc. With so many different possibilities, it can be hard to keep track of the characters, so I have named the characters according to the character and its author.

The downside is quality control. It takes a skilled coder with attention to detail and a ton of free time to emulate the characteristics of a character from a game. Problems from improper coding (which, when interacting with any of the thousands of other characters that are coded differently, compounds the issue) to unfaithful combos and damage values exist. Due to this, even when two skilled coders try to port a character over to Mugen as faithfully as possible, inevitably both characters will play differently, if not just for the AI.

Authors might decide to leave the Mugen scene, letting their download pages die out. "Warehousing" these characters in a seperate site to download is a solution. There are a few sites that are dedicated to storing a ton of Mugen characters. There are two problems here: 1. No matter what, quite a few characters slip through the cracks, with no download links. And if nobody is around who is willing and able to share the character they downloaded before all sites went down, then the character is unobtainable and lost. 2. Warehousing is nice, but can contain old versions of characters. If the author comes back or was never gone, often the characters stored in such warehouses are old. This could be frustrating to the author, since the author knows that quite a few people are using an old and possibly glitchy version of their work, and their new fixes and updates never went to most people.

Then we have privacy. Some characters are private, and some authors do not allow others to reuse their work. Some people are against "illegal edits" of other people's work, while others prefer a free-for-all. Some characters are hard to obtain, and the people who hold them are only willing to trade characters instead of freely sharing. My opinion for all of this is that Mugen should be about sharing. Nobody has a commercial interest, so what's the big deal?

There are AI patches for characters with crappy AI, but once in a while, I come across somebody who did more than change the AI. Without clear documentation in English, it's hard to tell what's going on. That's another thing: To read all the Mugen sites and documentation, I have to translate the words, and some of the stuff don't even display properly on my computer.

Crappy edits are a dime a dozen even though edits take a long time to make. On the other hand, beauty is the eye of the beholder. Some people want overpowered, super-flashy characters. Some want very accurate port of characters from a game they like. Some people like hybrids or want to see something totally new. Some people want Hentai Mugen. There's something for everybody, and people should be more polite in their critique of characters they don't like.

With a roster size of 150 and 120+ characters, you can imagine that it took quite a while to assemble the roster. There are quite a few characters I don't use for one reason or the other. The character might be ridiculously overpowered, or it might be too glitchy. Or maybe, I have another version of the character I prefer. Or, the AI is crap. You can't have a fun viewing experience if the AI is crap. Auto-regenerating health is also a no-no except for a few cases. And finally, attacks that do damage as a percentage of enemy health are also a big no-no. The author often wrote the attack assuming the user will be using sane health values. What if I don't wanna? ...Then you start editing the code and hope you find out how to get the character to work the way you want it to.

Some spent a lovely afternoon hiking. I spent it deciphering hieroglyphics.



In the picture above, you can see what my roster really looked like 2 months ago. Notice that the roster, while large, is not fully filled. Also notice how the character portraits for the two characters are not the same size or quality. I actually had to learn how to take a picture I found on the internet, apply transparency, and alter the character files for the picture on the right to show the way that it did. Anyways... I spent quite a while modifying the roster screen above. I got the motif from SxVictor, who does very artistic motifs. They all have an Anime feel to it though. However, I found this one to be fine for my purposes. Here is how the roster screen looks like unedited:





















Quite a contrast, no? And below is me editing the portrait picture of a character in Gimp. The problem is that most pictures come with a background, be it a picture or color. Having a white background means the white background will show up when the character is displayed. It looks very ugly because it shows up as this white patch of area. It stands out. So, I had to find a way to remove the background and apply transparency.



Ever since Winmugen turned into Mugen, Mugen has become a lot better to play or watch. Winmugen used to crash all the time and run in a shitty low resolution. Now, 720p can be done natively with many motifs, but 1080p or higher upscaling is also an option. I mentioned "watching" the game. That's how I typically experience the game. I like to watch AI vs AI matches. (2 vs 2 matches causes utter pandemonium!) So as you can imagine, I want a large variety of characters to make things interesting. And, I want the characters to be sorted by order of strength on the roster.

This wish is far harder to grant than you might think. It’s certainly was for me. The roster started out being sort-of in order of strength, but there are holes. That’s because when I added a new character to the roster, I did do checks to see where I would place the character in the roster. The results were never documented however, and the testing methodology was flawed.

Basically what I did was, I took a picture of the roster and I would pit a character against another. Of course, winning depends on AI along with the abilities of the character. A would be pit against B. I set the HP in Mugen to 1000% hp or so forth, so that each side has HP that is so high, it can never be depleted. When A has a 30,000 hp lead over B, I consider A to be stronger than B. And if that happens and I am testing A against the rest of the roster, I would bring up that picture of the roster and draw a check mark on the face of character B. I would start from the bottom up. It’ll start with all wins, and a loss here and there, and then, generally an equal amount of wins and losses, and then all loses pretty much. I would assume a win cancels out a loss, and see what the highest spot I could place the character would be given those rules. There are ties, in which case I had to randomly decide.


This is an example of a roster picture used for ranking characters in the old method.

The new method is very different. First, a separate Mugen setup has been set up, with custom settings to try to facilitate faster testing. For the first half of the testing, I used the same 30k HP lead idea, but with major differences. This time, wins and losses are charted in a giant 150 x 150 Excel spreadsheet. It was surprising to me that I ever had to use vertical text alignment seriously in my life. 150 x 150 is a HUGE spreadsheet, too large for 1440p screens to show. However, I only test characters in a general area of the roster at a time. I just need to show a part of the spreadsheet to be able to do my work. The spreadsheet took many hours to make, and I colored the spreadsheet in many colors to try to prevent confusion about which row I am on. A 1 means a win for the character who’s row corresponds to that cell. A -1 means a loss (and conversion can be done easily via copy, paste, transpose, multiply by -1). A 0 means a draw (too close to tell) or a fatal glitch. One common glitch is where one character punches the other endlessly since the opponent is backed into a corner. This is due to poor coding. To allow the pairing to fight, I have two stages for testing: A normal stage, and a never-ending stage which scrolls on both sides forever, making cornering impossible. I use the latter sparingly, since the nature of the map can affect AI. In rare cases, a character with a move that slams the opponent to the corner, will continue flying, trying to get to the edge of the map but failing forever due to the infinite scrolling nature of that map.

Running Mugen 20 times at one time causes me to encounter far more glitches than normal usage. The obvious reason is that there are 20 instances of the game, and therefore a x20 time chance of problems. Running so many at a time might also glitch things out, because this is a very extreme scenario. If one of the 20 instances bugs out or dies on its own, it is hard to spot the problem early and fix the problem.

Above is one type of glitch caused by poor coding.

Consider the advantages of a spreadsheet over that of the picture method. For the picture method, a picture needed to be done for each character on the roster. What happens if I added a new character and the entire roster moves around? What happens if the ratings get sloshed around with future corrections? There is not much I can do. With a spreadsheet, the fix is much easier. A spreadsheet also lets me look at more data in less time. It is, however, more prone to misreading.

A bit after halfway, I realized that my results against two characters were wildly different. I detected a 40k HP variance between two tests. Bear in mind that so far, all the tests had a sample size of one. I could not think of a way to conduct more tests automatically anyways. Doing 50 fast tests is far more of a pain in the ass and far lengthier due to manual adjudication than doing 1 really long test. Anyways. I accidently tested a pairing twice and found that 40k HP variance and I realized this was not acceptable.

It was crushing, because the prospect of losing over a hundred hours of work is painful. Good things came out of that problem however. I went back the drawing board and remembered that AI vs AI matches typically end on a best-out-of-3 basis. I don’t see that in action because during testing, HP is set to insane amounts. But what if I lowered the HP amount and found a way to make it best-out-of-500? Then I would have automatic testing between to characters, round after round. I also did some further optimizations of the settings in Mugen. Each side gets 10k HP, and the first person to deplete the other person’s HP to 0 gets a point, and the round restarts. This many short games method is in direct contrast to the infinitely long game idea I had at the start. (If I set the number to 1000 HP or below, weird things start happening because the super of a character might do more damage than either side's entire HP, meaning that the game is won whether the damage is 1000 or 10000. In a longer game this of course, matters.)

What I found out was very interesting. One of the characters I tested had a feature called “grooves”. It’s basically a system that allows multiple modes for a character, altering its abilities and so forth. This is the reason why I got wildly different results from each simulation of the pairing to the next. When I tested A vs A, the crazy outliers disappeared. However, I realized eventually that player 1 somehow has an advantage over player 2. That’s right: If you test A vs A, the A that starts out on the right always gets the upper hand. This was replicated in over 2000 games. To compensate for this, what I did was run each pairing with a pair of simulations: A vs B and B vs A. Their scores would be added up to get the final score. It’s handled like in chess, where white gets the advantage.

The groove problem I noted with one character was not controllable. There is no way to select the groove for that character. Therefore, I cannot consistently test the character. This brings up the fact that despite all the procedures I've laid out thus far, there is still room for spur of the moment decisions. What do I do with such a character? Or, what do I do with a pairing that is glitchy, but just about stable enough to test but very barely so? Later on, I realized that the gets-opponent-stuck-in-corner-leading-to-inifnite-combos glitch won't easily show up in many-short-game testing because the game would be won and the next game would start. Whereas with the infinity long game method, I can easily see a character punching the other in a corner endlessly.

One idea suggested by others in TCEC chat was to use a program that looks at the length of the health bars. This doesn’t work because of the high amounts of HP, which leads to the health bars glitching out. Mugen wasn’t designed to have both sides with 1 million HP, so I can cut Mugen some slack there. Without a way to grab the health values of the two sides (whether by reading the video with software or through some type of setting or log), automation becomes impossible. Oh yea. You might be wondering what TCEC chat is. TCEC is a chess engine competition, and while the Mugen testing was going on, I was in the chatbox there a lot. Naturally, I brought up this enterprise to the chatbox.

I decided to continue testing with the newest method, with many short games and two simulations per pairing with sides alternating, but not to redo all of the work I did earlier. Because I started this method halfway, I cannot log the wins-loss info into the spreadsheet, because then half of the spreadsheet would be 1, -1, 0, with no win-loss data available, while the other half would be like 50-25, 35-36, etc and the data wouldn’t conform. This makes it hard to consider the nuances of the results. If both A and B defeat everybody else, but A quickly runs down the opponent while B takes a while to edge out, A would be stronger. Just imagine if A and B were equal due to AI quirks or coding quirks. By the current system, A and B would be equal.

I implemented the newest method as follows: Person 1 vs 2, 2 vs 1, 1 vs 3, 3 vs 1, 1 vs 4, 4 vs 1, and so on. And if one pairing is done, I slide the number one higher: Person 2 vs 3, 3 vs 2 after I am done with 1 vs 2, 2 vs 1. That way, I don’t have to keep track of who is fighting whom. Once a pairing is done, I test the characters one to the right of the characters originally tested when looking at the roster. No need to reference the Excel sheet to figure out what needs testing. However, once again, the shitty Windows UI makes my life tougher. Due to the amount of games run, some instances Mugen will be nonresponsive some of the time, even if just for a split second. This causes Windows to move the icon for that Mugen to the end, messing up my order. There’s not much I can do for that.

I work on the weekends, and during those days I am not in front of the computer micromanaging Mugen, switching pairings and charting results. With the neverending match idea, it was tougher to use that time effectively, because if too much HP has been depleted on both sides (after both sides have lost ~200,000 hp or more), the health bar glitches out. With the many-short-matches concept, that's not a problem. The results just keep getting tallied for me. This helps crack down on the ties. One final thing that is different about lots of short games vs one very long game: A few characters have abilities unlocked only when health is under a certain threshold. This is not reflected in the never-ending game format, giving them an unfair disadvantage. One thing in common with both approaches is that in both cases, the cutoff where I consider one guy a victor, is really kindda arbitrary. 30k HP lead is quite a lot (let’s say, each attack does 100-500 damage, to simplify) when 1000 is the default amount, to be sure. But why 30k HP? Why not 35k? The question can be asked for the many-short-games approach. When do I decide a victor? I can use Ordo to try to analyze the Likelihood of Superiority (LOS), but that’s too time consuming.

Mugen running 16 times at a time was the limit, due to running out of Vram long ago, even on my 980ti, which sports 6gb of vram. The CPU was chugging along too slowly past that figure anyways. And finally, the Aero preview of icons on the taskbar maxes out at 16 for a 1440p screen. Any more, and everything ends up listed in a terrible way... No previews, just a lot of freezing, messy applications. I got around this limitation in part with optimizations in the settings of Mugen, and Windows 7 Taskbar Tweaker, which allows me to ungroup programs which are similar yet still only show the icons. This shifts the bottleneck to the CPU and allowed me to run 20 simulations of Mugen at a time. More cores would help greatly, since I only have 4.

Above is me working on testing. Notice the icons: Not grouped, but no labels due to Windows 7 Taskbar Tweaker. Red numbers in the chart means a glitch occurred.

What is 'casting a wider net'?

In chess, a 2500 engine might lose to engines around its rating but still maintain that rating due to its ability to constantly beat weaker engines. This is usually due to contempt, where an engine values its pieces more than the opponents, causing the engine to avoid trades in material which lead to a more drawn game. Well, in Mugen it’s like that, but due to the intricacies in character abilities, AI, and character coding, there are countless ways to cause A > B > C > A. You can’t optimize against every type of AI/coding/character ability configuration. So what do you do? 

Ideally you want a round robin, where every one of the 150 characters have faced off against the other characters on the roster. But that takes a very long amount of time, and most of the results are useless. I don't need to test anything to know the guy in first place who is smashing every single opponent it meets in the top 10 spots will defeat the guy in last place. (Note that ratings are closer together towards the center of the list, and on both the extreme high ratings and low ratings the elo ranking between each character increases as everything blows up to the extremes.) There's a lot of wasted time testing pairings I already know the results to.

On the other hand, if I overestimate my ability to predict outcomes accurately, then I end up faking results or not bothering to do tests which actually lead to an upset. So what I want to do is test 30th place guy against anybody that's ranked 15th-45th place and try to increase both directions of testing so I'm testing it against 14th-46th ranked characters, etc. I am testing each character with a wider net, from weak to strong.

A guy in 30th place has a higher and higher chance of beating a person ranked lower and lower than it, but there’s still a chance, say, the 40th place guy always beats the 30th place guy but loses to everybody else, due to the way the combinations of coding interact. In the roster-picture method, this is not considered. In the rating method, it is very important.

The Excel crosstable doesn’t include information about B vs A since A vs B is already noted, and the exact reverse would be a mirror of the data in the cell across the middle black line, only multiplied by -1. Thanks to the work of Jesse Gunderson, I have a script that can take the data Excel outputs in cvs into a PGN format which Ordo can read. It took a lot of work together, but we managed to get it working – and the procedure includes running Linux in Virtualbox to run the bash file which can convert the files. If I had to convert the data manually, it would’ve taken far longer and been far more prone to errors.

I managed to alter Jesse's script to read losses in addition to wins and draws. This is a time-saving measure. As I mentioned, A vs B with A winning would mean a "1" is recorded. Therefore by simple logic, B vs A means "-1". But to fill in those empty squares in my crosstable, which would contain no new data, I'd have to transpose 150 lines in Excel.

I altered the script and uploaded it to Termbin via the cat command and just changed the command I inputed into the console to use the new Termbin link. (Termbin is a site much like Pastebin, but pulling code directly off Pastebin can be problematic.) I'm not a coder, so this process took a long time.


Virtualbox running xubuntu, which in turn is running Jesse's script. 

Quick question to Kai Laskos, a statistics buff in TCEC chat, showed that BayesElo as probably the best software to use to calculate elo of the characters given that the sample size is only 1. (Even in the latter half of the chart where I used the newest method, as far as the PGN is concerned, there is a sample size of one. If I got 85-3 score, I input “1”. Remember, the first half of the chart was already in 1, 0, -1 format. I did take pictures of the scores to document however. And dear god, it sucks when I take a picture while a character is performing a super which covers up the score. If I exited the pairing already, then the entire A vs B and B vs A need to be reset, because if I include more B vs A results than A vs B, results will be skewed. Accidentally saving a picture over another due to not wanting to restart paint is a bummer too.) The problem is that the output of Jesse’s program can only be read by Ordo. Ordo is a great piece of software, but it was freaking out about perfect winners and losers. (Even if BayesElo didn’t freak out over that, it would’ve gave me crap results anyways.) In both programs, the white advantage needs to be set to 0 to prevent funkiness in the data.





















Here is what picture editing to combine the results of A vs B and B vs A looks like. This is to document the scores, so that I may build on top of them in the future for larger sample sizes, or reference them to make sure I logged the results correcting in the Excel sheet. Later on I changed the view of that folder to detailed instead of icons, and that sped up lookup of existing pictures.


If a character hasn’t lost to anybody and has only won games, then there’s no way to really simulate the rating of a character. The same goes for perfect losers. Are you 50elo above the guy in second place or 500 or 5000? The data won’t fit well and the program chugs along, even when forced to ignore warnings with the –G switch. The –g switch on the other hand, showed all the problematic characters with data problems. The solution is as follows: There are two types of perfect winners. The first type are winners that are not in first place. The solution is to make it face off against the guy in first place. Boom, now they’ve lost a game. (Casting a wider net on both sides instead of simply doing such an extreme pairing might yield better results I reckon’, at the cost of using up more of my time.) For the person in first place, what I do is I fake a result where the person in first place loses. And no matter the final ratings shown, that person goes in first place. Of course… having a loss there where there shouldn’t be affects the rating of the character, which in terms affects the ratings of characters that have faced off against that character, etc. However, I don’t think the confounding factor here is excessive. I can adjust the final rankings based upon the distortions I see coming.



























Here is Ordo (right) spitting out errors, and the perfect winners/losers report on the left.

---
Some questions were asked that are simply due to lack of understanding of Mugen. For Mugen I want a larger variety of characters. AI skill levels vary all in the roster. It makes no sense to trash a character simply because it’s not the strongest. Otherwise, I just need 1 spot in my roster, not 150. I decide which characters to keep – and some glitch characters might bring value which offsets the trouble they cause. Glitches aren’t black or white – they vary from the aesthetically unpleasing to totally game breaking ones.


Asking for help is easy to do, but getting a response is very tough. I might want to ask in a chess forum because the testing is similar to chess and there are some statisticians in the forum. I could also ask for help in a Mugen forum for obvious reasons. If I ask on a chess forum, people have no idea what Mugen is and don’t want to get involved. If I ask on a Mugen forum, the people there don’t understand why the hell I’d ever want to do what I want to do and stay away. In short, unless there’s a Mugen+Chess&Chill forum I’m now aware of, none of these places can be of help. I did search up match statistics in the Chess Programming Wiki, but that site is very hard to understand. And despite my best efforts are using the formulas, I did not know whem and how to implement them. It took Jesse to write the sorting file, and some chatting with others in TCEC chat for ideas to bounce around.
---

Anyways, time to show the results and conclude!

After some testing to make sure the results were accurate, the output file was hosted on Mega and downloaded from Mega outside of the virtual machine. It took some modifications for Ordo to accept the data, but all in all it went pretty smoothly. I ran the -w 0 switch to tell Ordo to ignore white advantage. (Ordo is a chess rating calculation program, and in chess playing as white is an advantage that needs to be accounted for in the ratings.) Draw percentage didn't affect the ratings, so I left that alone. Since there were only one game for each pairing as far as Ordo is concerned, a higher simulation count won't do much. I weighted God of Wind as #1 and at 0 elo. Then I took the output of Ordo and matched the roster for Mugen with that.




Looking at the names, it looks like I need to fix some of the names so that they line up with what's displaying in the game. Also, I will add more characters to the roster, and cast a wider and wider net for the characters I think need it. As you can see below, my Excel sheet isn't a perfect triangle because some characters needed more testing and some needed less.



Before and after pictures of the roster.



Almost no character was left untouched. The guys at the very end of the list didn't change though, because I've yet to decide whether I want those guys on my roster or not.


Games        : 2682 (finished)

White Wins   : 2031 (75.7 %)
Black Wins   : 572 (21.3 %)
Draws        : 79 (2.9 %)
Unfinished   : 0

White Score  : 77.2 %
Black Score  : 22.8 %

You can see that "white" wins more often than black. What that really means is that Player 1 wins more often than Player 2. Note that this is NOT related to the Player 1 advantage in Mugen. That's because the results are input into Excel after I have taken into account A vs B and B vs A to sidestep the problem. What white winning more often means is that my old roster that was in semi-correct-order had it so that a character who is higher ranked had a 76% chance of beating a character that is ranked lower than it. (Some caveats, but let's not over-complicate things here.)

The final step is to update the Excel spreadsheet/cross-table, to update its rankings from the Ordo data. Row 2 should have the character that's ranked #1, row 3 for #2, etc. I thought about the problem for several hours, but there was no practical way to update the spreadsheet. Intuitively I just want to get in there and start cutting and pasting to somehow get it to work. The problem is that "somehow" doesn't cut it. In the end I found a way to accomplish my goal.

I ran SCID (an old and ugly as balls chess GUI) and loaded the pgn file exported by Jesse's script. Then, I edit the ratings of the engines for SCID such that when I sort the crosstable by ratings, the order will be the same order as the ratings from Ordo. The program doesn't allow for ratings over 4000 or below zero, so what I did was set God of Wind (#1 ranked) as 4000 elo, 2nd place as 3990 elo, 3rd place as 3980 elo, etc. The reason why I put a 10 elo gap is in case I ever have to put a new character in between those characters, so I wouldn't have to manually bump the rest of the 100+ characters down one spot.


 Before and after pictures of the Excel chart:


And that's a wrap!
I've successfully dealt with many obstacles to get my Mugen testing up and running!

No comments:

Post a Comment