A Games of Tufte - Part I

This report concerns the first part of an exploratory data analysis based on the Games of Thrones dataset hosted on Kaggle. The aim of this work is to familiarize with the data for subsequent analysis, and using the Tufte design rules to represent the plots. During the process, personal domain knowledge (acquired from books and not the tv series) is used to motivate hypothesis and decisions. Since there aren’t motivations or questions that brought me to collect data, in order to answer to them, we let the Exploratory Data Analysis phase to generate questions for us. A sound answer to those questions would require at least another dataset, so we let to fix in mind the fact that we are simply describing the dataset at hand, without the temptation to make inferences or other types of final statements.

The entire code for this post can be found here.


Data cleaning and questions generation

After having load the required libraries and the dataset, which contains information about the main battles in the reign of Westeros during the War of the Five Kings, let’s first take an overview of the dataset at hand by checking the variables at our disposal. We can see \(38\) observations for each of the \(25\) variables. The next step will be to gain confidence with the features and the values they can take.

  'data.frame':	38 obs. of  25 variables:
   $ name              : Factor w/ 38 levels "Battle at the Mummer's Ford",..: 13 1 7 14 18 10 25 5 3 17 ...
   $ year              : int  298 298 298 298 298 298 298 299 299 299 ...
   $ battle_number     : int  1 2 3 4 5 6 7 8 9 10 ...
   $ attacker_king     : Factor w/ 5 levels "","Balon/Euron Greyjoy",..: 3 3 3 4 4 4 3 2 2 2 ...
   $ defender_king     : Factor w/ 7 levels "","Balon/Euron Greyjoy",..: 6 6 6 3 3 3 6 6 6 6 ...
   $ attacker_1        : Factor w/ 11 levels "Baratheon","Bolton",..: 10 10 10 11 11 11 10 9 9 9 ...
   $ attacker_2        : Factor w/ 8 levels "","Bolton","Frey",..: 1 1 1 1 8 8 1 1 1 1 ...
   $ attacker_3        : Factor w/ 3 levels "","Giants","Mormont": 1 1 1 1 1 1 1 1 1 1 ...
   $ attacker_4        : Factor w/ 2 levels "","Glover": 1 1 1 1 1 1 1 1 1 1 ...
   $ defender_1        : Factor w/ 13 levels "","Baratheon",..: 12 2 12 8 8 8 6 11 11 11 ...
   $ defender_2        : Factor w/ 3 levels "","Baratheon",..: 1 1 1 1 1 1 1 1 1 1 ...
   $ defender_3        : logi  NA NA NA NA NA NA ...
   $ defender_4        : logi  NA NA NA NA NA NA ...
   $ attacker_outcome  : Factor w/ 3 levels "","loss","win": 3 3 3 2 3 3 3 3 3 3 ...
   $ battle_type       : Factor w/ 5 levels "","ambush","pitched battle",..: 3 2 3 3 2 2 3 3 5 2 ...
   $ major_death       : int  1 1 0 1 1 0 0 0 0 0 ...
   $ major_capture     : int  0 0 1 1 1 0 0 0 0 0 ...
   $ attacker_size     : int  15000 NA 15000 18000 1875 6000 NA NA 1000 264 ...
   $ defender_size     : int  4000 120 10000 20000 6000 12625 NA NA NA NA ...
   $ attacker_commander: Factor w/ 32 levels "","Asha Greyjoy",..: 8 6 9 22 16 18 6 30 2 28 ...
   $ defender_commander: Factor w/ 29 levels "","Amory Lorch",..: 7 4 10 28 12 14 15 1 1 1 ...
   $ summer            : int  1 1 1 1 1 1 1 1 1 1 ...
   $ location          : Factor w/ 28 levels "","Castle Black",..: 8 13 17 9 27 17 4 12 5 23 ...
   $ region            : Factor w/ 7 levels "Beyond the Wall",..: 7 5 5 5 5 5 5 3 3 3 ...
   $ note              : Factor w/ 6 levels "","Greyjoy's troop number based on the Battle of Deepwood Motte, in which Asha had 1000 soldier on 30 longships. That comes out to"| __truncated__,..: 1 1 1 1 1 1 1 1 1 2 ...

Question, Expectation and Answers

In this section, we will follow the following logical framework:

  1. Letting the exploration generate the questions
  2. Setting an expectation for what the data will tell us
  3. Revise or confirm the expectation in the light of analysed data

Being the dataset composed most by categorical variables, we want to know first the discrete values that these variables can take, and see if any question arises.

Categorical Data

This section deals with the understanding and cleaning of categorical variables in the dataset.

Attacker King

The variable represents the attacker’s king. A slash indicates that the king changes over the course of the war. The levels of this variable are:

""  "Balon/Euron Greyjoy"  "Joffrey/Tommen Baratheon"  "Robb Stark"  "Stannis Baratheon"

The fact, in some circumstances, that there isn’t an attacking king is not an error: simply can be that there is no attacking king commanding the troops. In this case throwing away missing data can be detrimental, as they can be source of information. For example, a value of “ “ can mean “unknown” or “not applicable”, so it should be encoded in that way. Usually I search for trend in missing data to see if they miss for a reason.

Question Q1: Does the “ “ level mean something? Expectation E1: Yes, simply there’s no king. Answer A1: The “ “ stands for “NoKing”.

  • Check the battle names where there is no attacker king and possible notes on that battles:
                name                 note
23   Battle of the Burning Septry     
30   Sack of Saltpans
  • Is the missing king due to the battle type? ==> NO, usually pitched battles have a king which guides the army, and troops without a king are expected to act as bandits, preferring a razing or ambush strategy:

                     FALSE TRUE
                       1    0
    ambush            10    0
    pitched battle    13    1
    razing             1    1
    siege             11    0
  • Is the missing king due to the attacker? ==> YES, the Brave Companions and the Brotherhood don’t have a king:

                                  FALSE TRUE
    Baratheon                       6    0
    Bolton                          2    0
    Bracken                         1    0
    Brave Companions                0    1
    Brotherhood without Banners     0    1
    Darry                           1    0
    Free folk                       1    0
    Frey                            2    0
    Greyjoy                         7    0
    Lannister                       8    0
    Stark                           8    0
  • Is the previous conclusion the right one? Let’s Check for other associations. Seems that the Riverlands are land of no-one, and bandits or sellswords prefer to fight in that area!

                      FALSE TRUE
    Beyond the Wall     1    0
    The Crownlands      2    0
    The North          10    0
    The Reach           2    0
    The Riverlands     15    2
    The Stormlands      3    0
    The Westerlands     3    0

The most plausible answer is that the missing value is related to fighters which don’t have a king. We can then fill the missing value with a new one, the “NoKing” value in this case.

levels(battles$attacker_king)[match("",levels(battles$attacker_king))]="NoKing"

Question Q2: Who is the king who attacked more? Expectation E2: Joffrey/Tommen Baratheon, being Joffrey the most sadistic character. Answer A2: Joffrey/Tommen Baratheon

Defender King

This variable represents the defender’s king. The levels are:

""  "Balon/Euron Greyjoy"  "Joffrey/Tommen Baratheon"  "Mance Rayder"  "Renly Baratheon"
    "Robb Stark"  "Stannis Baratheon"

Question Q3: Does the “ “ level mean something? Expectation E3: Yes, simply there’s no king. Answer A3: No king.

From the data we can see that there are 3 battles without a defending king.

             name                    note
  23 Battle of the Burning Septry     
  25 Retaking of Harrenhal     
  30 Sack of Saltpans

Digging a little deeper shows that two of them were against the brave Companion which don’t have a king, and the remaining one is a razing, that is, an attack against an undefended position:


                         ambush  pitched battle  razing  siege
                     0      0         0             1     0
    Baratheon        0      0         0             0     0
    Blackwood        0      0         0             0     0
    Bolton           0      0         0             0     0
    Brave Companions 0      0         2             0     0
    Darry            0      0         0             0     0
    Greyjoy          0      0         0             0     0
    Lannister        0      0         0             0     0
    Mallister        0      0         0             0     0
    Night's Watch    0      0         0             0     0
    Stark            0      0         0             0     0
    Tully            0      0         0             0     0
    Tyrell           0      0         0             0     0

I feel confident to label the “ “ value to the “NoKing” one.

levels(battles$defender_king)[match("",levels(battles$defender_king))]="NoKing"

Question Q4: Whose king undergone more attacks? Expectation E4: Robb Stark, it seems that everyone wants the North. Answer A4: Robb Stark.

Observation O1: From the previous plots, it seems that both Renly Baratheon and Mance Rayder did not have the time, or the will, to perform any attack: they only defended their position.

Attackers

These variables indicates the major houses attacking.

Main attackers:

  "Baratheon"  "Bolton"  "Bracken"  "Brave Companions"  "Brotherhood without Banners"
  "Darry"  "Free folk"  "Frey"    "Greyjoy"  "Lannister"         "Stark"

Second attackers:

""  "Bolton"  "Frey"  "Greyjoy"  "Karstark"  "Lannister"  "Thenns"  "Tully"

Third attackers:

""  "Giants"  "Mormont"

Fourth attackers:

""  "Glover"

It can be seen that even in the attackers levels, except for the variable attacker_1, there are missing information represented as an empty string. Intuitively, this represent the fact that there aren’t the same number of allies for every battles. We can encode the missing values as “NonPresent”, indicating that an attacker is not present for that battle.

levels(battles$attacker_2)[match("",levels(battles$attacker_2))]="NotPresent"
levels(battles$attacker_3)[match("",levels(battles$attacker_3))]="NotPresent"
levels(battles$attacker_4)[match("",levels(battles$attacker_4))]="NotPresent"

Another observation is that the House Glover is present all the times that 4 attackers participated in a battle.

Question Q5: What are the battle with the maximum number of attackers? Expectation E5: Probably the battles to conquer the North. Answer A5: The battles were those carried on by Stannis Baratheon to free the North from the Greyjoy’s and Winterfell from Bolton’s:

            name                     attacker_king         defender_king       attacker_outcome   region
31  Retaking of Deepwood Motte   Stannis Baratheon    Balon/Euron Greyjoy           win        The North
38  Siege of Winterfell          Stannis Baratheon    Joffrey/Tommen Baratheon                 The North    

Observation O2:This last battle does not have an outcome, because in the book we only know a letter send to Jon Snow by Ramsey Bolton which tells him that Stannis died, but we are not sure of the letter trustfulness.

Question Q6: There are other cases with missing attacker_outcome? Expectation E6: Probably not, since that is the last battle of the books until now. Answer A6: No, that battle is the only one. We can set the missing value to “unknown”:

            name             attacker_king         defender_king          region
38  Siege of Winterfell   Stannis Baratheon   Joffrey/Tommen Baratheon   The North
levels(battles$attacker_outcome)[match("",levels(battles$attacker_outcome))]="unknown"

Defenders

This variable indicates the major houses defending.

Main defenders:

""   "Baratheon"  "Blackwood"  "Bolton"  "Brave Companions" "Darry"  "Greyjoy"          
     "Lannister"  "Mallister"  "Night's Watch"  "Stark"  "Tully"  "Tyrell"

Second defenders:

""  "Baratheon"  "Frey"

Third defenders:

  NULL

Fourth defenders:

  NULL

It can be seen that even in the defenders levels there are missing information represented as an empty string. As with the attackers, we can encode the missing values as “NonPresent”, indicating that a defender is not present for that battle.

Observation O3: It can be further noticed that nobody defended with more than one ally, and the defender_3 and defender_4 columns can be removed from the dataset.

levels(battles$defender_2)[match("",levels(battles$defender_2))]="NotPresent"
battles$defender_3=NULL
battles$defender_4=NULL

Question Q7: What does it mean a “ “ value on the variable defender_1? Expectation E7: A battle was fought without defenders, and probably was a razing. Answer A7: The battle was indeed a razing and there were nor attackers neither defender kings. We can set the missing value to the “NotPresent” one:

           name         attacker_king     attacker_1     defender_king   battle_type
 30  Sack of Saltpans      NoKing      Brave Companions     NoKing          razing
levels(battles$defender_1)[match("",levels(battles$defender_1))]="NotPresent"

Attacker outcomes

This variable indicates the outcome from the perspective of the attacker. Categories: win, loss, draw.

Question Q8: What are the possible outcomes? Expectation E8: From the codebook, the possible values are “draw”, “win”, “loss”.Answer A8: The values are under the expectations but no battle ended with a “draw”:

"unknown"  "loss"  "win"

Battle types

A classification of the battle’s primary type. Categories:

  • Pitched_battle: armies meet in a location and fight.
  • Ambush: a battle where stealth or subterfuge was the primary means of attack.
  • Siege: a prolonged of a forties position.
  • Razing: an attack against an undefended position
""  "ambush"  "pitched battle"  "razing"  "siege"

Question Q9: What does it mean the value “ “ on the variable battle_type? Expectation E9: Probably an unknown battle type. Answer A9: The value is not indicated because is unknown how the battle went and its outcome, being the battle the Siege of Winterfell by Stannis Baratheon:

               name         attacker_king    attacker_outcome    defender_king         battle_type
38  Siege of Winterfell   Stannis Baratheon     unknown       Joffrey/Tommen Baratheon  
levels(battles$battle_type)[match("",levels(battles$battle_type))]="unknown"

Attacker commander

Major commanders of the attackers. Commander’s names are included without honorific titles and commanders are separated by commas. Since there are many commanders, only the first are reported:

""  "Asha Greyjoy"  "Dagmer Cleftjaw"   "Daven Lannister, Ryman Fey, Jaime Lannister"
    "Euron Greyjoy, Victarion Greyjoy"  "Gregor Clegane"

Question Q10: What does it mean the value “ “ on the variable attacker_commander? Expectation E10: Probably a missing or unknown commander. Answer A10: The value is not indicated because there wasn’t a commander, being a battle led by the Brotherhood without Banners. We can set the missing value to a “NotPresent” one:

              name                attacker_king        attacker_1             defender_king    battle_type
23 Battle of the Burning Septry      NoKing      Brotherhood without Banners     NoKing      pitched battle
levels(battles$attacker_commander)[match("",levels(battles$attacker_commander))]="NotPresent"

Defender commander

Major commanders of the defenders. Commander’s names are included without honoric titles and commanders are separated by commas. Since there are many commanders, only the first one are reported:

""  "Amory Lorch"  "Asha Greyjoy"  "Beric Dondarrion"  "Bran Stark"  "Brynden Tully"

Question Q11: What does it mean the value “ “ on the variable defender_commander? Expectation E11: Probably a missing or unknown commander. Answer A11: The value is not indicated because there wasn’t a commander, or it was unknown. In the battles where there is “NoKing” as defender_king, we can assume that the a defender_commander was not present. In the rest of the battles, which most of them are led by the Greyjoy’s, probably there was a defender_commander but is not indicated, and thus is unknown:

            name                    attacker_king            defender_king           battle_type
8   Battle of Moat Cailin          Balon/Euron Greyjoy        Robb Stark pitched         battle
9   Battle of Deepwood Motte       Balon/Euron Greyjoy        Robb Stark                 siege
10  Battle of the Stony Shore      Balon/Euron Greyjoy        Robb Stark                 ambush
13  Sack of Torrhen's Square       Balon/Euron Greyjoy        Balon/Euron Greyjoy        siege
21  Siege of Darry                  Robb Stark                Joffrey/Tommen Baratheon   siege
23  Battle of the Burning Septry   NoKing                     NoKing pitched             battle
29  Fall of Moat Cailin            Joffrey/Tommen Baratheon   Balon/Euron Greyjoy        siege
30  Sack of Saltpans               NoKing                     NoKing                     razing
32  Battle of the Shield Islands   Balon/Euron Greyjoy        Joffrey/Tommen Baratheon   pitched battle
33  Invasion of Ryamsport,         Balon/Euron Greyjoy        Joffrey/Tommen Baratheon   razing
    Vinetown, and Starfish Harbor
levels(battles$defender_commander) = c(levels(battles$defender_commander), "NotPresent","unknown")
battles[battles$defender_king=="NoKing" & battles$defender_commander=="","defender_commander"]="NotPresent"
battles[battles$defender_commander=="","defender_commander"]="unknown"
battles$defender_commander = droplevels(battles$defender_commander)

Battles locations

This variable represents the battle location. Levels are:

""  "Castle Black"  "Crag"   "Darry"  "Deepwood Motte"  "Dragonstone"

Question Q12: What does it mean the value “ “ on the variable location? Expectation E12: Probably a missing or unknown location Answer A12: The location is not known:

             name                 attacker_king   defender_king
23 Battle of the Burning Septry       NoKing          NoKing
levels(battles$location)[match("",levels(battles$location))]="unknown"

Battle regions

The region where the battle takes place. Categories: Beyond the Wall, The North, The Iron Islands, The Riverlands, The Vale of Arryn, The Westerlands, The Crownlands, The Reach, The Stormlands, Dorne

Question Q13: What are the values assume med by the variable? Expectation E13: The values assumed by the variable are those described in the codebook. Answer A13: The answer meets the expectation, except for the regions “The Iron Islands”, “The Vale of Arryn” and “Dorne”, probably because no battle were fought in those regions:

"Beyond the Wall"  "The Crownlands"  "The North"  "The Reach"  "The Riverlands"  
"The Stormlands"   "The Westerlands"

Numerical Data

This section deals with the understanding and cleaning of numerical variables in the dataset.

Year

The year of the battle. We convert it to a factor variable for convenience and representation, since it assumes only \(3\) different values.

   Min.  1st Qu.  Median   Mean    3rd Qu.  Max.
  298.0   299.0    299.0   299.1   300.0   300.0

Attacker size

The size of the attacker’s force. No distinction is made between the types of soldiers such as cavalry and footmen:

   Min.  1st Qu.  Median  Mean   3rd Qu.   Max.   NA's
    20    1375     4000   9943    8250   100000    14

From the summary we can see that the distribution of the attacker army has a mean of about \(10000\) soldiers , but is very scattered with many missing numbers. Particularly impressing is maximum number of \(100000\) men.

Question Q14: Which is the battle with \(100000\) men? Expectation E14: A battle in the North with the wildlings. Answer A14: The battle was the assault of Castle Black by the wildlings and free folk, when Jon Snow loses Igritte. We can see that there is an error in the data, because we know that Stannis Baratheon was on the Night’s side, defending the Nigth’s Watch and seizing Mance Rayder. Furthermore, Stannis won the battle and Mance Rayder lost it, thus the attacker_king and defender_king variables should be swapped. The number of \(100000\) is more meaningful now if we think of it as the army of all the freefolks, as it is also reported here:

           name               attacker_king     defender_king   attacker_1    defender_1    attacker_outcome
28 Battle of Castle Black   Stannis Baratheon    Mance Rayder    Free folk   Night's Watch       loss

Question Q15: When attacking, which battle type required more men? Expectation E15: Probably the pitched battle type, since it requires more men than ambush or a siege, which require more discretion and tools (trebuchets, rams) capability respectively. Answer 15: The pitched battle has the higher median (about \(10000\) troops) and it is very skewed around this value, and only \(25%\) of values are lower than \(3000\) troops. Perhaps surprising, the median of the ambush distribution is similar to that of siege, being the latter more concentrated, indicating that there some standard number of troops to do a siege. Here we have isolated cases of ambushes with less than \(30\) men, and a siege with \(100000\) men (the Mance Rayder attack to Castle Black). We Do not consider “unknown” o “razing” battles since they have few or none observations.

Question Q16: When attacking, which king had the most numerous army? Expectation E16: We already know that Mance Rayder commanded \(100000\) men. Answer 16: Mance had the most numerous attacking army, but he attacked only one time, so it is more interesting to considered the other kings. I made the choice to exclude from the comparison also the “NoKing” category, since it has few observations. We can see from the plot that the Greyjoy’s had the smallest army, ranging from \(10\) to \(1000\) men. The Lannister’s and the Stark’s show a high median value army, but that also had great variation in its forces, mainly for Robb Stark. This can be probably due to his attitude to perform ambush attacks with few men. Stannis Baratheon forces undergo few losses, having a quite concentrated distribution.

Defender size

The size of the defender’s force. No distinction is made between the types of soldiers such as cavalry and footmen.

   Min.  1st Qu.  Median  Mean   3rd Qu.  Max.   NA's
   100    1070    6000    6428   10000   20000     19

Question Q17: When attacking, which king had the most numerous army? Expectation E17:Probably the Lannister’s or Stannis Baratheon. Answer 17: Apart from the “Noking” and “Renly Baratheon”, which have few observation, the plots show that the Lannister’s defended their position with more troops than the other, and Stannis Baratheon, despite attacking with an high number of troops, defended with very few ones.

Acknowledgements




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Visualizing Classification Results - Confusion Star and Confusion Gear
  • Human perception of probability
  • The three ways of statistical inference
  • A Games of Tufte - Part II