v A Games of Tufte - Part I · Mirko Mazzoleni

# A Games of Tufte - Part I

This report concerns the first part of an exploratory data analysis based on the Games of Thrones dataset hosted on Kaggle. The aim of this work is to familiarize with the data for subsequent analysis, and using the Tufte design rules to represent the plots. During the process, personal domain knowledge (acquired from books and not the tv series) is used to motivate hypothesis and decisions. Since there aren’t motivations or questions that brought me to collect data, in order to answer to them, we let the Exploratory Data Analysis phase to generate questions for us. A sound answer to those questions would require at least another dataset, so we let to fix in mind the fact that we are simply describing the dataset at hand, without the temptation to make inferences or other types of final statements.

The entire code for this post can be found here.

## Data cleaning and questions generation

After having load the required libraries and the dataset, which contains information about the main battles in the reign of Westeros during the War of the Five Kings, let’s first take an overview of the dataset at hand by checking the variables at our disposal. We can see $$38$$ observations for each of the $$25$$ variables. The next step will be to gain confidence with the features and the values they can take.

  'data.frame':	38 obs. of  25 variables:
$name : Factor w/ 38 levels "Battle at the Mummer's Ford",..: 13 1 7 14 18 10 25 5 3 17 ...$ year              : int  298 298 298 298 298 298 298 299 299 299 ...
$battle_number : int 1 2 3 4 5 6 7 8 9 10 ...$ attacker_king     : Factor w/ 5 levels "","Balon/Euron Greyjoy",..: 3 3 3 4 4 4 3 2 2 2 ...
$defender_king : Factor w/ 7 levels "","Balon/Euron Greyjoy",..: 6 6 6 3 3 3 6 6 6 6 ...$ attacker_1        : Factor w/ 11 levels "Baratheon","Bolton",..: 10 10 10 11 11 11 10 9 9 9 ...
$attacker_2 : Factor w/ 8 levels "","Bolton","Frey",..: 1 1 1 1 8 8 1 1 1 1 ...$ attacker_3        : Factor w/ 3 levels "","Giants","Mormont": 1 1 1 1 1 1 1 1 1 1 ...
$attacker_4 : Factor w/ 2 levels "","Glover": 1 1 1 1 1 1 1 1 1 1 ...$ defender_1        : Factor w/ 13 levels "","Baratheon",..: 12 2 12 8 8 8 6 11 11 11 ...
$defender_2 : Factor w/ 3 levels "","Baratheon",..: 1 1 1 1 1 1 1 1 1 1 ...$ defender_3        : logi  NA NA NA NA NA NA ...
$defender_4 : logi NA NA NA NA NA NA ...$ attacker_outcome  : Factor w/ 3 levels "","loss","win": 3 3 3 2 3 3 3 3 3 3 ...
$battle_type : Factor w/ 5 levels "","ambush","pitched battle",..: 3 2 3 3 2 2 3 3 5 2 ...$ major_death       : int  1 1 0 1 1 0 0 0 0 0 ...
$major_capture : int 0 0 1 1 1 0 0 0 0 0 ...$ attacker_size     : int  15000 NA 15000 18000 1875 6000 NA NA 1000 264 ...
$defender_size : int 4000 120 10000 20000 6000 12625 NA NA NA NA ...$ attacker_commander: Factor w/ 32 levels "","Asha Greyjoy",..: 8 6 9 22 16 18 6 30 2 28 ...
$defender_commander: Factor w/ 29 levels "","Amory Lorch",..: 7 4 10 28 12 14 15 1 1 1 ...$ summer            : int  1 1 1 1 1 1 1 1 1 1 ...
$location : Factor w/ 28 levels "","Castle Black",..: 8 13 17 9 27 17 4 12 5 23 ...$ region            : Factor w/ 7 levels "Beyond the Wall",..: 7 5 5 5 5 5 5 3 3 3 ...
battles$defender_4=NULL  Question Q7: What does it mean a “ “ value on the variable defender_1? Expectation E7: A battle was fought without defenders, and probably was a razing. Answer A7: The battle was indeed a razing and there were nor attackers neither defender kings. We can set the missing value to the “NotPresent” one:  name attacker_king attacker_1 defender_king battle_type 30 Sack of Saltpans NoKing Brave Companions NoKing razing  levels(battles$defender_1)[match("",levels(battles$defender_1))]="NotPresent"  #### Attacker outcomes This variable indicates the outcome from the perspective of the attacker. Categories: win, loss, draw. Question Q8: What are the possible outcomes? Expectation E8: From the codebook, the possible values are “draw”, “win”, “loss”.Answer A8: The values are under the expectations but no battle ended with a “draw”: "unknown" "loss" "win"  #### Battle types A classification of the battle’s primary type. Categories: • Pitched_battle: armies meet in a location and fight. • Ambush: a battle where stealth or subterfuge was the primary means of attack. • Siege: a prolonged of a forties position. • Razing: an attack against an undefended position "" "ambush" "pitched battle" "razing" "siege"  Question Q9: What does it mean the value “ “ on the variable battle_type? Expectation E9: Probably an unknown battle type. Answer A9: The value is not indicated because is unknown how the battle went and its outcome, being the battle the Siege of Winterfell by Stannis Baratheon:  name attacker_king attacker_outcome defender_king battle_type 38 Siege of Winterfell Stannis Baratheon unknown Joffrey/Tommen Baratheon  levels(battles$battle_type)[match("",levels(battles$battle_type))]="unknown"  #### Attacker commander Major commanders of the attackers. Commander’s names are included without honorific titles and commanders are separated by commas. Since there are many commanders, only the first are reported: "" "Asha Greyjoy" "Dagmer Cleftjaw" "Daven Lannister, Ryman Fey, Jaime Lannister" "Euron Greyjoy, Victarion Greyjoy" "Gregor Clegane"  Question Q10: What does it mean the value “ “ on the variable attacker_commander? Expectation E10: Probably a missing or unknown commander. Answer A10: The value is not indicated because there wasn’t a commander, being a battle led by the Brotherhood without Banners. We can set the missing value to a “NotPresent” one:  name attacker_king attacker_1 defender_king battle_type 23 Battle of the Burning Septry NoKing Brotherhood without Banners NoKing pitched battle  levels(battles$attacker_commander)[match("",levels(battles$attacker_commander))]="NotPresent"  #### Defender commander Major commanders of the defenders. Commander’s names are included without honoric titles and commanders are separated by commas. Since there are many commanders, only the first one are reported: "" "Amory Lorch" "Asha Greyjoy" "Beric Dondarrion" "Bran Stark" "Brynden Tully"  Question Q11: What does it mean the value “ “ on the variable defender_commander? Expectation E11: Probably a missing or unknown commander. Answer A11: The value is not indicated because there wasn’t a commander, or it was unknown. In the battles where there is “NoKing” as defender_king, we can assume that the a defender_commander was not present. In the rest of the battles, which most of them are led by the Greyjoy’s, probably there was a defender_commander but is not indicated, and thus is unknown:  name attacker_king defender_king battle_type 8 Battle of Moat Cailin Balon/Euron Greyjoy Robb Stark pitched battle 9 Battle of Deepwood Motte Balon/Euron Greyjoy Robb Stark siege 10 Battle of the Stony Shore Balon/Euron Greyjoy Robb Stark ambush 13 Sack of Torrhen's Square Balon/Euron Greyjoy Balon/Euron Greyjoy siege 21 Siege of Darry Robb Stark Joffrey/Tommen Baratheon siege 23 Battle of the Burning Septry NoKing NoKing pitched battle 29 Fall of Moat Cailin Joffrey/Tommen Baratheon Balon/Euron Greyjoy siege 30 Sack of Saltpans NoKing NoKing razing 32 Battle of the Shield Islands Balon/Euron Greyjoy Joffrey/Tommen Baratheon pitched battle 33 Invasion of Ryamsport, Balon/Euron Greyjoy Joffrey/Tommen Baratheon razing Vinetown, and Starfish Harbor  levels(battles$defender_commander) = c(levels(battles$defender_commander), "NotPresent","unknown") battles[battles$defender_king=="NoKing" & battles$defender_commander=="","defender_commander"]="NotPresent" battles[battles$defender_commander=="","defender_commander"]="unknown"
battles$defender_commander = droplevels(battles$defender_commander)


#### Battles locations

This variable represents the battle location. Levels are:

""  "Castle Black"  "Crag"   "Darry"  "Deepwood Motte"  "Dragonstone"


Question Q12: What does it mean the value “ “ on the variable location? Expectation E12: Probably a missing or unknown location Answer A12: The location is not known:

             name                 attacker_king   defender_king
23 Battle of the Burning Septry       NoKing          NoKing

levels(battles$location)[match("",levels(battles$location))]="unknown"


#### Battle regions

The region where the battle takes place. Categories: Beyond the Wall, The North, The Iron Islands, The Riverlands, The Vale of Arryn, The Westerlands, The Crownlands, The Reach, The Stormlands, Dorne

Question Q13: What are the values assume med by the variable? Expectation E13: The values assumed by the variable are those described in the codebook. Answer A13: The answer meets the expectation, except for the regions “The Iron Islands”, “The Vale of Arryn” and “Dorne”, probably because no battle were fought in those regions:

"Beyond the Wall"  "The Crownlands"  "The North"  "The Reach"  "The Riverlands"
"The Stormlands"   "The Westerlands"


### Numerical Data

This section deals with the understanding and cleaning of numerical variables in the dataset.

#### Year

The year of the battle. We convert it to a factor variable for convenience and representation, since it assumes only $$3$$ different values.

   Min.  1st Qu.  Median   Mean    3rd Qu.  Max.
298.0   299.0    299.0   299.1   300.0   300.0


#### Attacker size

The size of the attacker’s force. No distinction is made between the types of soldiers such as cavalry and footmen:

   Min.  1st Qu.  Median  Mean   3rd Qu.   Max.   NA's
20    1375     4000   9943    8250   100000    14


From the summary we can see that the distribution of the attacker army has a mean of about $$10000$$ soldiers , but is very scattered with many missing numbers. Particularly impressing is maximum number of $$100000$$ men.

Question Q14: Which is the battle with $$100000$$ men? Expectation E14: A battle in the North with the wildlings. Answer A14: The battle was the assault of Castle Black by the wildlings and free folk, when Jon Snow loses Igritte. We can see that there is an error in the data, because we know that Stannis Baratheon was on the Night’s side, defending the Nigth’s Watch and seizing Mance Rayder. Furthermore, Stannis won the battle and Mance Rayder lost it, thus the attacker_king and defender_king variables should be swapped. The number of $$100000$$ is more meaningful now if we think of it as the army of all the freefolks, as it is also reported here:

           name               attacker_king     defender_king   attacker_1    defender_1    attacker_outcome
28 Battle of Castle Black   Stannis Baratheon    Mance Rayder    Free folk   Night's Watch       loss


Question Q15: When attacking, which battle type required more men? Expectation E15: Probably the pitched battle type, since it requires more men than ambush or a siege, which require more discretion and tools (trebuchets, rams) capability respectively. Answer 15: The pitched battle has the higher median (about $$10000$$ troops) and it is very skewed around this value, and only $$25%$$ of values are lower than $$3000$$ troops. Perhaps surprising, the median of the ambush distribution is similar to that of siege, being the latter more concentrated, indicating that there some standard number of troops to do a siege. Here we have isolated cases of ambushes with less than $$30$$ men, and a siege with $$100000$$ men (the Mance Rayder attack to Castle Black). We Do not consider “unknown” o “razing” battles since they have few or none observations.

Question Q16: When attacking, which king had the most numerous army? Expectation E16: We already know that Mance Rayder commanded $$100000$$ men. Answer 16: Mance had the most numerous attacking army, but he attacked only one time, so it is more interesting to considered the other kings. I made the choice to exclude from the comparison also the “NoKing” category, since it has few observations. We can see from the plot that the Greyjoy’s had the smallest army, ranging from $$10$$ to $$1000$$ men. The Lannister’s and the Stark’s show a high median value army, but that also had great variation in its forces, mainly for Robb Stark. This can be probably due to his attitude to perform ambush attacks with few men. Stannis Baratheon forces undergo few losses, having a quite concentrated distribution.

#### Defender size

The size of the defender’s force. No distinction is made between the types of soldiers such as cavalry and footmen.

   Min.  1st Qu.  Median  Mean   3rd Qu.  Max.   NA's
100    1070    6000    6428   10000   20000     19


Question Q17: When attacking, which king had the most numerous army? Expectation E17:Probably the Lannister’s or Stannis Baratheon. Answer 17: Apart from the “Noking” and “Renly Baratheon”, which have few observation, the plots show that the Lannister’s defended their position with more troops than the other, and Stannis Baratheon, despite attacking with an high number of troops, defended with very few ones.