Pokémon Analysis

Andreas Nikolaidis

February 2022


In this project, I use Python to analayze stats on all Pokemon in Generations 1 - 8, and calculate some interesting statistics based on a number of factors.

We can use this data to answer questions such as:

  • Does a Pokemon's Type determine it's stats like: HP, Attack, Defense, etc.?
  • What is the most important stat for predicting other stats? i.e. which stats have a high correlation?

In the following sections, I will walk through my process of extracting and analyzing the information using in pandas DataFrames, creating some visualizations and perform modeling using scikit-learn.

Exploratory Analysis

Start by importing all the necessary packages into Python:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import warnings

import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

# Import for Linear Regression
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans

Read Data File:

df = pd.read_excel("pokemon.xlsx")

Create a separate dataframe including just the necessary stats:

df_stats = df[["Name","HP","Attack","Defense","SP_Attack","SP_Defense","Speed"]]

Although each stat is important in it's own right, the total value of all stats is what determines the category of a pokemon, therefore let's concatenate a column into the df that sums up the total values:

df['total'] = df.HP + df.Attack + df.Defense + df.SP_Attack + df.SP_Defense + df.Speed

Now let's view the range of total stats by each generation:

#palette: https://seaborn.pydata.org/tutorial/color_palettes.html?highlight=color
plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='Gen', y='total', data=df, scale='width', inner='quartile', palette='Set2') 
plt.title('Violin Plot of Total Stats by Generation', fontsize=22)

In the above violinplot we can see that each generation has quite a different range of total stats with Gens IV, VII, & VIII having the longest range, while Gen V had a relatively tight range of stats. All Generations from IV onwards had higher medians than the first 3 generations.

Looking at individual stats, Speed is one of (if not THE) most important stat in competitive play, so let's examine which generations had the best overall speed stats.

plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='Gen', y='Speed', data=df, scale='width', inner='quartile', palette='Set2')

plt.title('Violin Plot of Total Stats by Generation', fontsize=22)

Here we can clearly see Generation VIII has some of the fastest pokemon ever seen in games. Let's create a function to return the top 10 fastest pokemon in Gen VIII and their respective speed stat values:

def top_n(df, category, n):
    return (df.loc[df['Gen'] == 'VIII'].sort_values(category, ascending=False)[['Name','Gen',category]].head(n))
print('Top 10 Pokemon Speed')
top_n(df, 'Speed', 10)

Those are definitely some fast pokemon!

Let's now see if we can get any indication of whether a particular pokemon's type has an advantage over others in total stats.

types_color_dict = {
    'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
    'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
    'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
    'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }

plt.figure(figsize=(15,12), dpi=80)
sns.violinplot(x='Primary', y='total', data=df, scale='width', inner='quartile', palette=types_color_dict)

plt.title('Violin Plot of Total Stats by Type', fontsize=20)

The dragon type definitely has quite a high upper interquartile range compared to other types. Meanwhile water & fairy types seem to have quite a large variance in total stats.

Let's see what the most common type of pokemon is:

types_color_dict = {
    'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
    'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
    'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
    'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }

Type1 = pd.value_counts(df['Primary'])
dims = (11.7,8.27) #A4 dimensions
fig, ax=plt.subplots(figsize=dims)
BarT = sns.barplot(x=Type1.index, y=Type1, data=df, palette=types_color_dict, ax=ax)
BarT.set_xticklabels(BarT.get_xticklabels(), rotation= 90, fontsize=12)
BarT.set(ylabel = 'Freq')
BarT.set_title('Distribution of Primary Pokemon Types')
FigBar = BarT.get_figure()

We can see that the water and normal type pokemon are the most frequently appearing 'primary' types in the game.

Let's see how many pokemon are mono types vs dual-types so we can get a better sense of whether primary is sufficient.

labels = ['Mono type pokemon', 'Dual type pokemon']
sizes = [monotype, dualtype]
colors = ['lightskyblue', 'lightcoral']

patches, texts, _ = plt.pie(sizes, colors=colors, autopct='%1.1f%%', startangle=90, explode=(0,0.1))
plt.legend(patches, labels, loc="best")
plt.title('Dual-Type Ratio', fontsize=12)

Looks like there's actually more dual types than mono-types!

Aside from types, there are also 5 categories of pokemon: Regular, Pseudo-Legendary, Sub-Legendary, Legendary and Mythical. (There are of course also pre-evolutions, final evolutions, mega-evolutions etc.. but for the purposes of this analysis we will just bundle those together under 'regular' along with Pseudo-Legendary which are regular pokemon that have generally higher stats of 600 total.
As for Sub Legendaries, Legendaries and Mythical - these pokemon typically exhibit 2 types of traits:

  1. Rarity: There is usually only 1 of those pokemon available in every game (some may not even be obtainable in certain games)
  2. Stats: These pokemon generally have much higher stats than the average 'regular' pokemon.

Let's create a diverging bar to determine the rate at which legendary pokemon appear in each generation:

#Sub-Legendary, Legendary or Mythical:
df.loc[df["is_sllm"]==False,"sllmid"] = 0
df.loc[df["is_sllm"]==True,"sllmid"] = 1

# calculate proportion of SL, L, M #
sllm_ratio = df.groupby("Gen").mean()["sllmid"]
df_plot = pd.DataFrame(columns={"Gen","Rate","colors"})
x = sllm_ratio.values
df_plot["Gen"] = sllm_ratio.index
df_plot['Rate'] = (x - x.mean())/x.std()
df_plot['colors'] = ['red' if x < 0 else 'green' for x in df_plot['Rate']]
df_plot.sort_values('Rate', inplace=True)

plt.figure(figsize=(14, 10))
    y=df_plot.index, xmin=0, xmax=df_plot.Rate,

plt.gca().set(xlabel='Rate', ylabel='Gen')
plt.yticks(df_plot.index, df_plot.Gen, fontsize=12)
plt.title('Diverging Bars of SubL, Legendary & Mythical Rate', fontdict={'size':20})
sub, legend,myth

Seems like Gen 7's Alola region has a huge volume of these 'legendaries & mythical' pokemon, which after digging further into it makes perfect sense given the introduction of a plethora of legendaries called 'ultra beasts' which were only ever introduced in that generation.

Correlations & Descriptive Statistics

Let's move to explore some correlations between stats.

Base_stats = ['Primary','Secondary','Classification','%Male','%Female',

df_BS = df[Base_stats]

heatmap = sns.heatmap(df_BS.corr(), vmin=-1,vmax=1, annot=True, cmap='Blues')

heatmap.set_title('Correlation Base Stats Heatmap', fontdict={'fontsize':15}, pad=12)
p1 = sns.jointplot(x="SP_Attack",y="SP_Defense",data=df,kind="hex",color="lightgreen")
p1.fig.suptitle("Hex Plot of Special Attack and Special Defense - Some Correlation")
p2 = sns.jointplot(x="Defense",y="SP_Defense",data=df,kind="hex",color="lightblue")
p2.fig.suptitle("Hex Plot of Defense and Special Defense - Some Correlation")
p3 = sns.jointplot(x="SP_Attack",y="Speed",data=df,kind="hex",color="pink")
p3.fig.suptitle("Hex Plot of Special Attack and Speed - Some Correlation")
p4 = sns.jointplot(x="Attack",y="SP_Attack",data=df,kind="hex",color="orange")
p4.fig.suptitle("Hex Plot of Attack and Special Attack - Some Correlation")
p5 = sns.jointplot(x="Attack",y="Defense",data=df,kind="hex",color="purple")
p5.fig.suptitle("Hex Plot of Attack and Defense - Some Correlation")
from pandas import plotting
type1 = list(set(list(df['Primary'])))
cmap = plt.get_cmap('viridis')
colors = [cmap((type1.index(c) + 1) / (len(type1) + 2)) for c in df['Primary'].tolist()]
plotting.scatter_matrix(df.iloc[:, 13:18], figsize=(15, 15), color=colors, alpha=0.7) 
import numpy as np
pd.DataFrame(np.corrcoef(df.iloc[:, 13:18].T.values.tolist()), 
             columns=df.iloc[:, 13:18].columns, index=df.iloc[:, 13:18].columns)
corrplot values
labels = ["Defense", "Attack"]
dims = (11.7, 8.27) #a4
fig, ax = plt.subplots(figsize=dims)
Defhist = sns.distplot(df['Defense'],color='g', hist=True, ax=ax)
Atthist = sns.distplot(df['Attack'],color='r', hist=True, ax=ax)
Atthist.set(title='Distribution of Defense & Attack')
plt.legend(labels, loc="best")
FigHist = Atthist.get_figure()
fig, ax = plt.subplots(2, 3, figsize=(14, 8), sharey=True)

spines = ["top","right","left"]
for i, col in enumerate(["HP", "Attack", "Defense", "SP_Attack", "SP_Defense", "Speed"]):
    sns.kdeplot(x=col, data=df, label=col, ax=ax[i//3][i%3],
                fill=True, color='lightblue', linewidth=2
    ax[i//3][i%3].set_xlim(-5, 250)
    for s in spines:


Looking at the summary statistics, we can see that the assumption about the variance and skewness of both plots was correct. The ‘std’ metric of the Attack is less than Defense, meaning that Defense statistics are more spread. Similarly, the Sp.Atk ‘std’ is larger than that of the Sp.Def. Skewness is determined by the positions of the median (50%) and the mean. Since in all instances (Attack, Defense, Sp.Attack and Sp.Defense) the mean is greater than the median, it is emphasised that the distribution is right-skewed (positively skewed).

Principal Component Analysis (PCA)

Let's analyze 800+ Pokemon as principal components and plot them in a two-dimensional plane using the first and second principal components.
Principal component analysis (PCA) is a type of multivariate analysis method that is often used as a dimensionality reduction method.

In this data, the characteristics of 800+ Pokemon are represented by 6 types of "observed variables" (x1, x2, x3, x4, x5, x6).
These 6 variables are used as explanatory variables.
On the other hand, the synthetic variable synthesized by PCA is called "principal component score" and is given by a linear combination as shown in the following equation:


In principal component analysis, the larger the eigenvalue (= variance of the principal component score), the more important the principal component score is.
PCA is also sometimes regarded as a type of "unsupervised machine learning" and reveals the structure of the data itself. So let's start by importing PCA from Scikit-learn

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df.iloc[:, 13:18])
feature = pca.transform(df.iloc[:, 13:18])
plt.figure(figsize=(15, 15))
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.8)
import matplotlib.ticker as ticker
import numpy as np
plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative contribution ratio")

Let's see if we can determine what makes a 'legendary' pokemon

pca = PCA()
pca.fit(df.iloc[:, 13:18])
feature = pca.transform(df.iloc[:, 13:18])
plt.figure(figsize=(15, 15))
for binary in [True, False]:
    plt.scatter(feature[df['is_sllm'] == binary, 0], feature[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.legend(loc = 'best')

Nice! Although it's not 'exact' we can clearly see that when the first principal component (PC1) reaches 50, we start to see a significantly higher concentration of legendary pokemon! Now, let's illustrate how much PC1 actually contributes to the explanatory variable (parameter) with a loading plot.


Assuming that the first principal component (PC1) is actually a strong indicator of whether or not a pokemon is classified as legendary, sub-legendary or mythical, it seems like Special Attack is the best indicator out of all stats (follow by Physical Attack)

In the second principal component (PC2), Defense and Speed contribute to the opposite: Positive & Negative.

"Factor Analysis" is a method that is similar to principal component analysis.

In PCA, we synthesized the "principal component" yPC1 which is a linear combination of the weight matrix (eigenvector) a for the explanatory variables. Here, define as many principal components as there are explanatory variables.

yPC1 = a1,1 x1 + a1,2 x2 + a1,3 x3 + a1,4 x4 + a1,5 + ...

In factor analysis, based on the idea that the explanatory variable (observed variable) x is synthesized from a latent variable called "factor", the factor score f, the weight matrix (factor load) w, and the unique factor e are specified. (There is no idea of ​​a unique factor in principal component analysis).

x1 = w1,1 f1 + w1,2 f2 + e1

x2 = w2,1 f1 + w2,2 f2 + e2

x3 = w3,1 f1 + w3,2 f2 + e3

x4 = w4,1 f1 + w4,2 f2 + e4

x5 = w5,1 f1 + w5,2 f2 + e5

x6 = w6,1 f1 + w6,2 f2 + e6

The factor score f is a latent variable unique to each individual (sample). The linear sum of the factor score and the factor load (w1,1 f1 + w1,2 f2, etc.) is called the "common factor" and can be observed as an "observed variable" by adding it to the "unique factor" e unique to the observed variable. It's a way of thinking. The number of factors is usually smaller than the explanatory variables and must be decided in advance.

(However, terms such as common factors and factors are very confusing because it seems that different people have different definitions as far as I can see)

from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=2, max_iter=500)
factors = fa.fit_transform(df.iloc[:, 13:18])
plt.figure(figsize=(12, 12))
for binary in [True, False]:
    plt.scatter(factors[df['is_sllm'] == binary, 0], factors[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.legend(loc = 'best')

In this instance, the determining factor of a 'legendary' is whether or not the sum of factor 1 and factor 2 exceeds a certain level, but it seems that it is slightly biased toward the larger factor 2. So which parameters do factor 2 and factor 1 allude to?

plt.figure(figsize=(8, 8))
for x, y, name in zip(fa.components_[0], fa.components_[1], df.columns[13:18]):
    plt.text(x, y, name)
plt.scatter(fa.components_[0], fa.components_[1])
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")

Factor 1 highest value = "Defense"
Factor 2 highest value = "Special Attack"

Let's create some charts!

Firstly I created a dendrogram (dendro = greek word for tree :)) for all pokemon (Image file is way too large to display clearly)

dfs = df.iloc[:, 13:18].apply(lambda x: (x-x.mean())/x.std(), axis=0)
from scipy.cluster.hierarchy import linkage, dendrogram
result1 = linkage(dfs, 
                  metric = 'euclidean', 
                  method = 'average')
plt.figure(figsize=(15, 150))
dendrogram(result1, orientation='right', labels=list(df['Name']), color_threshold=2)
plt.title("Dedrogram of Pokemon")
def get_cluster_by_number(result, number):
    output_clusters = []
    x_result, y_result = result.shape
    n_clusters = x_result + 1
    cluster_id = x_result + 1
    father_of = {}
    x1 = []
    y1 = []
    x2 = []
    y2 = []
    for i in range(len(result) - 1):
        n1 = int(result[i][0])
        n2 = int(result[i][1])
        val = result[i][2]
        n_clusters -= 1
        if n_clusters >= number:
            father_of[n1] = cluster_id
            father_of[n2] = cluster_id

        cluster_id += 1

    cluster_dict = {}
    for n in range(x_result + 1):
        if n not in father_of:

        n2 = n
        m = False
        while n2 in father_of:
            m = father_of[n2]
            #print [n2, m]
            n2 = m

        if m not in cluster_dict:

    output_clusters += cluster_dict.values()

    output_cluster_id = 0
    output_cluster_ids = [0] * (x_result + 1)
    for cluster in sorted(output_clusters):
        for i in cluster:
            output_cluster_ids[i] = output_cluster_id
        output_cluster_id += 1

    return output_cluster_ids
clusterIDs = get_cluster_by_number(result1, 50)
plt.hist(clusterIDs, bins=50)

Here we've created a histogram of clusters of pokemon that exhibit similar traits with each other. Here we've created 50 bins so there will be 50 different clusters of pokemon. That's quite a large number of charts to display so I'll just display several so you get the idea.


Some pokemon exhibit lots of traits similar to each other while others (like Regieleki) stand out.

Cross Validation & Regression Analysis

Since we saw earlier that Special Attack is a huge contributing factor to determining whether a pokemon is classified as 'legendary', let's use the rest of the stats to see if we can predict Special Attack!

X = df.iloc[:, 13:18]
y = df['total']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)

print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(X, y))
df.columns[[12, 13, 14, 16, 17]]
X = df.iloc[:, [12, 13, 14, 16, 17]]
y = df['SP_Attack']

Cross Validation

In machine learning, in order to evaluate performance, known data is divided into training and test data. Training (learning) is performed using training data to build a prediction model, and performance evaluation is performed based on how accurately the test data that was not used to build the prediction model can be predicted. Such an evaluation method is called "cross-validation".

Training data (60% of all data)
X_train: Explanatory variables for training data
y_train: Objective variable for training data
Test data (40% of all data)
X_test: Explanatory variable for test data
y_test: Objective variable for test data
We aim to learn the relationship between X_train and y_train and predict y_test from X_test. If the training data seems to show good performance, but the test data not used for training has poor performance, the model is said to be "overfitted".

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination(train)= ", regr.score(X_train, y_train))
print("Coefficient of Determination(test)= ", regr.score(X_test, y_test))
Regression Coefficient=  [ 0.15598049  0.09796333 -0.11115187  0.47986509  0.32513351]
Intercept=  5.4684249031776915
Coefficient of Determination(train)=  0.39594357153305826
Coefficient of Determination(test)=  0.38127048972638855

The above values change with each calculation because the division into training data and test data is random.
If you want to find a regression equation, you can do as above, but by standardizing the explanatory variables and objective variables and then regressing, you can find the "standard regression coefficient", which is an index of "importance of variables".

Xs = X.apply(lambda x: (x-x.mean())/x.std(), axis=0)
ys = list(pd.DataFrame(y).apply(lambda x: (x-x.mean())/x.std()).values.reshape(len(y),))
from sklearn import linear_model
regr = linear_model.LinearRegression()

regr.fit(Xs, ys)

print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(Xs, ys))
Regression Coefficient=  [ 0.152545    0.11255532 -0.09718819  0.40725508  0.28208903]
Intercept=  1.1730486200365748e-16
Coefficient of Determination=  0.3958130072204933
pd.DataFrame(regr.coef_, index=list(df.columns[[12, 13, 14, 16, 17]])).sort_values(0, ascending=False).style.bar(subset=[0])
sp attack prediction

It seems that Special Defense & Speed are very important in predicting "Special Attack"


Regression analysis, such as multiple regression analysis, uses numerical data as an explanatory variable and predicts numerical data as an objective variable. On the other hand, quantification type I predicts using non-numeric categorical data as an explanatory variable and numerical data as an objective variable. When the explanatory variables are a mixture of numerical data and categorical data, they are called extended quantification type I.

We saw that Special Attack is definitely a strong predictor for determining whether a pokemon is legendary or not - and we also saw that Special Defense & Speed are also important indicators of Special Attack Value.

Overall this was a way of exploring different pokemon traits and taking into account multiple factors. There's plenty more we can look into such as 'strengths', 'weaknesses' etc.. I hope you all enjoyed this, and thanks for reading all the way through!

GitHub - atnikola/pokemon-analysis at pythonawesome.com
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats - GitHub - atnikola/pokemon-analysis at pythonawesome.com