Creating statistical neighbours comparator benchmarking models with Python

November 23, 2022 10 minute read

Composition of different conchs on beige table

Contents

This article will explore how to get started creating a statistical neighbours model to benchmark, compare and find similar observations within a dataset. This might be comparing the sales of a store, to only other stores that are statistically similar in terms of size, budget and staffing or comparing school attendance performance for a given area to only other areas of similar size, pupil numbers and other characteristics.

The main problem of comparator models is how to define what is considered statistically 'similar'. We will explore two approaches to solving this problem.

All of the data used in this article is not real data. It has been adapted and modified based upon real data sources for learning purposes.

Filtering approach

In this dummy dataset school_data.xlsx I adapted from two good open data sources Explore education statistics and Get Information about Schools there are around 1,800 schools but we only want to compare a school's attendance levels to it's top ten most statistically similar in terms of pupil size, alongside FSM and SEN characteristics.

School	Attendance%	Pupils	FSM	SEN	Phase	LocationID
SCHOOL-0001	98.2	63	5	6	PHASE-1	855
SCHOOL-0002	81	1229	257	72	PHASE-2	873
SCHOOL-0003	94.8	250	10	16	PHASE-1	891
SCHOOL-0004	94.5	653	78	89	PHASE-1	856
SCHOOL-0005	93.9	463	83	45	PHASE-1	866
SCHOOL-0006	94.2	918	156	131	PHASE-2	865
SCHOOL-0007	0	81	25	18	PHASE-2	888
SCHOOL-0008	91.4	195	83	29	PHASE-1	888
SCHOOL-0009	96.5	223	89	63	PHASE-1	888
SCHOOL-0010	92.5	719	253	130	PHASE-2	209
...	...	...	...	...	...	...

For each school, we will apply a series of filters to find it's top ten comparators in terms of both pupil size and characteristics like FSM and SEN.

"attendance_comparators.py"

"""
A model to identify school comparator's based on their size and 
characteristics in order to compare attendance performance.

Assumptions:
    
- Schools will only be compared to schools of the same phase type.
- Results will be the top ten statistically closest schools.
- The comparators will be based on attendance %.

Functionality:
    
- Ability to compare against schools of a similar size.
- Ability to compare against schools with similar characteristics
"""

import os
import time
import pandas as pd


def get_data() -> pd.DataFrame():
    """
    Reads the Excel dataset into a Pandas DataFrame and adds new features such 
    as %FSM and %SEN.
    """
    df = pd.read_excel("school_data.xlsx")
    df["Attendance%"] = df["Attendance%"] * 100
    df["%FSM"] = (df["FSM"] / df["Pupils"]) * 100
    df["%SEN"] = (df["SEN"] / df["Pupils"]) * 100

    return df


def generate_all_comparators(output_all_to_csv: bool = False) -> None:
    """
    Generates the top 10 comparators for every school in the dataset, for each
    of the 2 comparator groups (size, characteristics).
    
    Optionally saves the result to CSV files where the folder name is the name 
    of the school where output_all_to_csv is set to True.
    """
    df = get_data()
    df_length = len(df)
    comparator_mappings = []

    for index, row in df.iterrows():
        school_name = row["SchoolName"]

        similar_sized_comparators = find_similar_sized_comparators(
            school_name=school_name,
            df=df
        )

        similar_characteristics_comparators = find_similar_characteristics_comparators(
            school_name=school_name,
            df=df
        )

        add_comparators_to_mappings(
            comparators=similar_sized_comparators,
            mappings=comparator_mappings,
            school_name=school_name,
            grouping="Size"
        )

        add_comparators_to_mappings(
            comparators=similar_characteristics_comparators,
            mappings=comparator_mappings,
            school_name=school_name,
            grouping="Characteristics"
        )

        if output_all_to_csv:
            if not os.path.exists("output"):
                os.mkdir("output")

            school_name = school_name.replace("/", "")
            directory = f"output/{school_name}"

            if not os.path.exists(directory):
                os.mkdir(directory)

            similar_sized_comparators.drop(
                columns=["Unnamed: 0"], inplace=True)

            similar_characteristics_comparators.drop(
                columns=["Unnamed: 0"], inplace=True)

            similar_sized_comparators.to_csv(
                directory + "/similar_sized_comparators.csv",
                index=False
            )
            similar_characteristics_comparators.to_csv(
                directory + "/similar_characteristics_comparators.csv",
                index=False
            )

        print(f"{index + 1} of {df_length} done.")

    return pd.DataFrame.from_records(comparator_mappings)


def add_comparators_to_mappings(comparators,
                                mappings,
                                school_name,
                                grouping) -> None:
    """
    Builds the final output by adding all of the comparators from
    the size and characteristics DataFrames to the mapping list
    in JSON / dictionary format:
    
    [
     {
      "School": "A", 
      "Comparator": "B", 
      "Grouping": "Size"
     },
     {
      "School": "B", 
      "Comparator": "C", 
      "Grouping": "Characteristics"
     },
    ]
        
    Which results in the final output:
        
    School   Comparator  Grouping
    A        B           Size
    A        D           Size
    B        D           Characteristics
    
    Avoids adding the target school_name as it's own comparator.
    """
    for index, row in comparators.iterrows():
        comparator_school_name = row["SchoolName"]
        if comparator_school_name != school_name:
            mappings.append({
                "School": school_name,
                "Comparator": comparator_school_name,
                "Grouping": grouping
            })


def find_similar_sized_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Finds schools of a similar size and returns as comparators.
    
    This comparator is calculated by the total number of pupils in each school,
    per organisation type. The groupings for each organisation type will be 
    calculated based on the highest and lowest pupil count for schools in that 
    category i.e. within a given % threshold
    """
    school = df[df["SchoolName"] == school_name]
    school_size = school["Pupils"].values[0]
    school_type = school["Phase"].values[0]
    schools_with_same_type = df[df["Phase"] == school_type]

    upper_size_threshold = school_size * 1.25
    lower_size_threshold = school_size * 0.75

    schools_of_similar_size = schools_with_same_type[
        (schools_with_same_type["Pupils"] >= lower_size_threshold) &
        (schools_with_same_type["Pupils"] <= upper_size_threshold)
    ].copy(deep=True)

    schools_of_similar_size["Size difference"] = (abs(
        schools_of_similar_size["Pupils"] -
        school_size
    ))

    schools_of_similar_size = schools_of_similar_size.nsmallest(
        11, "Size difference")

    schools_of_similar_size["Rank"] = (
        schools_of_similar_size["Attendance%"].rank(
            ascending=False
        )
    )

    return schools_of_similar_size


def find_similar_characteristics_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Finds schools with similar %FSM and %SEN characteristics and returns as comparators.
    """
    school = df[df["SchoolName"] == school_name]
    school_type = school["Phase"].values[0]
    school_fsm_percentage = school["%FSM"].values[0]
    school_sen_percentage = school["%SEN"].values[0]
    schools_with_same_type = df[df["Phase"] == school_type]

    upper_fsm_threshold = school_fsm_percentage * 1.1
    lower_fsm_threshold = school_fsm_percentage * 0.9

    upper_sen_threshold = school_sen_percentage * 1.1
    lower_fsm_threshold = school_sen_percentage * 0.9

    schools_with_similar_characteristics = schools_with_same_type[
        (schools_with_same_type["%FSM"] >= lower_fsm_threshold) &
        (schools_with_same_type["%FSM"] <= upper_fsm_threshold) &
        (schools_with_same_type["%SEN"] >= lower_fsm_threshold) &
        (schools_with_same_type["%SEN"] <= upper_sen_threshold)
    ].copy(deep=True)

    schools_with_similar_characteristics["Characteristics difference"] = (
        abs(schools_with_similar_characteristics["%FSM"] - school_fsm_percentage) +
        abs(schools_with_similar_characteristics["%SEN"] -
            school_sen_percentage)
    )

    schools_with_similar_characteristics = schools_with_similar_characteristics.nsmallest(
        11,
        "Characteristics difference"
    )

    schools_with_similar_characteristics["Rank"] = (
        schools_with_similar_characteristics["Attendance%"].rank(
            ascending=False
        )
    )

    return schools_with_similar_characteristics


if __name__ == "__main__":
    start = time.time()

    output = generate_all_comparators(
        output_all_to_csv=True
    )

    output.to_csv("output/comparator-mappings.csv", index=False)

    end = time.time()

    print(f"Model finished in {round(end - start, 2)} seconds.")

If the output_all_to_csv flag is set to True then for each school a folder will be created in the output directory for it, containing all of it's comparators for both size and pupil characteristics. An example of one of these outputs for 'SCHOOL-005' can be seen in the image below.

Example of single comparator results output

Example of a single comparator result output showing top ten statistical neighbours

We can see within similar_characteristics_comparators.csv the %FSM and %SEN are within the upper and lower thresholds and within similar_size_comparators.csv Pupils are within the upper and lower thresholds. This shows the model is accurately filtering and ranking only those observations that fit inside these parameters.

Within the output directory, there is also the full list of comparators in the comparator-mappings.csv file.

All comparator mappings

The full list of all comparator mappings

If we also had columns for 'Easting' and 'Northing' for these schools, we could also add another filter to find the top ten geospatially closest schools.

"attendance_comparators.py"

from scipy.spatial import distance

def find_similar_location_comparators(school_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Finds schools which are geospatially closest and returns as comparators.
    """ 
    school = df[df["School"] == school_name]
    school_location_id = school["LocationID"].values[0]
    school_type = school["Phase"].values[0]
    schools_with_same_type = df[df["Phase"] == school_type]
    school_easting = school["Easting"].values[0]
    school_northing = school["Northing"].values[0]

    location_data_available = (
        (school_easting != 0) & 
        (school_northing != 0)
    )
    
    if location_data_available:
        geo_comparators = schools_with_same_type \
            .copy(deep=True) \
            .reset_index()
        
        distances = []

        for _, row in geo_comparators.iterrows():
            a = (school_easting, school_northing)
            b = (row["Easting"], row["Northing"])
            distances.append(
                distance.euclidean(a, b)
            )

        geo_comparators["distance"] = pd.Series(distances)
        geo_comparators = geo_comparators[
            geo_comparators["Phase"] == school_type
        ]
        geo_comparators = geo_comparators.sort_values(
            by="distance", 
            ascending=True
        )
        geo_comparators = geo_comparators.head(11)
        
        return geo_comparators

    schools_in_same_area = schools_with_same_type[
        (schools_with_same_type["LocationID"] == school_location_id)
    ].copy(deep=True)
        
    if len(schools_in_same_area) <= 11:
        return schools_in_same_area
    
    
    sample =  schools_in_same_area.sample(n=10)
    sample = sample.append(school)
    
    return sample

Scoring approach

In the next example, our dummy dataset la_data.csv (adapted from a dataset taken from the ONS) is at Local Authority (area) level.

Local Authority District code (2019)	Local Authority District name (2019)	Profile	Rural-urban classification	Deprivation gap (percentage points)	Deprivation gap %	Deprivation gap ranking	Moran's I	Moran's I ranking	Income deprivation rate	Income deprivation rate ranking	Income deprivation rate quintile	% of households with 3 or more children	School pupils	School attendance %	Schools total spending £	School spend per pupil £	School Free School Meal %
E07000223	Adur	n-shape	Urban with City and Town	21.70%	21.70	233	0.17	234	10.80%	158	3	10	37437	76	307104	8.20	28.70
E07000026	Allerdale	Flat	Mainly Rural (rural including hub towns >=80%)	36.60%	36.60	95	0.29	157	12.10%	130	3	16	40461	69	869572	21.49	43.60
E07000032	Amber Valley	n-shape	Urban with Minor Conurbation	32.90%	32.90	121	0.29	157	10.90%	153	3	6	22981	44	652505	28.39	39.90
E07000224	Arun	n-shape	Urban with City and Town	28.70%	28.70	164	0.31	139	10.40%	171	3	25	34449	64	437529	12.70	35.70
E07000170	Ashfield	More income deprived	Urban with City and Town	36.00%	36.00	98	0.15	246	15.20%	72	2	11	9366	50	770050	82.22	43.00
E07000105	Ashford	n-shape	Urban with Significant Rural (rural including hub towns 26-49%)	29.10%	29.10	160	0.34	116	11.00%	150	3	26	38834	71	613225	15.79	36.10
E07000004	Aylesbury Vale	Less income deprived	Largely Rural (rural including hub towns 50-79%)	19.60%	19.60	264	0.47	55	6.70%	272	5	22	38433	56	609848	15.87	26.60
E07000200	Babergh	Less income deprived	Mainly Rural (rural including hub towns >=80%)	16.90%	16.90	280	0.17	234	8.00%	232	4	21	48694	53	146570	3.01	23.90
E09000002	Barking and Dagenham	More income deprived	Urban with Major Conurbation	25.40%	25.40	195	0.27	175	19.40%	20	1	21	36548	89	326135	8.92	32.40
E09000003	Barnet	n-shape	Urban with Major Conurbation	31.90%	31.90	132	0.36	105	11.10%	148	3	15	48851	33	448473	9.18	38.90

We want to compare a Local Authority area to only other statistically similar areas, but not just on one factor, but many (or all) numeric factors available and score them in terms of 'closeness'. This will find the top ten closest neighbours for comparisons and benchmarking.

statistical_neighbours.py

import pandas as pd


def find_statistical_neighbours_for(local_authority_district_code: str) -> pd.DataFrame:
    df = pd.read_csv(
        filepath_or_buffer="la_data.csv",
        encoding="cp1252"
    )
    
    df["Comparator score"] = 0
    df["Comparator variables"] = ""
    
    target_la = df.loc[
        (df["Local Authority District code (2019)"] == local_authority_district_code)
    ]
    
    comparison_variables = {
       "Deprivation gap %": 1, 
       "Deprivation gap ranking": 1, 
       "Moran's I ranking": 1, 
       "Income deprivation %": 1,
       "Income deprivation rate ranking": 1, 
       "% of households with 3 or more children ": 1, 
       "School pupils": 2, 
       "School Free School Meal %": 2
    }
    
    # compare the comparator variables for each LA against the target LA and score them
    for index, row in df.iterrows():
        is_target_la = (
            row["Local Authority District code (2019)"] == local_authority_district_code
        )
        
        if is_target_la:
            continue
            
        for variable in comparison_variables:
            if variables_are_statistically_similar(target_la[variable].values[0], row[variable]):  
                df.loc[index, 'Comparator score'] = (
                    df.loc[index, 'Comparator score'] + comparison_variables[variable]
                )

                df.loc[index, 'Comparator variables'] = (
                    df.loc[index, 'Comparator variables'] + variable + ", "
                )
        
                
    return(df.nlargest(10, "Comparator score").append(target_la))


def variables_are_statistically_similar(target: float, comparator: float) -> bool:
    upper_bound = target * 1.10
    lower_bound = target * 0.90
    
    comparator_is_within_range = (
        comparator > lower_bound and comparator < upper_bound
    )
    
    return comparator_is_within_range


def print_attendance_comparisons(df: pd.DataFrame) -> None:
    target_la = df.iloc[-1]
    df = df[: -1]
    
    la_name = target_la["Local Authority District name (2019)"]
    la_school_attendance_percentage = target_la["School attendance %"]
    
    average_comparator_attendance_percentage = df["School attendance %"].mean()
    
    print("The average school attendance percentage of your comparator LAs was ", end="")
    print(f"{average_comparator_attendance_percentage}%", end="\n")
    print(f"School attendance in {la_name} was {la_school_attendance_percentage}%", end="\n")
    
    attendance_percentage_difference = (
        la_school_attendance_percentage - average_comparator_attendance_percentage
    )
    attendance_percentage_difference = round(abs(attendance_percentage_difference), 2)

    if la_school_attendance_percentage < average_comparator_attendance_percentage:
        print(
            f"This is {attendance_percentage_difference} " 
            f"percentage points lower than your comparator LAs"
        )
    else:
        print(
            f"This is {attendance_percentage_difference} " 
            f"percentage points higher than your comparator LAs"
        )
    

def print_spending_comparisons(df: pd.DataFrame) -> None:
    target_la = df.iloc[-1]
    df = df[: -1]
    
    la_name = target_la["Local Authority District name (2019)"]
    la_school_spending = target_la["Schools total spending £"]
    
    average_comparator_spending = df["Schools total spending £"].mean()
    
    print("", end="\n\n")
    
    print("The average school spending of your comparator LAs was ", end="")
    print(f"£{average_comparator_spending}", end="\n")
    print(f"School spending in {la_name} was £{la_school_spending}", end="\n")
    
    spending_difference = (
        la_school_spending - average_comparator_spending
    )
    spending_difference = round(abs(spending_difference), 2)
    
    if la_school_spending < average_comparator_spending:
        print(f"This is £{spending_difference} lower than your comparator LAs")
    else:
        print(f"This is £{spending_difference} higher than your comparator LAs")
        
    
if __name__ == "__main__":
    comparators = find_statistical_neighbours_for("E07000150")
    
    print_attendance_comparisons(comparators)
    print_spending_comparisons(comparators)
    
    html_file = open("index.html", "w")
    html_file.write(comparators.to_html())
    html_file.close()

TERMINAL

Python

user@ShedloadOfCode:~$ statistical_neighbours.py

The scoring model works by first assigning weights in the dictionary comparison_variables. Then later will check each of these to see if the variables_are_statistically_similar() against the target Local Authority, and if so, increment the score by the weight for each.

The scoring model then first prints some summary information to the console such as comparisons between the target Local Authority's average attendance and average spending against their comparator Local Authorities. It then outputs the comparators for the target Local Authority to a HTML file 'output.html' to see which has the highest score.

Scoring model comparators output

The output could be made to look a little nicer with some styling via CSS, but it clearly shows that across all of the comparison variables which are the 'closest' and even has a column 'Comparator variables' to show which variables were the ones driving those scores.

The target Local Authority (in this example Corby) is at the bottom of the table to refer back to. Go ahead and try plugging in different Local Authority District Codes to the find_statistical_neighbours_for(local_authority_district_code: str) function to see how it performs!

What we learned

We have covered using both filtering and scoring approaches to solving statistical neighbour problems. You can now apply these models to other problems in different domains. It is a very useful ability to only compare to other observations that are statistically similar - it makes the comparison analysis more tailored and as a result the conclusions and decisions are more relevant.

Much better to compare and benchmark observations against those with similar characteristics, else you may end up making decisions that don't really apply to the school, local authority, store, or anything else the observation may be!

I did use mostly an iterative approach whilst putting these solutions together, like looping over DataFrame rows for example. If you can think of more efficient ways to solve these statistical neighbour problems for larger datasets or have any other comparator techniques you would like to share, please post a comment in the comment section below!

As always, if you liked this article please check out other articles on the site.

Building an AutoTrader scraper with Python to search for multiple makes and models

Six tips for producing and assuring high quality analytical code