Don’t worry, nothing deep here, just how we can use our soon-to-be relational database to help you sanitise the data you’ve just loaded into it – Go back to Part 13 if this a) your first time here and b) of some interest.
So we left last time with you successfully adding your hard-won FM Squad data into your brand-spanking new SQL Database. Huzzah. Now lets use the power of both SQL/Relational Database Logic and all our combined consciousnesses to Transform (Its that ETL thing again) this data into something we can use and re-use to analyse everything we can about our team.
Side Note – I do this for a living, so Nerdyness can really pay 🙂
Lets us look at our data. I did the hard work last time by giving you the new column names for the extract. Go into SQL Server Management Studio (Hitherto known as SSMS), to your new database, go to your table (All these selections are via the + signs) then press the + against Columns at the top. You will see the full list of columns.
Note the first word before the first underscore: training, stats, disp, ctrct, attrib, player
I see these as the logical groupings of the data at hand.
- Player – the player level data – name, height, weight, date of birth – things that don’t change or change slowly ( https://en.wikipedia.org/wiki/Slowly_changing_dimension )
- Attributes – Our bread and butter, Acceleration, Pace, Passing, all that good stuff.
- Stats – Will become more important after more games – Minutes played, Dribbles, Passes, Goals. Note here we haven’t taken any ratio data (Passes per 90 minutes, Passes Completed % etc.) as they can all be computed from the other data we have (Passes & Minutes played/90, Completed Passes/Attempted Passes) – This makes the table as efficient as it can be, and is one of the tenets of relational database logical design ( https://en.wikipedia.org/wiki/Database_normalization#Normal_forms )
- Contract – So Contract type, start and end dates, fees, amounts and bonuses. All very important to running a streamlined operation from a financial perspective. One downside here is I can’t find number of unused substitue appearences in the data anywhere – its seems to be the main missing piece that I would like to see in FM20 for fullness.
- Training – focuses, sharpness, traits – not too much data but worth having (All data is worth having, watch The Great Hack ( https://www.youtube.com/watch?v=iX8GxLP1FHo ) if you think otherwise………
- Injuries – Injury risks, descriptions if current etc.
So what we have now are 6 logical groups for which we can make 6 tables. But what could relate all these tables?
We could use what is known as the Natural Key ( https://en.wikipedia.org/wiki/Natural_key ) such as name and date of birth, but that might not be unique, (Especially with lots of players in the database with DoB of 1/1/xx) meaning you could join your data incorrectly. Luckily we have the Unique ID of the player in the dataset. We also could add the date in game (1st Sep 2018 in our case currently) as we will be making multiple loads of data. The Current Team is also useful (But doesn’t work if you are interested in players away on loan as this will be different). Lastly there is the concept of a batch ( https://en.wikipedia.org/wiki/Batch_processing ) – we can create a table called dbo.batch_t, that on every load adds a row to it with a Primary Key ( https://en.wikipedia.org/wiki/Primary_key ) , an integer that increments by 1 on every run. In that way you can join the 6 tables above and know you have uniqueness. This also makes it a damn sight easier to compare data across different runs.
When we created our stage table to dump the extract data into Visual Studio made the Table Create script for us – We can also do the same thing in SSMS. This is the Script for making the Player table.
squad_player_id – this is the Primary Key as mentioned – This makes every row in this table unique – An integer starting at 1 incrementing up by 1 on every row added to it (IDENTITY (1,1)) – NOT NULL means that there must always be data in this column/attribute ( https://en.wikipedia.org/wiki/Null_(SQL) ) – This can be an interesting concept to the uninitiated,
batch_id – As discussed this will be the Primary Key on the to-be-made batch_t table and will separate loads – Each row for this load in the dbo.squad_player_t table will have a batch_id of 1, next load of the data from 1st October will be 2 and so on
game_dt – as discussed the in-game date – We are only interested in the date so the data-type ( https://www.w3schools.com/sql/sql_datatypes.asp ) of date is fine – If we wanted the time as well this would need to be datetime
player_unique_id – the Unique Key of the data, as this could start with a zero its classed as a text string, so we are leaving this as NVARCHAR(255) N means Unicode, as discussed on earlier posts this allows non-latin characters to be stored as it allows slightly more memory per character, VARCHAR means text that can grow or shrink depending on how long the data is (CHAR(x) would always be x characters long, but if these characters are used it would be padded with blank space – this is used where a field is updated a lot – it is more efficient in the way the data is stored on disk, but is outside of this piece and is causing me to waffle more than I normally would…..
Other interesting datatypes on the above are tinyint (allows any number between 0 and 255, good for Age, not 100% for height in centimeters though……..
We also use decimal(20,5) – Means 20 digits long in total, 5 significant figures after the decimal point – Though this can be tightened up – for player value this could be 11,2 (11 long in total, 2 decimal places, allows up to 999,999,999.99 – I don’t think Neymar will get quite that far, though there are some crazy release fees now).
Note also the end of the column/attribute names – these denote their data type usage. _id is exactly that – and identity field – a surrogate key that can be used to connect tables up or create uniqueness in a table ( https://en.wikipedia.org/wiki/Surrogate_key ).
_dsc means description, so a text description, e.g. player_personality_description.
_ind means indicator – usually a Y/N but can be 1/0 to mean the same thing
There are a few others that should be pretty self explanatory
We have a good few _id fields in the above. Ones involving player position information are the ones we will be looking at now.
Go back into SSMS, and run this query:
select [ player_unique_ID ], [ player_nm ], [ player_position_cd ],[ player_secondary_position_cd ] FROM [dbo].[stg_all_data_t]
When you specify the table column names (You can either drag them into the query from the column list for the table or Intellisense (a code helper inbuilt into SSMS) can suggest columns based on typing a few letters) it means you only bring them back in the query result
In FM you have the player position code (all the positions the player can play – corresponds to the lime green, dark green and orange dots) and the secondary position code (any positions that aren’t the players best position) – as such you can work out their primary position from the difference between the 2.
Lets take Murray McCulloch in row 18 above as an example. His Position Code is D/WB(R), M(RC). His Secondary Position Code is D/WB/M(R). The missing code is M(C), meaning that is his most natural position.
Lime Green denotes the most natural position above. Where the Secondary Position is a – this means there is no secondary position – Take Chris Barras (Line 3) as a case in point
He has no secondary position – AM(RLC) is his position, and he is natural at all of them. We don’t however see M(C) anywhere in his data, though we do have the Assistant Manager’s view of his best position in the Player_AM_best_pos_cd column (AM(L))
Positions in my mind are Atomic – Indivisible – You can play a different role and duty but D(C) isn’t divided into Right-Central Defender and Left-Central Defender in the game. As such I can streamline the way we look at positions and the way we have them in the database.
There are 17 positions code that only relate to a single position (if we ignore sidedness). They are:
AM (C) – Attacking Midfielder
AM (L) – Wide Attacking Midfielder
AM (R) – Wide Attacking Midfielder
AM (RL) – Wide Attacking Midfielder
D (C) – Central Defender
D (L) – Fullback
D (R) – Fullback
D (RL) – Fullback
DM – Defensive Midfielder
GK – Goalkeeper
M (C) – Midfielder
M (L) – Wide Midfielder
M (R) – Wide Midfielder
M (RL) – Wide Midfielder
ST (C) – Striker
WB (L) – Wide Defensive Midfielder (Wingback)
WB (R) – Wide Defensive Midfielder (Wingback)
Everything else means more than 1 position – as such they not atomic, and would need more than 1 row in a table to describe them. AM(RLC) for instance means that a player can play as either a Central or Wide Attacking Midfielder. I have compiled 77 different position codes so far (There will be more to add) relating to 161 different playing positions. I’ve added them to a table called dbo.position_name_t
So AM(RLC) is in the table twice – Lines 9 & 13 – Once as the Attacking Midfield position, Once as a Wide Attacking Midfield position. Some of the more complicated codes can be in there many times. I currently have 2 codes with 4 positions, M (RLC), AM (RL), ST (C) & D/WB (R), M (RC). In this way you can at least look at the positions individually, basically making the system as simple as you can, making analysis down the line easier again.
Back to the title of this post – Now you have all the codes in a table with a unique id (pos_nm_id in this case) we can use this id when sanitising the data rather than the full code all them time. We have created a reference table (dbo.position_name_t) that is used to store the full position codes, and we can use the pos_nm_id in what will be the dbo.squad_player_t table – the table with Player-specific information we will load from the initial dbo.stg_all_data table.
We can also do the same with roles and duties (They are a bit more simple – There are 109 position-role-duty combinations if you ignore sided-ness) – a selection of these are below:
I’ve added these to a table called [dbo].[position_role_duty_t] that you see above.
We also have somewhere else we can sanitise things – Attributes
When these come out of FM they are just a number, but in reality a single attribute has a type (Technical, Mental, Physical, Goalkeeping) and either are an Outfield attribute, Goalkeeping attribute or Both, In this way we can make an attribute table, a selection of these are below.
So we have positions, roles, duties and attributes sanitised. Each position/role_duty has a set of attributes that are either key to the role or supplemental. We can create a table that stores these relationships so we can use them with the attribute numbers themselves to show how a player is suited to a position/role/duty. I called this dbo.pol_role_duty_attributes_t. The Create script for this is below.
So we have the table primary key id (pos_role_duty_attrib_id), 2 Foreign Keys ( https://en.wikipedia.org/wiki/Foreign_key ) relating to the Primary Keys on the attributes (attrib_id) and position role_duty (pos_role_duty_id) and an importance Indicator (K for Key, P for Partial) that we can load using the info in FM (You can see what is Key and what is not.
So the greenish background against the attributes is a Key attribute for the Role/Duty (So in this instance Winger Support), bluish are partial/supplemental. So Key are:
Crossing, Technique, Off The Ball, Acceleration, Pace
Dribbling, First Touch, Passing, Work Rate, Agility, Stamina
As such Chris has a score of 42 for Key attributes against 5 attributes (8.4/20 average), 53 for Supplemental attributes out of 6 attributes (8.83/20) and 95 for the 11 attributes across both Key and Supplemental (8.63/20). In this way using the attribute/position_role_duty joining table we can automate working out a players scores against every position…………….
So we have a database with a number of tables – We now need to try and Load (the L in ETL) this data into the end tables (dbo.squad_x naming convention). The current state of play of my Database (We’ve not been through all the tables) is below:
You can see the links between the tables as the lines. A link to the Create Database script is here:
You can open this in SSMS, run it and it will create this database for you with one click of Execute/Press of F5, plus load all the data as of this point………..
Next time we will look at loading the 6 dbo.squad_x tables via a stored procedure ( https://en.wikipedia.org/wiki/Stored_procedure ) – If you’ve gotten this far good on you!!!!!!