CAUTION! This will be pretty dry fodder, but if you have any interest in databases, data and how to get the latter into the former this might be useful to you.
Extract, Transform & Load – You may see ETL in a lot of job descriptions in the data space, but at the end of the day it is just get some data, transform it into a format that you can work with & load it into a target, in this case SQL Server Database.
Our priority one is setting up a database. By now you will hopefully have downloaded SQL Server 2017/2019/20xx and are in a position to create a database – If you haven’t go back to the last post. This bit is pretty simple. Log in to the default server you’ve created on your machine, then left click on Databases and Click New Database.
Just give the Database a logical name – the one I’m using is called FM_Soccermetrics – and click OK – You will now be the owner of a pristine, but blank, database
So we have a database to put our data into, but how will we do that. That’s where our Visual Studio and SQL Server Data Tools installations will come in handy. Open Visual Studio (in my case 2017) and click the Create new project link in the middle(ish) of the screen.
In the window that opens you should have Business Intelligence as a clickable on the left hand Menu – click that and select the Integration Services Project (If you see SSIS anywhere that’s what this is) at the top – give it a logical name and click OK. If you don’t see Business Intelligence as an option on the left hand menu (or Analysis Services or Reporting Services) you need to make sure SSDT (SQL Server Data Tools) is installed for VS 2017/whatever version of Visual Studio you are using.
You should now have a blank project open, with an SSIS Package just called Package.dtsx – Rename that using right click – Rename – to something more logical for you (I’ve renamed mine PPFM_JM for PearceyPlays Football Manager – Journeyman). Next step is understanding the interface. the SSIS Toolbox should be open on the left hand side, and everything you need for this task is in there. First up click on data Flow Task and drag it into your main blank window. This is the task where you can move the extracted data we created in the last post. First up we will look at moving the csv data I created.
Click on the Data Flow Task – This will give you another blank window. First up you need to make a connection for the csv file we created. to do that right click in the connection managers window 20% up the screen and click New Flat File Connection.
Give it a descriptive name, Browse where you saved your csv file (You will need to change your file type from text files to csv files when you browse), make sure you set a Text Qualifier (Just under Format – Delimited) to a Double Quote (Stops issues parsing the data out when you have commas in the data itself), and then click Columns on the left hand side.
All the columns and data from your csv file you produced from the squad view should be in the preview window (Note I have 23 players in my First Team Squad, so there are 24 rows including the header row). Next Click Advanced under Columns.
Select all the columns in the column name window by clicking one one and pressing Ctrl + A (they turn blue as per the screenshot) – Change Data Type to Unicode String (This will allow for any non-latin text characters to come through with no issues) and change the width to 255 from 50 (50 characters is not long enough for some of the fields, I use 255 as that is the maximum you can get out of Excel, so I always use that as a default, you can use whatever you think is a big enough number for the data). Then Click OK – Your csv connection is now set up.
You will see your csv connection in your Connection Managers. Next you need to create a connection to your SQL Server Instance. Right Click again in the Connection Managers Window and click on New OLEDB Connection ( https://en.wikipedia.org/wiki/OLE_DB ) – Then click New. You will need to get your SQL Server Name you have created your new blank Database on, and you will need to paste it into the Server Name selector.
Select your Database, test the connection (Should be successfully, if not follow the error message) and click OK – Your OLEDB Connection (Connecting to your SQL Database) should now be in your Configure OLE DB Connection Manager Window. Select it, Click OK and it should now be in your main Connection Managers Window ready to use in the Data Flow Task. You can rename it (It defaults to Server.Database) if you wish by right clicking on it – rename.
You should be back in the Data Flow task blank window. We firstly need the Flat File Source (All text files like this are considered to be ‘Flat’ Files – https://en.wikipedia.org/wiki/Flat-file_database ) from the Other Sources Menu at the bottom of the SSIS Toolbox on the left of the screen. Click and drag that into you data flow task blank window, then click in the task itself (It can be good to rename these generic tasks so you can see what is happening in the task – I’m calling this csv Source). The Flat File Source Editor will come up.
Make sure Retain Null Values is ticked, then click columns on the left. You will get an error.
The problem with the headers coming from the file (So from Football Manager itself) is that there is the same header twice in the extract, and SSIS doesn’t like that as you can’t have the same header name twice in the same database table (Where this data will reside at some point) – As such I take you back to the original csv file I created in Excel. I have created unique headers for all the columns (All 291 of them) with better naming conventions so it will make our lives easier when the data is in the database. These headers are in the below code block – You can take these and replace the original headers in Notepad by Copy and Paste.
Once you replaced the headers in the original file in Notepad there are a few little bits to do – go back to your csv Connection in the Connection Managers Window – remove it, and redo the flat file connection setup we went through earlier. Done that? Go back to the Data Flow Source task, click in it and now click columns – as your column headers are now unique you get no errors:
This is now ready for the next step – click OK
We now need to drag a OLE DB Destination task onto the pane – If you don’t see Other Destinations in the SSIS Toolbox scroll down until you do, click on the OLE DB Destination and drag it over – rename it (I usually rename this to the table I intend to load – we haven’t got one yet but I’ll call it dbo_stg_all_data – dbo means Database Owner and is a schema ( https://en.wikipedia.org/wiki/Database_schema ) and is used for permissions to certain tables normally – stg_ means Stage – its usual to call a table that is taking extract data in without any transforms a Stage table, but _raw is also used. Take the Blue line from the bottom of the csv Source task and pull it until it joins the new destination task you have – this is the data flow from the source to the destination. Click in the Destination task.
We don’t have a table yet, so click New next to the blank Name of the Table window. A Table Creation script with all the columns from the source is created.
Change the underscore between the dbo and stg in the CREATE TABLE top line to a full stop/period – Table names in SQL are in <server>.<database>.<schema>.<table>, but we know what server and database this is on from the OLE DB Connection so we only need the Schema (dbo) and the name – also add an _t to the end of the table name – this denotes it as a table and is best practice. Click OK, then click Mappings in the left hand window.
All the columns should be mapped from the Input (csv source) to the Destination (database table we just made) automatically as they all have the same names. We are good to go. Click OK. neither of the two boxes should have a Red X in them (If you do you need move your cursor over them to see what is wrong and debug it – outside the scope of this already very long post). Click in Control Flow at the top left of this pane and you should just see the Data Flow task in your window now. Right Click in that and Click Execute Task. Your computer should now be loading the data from the file to the database. the background should get some white horizontal lines on it and the Data Flow Task should have a Yellow timer on it, quickly changing to a green tick. This means you have been successful 🙂
Click in the Data Flow task and you should see the 2 windows, source and Destination again, both with green ticks plus the number of rows of data moved between both (In my case 23 rows of data, the header isn’t counted). To exit out of this execute state in Visual Studio go to the top of the screen – Debug – Stop Debugging. Time to go and have a look at the Database.
Press the + sign against the Database you’ve created earlier, then the + against tables, and lo-and-behold your table you just created in Visual Studio should be there. We need to check the data in the table. Right Click on your database Name and select new Query, then type this in:
SELECT * FROM [dbo].[dbo.stg_all_data_t]
SELECT is the way to look at data in SQL – * means everything – so this means Show me everything that is in the table I’ve just created and loaded for the first time. Press F5 or click on Execute near the top of the screen and you should see the data in teh Results window in the bottom half of the screen.
SUCCESS!!!!!!! We have our precious player data in the database ready for all the analysis we can think of. However this data isn’t structured. Next time I will take you through how we can structure this data (The Transform in ETL) to make our lives a lot easier down the road. Well done for getting to this point in the post, its to your credit 🙂