Introduction
I was exploring the Covid-19 data provided by the John Hopkins institute and I stumbled as usual on several data issues. Therefore I thought I could share some scripts and ideas in a post.
The data from the downloaded files cover a timeframe between 22nd of January and 10th of June 2020 and reflect the number of confirmed cases, deaths and recoveries per day, country and state. As it seems the data are updated on a daily basis. Unfortunately, the data are spread over several files (one for each indicator), which makes their consumption more difficult than expected, though the challenges are minor. I downloaded for the beginning the following files from the above link:
time_series_covid19_recovered_global_narrow.csv
time_series_covid19_confirmed_global_narrow.csv
time_series_covid19_deaths_global_narrow.csv
time_series_covid19_confirmed_global_narrow.csv
time_series_covid19_deaths_global_narrow.csv
Before attempting anything with the files, it's recommended to look over them to check whether column names are provided and the columns are properly named, respectively how the columns and the rows are delimited, or on whether other things can be observed during a first review. I needed for example to delete the second line from each file.
Data Loading
When starting this kind of projects, it's useful to check first project's feasibility in term of whether the data are usable. Therefore, I use first to import the data via the 'Import Data' wizard, following to decide later whether it makes sense to build a project for this. Right click on the database in which you'd like to import the data, then from 'Tasks' choose 'Import Data' to use the wizard:
As data source we will consider the first data file, therefore into the 'Choose a Data Source' step, browse for the file:
The 'Locale' might appear different for you, therefore let the default value, however you'll have to make sure that the following values are like in the above screenshot. Look over the 'Columns' section to check whether the formatting was applied correctly. The preview offers a first overview of the data:
It's useful to review the default data types defined by the wizard for each column within the 'Advanced' section. For example the first two fields could have more than the default of 50 characters. The length or the data type can be modified as needed:
One can attempt in theory to get all the column definitions right from the beginning, though for the first attempt this is less important. In addition, without knowing data's definition, there's always the possibility for something to go wrong. Personally, I prefer loading the data as text and doing later the needed conversions, if necessary.
Into the next step one can define the 'Destination', the database where the data will be loaded:
After defining the source and destination is needing to define the mapping between the two. One can consider going with the table definition provided by the wizard or modify table's name directly in the wizard:
By clicking on the 'Edit Mappings' one can review the mappings. As there's a one-to-one import, one can in theory skip this step. However, if there are data already into the table, one can delete the rows from the destination tables or append them directly to the destination. The first radio button is selected though, as the table will be created as well:
With this being done, one can run the package as it is - just click 'Next':
If everything is ok, each step from the package will appear in green, in the end the number of records appears.
It may look like many steps, though once one got used to using the wizard, the data are loaded in less than 5 minutes.
Data Discovery
Once the data loaded, it's time for data discovery - looking at the structure of the data and trying to understand their meaning. Before further using the data it's important to identify the attributes which identify uniquely a record within your dataset - in this case the combination between State, Country and Date. At least it should be unique, because the second query returned some duplicates for the Country 'Korea', respectively 'Sint Eustatius and Saba' which seems to be a region within 'Netherlands'.
-- looking at the data SELECT top 1000 * FROM [dbo].[time_series_covid19_recovered_global_narrow] -- checking for duplicates SELECT [Province State] , [Country Region] , Date , count(*) NoRecords FROM [dbo].[time_series_covid19_recovered_global_narrow] GROUP BY [Province State] , [Country Region] , Date HAVING count(*)>1 -- checking for distinct values SELECT [Country Region] , count(*) NoRecords , SUM(TRY_CAST(Value as int)) NoCases FROM [dbo].[time_series_covid19_recovered_global_narrow] GROUP BY [Country Region] HAVING count(*)>1 ORDER BY [Country Region]
The two duplicates are caused by the fact that a comma was used in a country, respectively a province's name, when the comma is actually used as delimiter. (Therefore it's better to use a sign like "|" as delimiter, as the chances are small for the sign to be used anywhere. Another solution would be to use quotes for alphanumeric values.) Fortunately, the problem can be easily fixed with an update which uses the dbo.CutLeft, respectively dbo.CutRight functions defined in a previous post. Therefore the functions need to be created first within the same database before running the scripts.
-- review the issue SELECT * FROM [dbo].[time_series_covid19_recovered_global_narrow] WHERE [Country Region] LIKE '%int Eustatius and Saba%' -- correct the data UPDATE [dbo].[time_series_covid19_recovered_global_narrow] SET [Province State] = [Province State] + ', ' + [Country Region] , [Country Region] = [Lat] , [Lat] = [Long] , [Long] = [Date] , [Date] = [Value] , [Value] = [ISO 3166-1 Alpha 3-Codes] , [ISO 3166-1 Alpha 3-Codes] = [Region Code] , [Region Code] = [Sub-region Code] , [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0) , [Intermediate Region Code] = dbo.CutRight([Intermediate Region Code], ',', 0) WHERE [Country Region] LIKE '%Sint Eustatius and Saba%' -- review data after correction SELECT * FROM [dbo].[time_series_covid19_recovered_global_narrow] WHERE [Country Region] LIKE '%int Eustatius and Saba%'
A similar solution is used for the second problem:
-- review the issue SELECT * FROM [dbo].[time_series_covid19_recovered_global_narrow] WHERE [Country Region] LIKE '%Korea%' -- correct the data UPDATE [dbo].[time_series_covid19_recovered_global_narrow] SET [Province State] = '' , [Country Region] = Replace( [Country Region] + ', ' + [Lat], '"', '') , [Lat] = [Long] , [Long] = [Date] , [Date] = [Value] , [Value] = [ISO 3166-1 Alpha 3-Codes] , [ISO 3166-1 Alpha 3-Codes] = [Region Code] , [Region Code] = [Sub-region Code] , [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0) , [Intermediate Region Code] = Replace(dbo.CutRight([Intermediate Region Code], ',', 0), '"', '') WHERE [Country Region] LIKE '%Korea%' -- review data after correction SELECT * FROM [dbo].[time_series_covid19_recovered_global_narrow] WHERE [Country Region] LIKE '%Korea%'
With this the data from the first file are ready to use. The data from the other two files can be loaded following the same steps as above. The tables not only that they have a similar structure, but they have the same issues. One can just replace the name of the tables into the scripts to correct the issues.
Putting All Together
As we need for analysis the data from all three tables, we could create a query that joins them together and encapsulate it within a view. The volume of data is neglectable, and even without an index the query can perform acceptably. However, as soon the number of data increases, it's useful to have only one table for consumption. Independently of the approach considered the query is similar. As we made sure that the key is unique across all the data, we could write the query as follows:
-- combining the data together SELECT [Province State] , [Country Region] , TRY_CAST([Lat] as decimal(10,4)) [Lat] , TRY_CAST([Long] as decimal(10,4)) [Long] , [Date] , Sum(Confirmed) Confirmed , Sum(Death) Death , Sum(Recovered) Recovered , [ISO 3166-1 Alpha 3-Codes] , [Region Code] , [Sub-region Code] , [Intermediate Region Code] INTO [dbo].[time_series_covid19_global_narrow] FROM ( SELECT [Province State] , [Country Region] , [Lat] , [Long] , [Date] , TRY_CAST([Value] as int) Confirmed , 0 Death , 0 Recovered , [ISO 3166-1 Alpha 3-Codes] , [Region Code] , [Sub-region Code] , [Intermediate Region Code] FROM [dbo].[time_series_covid19_confirmed_global_narrow] UNION ALL SELECT [Province State] , [Country Region] , [Lat] , [Long] , [Date] , 0 Confirmed , TRY_CAST([Value] as int) Death , 0 Recovered , [ISO 3166-1 Alpha 3-Codes] , [Region Code] , [Sub-region Code] , [Intermediate Region Code] FROM [dbo].[time_series_covid19_deaths_global_narrow] UNION ALL SELECT [Province State] , [Country Region] , [Lat] , [Long] , [Date] , 0 Confirmed , 0 Death , TRY_CAST([Value] as int) Recovered , [ISO 3166-1 Alpha 3-Codes] , [Region Code] , [Sub-region Code] , [Intermediate Region Code] FROM [dbo].[time_series_covid19_recovered_global_narrow] ) DAT GROUP BY [Province State] , [Country Region] , TRY_CAST([Lat] as decimal(10,4)) , TRY_CAST([Long] as decimal(10,4)) , [Date] , [ISO 3166-1 Alpha 3-Codes] , [Region Code] , [Sub-region Code] , [Intermediate Region Code] -- reviewing the data SELECT * FROM [dbo].[time_series_covid19_global_narrow] -- checking for duplicates SELECT [Province State] , [Country Region] , Date , count(*) NoRecords FROM [dbo].[time_series_covid19_global_narrow] GROUP BY [Province State] , [Country Region] , Date HAVING count(*)>1
If everything went smoothly, then the last query will return no records. As the latitude was given differently, it was needed to format and cut the values after 4 decimals. Before using the data is needed to do a few adjustments. As the data are incremental, adding up to the previous date, it's useful to calculate the increase between two consecutive days. This can be done via the LAG window function:
-- preparing the data for analysis in a view CREATE VIEW dbo.v_time_series_covid19 AS SELECT * , LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevConfirmed , LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevDeath , LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevRecovered , Confirmed-LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseConfirmed , Death-LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseDeath , Recovered - LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseRecovered FROM [dbo].[time_series_covid19_global_narrow]
With this the data are ready for consumption:
-- sample query SELECT * FROM dbo.v_time_series_covid19 WHERE [Country Region] LIKE '%China%' AND [Province State] LIKE '%Hubei%' ORDER BY DATE
Of course, one can increase the value of this dataset by pulling further information, like the size of the population, the average density, or any other factors that could have impact on the propagation of the disease.
Instead of loading the data via the wizard, one can create an SSIS project instead, however some of the corrections still need to be done manually, unless one includes into the logic the corrections as well.
Happy coding!
No comments:
Post a Comment