Some problems related to dimensional modeling in Data Warehouse.

August 20, 2019, 7:47 pm

≫ Next: Dimension attribute with one-to-many relationship

≪ Previous: Inserting data into SQL table from Oracle table taking huge time for 6 millions records

I have asked these questions in Stack Overflow, but no one there answers me, so I bring them here, hoping someone can answer my questions.

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.

My problems are:

1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.

Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.

Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".

From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".

Do I get it right? Or should I have a different approach?

2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...

From my point of view, this is clearly a fact table.

But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.

So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.

3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)

I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.

Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.

Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.

_________

Hope that some one can answer my question.

↧

Dimension attribute with one-to-many relationship

August 21, 2019, 4:51 am

≫ Next: Sharepoint log DB file is Huge. >300Gb

≪ Previous: Some problems related to dimensional modeling in Data Warehouse.

I have a request for an attribute in a dimension that has a one-to-many relationship with the lower level of the dimension.

Here is the case: Dimension: Employee Attribute: Immatriculation

One employee can have multiple immatriculation codes. Each immatriculation has a number, start_date, expiration_date.

How can I model this case ?

↧

Sharepoint log DB file is Huge. >300Gb

August 23, 2019, 3:25 am

≫ Next: Issues with row level security

≪ Previous: Dimension attribute with one-to-many relationship

My SQL server is running out of space. the main culprit seems to be 2 database files

Sharepoint_config_log 340GB of 480GB

Sicon_log 125Gb of 300GB

my backup system is set to truncate logs every day. but it never reduces and just grows. I have tried shrinking and releasing used space to no avail.

any ideas how I can rectify this. I don't want to increase the drive size.

↧

Issues with row level security

August 28, 2019, 3:33 am

≫ Next: Sql server table partitioning in Data Warehousing

≪ Previous: Sharepoint log DB file is Huge. >300Gb

I have setup row level security using a function and the below code:

CREATE FUNCTION security.fn_securitypredicate(@TenantId AS sysname)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS fn_securitypredicate_result
WHERE @TenantId in (SELECT tenant_id from [dw].[DimTenant] WHERE analysis_user_name= SYSTEM_USER OR SYSTEM_USER = 'Testlogin' OR SYSTEM_USER = 'Test2login' OR Is_Member('Reader') = 1);

CREATE SECURITY POLICY AdviserFilter
ADD FILTER PREDICATE security.fn_securitypredicate(tenant_id)
ON [dw].[DimAdviser]
WITH (STATE = ON);

For some reason whenever I try to update any records in the table I get the following error:

Msg 100083, Level 16, State 1, Line 41
Table cannot be used as target of an update or delete operation that involves metadata or security built-in functions. Modify the statement and re-run it.

From my understanding users on the db should still be able to update or delete rows on the table unless I add a block predicate. Am I missing something?

I'm using Azure datawarehouse

↧

Sql server table partitioning in Data Warehousing

August 29, 2019, 12:52 am

≫ Next: Kimball Modelling Approach- SSAS

≪ Previous: Issues with row level security

Hi guys,

I am trying to implement table partitioning on one of my facts that have over 200M records and i am stuck.

I have created 13 file groups. 1 for archiving and 12 for-each month of the current year. I have loaded a sample data and data that is not for this year is not showing on any of the partitions.

What am i doing wrong - Mapping history to the default partition ?

How will i make sure that the 12 partitions are archived and cleared on the first of every year?

↧

Kimball Modelling Approach- SSAS

August 29, 2019, 6:06 am

≫ Next: KImbell

≪ Previous: Sql server table partitioning in Data Warehousing

We trying to design data marts for 2 specific areas and we want to understand tabular performance for if we increase columns but reduce rows or increase rows but reduce columns and impact on star schema.

If we increase rows but reduce columns, will having more dimensions not impact performance? and if its 1 to many is that fine or are we saying that by increasing dimensions we increasing cardinality?

Please share your views on best practices and approach?

↧

KImbell

August 29, 2019, 9:20 am

≫ Next: Fetch data in staging database if any of the dependent table data modified

≪ Previous: Kimball Modelling Approach- SSAS

I am really confused here . Are the datamarts used in Kimbell's approch are in 3NF form or dimensional modelling form.

In kimbell data are extracted from sources and loaded in individual datamarts first so are they in 3NF or dimensional format.

Please help.

↧

Fetch data in staging database if any of the dependent table data modified

September 3, 2019, 7:36 am

≫ Next: How to perform incremental load in staging database

≪ Previous: KImbell

Hi,

After Data warehouse completed successfully, I want to delete data from the staging table.

Below is my scenario,

I have two dependent table Employee and Department with 'Created Date' and 'Modified date' columns. The first time we fetched all data from source to staging database through ETL initial load.

When all Staging data dumped in Datawarehouse successfully for a particular date through job then after we truncate the staging tables (based on max date).So it will delete all records from staging database.

Now in my scenario, after 20 days the employee's name 'xxx' department changed to 'yyy'.Now when I run the staging job(to delete old records) it deletes data from Employee and Department. After that incremental load for staging is run, It fetches data from Department table as data is modified (based on the 'Modified Date' column) but for Employee table we didn't get any data as there is no modification( i.e; Employee table Modified date = NULL) because data was already deleted.

My view query is something like this :

Select E.Empid, E.EmpName, D.DepartmentName

From tbl_Emp E

Left Join tbl_Dept D on E.Empid= D.Empid

So for Dimension table, while fetching data from the staging view, I didn't get proper data as an employee whose department was changed that information is available in the Department table but Employee table has no data. When I apply join in view using EmployeeId to fetch data it gives NULL as Department has data but Employee has no data for that EmployeeId.

I want to fetch all data from Employee table whose department is updated in the Department table.

I don't have any clue how to solve this issue.

Thanks,

↧

How to perform incremental load in staging database

September 4, 2019, 7:47 am

≫ Next: Issue faced when performing incremental load in parent child relationship tables to dump data in data warehouse tables

≪ Previous: Fetch data in staging database if any of the dependent table data modified

Hi,

How to perform incremental load in staging database to fetch newly and updated data.

Thanks,

↧

Issue faced when performing incremental load in parent child relationship tables to dump data in data warehouse tables

September 10, 2019, 10:08 pm

≫ Next: how to change 12 hour date format to 24 hour date format in Sql server

≪ Previous: How to perform incremental load in staging database

Hi,

I have Emp and EmpDetail table. Firstly my initial load package is executed and data are dump in Staging database(Source and Staging table structure is same). Then, with the help of view data are dump in data warehouse table DimEmpDetail. My view query is:

SELECT E.EmpId, E.Name, ED.Topic, ED.MentorId

FROM EmpDetail AS ED

LEFT JOIN Emp AS E ON E.EmpId = ED.EmpId

As data are successfully dump in data warehouse table DimEmpDetail, I truncate staging tables after that.

Second time when package is executed incremental load package is executed. EmpDetail table data are modified so I get data of that table second time, but I don't get data from Emp table as no data are modified in that table. So, in my view query I don't get EmpId and Name( i.e. I get NULL in those columns) so my MentorId will not be modified in data warehouse table DimEmpDetail as I get NULL in EmpId column.

How can I solve my issue? Do anyone have idea on this?

Thanks,

Dhara

↧

how to change 12 hour date format to 24 hour date format in Sql server

September 10, 2019, 11:37 pm

≫ Next: MDX Query to calculate Average rather than Sum and hide it

≪ Previous: Issue faced when performing incremental load in parent child relationship tables to dump data in data warehouse tables

Hi ,

could anybody explain me how to change 12 hour time format to 24 hour time format in SQL server and after that i want to get the difference between those two date fields in hours??

basically in my DB its appear in 12 hour format , while doing subtract between those two dates its not giving proper value,

can any one help me here.

Regards,

↧

MDX Query to calculate Average rather than Sum and hide it

September 13, 2019, 8:55 am

≫ Next: Best way to deal with recording events and non events fact table

≪ Previous: how to change 12 hour date format to 24 hour date format in Sql server

Looking to calculate the Avg instead of sum but do not want to display it

↧

Best way to deal with recording events and non events fact table

September 19, 2019, 12:41 am

≫ Next: Many to Many relationship between two Dimensions and Fact

≪ Previous: MDX Query to calculate Average rather than Sum and hide it

Hello, I am trying to think of the best way to deal with the following.

I am attempting to design a fact table that looks at a daily feed from a Safety system and records if there has been an injury.

Looking at the data, there have only been 13 of these events in the last 5 years.

Is it worth creating a FACT table for so few events or should I just use the staging table (exact copy of the source) to report from in the Warehouse?

They will want to see that there were no events on certain days so I assume (whether using a FACT table or not) I would just do an outer join against the date dimension to show that there are 0 events for the majority of the time and also then capture the days when there is an injury.

Is it always best to use a FACT table even if there are very few facts in it?

I hope the above makes sense

Thanks,

Phil

↧

Many to Many relationship between two Dimensions and Fact

September 19, 2019, 4:30 am

≫ Next: Bridge tables VS Star Schema Data Model

≪ Previous: Best way to deal with recording events and non events fact table

I have a request for an attribute in a dimension that has a many-to-many relationship with another dimension.

Here is the case:

Dimension 1: Employee
Dimension 2: Nationality

I have 2 source tables : Employee, Nationality.

One employee can have multiple Nationalities codes. Each Nationality has an ID(code) and a label. A Nationality can belong to many employees.

In my data model the Fact is linked with the Employee and it's recording every change made to some specific fields of the Employee (and other dimensions).

Is it possible to use Bridge Table Method between Employee Dimension and Nationality Dimension and link only the Employee Dimension to my Fact ? If yes, how should I implement this method?

↧

Bridge tables VS Star Schema Data Model

September 20, 2019, 5:24 am

≫ Next: Delete duplicate records from Azure SQL DW- Doesn't work

≪ Previous: Many to Many relationship between two Dimensions and Fact

As known, the bridge table is a powerful way to handle dimensions that take on multiple values when associated with the grain of a fact table’s measurement event. In a simple way, a bridge table enables you to resolve many-to-many relationships between dimensions. The goal is to create a data model that performs well and is simple to query. In many case, a star schema is a must because we might face performance issues. I have the following dimensions :

Employee
Function
Role

An employee can have many current functions : one primary and the others are secondary. An employee can have many current roles.

My Fact will record any updates in the function and the role of any employee. My model is like below :

My Fact_Employee will contain for a specific Employee different current records like below :

FactID EmployeeID FunctionID RoleID BIStartDate BIEndDate CurrentRecord
1      544        390        56     20/09/2018  NULL      1
2      544        390        11     03/10/2018  NULL      1
3      544        67         56     ...         ....      1
4      544        67         11     ...         ....      1

For the EmployeeID = 544, his current functions FunctionID=390,67 and his current RoleID= 56,11 and I linked the BIStartDate and BIEndDate twice with the DateDimension as shown in my model above.

My questions : In my case is it recommended to use bridge tables between DimEmployee and DimFunction/DimRole ? If I am willing to keep working with schema model, is my model presentation correct/optimised? How can I distinct the EndStartDate for a Function from a Role in my Fact because as I think I am using a cartesian product and my Fact will be containing a very large amount of data ?

↧

Delete duplicate records from Azure SQL DW- Doesn't work

December 29, 2017, 10:41 am

≫ Next: Detailed and agregated mesures in the same fact table

≪ Previous: Bridge tables VS Star Schema Data Model

I m trying to delete some dup records on Azure SQL DW.

Tried following queries but nothing works/not supported on Azure DW platform

1)With cte (select *,row_num from tablename)

delete cte from where row_num>1

throws errorthat "delete statement cannot follow cte"

2)Delete statement using sub queries

Delete <TableAlias> from

(Select sub query)Tablealias where row_num>1

Throws error "A From clause is not supported"

So we can't use the delete statements on Azure DW?

Alternate solution followed

Create CTAS table. Loaded all data from original table
Truncated orginal table
loaded from ctas with rank=1

Disclaimer: The contents, I write here are my personal views, not the view of my employer or anyone else.

↧

Detailed and agregated mesures in the same fact table

October 3, 2019, 2:12 am

≫ Next: Efficiently convert Fact measures to multiple units of measurements

≪ Previous: Delete duplicate records from Azure SQL DW- Doesn't work

Hi,

I have the following structure and need to manage it to have a good datawarehouse:

In the database i have a table for cars and a table for kilometer reading, so for car A i can have a zero or one ro many readings in the month

The project have to give the measure detailed called "KM Index" (granularity: day) and a calculated measure for the kilometer traveled MONTHLY (granularity : Month) by the car (difference between the first and last reading in the month in real or proportion reading if not exist real reading for the month)

How to use to have those mesaures detailed and calculated with those different granularity in the same fact table (if possible) or should i create two fact tables?

NB: We can not sum up KM Index (Kilometer reading):

exp for Car A :

05 January : Km Index = 1000

31 January : Km Index = 1250

The total Km traveled is not the sum of 1000+1250

I m working in the solution to have the mesaure calculated for the Km Traveled to be linked to the last day of the month so that if we want to have the same garnularity in the same fact table we can have the two measures KM Index and KM Traveled so that for aggregation we take MAX(KM index) and SUM(KM Travaled), is it possible?

Thanks

↧

Efficiently convert Fact measures to multiple units of measurements

October 9, 2019, 5:34 pm

≫ Next: Modifying table with large number of records

≪ Previous: Detailed and agregated mesures in the same fact table

We have a data-warehouse with multiple Fact tables (Finance,Supply Chain, etc.). The fact tables contain a single "raw" measure from our ERP. At the moment, we are using a SQLCLR UDF (to allow parallelism) to convert all those measurements on the fly, rather than storing them in the tables.

This is mainly done to allow code reuse as the conversion method is the same for all raw measures. This UDF is called through a view that contains the fact table, joined to a product dimension to retrieve the necessary fields for calculation. This helps when any of the product attributes change or the rule changes (it's happened before and I'm sure it will again) and we've have to run a massive update. It also helps if we ever end up needing yet another measure in a different uom.

We've recently looked into upgrading our data-warehouse from SQL 2014 to SQL 2017 and we're trying to leverage the new CCI. It's been great except it seems the the functions have now reverted to being single thread only, thus breaking any performance gain from using a CCI.

My question is, should we store all measure conversions in the fact table (soon to be using a Clustered columnstore), can we force the UDF back to parallelism or should we move to Table Valued functions where both code reuse and parallelism could be used at the cost of extra CROSS APPLY in our view.

Here's an example of we we're currently doing:

SELECT dbo.MyConversion(P1, P2, P3, QtyShippedRaw)as QtyShippedPounds, dbo.MyConversion(P1, P2, P3, QtyShippedRaw)as QtyShippedCans, dbo.MyConversion(P1, P2, P3, QtyOrderedRaw)as QtyOrderedPounds, dbo.MyConversion(P1, P2, P3, QtyOrderedRaw)as QtyOrderedxxx, dbo.xxxxFROM MyFactTableINNERJOIN MyProductDimension (retrieves p1,p2,p3)

↧

Modifying table with large number of records

October 11, 2019, 4:57 pm

≫ Next: special mode optimized for star schema performance

≪ Previous: Efficiently convert Fact measures to multiple units of measurements

I have a table with about 400 million rows. I need to drop a hash column, add some new columns, set them to default of 0, set them to not null, and re add the hash column. Its taking hours to do this. Mush of the time is spent recalculating the row hash, I believe. Are there any techniques i could use to speed up the process? The DB is using the simple recovery model

John Schroeder

↧

special mode optimized for star schema performance

October 22, 2019, 10:16 am

≫ Next: Azure Dataware house-Data Security

≪ Previous: Modifying table with large number of records

hi we run 2017 std. But are open to alternatives. I heard once about a mode (I think on premises) of sql server that is optimized for star schema performance. I cannot find documentation now that i'm looking online for it.

The reason we might have an interest is kind of cool. We don't want to carry expertise on technologies like tabular model but would like to be poised for such technologies if possible. We are torn design wise now between a flattened version of our facts and dimensions vs a star.

We need great performance because tabular is out at least temporarily. I could be wrong but I've never been impressed by column store performance in spite of the fact that it supposedly uses the same engine as tabular. But I've never really tried it as a non clustered index without dense columns. And my opinion may be clouded because I watched its performance with joins and in non aggregation scenarios as well.

If it realy exists, what is that star technology called? And is it really great performance wise?

↧