I learned a new trick with SQL Data Generator that I wasn’t aware of previously. I think this is a good idea for masking some of that production data that you might not want developers to have.
Let’s start with a production table. In my case, I’ve created a Sandbox_Prod database with a table in it for employees. I’ve added a few records that contain some sensitive information.
CREATE TABLE Employees ( empid INT IDENTITY(1,1) , EmployeeName VARCHAR(250) , EmpployeeEmail VARCHAR(250) , active TINYINT , salary money , pwd VARBINARY(max) ) ; GO INSERT INTO Employees VALUES ( 'sjones', 'firstname.lastname@example.org', 1, 10000, ENCRYPTBYPASSPHRASE('The User Sample', 'MyS%83ongPa44#word')) , ( 'awarren', 'email@example.com', 1, 20000, ENCRYPTBYPASSPHRASE('The User Sample', 'Ano$2therS%83ongPa44#word') ) , ( 'rsmith', 'firstname.lastname@example.org', 1, 5000, ENCRYPTBYPASSPHRASE('The User Sample', 'Kedj93m@@83ongPa44#word')); GO
I’ve got a second database, called Sandbox, that simulates my development environment. I’ve got the same table, but without any data in it.
What I want to do is move some of the production data to my development area, but not all of it. Some of the production data needs to be masked.
SQL Data Generator Sources
I can use SQL Data Generator from Redgate to do this, by using a data source that actually exists. In this case, I’ll create a new project and point it at my Sandbox database. I’ve deselected all of the tables except my Employees table.
When I pick my Employees table, I see the familiar generation screen on the right. Most of you are like me and notice the number of rows and the option to delete data.
However there’s another option. I can select the “Use existing data source” radio button instead. When I do this, I have a few choices for data. I can use an existing table or a CSV file. Both of those can be good choices, especially if I have sets of data I want to load into the table. Either one can help me to build known, specific data sets for development (or testing).
In my case I will choose and existing table. When I do this, I click the “Browse” button and I get a connection dialog for SQL Server. I pick my instance and the production database.
I click “Next” and then get the chance to select the table to use. In this case, I’ll pick the Employees table.
When I return to the main SDG screen, I see the table listed as the source, but my preview shows the actual production data. This is because I’ve mapped the production table as a source, and it will be used as it currently exists.
That’s not what I want. I want to mask the email address and the salary. However, now I can change things like I might do for any random data generation.
Let’s first click in the EmployeeEmail column. When I do that, I see the following, the column with its source set as the existing column in the production table.
However the drop down gives me lots of choices, including an Internet email generator.
If I select, then my preview changes. Now the image below shows production data for all columns other than the email.
I can repeat this for the salary (and password to be safe). When I do that, I’ll see random data for those columns and production data for others.
I can repeat this for all tables in my project, mapping through data that isn’t sensitive, and masking data that is. It’s a tedious process, but it’s a one time process for specific data. Once this is done, every restore can have the project run and the data masked. If production DBAs do this refresh, then developers never see sensitive information
It seems that there’s no shortage of re-branding attempts being made in all industries and by all types of people. I still remember when most of us were called computer programmers instead of developers. Not many people writing C# or Java code would want to be called “programmers” today.
One of the latest fads is the call for more data scientists to work on big data, another equally, poorly defined term. However it seems that he definition of what a data scientist is has been so ill defined that almost anyone that can write a query using aggregates might define themselves as a data scientist.
A good thing if you are looking for a job. Many of you might find opportunities (and raises) if you convince a hiring manager that you are a data scientist. However I’d be wary of living on just the new brand without growing your skills. If your company comes to expect more, especially with regards to advanced statistical analysis, you might find yourself in a bind.
I ran across a piece that looks at the skills that a data scientist might actually need. I don’t know how many managers might understand the difference between simple discrete rules engines and more subtle, complex, multi variable, adaptive algorithms, but there can be a big difference in how well the system actually performs for your company.
No matter what you choose for your carer, I’d certainly encourage you to continue to learn more about how to work with data. Whether you want to learn more about statistics, pick up R, or improve your visualization skills. Keep Learning. Keep your brain active and work to improve before you find yourself without a job and in need of training. Every little bit you learn helps and the practice of continuous improvement builds a habit that will serve you well over time.
The Voice of the DBA Podcast
Last week we received the results of the PASS Board of Directors elections for 2015. Jen Stirrup and Tim Ford were re-elected to new terms. They’ve served for the past few years as members of the BOD. Ryan Adams was elected for the first time, and I supported his candidacy, so I’m glad he won.
Argenis Fernandez didn’t get elected, but I still applaud his decision to run. He was one of only four out of the thousands of PASS members that did so. However in reading Andy Warren’s notes, perhaps Argenis will end up being on the board anyway.
If that’s the case, I’m glad. I think Argenis brings a fresh, new, different view, which is needed.
In any case, I’ll publicly thank them here, and in person when I see them, as their volunteerism should be appreciated and acknowledged.
There is a bit of a rant from John Welch on testing your data manipulation that I like. I think some of the impact is lost because the end shows an advertisement for a product that helps here, but the points made are good. We all want to test, we think it’s hard, we don’t have time, and our businesses live with the issues from limited testing.
I’m not a fan of Test Driven Development, as John is. Usually this is because I’m not always 100% sure of the results I want or have been given. I’ve often been given a request to do x and as I get involved, I find that the requirements might be incomplete, or even wrong, and they’ll change. As a result, I like to write a little code, get some idea of what I want to return or change, and then write a test that verifies what I’ve done is correct.
It’s a subtle difference, and maybe I’m doing TDD in the wrong order, but I like to get code, test it, then think about potential issues (which I might find as I write code) and write a few tests for the things that I’ve missed.
However I do believe we need to test our code. We all do test our code, even if it’s with a few before/after queries. What I don’t get is why we don’t just mock up a quick test that we can run in an automated fashion. It’s not much more work, and then we can more easily re-run the test later to ensure any refactoring or optimizations we make continue to work as expected.
The Voice of the DBA Podcast
I’ve been looking to work on my programming skills a bit and try some new languages. I’d like to grow my career in a few different ways, as well as investigate where some of the new languages and platforms might be useful for data analysis. As I’ve talked to some developers and been looking around, someone recommended Exercism.io.
Exercism.io is a site that allows you to practice coding exercises and get feedback from others. When you visit the site, it’s an interesting look, and invites you to log in through GitHub. If you’re not a member of GitHub, and you’re a developer, you probably are making some sort of mistake.
Once you log in, you really need to download the command line client and execute it. Once you do, you can configure it to connect and download exercises. Each of these is placed in a folder, as shown:
For each of these languages, you get an exercise that you need to complete in that language. I’ve been playing with Python, and I had a first exercise of hello-world. In the python folder, were my exercises (you can see I’ve moved on).
The hello-world folder had a read me and a test file (it’s since gotten my program in there). The Readme has instructions and the test file is a set of tests that can be executed to check your program.
When you pass the tests, you submit your solution from the command line client. The submissions appear on the website and people can comment on your code. I am in no way commenting on any else’s Python code at this point, but I did get a comment on my leap year calculation.
I looked over the comment and then changed some code. I had to futz with the command line to get this to resubmit, but as you can see, I ended up getting a second iteration in there. No comments on that one, but we’ll see.
It’s an interesting idea to share code and get comments. I think mostly this is a way to formally practice some exercises and get comments from experienced users, but the volume means that potentially you won’t get comments on your solutions. I know many of the SQL users may, or may not, comment on solutions.
I think this is interesting, and I’m tempted to try to do something like this for SQLServerCentral. The hard part of putting together enough questions that others can practice in an organized fashion.
Last week I did something that many people have already done. However this was my first experience, and I really enjoyed it. We’ll see if I get to do it again.
I went to SQL Saturday #441 – Denver as an attendee. I hadn’t submitted to speak, but I spent half my day over at the event, watching some sessions, networking, and enjoying some time with my fellow data professionals.
I’ve never done that before. I’ve attended many SQL Saturdays, 49 by my count, but always as a speaker. I’ll sit in sessions, but I always have a responsibility and my laptop.
Saturday was a bit more relaxing for me. I could attend sessions on topics and focus more, being unworried about keeping my talk in my head, worried about setup or practicing anything during the day. I could focus all my time on someone else’s presentation, which was nice.
I don’t know how many more SQL Saturdays I’ll get to attend like this. Certainly the ones in the Denver area are events I like to support by volunteering to present on a topic. Denver has grown quite a bit and we have many speakers from out of town that submit, so I may avoid submitting here in the future and continuing to attend, or volunteer in some other fashion.
It was a fun experience, and I’m glad I got to go. It was worth my time, as I learned a few things that I am looking forward to trying out over the next few weeks.
I was chatting in one of our DLM webinars and someone asked a question about logging data. I mentioned a few things, but this individual followed up and wanted more information. Specifically, how do you build a logging table?
It’s an open ended question, but it’s perfect for a poll. How do you construct a table that you use for capturing lots of information that you may or may not query later, but you might only query in a very ad hoc fashion when something has gone wrong. Usually that’s the only time I look through logs, when I’m investigating some type of issue. With that in mind, this week the question is:
How do you design a logging table?
I’m looking for two things here: the DDL and the rational for the design. What is good about this design? How does this particular structure (with indexing) allow me to find answers to the questions I might need to ask.
I realize this is very open ended, but the concept of logging is also open-ended, used different ways for different purposes. If you have problems with logging structures, let us know what design techniques to avoid as well.
The Voice of the DBA Podcast
Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.
I ran across a question on Facebook, of all places, the other day. Someone had asked a friend how to return a value from a procedure and assign it to a variable. My friend answered, but in the discussion, I noticed the poster was looking to return @@IDENTITY to the calling procedure as the value of the row that was just inserted.
Don’t do that. At least not without understanding the potential issues.
It’s been years since I’ve seen @@IDENTITY in use, and for a number of years before that, this was an easy “weed out” question in interviews.
If you look at the documentation for @@IDENTITY, the documentation notes that SCOPE_IDENTITY() and @@IDENTITY both return the last identity value inserted in the table, but @@IDENTITY is not limited in scope to the current session. This means that when concurrent inserts occur, you could receive the identity value of another session. Depending on how you use this value, that may or may not be an issue.
How does this work? Let’s create a simple table with an identity. I also create a logging table and a trigger that will add a message to my logging table when I add a row to the first table.
CREATE TABLE newtable ( id INT IDENTITY(1 ,1) , mychar VARCHAR(20) ); GO CREATE TABLE Logger (logid INT IDENTITY(56,1) , logdate DATETIME , msg VARCHAR(2000) ); GO CREATE TRIGGER newtable_logger ON dbo.newtable FOR INSERT as INSERT INTO logger VALUES (GETDATE(), 'New value inserted into newtable.') RETURN ; go
If I run this, what do I expect to be returned?
INSERT INTO dbo.newtable ( mychar ) VALUES ( 'First row' -- mychar - varchar(20) ) SELECT @@IDENTITY
However I get this. A 56 in my result set for @@identity.
The reason is that the last identity value was 56, from the logging table. The order of operations is
- insert value into newtable
- @@identity set to 1
- trigger fires
- insert into logger
- @@identity set to 56
That’s often not what we want when capturing an identity value. What’s worse, this behavior can exist, but not manifest itself until someone changes a trigger later.
If I change this SCOPE_IDENTITY(), I get a different result.
This is because the SCOPE_IDENTITY() function takes the scope into account and doesn’t get reset by the trigger code.
This took some time to write. Mostly because I had to setup the demo, test things, and then get the explanation straight in my head. It took me 15-20 minutes, including lookup time in BOL, but if you are new to writing, this might take a bit longer. You’d also want someone to review your explanation since this can be tricky to explain.
I was at Microsoft in Redmond recently and heard an interesting comment from a SQL Server developer. I was debating the data masking feature with a friend, and we were torn on the value of this for various situations we’d each encountered in the past. There are some restrictions, and it doesn’t seem that data masking is really offering a lot of security.
The Microsoft developer, however, noted that this isn’t really a high security feature. It’s a developer feature. The design of data masking is to prevent that same code from being rewritten over and over by application developers. The use case is really to help with systems that might read some data, like those that print off part of an account number, ID number, credit card number, etc.
If you read up on the restrictions, this makes sense. If you are just trying to make development more convenient, the feature makes sense. I hadn’t thought about that use case, but the more I consider this, the more I’m sure that data masking does remove a bunch of code that developers might be re-implementing themselves, perhaps with highly variable levels of quality. It also removes the chance that application developers will accidentally pull sensitive data to a client and (poorly) implement mask replacement there.
I think this feature is being mis-marketed a bit, really to increase sales to executives and management. I’m sure there isn’t anything we can do about that, but I’d love to see technical documents and information about this for developers and DBAs. Give us a more realistic use case and give us better guidance. I think if we got that for many features, there might be more positive responses and great interest from technical professionals to the changes in the SQL Server platform.
The Voice of the DBA Podcast
There was a piece in Dark Reading about a security researcher being targeted by a hacking group. While this is more political, this does raise some questions about how hackers might target our systems in the future. It just might be through personalized attacks against anyone that has privileged access.
In this case, hackers learned who researchers were and made repeated attempts to personally craft phishing attacks against specific people. However since many hackers communicate with each other, and could easily turn from political goals to economic ones, I’d be concerned about how this might affect data professionals in the future.
We know that social engineering works. While many of our customers and clients do have access to large amount of data and are perhaps easier targets, I would still expect to see attacks against a data professional that manages lots of data. Especially if the individual might have access to high profile data, or multiple companies as a consultant.
Targeted attacks against individuals could be a concern for many of us. We are usually more conscious of phishing and social engineering, but we’re not invulnerable. Many of us need to practice good security habits, being careful how we access privileged information, and perhaps even finding ways to do so only through containers or virtual machines that may protect us against some of the malware that could slip past us.
Security is a pain, it’s annoying, it can slow us down, and it is hard to adhere to best practices consistently and constantly. However we need to be careful and vigilant against the regular stream of attacks that will likely continue for the foreseeable future.