Creating Synthetic Test Data
When testing for mobile devices, there may be times where you opt to create fake test data. There could be a variety of reasons for wanting to create fake test data. For example, your test data may have required Personal Identifiable Information. This is becoming more common with applications that require the user to validate their identity. When sharing copies of the data structures associated with the application, the examiner creating the test data may want to protect the data of themselves or others. This is where applications can come in handy to replicate the data you have in anonymous ways.
Another reason you may want to create fake test data is when the data source is from a real case. For example, one challenge of writing parsers for warrant return data is that the data is from real cases and holds PII. Use of this data for testing or sharing purposes may lead to violations of privacy laws and policies. In order to pass the data to a 3rd party to create a parser or to validate an existing parser, it is optimal to be able to create fake data in the same schema that the real data is stored.. This may even be if you discovered a new data structure in real evidence, and you want to replace it with synthetic data for a presentation or blog post discussing your findings. This also will help if the real data set includes innappropriate content for the systems or places where it is being displayed.
Luckily there are a couple of tools that can be used for this purpose. They include mockaroo and generatedata. Each of these allows you to create datasets that will be realistic, but free of PII. While there are websites that will generate data about one fictitions user at a time, which are fantastic for creating profiles for test devices, the advantage of mockaroo and generatedata are that you can create a large mass of data for a variety of fields and formats simultaneously. This makes these tools perfect for instances when you are trying to increase the volume of a current test set in order to run a tool through more rigourous data. These generated data sets can also help provide additional variables that may not exist in a user generated data set (i.e. charachter set, string length, etc.) and help remove some of the bias from the dataset that is initially generated. It is important to note that these tools are only useful when the data structures are already understood and you know what data to expect and in what format. But if you know the schema, these geneartors can quickly create large randomized datasets with the fields you may be interested in using.
Mockaroo
Mockaroo allows you to either start with data in CSV to create a dataset or begin by drafting a new schema. In order to upload an existing dataset to make a random dataset, you first name your data set and upload your CSV File as shown in Figure 1. This can be useful if you have test data export that needs more rows and randomization. NOTE Do NOT load real case data into the platform.
The next step if you are using an uploaded data set or creating fresh data is to go to Schemas on the top bar and then click the Create A Schema button. This will bring you to a page where you can create or alter fields and choose a data type. There are a variety of data types you can choose. This includes fun ones like “Catch Phrase” which will string together buzzwords. Some of the choices can be seen in Figure 2.
In this instance we are going to generate some data that includes First Name, Last Name, Favorite Movie, Catch Phrase, Email, Phone Number, and IP Address. You can see that I was able to select the desired format for phone number. I like using some of the longer strings as they will integrate charachters I may not have thought of as seen in the preview of the dataset in Figure 3. Also note that you can set the number of rows before you download your data.
Generatedata
Generatedata is similar to mockaroo in that it allows you to create and preview data. What I like about this site is that you can create a variety of data formats including JSON , XML, and SQL in addition to CSV. This site allows you to use either a Quick Start geneartor where you click a few buttons to pick data and format or you can go strait to the generator. The quick start menu is shown in FIgure 4. The selections that have been made change the buttons to a green color.
Once you have either made your selections or skipped the quick start, you will be brought to a page with your selections where you can edit the data and options while seeing your output in the window below as shown in Figure 5. This feature is great to see what the output in your designated file type will look like as you change some of the options. The next step is to select the GENERATE button and select the total number of rows. The free version does limit you to 100 rows.
Conclusion
These dataset generation tools can be useful for creating larger datasets for testing and parser development. Each tool has pros and cons. For example mockaroo allows for larger data set geneartion in the free version as well as the ability to upload a dataset to append. Some of the great featurs of geneartedata are the different output data format choices and the ability to see a preview as you change selections. Please remember to not load real data that may include case information or PII into web based tools like the ones described in this post.
Hope that this can help you with the creation of test data sets in the future saving you time. If you have any questions, please reach out to me on Twitter @B1N2H3X.