What’s Inside Tokopedia 15m Breach Data? — An Analysis

Rafi Ramzy
7 min readMay 13, 2020

--

Source: images.glints.com and underthebreach post

We all know about the latest news about a data breach that hit one of the largest Indonesian companies specializing in e-commerce, Tokopedia, at the beginning of May 2020. Twitter account @underthebreach posted a tweet that the hack occurred in March and affected the personal information of 15 million users.

The data was exposed to somewhere on the internet; anyone who gets the data quickly can access the information of Tokopedia users.

Data Information

This incident is causing 15 Million rows of data posted to a popular hacking forum. The data included over 12 Million unique email addresses, names, genders, birth dates, and passwords hashes. I wonder if there something interesting with the data. Not how much information that the data provided, but what can we get from there.

Of course, it will take several hours and a lot of works to do the searching and analyzing million data inside if do it manually. There are so many options to do with the data, but which faster and efficient way to do?

Storing The Raw Data

Luckily there’s a service called Elasticsearch. Because we will play with large data, using Elasticsearch may be a great option because it’s a fast and simple (some people call it like a search engine) to store and analyze data.

Elasticsearch commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana). We are using Elasticsearch for managing large data, then visualize and analyze the data with Kibana.

The given data (I was given this data) is in txt file extensions with million rows of data inside. Each row from the data given separated by a new line, and each value separated by a tab. Automating the storing process with a python code makes it fast and easy to process the data. The python code reads each row of data, separating the data, stores it in an object, then send as Elasticsearch documents to an Elasticsearch Index.

Here is the code that I used to automate the storing process:

Before, I tried to use traditional ways to send the data to the Elasticsearch index with es module one by one, each row of data. And, it takes 1 day per 1 million data (maybe 14 Days or more for 15 Million data. yep sorry I'm just trying).

While importing data to ELK, I found out that we can speed up data much better with some improvement, With buffering to lists first, send them, then delete it. While I took several hours of tweaking the code, I found out the setting for the very fast way to send the data: set buffering to 10 thousand values to a list, and send it with 8 cores, 500 chunk size using paralallel_bulk.

Also, the parameters like set timeout, max_retries, and retry_on_timeout, that parameter greatly changed the whole game. Now it only takes 1.5 hours to store the data (from 15 days to 1.5 hours. but it depends on the spec of the server too btw).

The stored data in the Elasticsearch index is about 14.989.997 containing several variable keys and attached to each entity as a single document. Each document has several variables :

But unfortunately, there are only a few variables that contain information inside, not all the variables as mentioned above.

Visualizing The Data

After done with Elasticsearch, it’s time to let Kibana do the magic with the data. Easily, million of data turn out to a kind of table/chart to make it readable, and analyzable.

The following data chart or data table was made using Kibana visualization from the data that has been stored in the Elasticsearch index. Some charts are limited to only displaying some data due to a large amount of data. The purpose is only to provide whatever information can be processed.

a. Count User Email By Domain Name — Pie Chart

This pie chart shows how much percentage and amount of email domain that the users use for their tokopedia account.

b. Count User Email By Domain Name — Data Table

c. Grouping User Email By Domain Name — Data Table

Plot Twist: We found several high profiles Instagram account personal data inside (Thanks to my friend for giving me this information).

d. Count User Phone Number By Operators (Indonesia) — Data Table

I don't provide the information related to user telephone numbers because some of these numbers are registered in the WhatsApp application (Prooved by when we doing reconnaissance). So we only show grouping for the operator numbers.

e. Count User Age Range By Year Of Birth

There are many fake users or just a Tokopedia test case user, so the birth of date may contain some fake information too.

f. Count User Year Of Birth

As I mentioned above, there is another example of fake users or just a Tokopedia test case user (Year of birth: 0001).

g. ID Mail Domain

I decided to count occurrences of .id domain, to make clear how many users are impacted from the Indonesian mail server (or again, another fake mail).

The results are quite high, making us sure this leads to the biggest Indonesian data leaks ever (not to mention 91 million data that attackers already sells on the dark web).

h. Indonesian Company

Some users also using their company email address. At first, we didn’t know what takes the users register using their company email address. After some analysis, we can take notes that these users might register to Tokopedia for buying some stuff for their company, and easily can make an invoice from it.

But when I‘m considering the amount of registered email using company email addresses, I’m not sure that the user made for the company purposes.

As you can see, many users using their company email address (bank, telecommunication, also state company) to register themself as a Tokopedia account.

i. Government Mail Domain

From all the entities, the most interesting part is when some users using their government email address to register. until now, we only think (at least) attacker may gain personal contact email from any government employee listed on breached data.

But Is this make government users more vulnerable to attack surfaces? Well, for now, the attacker may gain:

a. The email address of the government’s users leaked

b. Personal Phone number registered and connected to users’ email

c. The behavior for a government department, which department that have small awareness of security, making themself register their email address to e-commerce.

j. Password Hash Occurrence

Information based on the forum or some news on the internet, Tokopedia use SHA2–384 to hash all of the user passwords. SHA2–384 known has 96 characters long and this hash has a length of 384 bits, which turned out to be the same as the hash of Tokopedia Users Password (but in user_pwd_1 variable).

The data that has the same password hash already analyzed and the result showing the same password hash in the different real users, not a Tokopedia test case users or fake users. I take an example of one hash that has multiple occurrences and search it into the data documents, and returning that 7 users have the same password hash.

raidforums.com

According to one user from raidforums.com, there are 2 variables containing the hashed password, named user_pwd and user_pwd_1. So far, I found the information on the forums that tell about; user_pwd contains MD5 hashing that may leak to user password, while user_pwd_1 contains SHA-384 that may be salted or added some character from uniq_char column. But I’m not sure about this.

MD5 Hashed Password
SHA2–384 With a dash of salt

Conclusion

Preventing and detecting data leaks require constant effort and investment from organizations. In this medium story, we demonstrated how the data breach could impact your organization.

While security impact must be clear, there is a note that on the outside. When security is involved, not in hours, but in a minute, even in seconds, an anonymous attacker might trying to attack and hack your company right at the moment.

Always try to do preventative things that can keep you and your company data safe, especially if you are in a banking company or even in government. Try not to use company data for personal use.

In this case, no one has revealed who the attacker makes the Tokopedia data leaks, but when building an application a company must know how important is customer data, and never make data unencrypted along with the internet traffic.

--

--

Rafi Ramzy
Rafi Ramzy

Responses (2)