An exploratory analysis on the risk to be offended on the internet

Questionnaire data is used to identify socio-demographic as well as the risk-awareness characteristics of users oﬀended on the Internet. The data comprises a representative sample of 3,000 individuals, containing information on employment, education, age, the frequency of Internet usage and security measures taken by the users. By means of a cluster analysis, within the sub-sample of oﬀended users, we identify a female group, where employment and education are high, a male cluster with similar characteristics, a group of urban users with low security awareness and a group of young users. Regressions show that the frequency of using the Internet increases, while to communicate only to people known in real life reduces the risk to be oﬀended on the Internet


Introduction
The growing number of providers and users of Internet services as well as social communication networks also raises security issues and resulted in the emergence of cyber-crime research (see, e.g., Hartel et al., 2011).This article uses questionnaire data and applies statistical methods to identify users and their characteristics who were subject to some form of offense on the Internet and Social media by means of a cluster analysis.In a second step, we analyze how these characteristics are related to the likelihood of being offended on the Internet.In particular, we investigate how different safety measures taken by users affect their protection against cyber-crime. 1n parallel to the emergence of cyber-crime science, for governmental institutions, such as the Ministry of the Interior and police authorities, the criminal aspects (such as theft of data, hacking, fraud, etc.) have become of particular interest (see, e.g., the study of Kirchner et al., 2015, instructed by the Austrian Federal Ministry of the Interior).To implement policies with the goal to improve cyber-security and to reduce crime (see, e.g., Becker, 1968;Freeman, 1999;Hartel et al., 2011;Dimkov, 2012), knowledge about the actual number of crimes committed, the socio-demographic structure of the users offended, factors (variables) raising the probability of an offense as well as the cost of Internet crime becomes important (for a cost-benefit for hackers and the cost of cyber-crime see, e.g., Kshetri, 2010;Anderson et al., 2013;Cook et al., 2014).
In addition, also governmental as well as non-governmental institutions provide guidelines how to responsibly use information technology.Such commandments are e.g.provided by CPSR (2015), E.C. (2016) or OeIAT (2016).For these institutions, knowledge on the socio-demographic structure of the users offended as well as user characteristics connected to offenses can be helpful to provide target group specific information, with the goal to increase the risk-awareness and to reduce the risk to be offended.
Regarding the effectiveness of security awareness, Bullée et al. (2015) showed in experiments that measures to increase security awareness turned out to be statistically significant.
Let us relate this article to recent literature: An overview on recent developments and results in cybercrime science is e.g.provided in Hartel et al. (2011) and Dimkov (2012).Regarding academic publications in the field of cyber-crime research, almost recently, Hartel et al. (2011) intensively searched through literature in various academic disciplines and concluded that "In spite of our efforts we have failed to find documented scientific studies of how Information Security effectively prevents cyber-crime."By looking for causes of this gap, the authors claim that problems in information security are hardly reported to the police for several reasons.For example, a problem in the information security system can but need not result in crime or firms try to solve problems internally.
Cyber-bullying was investigated in the empirical study of Hinduja and Patchin (2008).The authors used an on-line survey tool to collect data from 6,800 users in the time span December 2004 to January 2005.After focusing on the group of users not older than 17 years and data cleaning, the authors ended up with data from 1,378 users.The response variables constructed by the authors are two victimization variables ("general/serious cyber-bullying victimization") and two offending variables ("general/serious cyber-bullying offending").Regarding serious cyber-bullying victimization, the authors observe (by applying logistic regression) that the time spent at the computer, school problems and being a bullying victim in real life are positively related to victimization.Other variables such as gender, age, black/white and peer effects turned to be insignificant.Due to the different age structure of the users, relating the study of Hinduja and Patchin (2008) to the results obtained in this article is difficult.
Information security awareness of Internet users was analyzed in Tsohou et al. (2008) as well as Talib et al. (2010).While Tsohou et al. (2008) provide an overview on information security awareness, the study of Talib et al. (2010) is based on survey data containing 333 observations.The authors argue that -compared to private use -at an individual's workplace clearer legislation and regulation about IT security exist.Because of this, the authors claim that learning about Internet security mainly takes place at an individual's workplace.Then, positive spill-over effects to security awareness at home are observed.Moreover, information on the "the perception of security in e-commerce B2C (business to customer) and C2C (costumer to customer) websites" is provided by Halaweh and Fidler (2008), who followed a qualitative approach by interviewing fifteen customers and twelve organizations' managers and their IT staff.Kirchner et al. (2015) analyzed which criminal-relevant phenomena and activities do occur in social media, to what extent did they reach so far, and which methods to attack users were applied.By using questionnaire data, containing information from 3,000 individuals, the study shows that Facebook (used by 62% of the people asked in the questionnaire), WhatsApp (50%) and YouTube (46%) are those social media, which are used most frequently in the age group 14 -49 years old.Regarding police relevant issues, Kirchner et al. (2015) observed that defective software/malware, hacking, fake accounts, cybermobbing (see also Schneider et al., 2013, and the literature cited there), phishing, cyber-bullying (see also Hinduja and Patchin, 2008), cyber-stalking, profile copying, sexting (see also Lee et al., 2013), and happy slapping are the most frequent ways how users were offended (the order of these terms corresponds to their frequency of occurrence).
This article uses the questionnaire data collected by Kirchner et al. (2015) and identifies groups of offended Internet users.In particular, Section 2 describes the data.To obtain information on the securityawareness and the socio-demographic characteristics of the users offended, Section 3 first presents results obtained by means of a cluster analysis.In a second step logit and probit regressions are performed to investigate the impact of user characteristics on the risk to be offended on the Internet.Section 4 concludes.

2D a t a
A very first step to investigate the risk of being offended on the Internet is to look on the number of notifications and complaints collected by police authorities.For example, the Austrian Ministry of the Interior collects the number of notifications on a yearly basis (for Austria, see e.g., BM.I, 2015, "Austrian Security Report").This report shows the following: For 2014 a decline in the area of Internet crime is reported (-10.8% compared to 2013), while for the last decade an increase from 1,794 notified offenses in 2005 to 8,966 notified offenses in 2014 is observed.After the significant rise in the last decade and the decrease in 2014, the criminal offenses are less than 10,000, which corresponds to approximately 0.1% of the total Austrian population.The number of notified offenses is to be found mainly in the area of cyber-crime in a broader sense, and particularly, in the field of Internet fraud.
During the same periods, also the number of complaints increased enormously.In particular, from 1,151 in 2005 to 7,667 complaints in 2013.In parallel to the number of notifications, the complaints with respect to Internet fraud fell by 13.5% in the year 2014.However, the value of 6,635 complaints in 2014, is imperceptibly higher than the value in 2012, where 6,598 complaints were observed.In addition, police authorities are also concerned about a large dark field in the area of cyber-crime, and point out that new criminal phenomena are in progress (see Bundeskriminalamt, 2015).
To obtain more detailed information, this article uses data from the study of Kirchner et al. (2015),  Kirchner et al. (2015).Sample size N =3, 000.The table presents the number of men and women who were already confronted with cyber-crime.O stands for personally confronted with cyber-crime, while N.A. stands for no answer.
where data on socio-demographic factors as well as on offenses on the Internet were collected for a target group of N =3 , 000 representative users with an age between 14 and 49 years (more details on the data collection process are provided in Appendix B).Table 1 presents some descriptive statistics obtained from this questionnaire data.For the sample of N = 3000, the number of people personally confronted with cyber-crime is O = 470.Comparing the rate O/N ≈ 16% to the notification rate of approximately 0.13%, based on the data provided in BM.I (2015)2 , strongly supports the arguments provided e.g. in Appendix A of Hartel et al. (2011), who claimed that the number of offenses is above the number of offenses notified by the police.The differences observed between the male and the female population turned out to be small (this difference is also statistically insignificant at a 5% significance level).
Next, the data collected by Kirchner et al. (2015) is used to construct k ′ = 21 variables.In more formal terms, the data y n : The binary variable Attacked, where 0 implies that the corresponding individual was not personally offended on the Internet or social media, while the variable is 1 if the user was offended personally.
Hence, O = N n=1 y n .
x n2 :T h ev a r i a b l eFrequency, measuring the frequency of Internet and social network usage.This variable is an integer ranging from 0 to 2. The value 0 stands for no current use of social networks, 1 stands for occasional use and 2 for frequent use.
x n3 : The binary variable Gender, where 0 stands for male and 1 for female.
x n4 : The integer variable Age, measured in years.
x n5 :T h ev a r i a b l eInhabitants approximates the number of inhabitants of the city where the individual currently lives.Here, the following categories are used: 1 stands for < 10, 000 inhabitants, 2 stands for more than or equal to 10, 000 and less than 50, 000 inhabitants, 3 stands for more than or equal to 50, 000 and < 100, 000 inhabitants, 4 stands for more than or equal to 100, 000 and < 250, 000 inhabitants, while 5 stands for ≥ 250, 000 inhabitants.
x n6 : The integer variable Employment denotes the current employment status, where 0 stands for unemployment, 1 for part time employment and 2 for full employment.On leave, retirement, apprenticeship, civil-or military service and pupils are treated as missing values.
x n7 :T h ev a r i a b l eHuman Capital (Education), measuring the highest level of education obtained by individual n.This variable is equal to 1 if no school was completed, to 2 if the highest degree is from a secondary modern school ("Pflichschulabschluss in the Austrian school system), to 3 if an apprenticeship, a school without general qualification for university entrance ("Berufsbildende mittlere Schule" or "Allgemeinbildende höhere Schule ohne Matura" in the Austrian school system) was completed, to 4 if a grammar school or an equivalent degree ("Berufsbildende höhere Schule" (e.g., HAK, HLW, HTL) in the Austrian school system) was completed, while 5 stands for some university degree (or (almost) equivalent degrees like "Abiturientenlehrgang, Kollege, Pädagogische Akademie" in the Austrian education system).

Results
This section investigates the questions: (i) 'What groups of persons show an insufficient problemconsciousness concerning cyber-crime and thus being at particular risk?' and (ii) 'What variables increase/decrease the risk to be offended on the Internet?'.Regarding the first question we perform a cluster analysis, while the second question is investigated by means of regressions.
Hence, the first goal of this exploratory analysis is to group (cluster) the data described in Section 2, such that the individuals in the same cluster have stronger similarities than the individuals collected in the other clusters.To perform the cluster analysis in a more parsimonious setting and to avoid similarities in x nS,1 ,...,x nS,14 to dominate the clustering results, the security variable x nS,1 is selected from x nS,1 ,...,x nS,14 when performing the cluster analysis.Hence, the observations used to perform the cluster analysis are x n =( y n ,x n2 ,...,x n7 ,x nS,1 ) ⊤ ∈ R k ,w h e r ek =8 ,f o rn =1 ,...,N = 3000.The data used to perform the cluster analysis is abbreviated by X ∈ R N ×k , collecting the observations x n , n =1,...,N.
Additionally, a distance function measuring the dissimilarity between the observations x n and x m has to be chosen.In this section we apply l 1 -distances (= sum of absolute distances or Manhattan distances; see equation (3) in the Appendix C).To measure dissimilarities between clusters the "unweighted pair-group average method" is used (see equation ( 5) in the Appendix C).
In this article we apply agglomerate hierarchical clustering techniques, which start with N clusters (i.e. each observation n is a cluster) and then, based on the distance between groups, groups are merged.
This merging procedure is continued until one cluster (containing all elements of X) is remaining.In particular, the agglomerate hierarchical clustering algorithm agnes described in Kaufman and Rousseeuw (1990)[Chapter 5] and implemented in the software package R by Maechler et al. (2015) is applied.For more details see Appendix C.
By applying this clustering technique to our data X, we observe a high agglomerative coefficient of AC =0 .98 (see equation ( 6) in Appendix C), measuring the quality of the clustering method applied to the data.Based on the dendrogram (see Figure 1 in Appendix C) and with the goal to get a parsimonious description of the data, we decided to present the result where the data X is clustered into twelve groups.
This decision is based on the observation that for the branches on the top of the clustering tree larger differences are observed, while for a larger number of clusters the differences in the variables of interest for this study become small.4 Table 2 presents results when I = 12 groups are considered.The columns 2 to 13 present the groupspecific mean values and the group-specific standard deviations within the corresponding cluster C i .T h e last column presents the sample means and the sample standard deviations for each variable, obtained from N =3 , 000 observations.The last row presents the number of individuals assigned to cluster C i , i =1,...,I = 12.Note that the mean value for the variable Attack corresponds to the percentage of the individuals offended on the net, i.e.O N = 470 3000 =0.1567.In the following we focus on individuals who have been offended on the Internet or social media.From Table 2 we observe that all individuals offended on the Internet are contained in the clusters C 6 , C 7 , C 9 , C 10 and C 11 .These clusters only contain offended users (note that the within-group sample standard deviations of the variable Attacked are zero).By adding up these numbers we get 470.
Regarding the socio-demographic factors as well as risk-awareness we observe the following: Class C 6 contains almost only women (the mean of the group-specific gender variable is 0.923), who have a mean age around 34 years and within group standard deviation for the variable age of 9.282, i.e. the age structure of this class approximately corresponds to the age structure of the full sample.In addition, the members of C 6 live in smaller cities and have in the mean a high level of education as well as employment.The class specific mean of the variable Security x nS,1 is close to the mean of the full sample (see last column).
The majority in class C 7 is male.The group-specific means of the variables Age, Inhabitants and x nS,1 are close to the values in cluster C 6 .The group-specific means of the variables Employment and Human Capital are slightly smaller than the values in group C 6 .
Class C 9 contains users who have been offended and live in larger cities, in particular, mainly in Vienna.
For cluster C 9 , with the exception of the size of the city, most group-specific means are almost the same as the group-specific means of the full sample (given the standard deviations of these variables), however for the users in cluster C 9 the security awareness measured by the variable x nS,1 is very low.
Class C 10 contains young users who were offended.Last but not least, Class C 11 contains only four group members.For these users we get the contradicting result that these users were offended (y n =1 ) although they did not use the Internet (x n2 = 0).In addition, the security awareness of these persons is high (x nS,1 = 1).This contradicting result can either be explained by mis-reporting (e.g.some interviewees hardly using the Internet reported that they currently do not use the Internet) or that these users changed their behavior after they have been offended.
After we have identified classes of offended users and their characteristics, we investigate the second question on variables increasing or decreasing the risk to be offended on the Internet.Given our data set we analyze how the variable Attacked, i.e. y n , is affected by the variables Frequency, Gender, Age, Inhabitants, Employment, Human Capital and the Security/Incertitude variables x nS,j .For example, this allows to investigate the questions whether and how the probability to be offended on the Internet is affected by gender, by age, the security awareness, etc.
To investigate these questions we have to account for the fact that y n is a binary variable.In formal termsw econsidertheev en ts{y n =1} and {y n =0}.Logit and probit regressions (see, e.g., Greene, 1997;Cameron and Trivedi, 2005) are applied to obtain estimates how the conditional probability P (y n =1|x n ) depends on the explanatory variables xn := (1,x n2 ,x n3 ,...,x n7,1 ,x nS,1 ,...,x nS,14 By means of the 1 as the first coordinate of xn , we include an intercept term.In addition, we abstract from feedback effects from xn on y n (in more technical terms we assume that the regressors xn are exogenous; see, e.g., Davidson and MacKinnon, 1993, p. 624-627).
With probit and logit models P (y The regression parameter β i describes the impact of xni , i.e. the ith coordinate of xn , on the conditional probability P (y n =1|x n ) (equal to the conditional expectation E (y n =1|x n )), for i =0, 2,...,k ′ = 21, while F (•) is called link function.For the logit model the link function is provided by the logistic function, i.e.
, while for the probit model where Φ (•) abbreviates the distribution function of the standard normal distribution.In this article parameter estimates, denoted by β, of the parameter vector β are obtained by means of maximum likelihood estimation (by using the glm function contained in the R package AER).To investigate the question how xni affects P (y n =1|x n ), the marginal effects can be obtained, for i =0 , 2, 3,...,k ′ = 21 (see, e.g., Greene, 1997;Cameron and Trivedi, 2005).In contrast to the linear regression model, the marginal effects described in (2) depend on the value of xn where (2) is evaluated.In the following analysis, the term ME i abbreviates the marginal effect We obtain an estimate of the marginal effect, ME i , by replacing β and E (x n ) by their finite sample analogs β and xn = 1 N N n=1 xn .In contrast to the assumption of exogenous regressors, some users might have decided to 'install safety software', to 'read terms and conditions carefully at every registration', etc. after they had been offended and before they had been interviewed (in which case regressor endogeneity arises).If there are serious concerns that the persons interviewed behaved in this way, instrumental variable estimation should be performed, where we claim that finding good instruments for the given regression is a difficult problem.
Although we can neither verify nor exclude that some interviewees acted in this way, we already observed in Table 2 inconsistent answers for a small group of interviewees (where we already argued that this can be due to mis-reporting or to a change in the behavior after an offense).To avoid possible problems arising from data points with inconsistent responses, we excluded those 48 observations x ′ n where an interviewee n reported y n = 1 and x n2 = 0 (regressions where these observations are still included are provided in Appendix E).
Tables 3 and 4 provide the regression results.By looking at the p-values, we observe that the regression intercept and the variable Frequency are highly statistically significant for both models.The higher the variable Frequency the larger the risk of an offense on the Internet.By means of the marginal effect we observe that a rise in the variable Frequency by an infinitesimal unit, increases the probability to be offended by approximately 11% times this infinitesimal unit.When applying a significance level of 5% the variables Employment and Human Capital are statistically insignificant, while at the 10% significance level the variables Employment and Human Capital are (almost) significant.In more detail, for the variable Employment the p-values for the logit and the probit model are approximately 11% and 14%, while for the education variable Human Capital the p-values are 10.4% and 9.6%, respectively.Since higher Employment reduces the risk to be offended on a significance close to 10%, the regressions provide weak support for the learning arguments provided Talib et al. (2010).Higher education, measured by the variable Human Capital, interestingly raises to probability to be offended at a significance level close to 10%.The impacts of the variables Age, Gender,a n dInhabitants are statistically insignificant (when applying significance levels ≤ 10%).Finally, we investigate the impacts arising from the various Security variables x nS,j .For both the logit and the probit model, the variable x nS,6 , 'only communicate with persons known in real life' is significant at a 5% significance level.The other x nS,j are not statistically significant at significance levels ≤ 10%.
offended on the Internet.The cluster analysis suggests that offended users be partitioned into four groups, which are: A mainly female group, with group members living in small cities.In this cluster employment and the level of education are high.The second group is mainly male, living as well in smaller cities.For this group employment and education are also high, but slightly lower than in the female group.The third group of offended users lives mainly in large cities, with socio-demographic characteristics close to the values observed in the total sample of 3,000 individuals.However, this group exhibits the smallest awareness with respect to Internet security.The fourth group contains young users.
Second, after having identified these groups, we analyze the question whether the characteristics of the users such as age and gender as well as various protection methods applied by the users increase or decrease the risk to be offended on the Internet.By means of probit and logit regressions and applying a 5% significance level, we observe that the frequency of using the net raises the conditional probability to be offended, while to communicate only to people known in real life diminishes the conditional probability to be offended on the Internet.Variables like age and gender turned out to be statistically insignificant.4.000 14.000 3000.000 a Results obtained from the cluster analysis.Data set X, N =3 , 000 observations, k = 8 variables, I = 12 clusters and l 1 -distances.For each variable the first row presents group-specific sample means in the corresponding cluster C i , i =1,...,12, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1 ,...,N.T h el a s tr o w presents the number of individuals assigned to cluster C i .

A Cyber-Crime and Cyber-Crime Research
By considering the historical development, "cyber-crime emerged from hacking.Fraud schemes in relation with Social Engineering and other criminal activities were gradually added and connected to the technical and craft skills of the early hackers" (see Kochheim, 2016).While information security research is engaged in the development of software to increase IT security, cyber-crime research is connected to criminology and other social sciences with the goal to prevent cyber-crime (see, e.g., Hartel et al., 2011).Hartel et al. (2011)[Section 2] define crime science as applying scientific methods to prevent and to detect disorder, particularly crime.Then, referring to Newman (2009), the authors define cyber-crime as "behaviour in which computers or networks are a tool, a target, or a place of criminal activity."For guidelines to perform information and communication technology research see, e.g.Bailey et al. (2012).
In addition, cyber-crime can be divided into "cyber-crime in a narrower sense", where offenses are committed by using the technologies of the Internet (e.g., illegal access to a computer system), and "cyber-crime in a broader sense" (see, e.g., Bundeskriminalamt, 2015, p. 17), where the Internet is used as communication medium for criminal activity (e.g., fraud, child pornography and the initiation of sexual contacts with minors).In this article we refer to the broader definition of cyber-crime.

B Further Information about the Data
The study of Kirchner et al. (2015) is based on two surveys: The first sample comprises data from the Austrian population with an age between 14 and 49 years.The second sample considers parents (both or one parent) of children aged 10 to 13 years.In order to create the basis for the surveys and focus groups, 8 interviews with experts of the IT-division of the Austrian Ministry of the Interior (BM.I) as well as police-attorneys have been conducted.During the expert-interviews the problems of using the social media and future challenges were discussed.The results of the expert-interviews were used to design the questionnaires.
To obtain these data, Computer Assisted Telephone Interviews were performed.The data finally consists of 3,000 Austrians aged 14 to 49 years and 500 parents of children aged 10 to 13 years by using a standardized questionnaire.According to the requirements of the study, the characteristics of gender, age and place of residence (federal state) were considered as representative criteria.To obtain these -0.1682 0.1639 -1.0260 0.3047 -0.0266 Incertitude x n9,12 -0.0541 0.6816 -0.0790 0.9367 -0.0086 Security x n9,13 -0.8227 0.7442 -1.1050 0.2689 -0.1302 Incertitude x n9,14 -1.0265 0.7633 -1.3450 0.1787 -0.1625 a Results obtained from the logit regression.Ñ = N − 48 = 2, 952 observations, 1, 708 observations used by R due to missing values.y n , i.e. 'personally offended', is the dependent variable, while x n2 ,...,x nS,14 are the dependent variables.The second column provides the maximum likelihood estimates βi , i = 0, 2,...,k ′ = 21, while the third, the forth and the fifth columns provide standard errors, z-values and p-values for the corresponding parameter estimates.A p-value of 0.000 denotes a p-value smaller than 0.0001.The last column shows estimates of the marginal effects ME i .-0.1000 0.0926 -1.0800 0.2801 -0.0281 Incertitude x n9,12 0.0236 0.3903 0.0600 0.9519 0.0066 Security x n9,13 -0.4284 0.3663 -1.1700 0.2422 -0.1203 Incertitude x n9,14 -0.5229 0.3735 -1.4000 0.1615 -0.1468 a Results obtained from the probit regression.Ñ = N − 48 = 2, 952 observations, 1, 708 observations used by R due to missing values.y n , i.e. 'personally offended', is the dependent variable, while x n2 ,...,x nS,14 are the dependent variables.The second column provides the maximum likelihood estimates βi , i = 0, 2,...,k ′ = 21, while the third, the forth and the fifth column provide standard errors, z-values and p-values for the corresponding parameter estimates.A p-value of 0.000 denotes a p-value smaller than 0.0001.The last column shows estimates of the marginal effects ME i .data, in total, about 50,000 people were contacted in order to achieve the desired 3,500 interviews.This corresponds to a response rate of around 7%.For about 37% of the calls, no one picked up; at about 18% the number from the phone book was invalid.Approximately 22% refused to participate in the survey and approximately 4% broke off the interview during the conversation.
The N =3, 000 survey was held in the period from July 9, 2014 to October 12, 2014.Some summary statistics are provided in Table 5.With the goal to obtain information on young users, in addition to the N =3 , 000 sample used in this article, Kirchner et al. (2015) interviewed 500 parent(s) from December 11, 2014 until May 1, 2015.In those cases where the parents had more than one child in this age group, they were asked at the beginning of the interview how many children in this age group they have -and a random selection was set to which of their children they should refer.
For the sample of N =3 , 000 interviews we observe the following: Let ζ stand for some attribute of the population measured in percentage terms.Then, given some point estimate ζ based on the sample X of size N =3 , 000, the 95% confidence interval (based on the normal approximation following from the asymptotic analysis) is ζ − 1.8%, ζ +1.8% .In addition, by comparing the percentages observed for the population (third column in Table 5) to their sample analogs (fifth column in Table 5), we observe that all percentages observed for the population are contained in the interval "value observed in the sample ± standard error".By this we consider the survey samples as representative.That is, the distribution of the characteristics of gender, age and place of residence in the sample corresponds to that in the population.The second columns presents the total number of individuals with an age between 14 and 49 years in Austria in the year 2014.The third column presents the percentages of the corresponding subgroups of the population.The forth column shows the number of individuals contained in the corresponding subgroup in the sample of N =3, 000 individuals.The last column presents the corresponding percentages.-0.001 -0.003 0.029 -0.024 -0.006 -0.023 -0.033 0.011 -0.003 0.025 -0.017 1.000 -0.010 -0.011 x nS,13 -0.070 -0.034 -0.041 -0.083 -0.036 -0.065 -0.054 -0.026 -0.048 -0.070 -0.020 -0.010 1.000 -0.018 x nS,14 -0.204 -0.121 -0.184 -0.235 -0.144 -0.232 -0.226 -0.101 -0.158 -0.168 -0.076 -0.011 -0.018 1.000 a Descriptive Statistics Variable x nS,j , N =3 , 000 − 681 = 2319 observations.M ean abbreviates the sample mean, SD the sample standard deviation and Correlation for the Pearson correlation.

C Agglomerate Hierarchical Clustering
By means of a cluster analysis we try to find groups within a data set (the following section is mainly based on Kaufman and Rousseeuw, 1990, Chapters 2, 3 and 5).The data consists of N observations x n ∈ R k , n =1 ,...,N ,w h e r ek is the dimension of column vector x n .X = {x 1 ,...,x N } stands for the data set, which can also be written in terms of the matrix X =(x 1 ,...,x N ) ⊤ ∈ R N ×k .In particular, our data set consists of N =3 , 000 individuals who filled in the questionnaire, while k =8i st h en u m b e ro f attributes taken from the questionnaire.x n1 and x nS in this section corresponds to y n and x nS,1 in the main text.
Since the data are measured on different scales, the standardized observations z ni := x ni −μ i sd i are often used, when a cluster analysis is performed.μi := 1 N N n=1 x ni stands for the sample mean and for the sample standard deviation of attribute i,w h e r ei =1,...,k.We also follow this approach and standardize the observations To measure the degree of dissimilarity between the observations z n and z m a distance function d(•, •) has to be chosen (if the data are not standardized, replace z n and z m by x n and x m ).In the following we work with l 1 -distances (= Manhattan distances in R) as well as with Euclidean distances After having defined distances between observations z n and z m , we want to obtain distances between some clusters C i and C j .Ac l u s t e rC i is a subset of X,w h e r eC 1 ,...,C I partition the set X.T h a ti sC i = ∅, C i ∩ C j = ∅ for all i, j =1 ,...,I,w h e r ei = j,a n d I i=1 C i = X.I stands for the number of clusters considered.Equipped with the definition of C i , we define the distance between C i and C j as follows: where v ∈{ 1, 2}. |C i | and |C j | s t a n df o rt h en u m b e ro fe l e m e n t so ft h es e t sC i and C j .Literature calls the distance defined in ( 5) "unweighted pair-group average method".
As already stated in the main text, the Agglomerate hierarchical clustering technique agnes described in Kaufman and Rousseeuw (1990)[Chapter 5] is applied in our study.Agglomerate hierarchical clustering techniques start with N clusters, that is C n = {x n } for n =1,...,N, and then merge the groups according to the value of the distance function.This procedure is continued until we end up with C 1 = X and Differences in various agglomerate hierarchical clustering methods are mainly due to differences in the distance measures.
In more detail, in this study we proceed as follows: Let I ℓ stand for the number of clusters in step ℓ, C i,ℓ ,w h e r ei =1 ,...,I ℓ , for the clusters obtained in step ℓ, d v (C i,ℓ , C j,ℓ ) for the corresponding distances between C i,ℓ and C j,ℓ and d v,[ℓ,1] (C q,ℓ , C w,ℓ ) for the smallest distance between C i,ℓ and C j,ℓ ,w h e r ej, i =1 ,...,I ℓ and i = j,i ns t e pℓ.Let the pair with the smallest distance have the indexes q and w,w h e r eq, w ∈{ 1,...,I ℓ }.Table 7 demonstrates how an agglomerate hierarchical clustering algorithm starts with N clusters, where C n,ℓ=0 = {x n }, and ends up with one cluster C 1,ℓ=N −1 = X in the final step.To obtain the distance between the clusters, the data are standardized.Then Euclidean and l 1 -distances are applied in (5) to obtain the distances between the clusters (see equation ( 5)).As described in Kaufman and Rousseeuw (1990)[page 205] the "dissimilarity between merging clusters" is monotone.In more formal terms, By collecting these dissimilarities we obtain the monotone increasing sequence of "levels" By considering the step ℓ = h, where observation x n is merged the first time with some C w , we observe the dissimilarity d v,[h−1,1] (x n , C w,h−1 )=g n at this merger.By calculating g n /l N −2 we obtain a number in the interval [0, 1].g n /l N −2 is often called "width of the banner n", since the factions g n /l N −2 can be presented in terms of a banner plot.Kaufman and Rousseeuw (1990)[page 211] interpret g n /l N −2 as "... it gives an idea of the amount of structure that has been found by the algorithm.Indeed, when the data Step 0: Step 1: Take the distances d v (C i,0 , C j,0 ), where i, j ∈{1,...,I 0 }, Step ℓ: Step N-1: 5), the distances between the standardized observations from (3) and (4 possess a clear cluster structure, the between-cluster dissimilarities and hence the highest level (l N −2 in our notation) will become much larger than the within-cluster dissimilarities, and as a consequence the black lines become longer (1 − g n /l N −2 becomes larger in our notation)."The mean of these fractions is called agglomerative coefficient.The higher AC the better the explanatory power of the cluster analysis.
The dendrogram (clustering tree) is a graphical representation of the results obtained by a hierarchical clustering technique.On the vertical axis the observations indices n =1 ,...,N are arranged, such that the branches of the tree do not intersect.On the vertical axis we observe the levels l n .The corresponding branches describe the leaves of the tree to be merged.The "height" of a branch represents the difference -in terms of levels -between the corresponding groups C q and C w to be merged.In formal terms, the heights are obtained by means of for the data set X are provided in the Figures 1 and 2, for l 1 and Euclidean distances, respectively.E.g., in Figure 1 we observe the final transition from two groups to one group at a height of 12.At a height close to 8 we observe already twelve groups, etc.
The agglomerate hierarchical clustering algorithm agnes was implemented in the R-package in Maechler et al. (2015).In particular, our estimates are obtained by means of the R-commands: agnes(X,diss = FALSE, metric = manhattan, stand = TRUE, method = average)f o rl 1 -distances.For Euclidean distances set metric = euclidian.stand = TRUE means that the data are standardized, method = average implies that unweighted pair-group averages are used.For more details see Maechler et al. (2015) and the literature cited in this manual.

D Further Clustering Results
This section provides further clustering results, with different numbers of clusters I as well as clustering results with Euclidean distances (4).NA denotes a static which is not available.In the following tables this takes place for the sample standard deviation when the number of group members is one.
With I = 4 classes all O = 470 persons subject to an attack are contained in the class C 3 with l 1distances (3), while with Euclidean distances the classes C 2 and C 4 contain offended users (see Tables 8   and 9).
With I = 8 clusters we observe that the class C 3 in Table 8   2287.000212.000 470.000 31.0003000.000 a For each variable presented in the first column, the first row presents group-specific sample means in the corresponding cluster C i , i =1 ,...,4, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample sample standard deviations for the corresponding variables, obtained from all observations n =1 ,...,N.The last row presents the number of individuals assigned to cluster C i .2485.000 464.000 45.000 6.000 3000.000 a Euclidean distances.For each variable presented in the first column, the first row presents groupspecific sample means in the corresponding cluster C i , i =1 ,...,4, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1 ,...,N.T h e last row presents the number of individuals assigned to cluster C i .For each variable presented in the first column, the first row presents group-specific sample means in the corresponding cluster C i , i =1,...,8, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1,...,N.The last row presents the number of individuals assigned to cluster C i .1191.000 1232.000377.000 45.000 55.000 62.000 32.000 6.000 3000.000 a For each variable presented in the first column, the first row presents group-specific sample means in the corresponding cluster C i , i =1,...,8, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1,...,N.The last row presents the number of individuals assigned to cluster C i .1.000 3000.000 a For each variable presented in the first column, the first row presents group-specific sample means in the corresponding cluster C i , i =1 ,...,12, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1,...,N.The last row presents the number of individuals assigned to cluster C i .581.000 79.000 162.000 719.000 894.000 47.000 203.000 147.000 31.00050.000 41.000 1.000 27.000 4.000 7.000 7.000 3000.000 a For each variable presented in the first column, the first row presents group-specific sample means in the corresponding cluster Ci, i =1 ,...,16, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1 ,...,N.The last row presents the number of individuals assigned to cluster Ci. 786.000 430.000 589.000 405.000 262.000 16.000 29.000 55.000 213.000 96.000 38.000 24.000 31.00019.000 6.000 1.000 3000.000 a For each variable presented in the first column, the first row presents group-specific sample means in the corresponding cluster Ci, i =1 ,...,16, while the second row presents the group-specific sample standard deviations.The last column presents the mean values and the sample standard deviations for the corresponding variables, obtained from all observations n =1 ,...,N.The last row presents the number of individuals assigned to cluster Ci.

Figure 1 :
Figure 1: Dendrogram for Internet Security data This figure plots the dendrogram for the Internet security data X.N =3, 000 and k =8.l1 distances are applied here.Heights on the vertical axis, individuals arranged according to the tree structure on the horizontal axis.Agglomerative Coefficient AC =0.98.

Figure 2 :
Figure 2: Dendrogram for Internet Security data This figure plots the dendrogram for the Internet security data X.N =3 , 000 and k = 8.Euclidean distances are applied here.Heights on the vertical axis, individuals arranged according to the tree structure on the horizontal axis.Agglomerative Coefficient AC =0.94.

Table 1 :
Descriptive statistics.Number of male and female participants in the study of Sample means and standard deviations for the variables y n and x ni , i =1,...,7, are provided in the last column of Table2, while the sample means and standard deviations as well as correlation coefficients of x nS,j , j =1 ,...,14, are provided in Table6in the Appendix B. If no answer is provided or if the answer "don't know" is chosen for some variable by individual n, we obtain a missing value.For y n , x n2 and x n3 no missing values are observed.For the variables age, inhabitants and human capital two, thirty and eighteen missing values are observed.For the variable Employment where on leave, retirement,

Table 2 :
Results obtained from the Cluster Analysis.

Table 3 :
Results obtained from the Logit Regression.

Table 4 :
Results obtained from the Probit Regression.
splits up into the classes C 4 , C 6 , C 7 and C 8 in Table 10.For Euclidean distances the class C 2 in Table 9 splits up into the classes C 3 , C 5 and C 7 in Table 11, while the class C 4 with I = 4 is now labeled C 8 .With I = 12 classes, the group C 4 splits up into C 6 and C 7 in Table 10, while the former classes C 6 , C 7 and C 8 are labeled C 9 , C 10 and C 11 in Table 2.For Euclidean distances we observe that the classes C 5 and C 8 remain the same, the new labels with with I =12areC 7 and C 11 .The class C 3 splits up into the classes C 4 and C 10 , while C 7 splits up into the classes C 9 and C 12 in Table 12.With I = 16 groups, the classes C 7 , C 10 and C 11 remain the same, with the new labels C 7 , C 13 and C 14 .The former class C 6 splits up into C 6 and C 8 , while C 9 splits up into C 11 and C 12 in Table 13.For Euclidean distances only the labeling is changed for the classes C 7 , C 9 , C 10 , C 11 and C 12 , i.e. these classes are C 8 , C 13 , C 14 , C 15 and C 16 in Table 14.Finally, class C 4 in Table 12 splits up into the classes C 5 and C 10 in Table 12.

Table 15 :
Results obtained from the Logit Regression.