THE POTENTIAL OF BIG DATA: THE POSSIBILITIES OF SOCIOLOGICAL ANALYSIS

Cover Page

Cite item

Full Text

Abstract

The article presents the results of studying the relationship between search queries (based on Google Trends) and sociological indices of consumer sentiment (Levada Center).

To build the model, data for the last 10 years (from 2012 to 2022) are used. The results of search queries on relevant topics are consistently summarized, a space of factors is formed and a multiple regression model is built. In the regression equation, the IPN acts as a dependent variable, and the factors generalizing search queries are independent.

Modeling results, strengths and weaknesses of using Big Data to solve sociological problems are discussed

Full Text

The phenomenon of "big data" appeared in the late 1990s - early 2000s and was defined as a 3V model: volume, velocity and variability. This model has been changing, using the new possibilities of digital reality, and has grown into a 4V model: a parameter such as value, the process of extracting valuable information from a data set intended for analytics. Now big data is defined as a 5V model, to which the veracity parameter has been added, which implies not only data management, but also respect for the right to privacy [1].

The birth of big data has led to fundamental changes in the details of analysis and raised important questions for the social sciences. These changes are associated with a change in the relationship between the methods of data collection and analysis. Firstly, methods focused on large arrays, rather than selective ones, are needed to assess the reliability of communication, which are currently lacking in widespread practice. Secondly, in order to prepare data for analysis, new personnel are needed who will be able to apply programming skills. The system assumes an automatic mode of operation to help specialists find and eliminate errors. Thirdly, there is the problem of data anonymity, which will allow to uniquely identify each observation and at the same time make it impossible to access personal information [2].

Digital footprints left by people have led to an exponential increase in the number of data sources (in addition to traditional surveys and official reports) available for social and economic analysis. There are an infinite number of reasons for creating new data, however, the ways to obtain them have important ethical and legal consequences. For example, personal data related to purchases made cannot be used for the same purposes as data from a person's profile presented on Twitter. The use of data is limited by how it is created.

Researchers talk about negative traits from an ethical point of view. I. Aprichard points to vulgarity and the "violating" nature of Big Data [3]. And D. Lapton adds such characteristics as depravity, provocation, involvement in private life, and so on [4].

And yet, Big Data is positioned as a model for obtaining, storing and processing information about society, having determined the primacy before public opinion polls. The emergence of big data in the social sciences has become the frontier beyond which all traditional methods of obtaining and processing information about public opinion have become known as small data. The differences between small data and Big Data were analyzed by R. Kitchin and G. McArdle on the example of research, administrative statistics and "big data" [5].Analyzing a number of studies, we came to a broad definition – big data is understood as a volume of data, the most important parameters of which are speed and accuracy. Obtaining information and knowledge requires the use of special analytical methods and methods. This conclusion is based on the definition of A. de Mauro [6], compiled on the basis of the analysis of the annotations of scientific papers, adjusted taking into account the edits of R. Kitchen [5].

Speed is considered a key attribute of big data. Big data is being created continuously. For example, data can be created while a user is browsing websites.

Exhaustibility lies in the fact that big data tends to cover the entire population (n = all) within the system, not a sample. For example, Twitter captures all tweets made by all accounts, not a sample tweet.

A number of processing and analysis methods are used to extract information. These methods can be traditional methods (relevance, content analysis) or innovative methods (natural language processing, neural networks, etc.).

Social sciences did not immediately perceive the new possibilities of using Big Data, the first scientific articles appeared only in 2009. At that time, the era of big data was already proclaimed by mass publications, the number of articles in popular journals significantly exceeds the number of articles in scientific journals to this day. The article "Computational Social Science", which appeared in Science in 2009, can be considered as a manifesto of the "new science" [7]. Authors David Lazar, Alex Pentland, Lada Adamik and others have repeatedly acted as speakers at various conferences, they head centers and institutes, the results of their research appear in prestigious journals "Science" and "Nature". Today, many Russian researchers are engaged in the development of the big data problem. In his article V.V. Volkov "Problems and prospects of research based on Big Data (using the example of sociology of law)" summarizes the capabilities of analytical platforms for collecting, processing and storing big data, describes parameters, gives examples characterizing the difficulties and features of working with them [2]; N.V. Korytnikova in her article "Online Big Data as the source of analytical information in online research" describes the capabilities of analytical platforms for collecting, processing and storing Big Data, presents a system of indicators used for sociological analysis [8]; K. Guba in the article "Big Data in Sociology: New data, new Sociology?" answers the question of what changes the new data have brought to sociology [9].

The commercial experience of using Big Data and cases of their application to solving political issues forced scientists to look for points of contact with classical methods of studying society [10]. At the moment, successful practices of such interaction can be traced in three key areas:

  • application of Big Data for the study of classical areas of interest in social sciences;
  • addition of the results of using Big Data by traditional sociological methods (small data);
  • application of Big Data mechanisms to data collected by traditional sociological methods.

The idea of combining the data obtained by sociology into larger samples is quite obvious. A good example of how the scale affects the depth of analysis of the data obtained is the large cross-cultural projects of the World Values Survey and European Values Survey. It should be remembered that they are connected by a single methodology and methodology of research, including interpretation and operationalization of key concepts (as far as possible within the framework of the translation of the questionnaire questions) [11].

The connection of small data into large arrays that at least partially meet the criteria of Big Data is due to the desire to re-involve the originally collected data in scientific circulation, as well as to acquire new correlations that are not defined in each individual array.

Today, urban planners and sociologists are increasingly using big data to analyze the daily practices of citizens, applying the features of the urban environment and urban mobility. For example, big data can actively participate in the ritmanalization of urban space. Ritmanalysis as a research tool was first proposed in the work of Henri Lefebvre "Ritmanalysis" in 1992, in which the urban environment is considered as a unity of rhythm, space and time [12]. On the one hand, the way city institutions work in some way regulates the daily life of citizens, establishes certain norms, perception and understanding of "social time", and on the other hand, adapts to the rhythm of citizens' lives. They synchronize with the surrounding rhythm or create their own rhythm (including physical). For example, the rhythm of life of world capitals can be analyzed by studying, displaying and visualizing location signals that are regularly performed by social networks using Foursquare geolocation capabilities [13]. The use of Foursquare data allows us to talk about differences in the styles and lifestyle of different cities, as well as about the heterogeneity of the urban space itself, about the increase and "disappearance" of various parts of the intensity of processes and can become the starting point of urban research at the junction of various sciences.

However, big data is becoming a source of new risks, including the following problems.

Firstly, there are problems with confidentiality and protection of private information, which limit the understanding of the data that interested parties collect and study — this is possible because the government has the right to collect information without the user's consent, without signing a private agreement.

Secondly, problems arise at different stages of working with big data. For example, at the stages of data collection and analysis, due to ignorance of the technological principles of working with big data, the choice of inappropriate methods or illiterate use of them, the results of using big data may be erroneous. Difficulties may also arise at the stage of data interpretation due to people's excessive trust in big data technologies and, as a result, lead to incorrect management decisions.

Thirdly, there are problems that require large investments in the technology sector, and there are examples when these investments do not give the expected results.

Advantages and ways to reach a new level thanks to the methods of analyzing new data and methods, according to Gary King, when they are available to social sciences, there are three options: innovative statistical methods, new computer science and original theories of individual fields of knowledge [14]. This will help to overcome the gaps of the previous data – they create artificial conditions, retrospective nature and static information.

Online data provides information about people's behavior in real time, automatically recording who, where and with whom is interacting now; at the same time, the influence of the researcher during the production of data is minimized, because they exist regardless of whether he will analyze them or not [15].

So, as we could see, sociology is gradually comprehending the possibilities of Big Data as a research tool.

For my part, I want to present the results of my own experience of using Big Data to solve sociological problems.

On a particular example, the correlation of the frequency of search queries of a certain topic (Google Trends) and the consumer sentiment index (Levada Center) – I want to check if there is a synchronous relationship between these indicators and what it is.

In other words, from a meaningful point of view, can we show that search queries reflect/relate to consumer sentiment (measured in a classical sociological perspective).

Big data here consists of the frequency of keyword searches in Google, which are available on the basis of the Google Trends service. The procedure for selecting keywords includes the following steps:

  • construction of a conceptual scheme of the phenomenon under study based on the procedural part of monitoring consumer sentiment Levada-cent;
  • identification of key research aspects;
  • identification of search queries within aspects;
  • selection of queries that have dynamics over 10 years.

At the first stage, the analysis of the methodology for constructing the consumer sentiment index (hereinafter IPN) of the Levada Center was carried out (Fig. 1.1). This index is calculated based on respondents' answers to five questions. For each question, an individual index is built as the difference between the proportions of positive and negative answers, and 100 is added to avoid the appearance of negative index values. The cumulative IPN index is calculated as the arithmetic average of the individual indices. The IPN varies in the range from 0 to 200, and index values less than 100 mean the predominance of negative assessments in society.

Figure 1.1 – Dynamics of the consumer sentiment index for 2012-2022.

The second stage, based on the questions asked to respondents to build the IPN, identified four aspects of consumer sentiment: material conditions (questions 1, 2), the economic situation of the country (questions 3, 4), purchases (question 5) and willingness to make a purchase (question 5). (fig. 1.1)

A set of explanatory indicators (keywords) was selected in the Google Trends search environment for the analyzed aspects based on the analysis of various international questionnaires: OECD (dedicated to finding the index of a better life) and American Time Use Survey (dedicated to the study of the use of time by the population), as well as keywords that are potentially related to consumer moods (fig. 1.2).

Figure 1.2. – Сonceptual diagram of the consumer sentiment model.

At the same time, the keywords used from the questionnaires of international databases may be far from people's daily lives if they do not characterize the real conditions that are reflected directly in the search queries coming from each person. In this regard, keywords such as "what to keep money in", "advanced training", "communal apartment", "fitness subscription", "pyaterochka delivery", "where to relax on the weekend", "availability nearby", etc. were added. This stage of word selection has a number of disadvantages, because it is associated with a high degree of subjectivity.

As a result, among the many search queries characterizing one or another aspect of consumer sentiment, only those that have dynamics (visible changes) over the past 10 years were selected. The original database of search queries contained 411 words. The data set used (search queries) consists of monthly data covering the period 2012-2022. (10 years old).

The data on the search query from Google Trends is not an exact characteristic of only this word, rather, it is a fraction of the total number of searches for a certain period that include this keyword, normalized so that the largest volume for the period is 100. In this regard, the values of the series on any given day cannot be compared between search queries, because they are normalized to the maximum value that is attached to each word. To solve this problem, a standardized Z-score is used:

where  — is the average value of a random variable ,  — is its standard deviation.

There may be sharp jumps in the popularity of a search query in Google Trends data. This makes it difficult to evaluate, because there is a risk of losing the relevance of the model during construction. In order to eliminate this problem, a moving average is applied. The order of the moving average is determined by the number of previous values of random deviations taken into account in the model. In this study, the data were smoothed by a three—period moving average, where the period is a month.

In addition, there are search queries that have zero search volume for a long time. Such periods with a large number of zeros have problems similar to sharp jumps in the popularity of the word. As a result, words with a large number of zeros for the period were excluded from our consideration. After applying a moving average and deleting zero-volume queries, 290 queries were subsequently included in the analysis.

The construction of categories of consumer sentiment is based on the grouping of search queries, which should correspond to a logical scheme and have a similar semantic load. For example, it is impossible to combine words such as "economic crisis" and "concerts" into one category, so to simplify the words were previously divided into general categories such as "Labor market and job search", "Culture and recreation", etc.

Statistical substantiation of the received categories of search queries is carried out by means of factor analysis, the main purpose of which is to combine search queries into the appropriate categories, characterizing one or another aspect of consumer sentiment based on factor loads.

Factor analysis was used as a method of detecting relationships between the values of variables by studying the structure of covariance and correlation matrices. When extracting factors, the principal component method is used, the rotation of factors is carried out using the VARIMAX method.

The use of factor analysis is necessary to create composite categories of search words, which will significantly reduce the number of potentially explanatory variables.

For factor analysis, words were taken that were normalized using a Z‑score and in which shock fluctuations were eliminated. When conducting factor analysis, search queries were excluded if the factor load factor is negative or less than 0.3.

In the process of constructing a mathematical model of consumer sentiment, we obtained 4 factor space models in order to minimize zero-sum search queries. When checking the relevance of the first construct, we found that individual words did not match the specified parameters, as a result of which 18 search queries were excluded from the analysis. In the case of the second construct, 1 search query was excluded from the analysis. At the fourth stage, the factors did not contain insignificant requests, so we took this model as a working one, considering it the most relevant. After applying such a selection, 200 words were used. Further analysis involved a factor model with the best solution in the content aspect, which describes 9 factors explained by 62.8% of the variance. Thus, we can conclude about the satisfactory quality of the constructed model.

Many search queries are ignored because they do not fit into any category of words. So, for example, "withdraw money", "rent an apartment", "where to go" has no connection with any factor. It is important to note that if two words are grouped into one category, then this does not lead to the fact that they mean the same thing, but only that they are characterized by a general trend of queries over a given period of time.

Factor loads for the words that passed the test were grouped into categories that reflect nine aspects of life (Table 1.1). The components of words allow you to visually assess the components of each category of consumer sentiment. So, for example, "Factor 5" is different from "Factor 2", although they have similar characteristics, but groups of search queries indicate that "Factor 5" includes everything that is not essential goods, but which are directly related to people's lives and reflects the state of the economy. "Factor 2" is expressed by the fact that they require larger financial investments, are available to a limited circle of people and characterizes a certain position in society.

Table 1.1. – Distribution of consumer sentiment search queries within factors

Factors

Search queries

1

promotions, apartment, sushi, massage, xiaomi, rolls, pizza, burgers, car, bread, phone, cheese, eco, order, vacancies, discounts, buy pills, etc.

2

poco, dyson, investments, promo codes, skillbox, smart home, haier, diamonds, salary indexing, airsales, oneplus, rolls-royce, payments, airpods, haval, bork, auchan delivery, etc.

3

cucumbers, refrigerator, tomatoes, zucchini, motorcycle, jet ski, ATV, bicycle, camp site, plane tickets, boat, train tickets, repairs, children's camp, kombucha, sugar. bmw, etc.

4

new year, corporate, lg, buy samsung, carnival, fur coat, buy samsung, TV, house, curling iron, where to buy, braun, compare prices, ski, salad recipes, buy online, buy iphone, etc.

5

reviews, kia, renault, jeep, laptop, online menu, income tax, swift, candy

6

tutor, benefits, lada, vacation, conferences, buy diploma, household appliances, discounts on order

7

iphone, apples, soups, mushrooms, discounts for students, your business, cottage, excursions, currency exchange rate

8

sportswear, take on credit, lamborghini, mercedes-benz, porsche, liposuction, earth, cadillac, furniture, outdoor activities, how to buy gold

9

billboard, sanctions

 

One of the key features that causes certain difficulties at the stage of interpretation of factor analysis is the identification and interpretation of the main factors. When selecting components, we encountered certain difficulties, since there is no unambiguous criterion for the selection of factors, and therefore subjectivism of interpretation of the results is inevitable here. Unfortunately, among the many factor models, even the best model in terms of content requires improvement. We can improve the quality of factors by:

  • studying the criteria for forming everyday queries;
  • changing the method of extracting factors (for example, using the method of factorization of the main axis).

At the final stage of the study, we built a multiple regression model to assess the relationship between the phenomena under study: the categories being constructed and the IPN.

Regression analysis – a statistical method used to investigate the relationship between two quantities.

The construction of a multiple regression model allows us to deduce a certain equation of the relationship between the obtained factors and the survey construct.

The dependent variable is the consumer sentiment index, and the independent variables are the categories of Google search queries selected as factors using mathematical modeling. The results of the model were verified using multiple regression. (table 1.2)

Table 1.2. – Regression coefficients between the studied phenomena of the consumer sentiment model and the Levada Center's IPN

Factors

IPN

Significance

Factor 1

-0,876

0,043

Factor 2

-0,214

0,661

Factor 3

-0,196

0,626

Factor 4

-4,857

0,000

Factor 5

-0,150

0,767

Factor 6

2,922

0,000

Factor 7

-0,613

0,148

Factor 8

-0,192

0,648

Factor 9

-0,953

0,023

 

In our case, the R-square is 0.667. This means that 66.7% of the variation of the dependent variable is explained by the variation of the independent variable. The fact that changes in IPN by 66.7% are determined by the dynamics of factors indicates the quality of the tested model.

Another important indicator that should also be taken into account when summarizing the results obtained is called F-statistics. With its help, we can specify the probability with which the independent variable affects the dependent one. To assess the significance of the coefficient of determination, F-statistics are used, which is calculated as the ratio of the explained sum of squares (based on one variable) to the unexplained sum of squares (based on one degree of freedom).

The significance level of the F criterion indicates the reliability of the results obtained. In our case, it has a value less than 0.05, from which we can conclude about the stability of this model.

The comparison of the constructed mathematical model of consumer sentiment and the IPN of the Levada Center is based on the identification of synchronous changes. The presence of such changes confirms the presence of a relationship between the studied phenomena.

Ideally, our factors fit into a meaningful accurate regression model, where they all comprehensively and synchronously affect the consumer sentiment index. In fact, not everything is as obvious as we assumed. Let's turn to the results of multiple regression.

Having analyzed the relationship between the IPN and the constructed factors, using the multiple linear regression equation, we can say that only a part of the factors are significant within this model, Factors 1, 4, 6 and 9 have a value less than 0.05 and, accordingly, with the dynamics of the IPN, their indicators will also change (Table 1.2).

Regression gives us very contradictory results, we see an inverse relationship. In an ideal model, search queries should be associated with activity, therefore, positive dynamics should be observed. However, in our case we observe the opposite picture. This phenomenon may be due to the fact that the factors do not always fit into the content, and also do not always acquire a significant relationship within the framework of regression modeling with index construction in survey methods.

Perhaps this is the specificity of the fact that these are not direct requests expressed quantitatively (IPN), in addition, we cannot say that they passed the smoothing procedure. At the same time, search queries also have a vulnerability: they are indexed relative to themselves, it is a closed system, they do not combine with each other, for example, where one query has 100 points represented by 1000 queries, another 100 points – 10,000 queries. Thus, we can note that when building index constructs of Google Trends search queries, different metrics are used. The problem of matching search queries requires the use of separate tools to align their quantitative values.

To clarify the regression model, let's trace how the IPN correlates with each of the factors (Table 1.3). Let's evaluate the closeness of the connection between the phenomena under study. As we can see, the closest relationship between the factors of the consumer sentiment index and factors 4 and 6. In addition, only in the case of factor 6, the correlation coefficient takes a positive value (in other cases, there is a feedback). We can conclude that changes in the structure of this factor will favorably affect the dynamics of IPN.

Table 1.3. – Correlation coefficient of variables factor 1-9 and IPN

Factors

R Pearson

Significance

Factor 1

-0,075

0,445

Factor 2

-0,003

0,974

Factor 3

-0,044

0,653

Factor 4

-0,746

0,000

Factor 5

-0,084

0,389

Factor 6

0,207

0,033

Factor 7

-0,165

0,090

Factor 8

-0,014

0,883

Factor 9

-0,172

0,077

In order for the model to give us useful information that can be used in the cases being compared, it is necessary to have an idea of the strength of the corresponding correlational analysis relationships, that is, to understand which of the indicators affect the result more strongly and which are weaker. In our case, only two factors have an impact on the IPN.

Thus, the results of the mathematical model satisfy us only partially. The constructed model cannot yet unambiguously replace the "survey" tools.

Big Data in the social sciences is just beginning to develop as an alternative method compared to the classical ones. Today, thanks to a wide range of techniques for data extraction and analysis, we can improve the model at different stages of its construction.

Big data has more detailed statistical estimates of various phenomena and processes in society, which is a necessary argument in the development of the provisions of the concept of consumer sentiment as one of the most important categories of social and economic science.

Google indexes are an interesting additional tool. The found determinants of the Levada Center consumer sentiment indices - categories of Google search queries can be used in determining key directions in economic policy, and ensuring a significantly higher level of not only material, but also social benefits, which will improve the quality of life.

 

×

About the authors

Ekaterina Nikitskaya

Samara National Research University named S.P. Korolev

Author for correspondence.
Email: mymail-kat@mail.ru
Russian Federation

References

  1. Bello-Orgaz, G. Social big data: recent achievements and new challenges / G. Bello-Orgaz, J.J. Jung, D. Camacho // Fusion. – 2016. – 28. – pp. 45-59.
  2. Volkov, V.V. Problems and prospects of research based on Big Data (on the example of sociology of law) / V.V. Volkov, D.A. Skugarevsky, K.D. Titaev // Sociological Research. – 2016. – 1. – pp. 48-58.
  3. Uprichard, E. Big data, little questions? / E. Uprichard / Discover Society. - 2013. – 1. – P. 1-6
  4. Lupton, D. The thirteen Ps of big data. This Sociological Life, 2015. [electronic resource]. URL: https://simplysociology.wordpress.com/2015/05/11/the-thirteen-ps-of-big-data / (accessed: 05/25/2022).
  5. Maltseva, A.V. Problems of representativeness when working with "big data" / A.V. Maltseva // Social practices and management: the problem field of sociology: materials of the Siberian Sociological Forum with international participation. – 2017. – pp. 141-145.
  6. De Mauro, A. What is big data? A consensual definition and a review of key research topics / A. De Mauro, M. Greco, M. Grimaldi // Conference: 4th International Conference on Integrated Information. – 2014.
  7. Lazer, D. Computational Social Science / D. Lazer, A. Pentland, L. Adamic, S. Aral, A-L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King, M. Macy, D. Roy, M. Van Alstyne // Science. – 2009. – 5915. – P. 721-723.
  8. Korytnikova, N. V. Online Big Data as a source of analytical information in online research / N. V. Korytnikova // Sociological research. – 2015. – 8. – pp. 14-24.
  9. Guba, K. Big data in sociology: new data, new sociology? / K. Guba // Sociological Review. – 2018. – Vol. 17. No. 1. – pp. 213-236.
  10. Mann, R. Five minutes with Prabhakar Raghavan : Big data and social science at Google. Impact of Social Sciences, London School of Economics and Political Science [Electronic resource]. 2012. URL: http://eprints.lse.ac.uk/52128 / (accessed: 05/25/2022).
  11. Kitchin, R. The Data Revolution. Big Data, Open Data, Data Infrastructures & Their Consequenses. / R. Kitchin. - Los Angeles, London, Singapore, Washington DC: SAGE, 2014. – p. 240.
  12. Prokofieva, A.V. On some possibilities of using big data in urban sociology / A.V. Prokofieva, M.D. Romanova // Actual problems of human potential development in modern society. – 2017. – pp. 1-4.
  13. Mapping the pulse of NYC, Tokyo, Istanbul, & London [Electronic resource]. 2017 URL: https://vimeo.com/144409527 (Accessed: 05/25/2022)
  14. King, G. Restructuring the Social Sciences : Reflections from Harvard’s Institute for Quantitative Social Science / G. King // Political Science & Politics. – 2013. – 1. – P. 165-172.
  15. Golder, S. A., Macy M. W. Digital Footprints : Opportunities and Challenges for Online Social Research / S. A. Golder, M. W. Macy // Annual Review of Sociology. – 2014. – 40. – P. 129-152.

Supplementary files

Supplementary Files
Action
1. JATS XML

Copyright (c) 2023 Proceedings of young scientists and specialists of the Samara University

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This website uses cookies

You consent to our cookies if you continue to use our website.

About Cookies