Introduction

High-quality research rests on high-quality datasets. “Big” digital behavioral data consists of traces of behavior left by uses of, or harnessed by, digital technology. It is often created for economic purposes and increasingly allows large-scale and high-resolution network analyses of the behavior and performance of persons or aggregated identities in whole fields (Lazer and Radford 2017). A field or network domain is comprised of a story set (domain) and the persons collectively enacting it (network). Put simply, a field is a set of persons thematically concerned with a set of things (White 2008). Field delineation is the computational task of collecting, or retrieving from a database, the building blocks of fields (Zitt 2015). Depending on how a field is represented by data, those blocks can be as diverse as publications or tweets. But there is a conceptual problem: the goal of field delineation is to draw a boundary that does not exist in reality. Socio-cultural systems have no clear-cut boundaries but are highly overlapping (Palla et al. 2007) due to their constructed and fractal nature (Abbott 2001; Fuchs 2001). This is known as the boundary problem in sociology. In that sense, to delineate a field is to draw an impossible boundary.

Bibliographic data is not only a very early example of unobtrusive behavioral data as publications are not produced for the purpose of statistical analysis. It is also a form of multiplex data (Padgett and Powell 2012) for joint analyses of network (co-authorship) and domain (citation and word usage). This makes it sociologically very appealing. Bibliographic databases like the Web of Science and Scopus provide classification systems to aid publication retrieval. Since these systems classify journals, not publications, along coarse disciplinary lines, they are of limited help when it comes to delineating interdisciplinary fields that span across these journal classes. Article-level classification systems can improve the fine-grained publication retrieval of interdisciplinary fields (Glänzel and Schubert 2003; Neuhaus and Daniel 2009; Waltman and van Eck 2012; Sjögårde and Ahlgren 2018). However, they may not be available other than to institutes with priviledged data access (Waltman and van Eck 2012) or may be of limited trustworthiness due to their black box nature (Sinha et al. 2015). Furthermore, even if an article-level classification system is available and trustworthy, publication retrieval still involves manual checking and refinement steps (Milanez et al. 2016).

I propose a sociologically enhanced information retrieval method for field delineation with three parameters that is tailored to retrieving substructured fields, does not rely on an existing classification system, is rooted in sociological theory, and can be applied in non-scientometric settings. For example, it is supposed to be capable of retrieving publications from a bibliographic database as well as tweets from a Twitter corpus. My method is based on the bibliometrically enhanced information retrieval method of Zitt and Bassecoulard (2006) according to which a field is delineated by starting with a precise seed set of publications, then identifying its core cited references, and finally retrieving publications that cite this core. This citing/cited/citing logic is a good starting point because it resonates with the mechanism how complex socio-cultural systems operate via feedback (White 2008; Padgett and Powell 2012). On the way to a general field delineation method, the citation-based retrieval method is generalized to include word usage (Zitt 2015) and subfield delineation is introduced to deal with field heterogeneity (Mogoutov and Kahane 2007). The solution to the boundary problem is to classify a sample set of transactions (e.g., publications or tweets), decide how many false positives or false negatives one is willing to accept in retrieval, and specify the respective fuzziness of the boundary.

This procedure is demonstrated in a delineation of the Social Network Science (SNS) field using the Web of Science database. This field is defined as the network domain that studies socio-cultural systems in a relational way—a multidisciplinary science of social networks, not a sociological network science. As such it roughly combines the classical Social Network Analysis (SNA) field (Freeman 2004) and the subfield of Network Science (Barabási 2016) that studies socio-cultural systems. SNS is a particularly interesting case because it is an evolving field that has seen many twists and turns (Hummon and Carley 1993; Freeman 2004; Garfield 2004; Shibata et al. 2007; Leydesdorff et al. 2008; Lazer et al. 2009; Brandes and Pich 2011; Freeman 2011; Lancichinetti and Fortunato 2012; Batagelj and Cerinšek 2013; Hidalgo 2016; Maltseva and Batagelj 2019). So far, the only bibliographic dataset of the whole field is the SN17 dataset retrieved from the Web of Science by Maltseva and Batagelj (2019). It is based on the SN5 dataset (Batagelj and Cerinšek 2013) retrieved from the Web of Science in 2007. SN5 contains publications that use the search term SOCIAL NETWORK* in either title, abstract, or keywords, or have been published in the journal Social Networks, plus the “most frequently cited works” of those publications.Footnote 1 SN17 is an extension of SN5 to the year 2018 using the same search term but adding new complete network-related journals (Maltseva and Batagelj 2019).

Fig. 1
figure 1

Co-selection graph cores of Social Network Science. From left to right: authors co-authoring, references co-cited in, and words co-used in publications. Colors indicate subfields and unveil how the latter overlap in the co-selection of facts. See description of the final dataset for how these graphs are constructed. (Color figure online)

The goal of the delineation task is to create a high-quality dataset that has undergone manual oversight. It should exclude publications that talk of “social networks” metaphorically, have disambiguated author names, contain the most important citations made in publications’ reference lists (not just to items in the database), include multi-token linguistic concepts (n-grams), and allow historical analysis, i.e., capture the field from its predecessors on. The SN17 dataset does not meet these criteria. Its boundary is too fuzzy because it includes publications that use the networks term metaphorically. Therefore, I have delineated SNS anew. The resulting dataset consists of 25,760 biographical records retrieved from the Web of Science, ranging from 1916 to 2012. There are 45,580 distinct authors, 574,036 cited references, and 23,026 linguistic concepts. Except for citations, the data is made available to the community under a Creative Commons license (Lietz 2019) and can be explored online in a virtual Jupyter Notebook without the need to install or master a programming language (Lietz 2020). Figure 1 gives an impression of the networks that can be constructed from this dataset.

This paper is a revised chapter of my dissertation (Lietz 2016). In the next section, the sociological model of fields is introduced. Then I describe the field delineation procedure in detail before I apply it to delineating the SNS field. A discussion and conclusion is offered in the last section. Most of the mathematical formalism is put in a “Technical Appendix”. Supplementary Information is given for data processing and publication classification.

Sociological field model

The field delineation procedure is supposed to generate data that resembles the operations of persons in the network domain to be delineated. Therefore, it is necessarily rooted in a behavioral model. In sociology, the field concept refers to a structure of positions that are equipped with different sorts of social capital (Bourdieu and Wacquant 1992). This concept is compatible with the concept of network domain in Relational Sociology (Schmitt 2019). Throughout the paper, I use these terms interchangeably. There is never a social network without a culture giving meaning to connectivity, and, vice versa, there is never a culture without it being practiced in social relations. The concept of network domain captures this duality of connectivity (network) and culture (domain) (White 2008). For this reason, I refer to “socio-cultural systems” as opposed to the more common “social systems” term.

Fig. 2
figure 2

Unified field and data model. The feedback mechanism of field reproduction is depicted on the left. Transactions are the constituents of fields and facts are the components of the emergent patterns that influence transactions in downward causation. This field model maps to a bipartite graph model of selections, shown in the middle. Selection matrices G can be projected into fact-coupled transaction matrices H (which are used for partitioning seeds and fields) and into fact co-selection matrices I (as shown in Fig. 1), both depicted on the right. For the citation practice, transactions are publications, facts are references, selections are citations, H is the reference-coupled “bibliographic coupling” publication matrix, and I is the reference co-citation matrix which represents the conceptual pattern of the field. The graph plots visualize a toy example where normalization is used throughout. Consult the “Technical Appendix” for details of the formalism

Network domains reproduce themselves in self-organization. Transactions are their building blocks (Emirbayer 1997). In the “network” dimension, these are social relations. In the “domain” dimension, facts are selected. Durkheim (1982 [1895]) conceptualized a fact as a thing that emerges from collective action and influences individual behavior. Selection expresses this duality that persons both actively chose to make reference to (“select”) facts and, at the same time, are influenced by them. Put into the relational perspective of complex socio-cultural systems, a field operates by persons making selections in transactions from which meaning structures emerge which feed back onto future transactions (Breiger 1974; Fuhse 2009; Padgett and Powell 2012; Page 2015). The feedback loop of field reproduction is depicted in the left part of Fig. 2. While emergence is non-causal, “downward causation” conceptualizes the causal part of the feedback dynamic (Flack 2017). Meaning structures are any kind of observable pattern, like fact co-selection structures or fact size distributions. They have the function to signal which fact belongs to the core of the network domain. The core harbors the agreed-upon concepts and institutions of a network domain (Fuchs 2001). Facts can be distinguished according to the capability of agency, to actively engage into social action (Emirbayer and Mische 1998). If facts are capable of agency, e.g., persons, groups, or organizations, the corresponding meaning structure is social. Meaning structures built of symbols, words, ideas etc. are cultural networks (McLean 2017). Finally, network domains involve multiple practices or types of agency (Swidler 1986).

Sociologically, bibliographic data is particularly interesting because it contains data on three practices: one social and two cultural. Authorship is the social practice of communicating research results in scholarly publications—in my terminology: authors are selected in publications. The other practices are cultural because the facts are not capable of agency. Citation is the practice of making reference to concept symbols, i.e., references are cited in transactions; word usage is the practice of language, i.e., words are selected in transactions.

Core concepts in scientometrics are easily incorporated into this field model. For example, the duality of connectivity and culture is mirrored in the idea that research communities are not just social groups but “thought collectives” (Fleck 1979 [1935]) who “share similar research interests” (Zuccala 2006, p. 155). A publication is a transaction made by authors in which references and word concepts are selected (cited and used). A cited publication is a fact since, being a concept symbol, it influences the citing publication (Small 1978). Co-citation (Small 1973) and co-word (Callon et al. 1986) networks are examples of cultural meaning structures for the practices of citation and word usage, respectively. The size distributions of Lotka (1926), Bradford (1985 [1934]), Zipf (2012 [1949]), and Price (1976) are descriptions of such meaning structures, signaling who are the core scholars, journals, linguistic concepts, and citeable references, respectively, in a field.

Field delineation procedure

The procedure proposed here is based on the bibliometrically enhanced publication retrieval procedure of Zitt and Bassecoulard (2006). This method can be mapped to the field model just described. It is based on the citing/cited/citing logic that publications which are known to belong to a field of interest cite a set of core references which must also be cited by other field publications. One starts on the citing side of transactions: a field is delineated by retrieving, from the set of all publications S in a database, a seed set A, using expert-defined lexical queries that are very precise. Then one moves to the cited side of meaning structures: from A, the set B of cited references is identified in which a reference is cited y times; to obtain a generic and specific core of cited references, B is reduced to C by requiring that references in B receive \(y\ge Y\) citations from the seed publications; Y is a genericness parameter; next C is reduced to the “cited core” D by requiring that references in C receive a fraction \(u=y/y'\ge U\) of their citations from the seed A; \(y'\) is the number of citations a reference receives in the whole database S; U is a specificity parameter. Finally, one goes back to the citing side: the field E is the set of publications that each cite at least \(x\ge X\) references in the cited core D; X is a relevance parameter. Throughout the paper, I refer to this procedure as the original method.

Fig. 3
figure 3

Field delineation procedure. First, a set of candidate transactions (boundary set) \(\varOmega\) and an initial iteration of the field (seed set) A are created, and A is partitioned into subfields \(A_{\mathrm{s}}\). Second, the overlapping subcores \(B_{\mathrm{s}}\) selected by \(A_{\mathrm{s}}\) are defined using threshold parameters \(\varPsi\) for fact genericness and \(\varUpsilon\) for fact specificity. Third, the field E is assembled by retrieving those transactions from \(\varOmega\) that select the subcores \(B_{\mathrm{s}}\) and fulfil a stated requirement (minimum subcore recall of precision), defined by the assembly parameter \(\varPi\). Fourth, the final (extended) field set Z is created by adding to E its cores \(\varGamma _{\mathrm{s}}\) if transactions and facts are of the same entity type. Consult Table 1 for details on the notation used in the procedure

Table 1 Notation of field delineation procedure

I generalize this procedure to be able to delineate any type of field defined as the feedback process of transactions selecting facts. The reasons to also modify the original method are twofold. First, it is unfair in the case of field heterogeneity. For a subfield that is large or has a very skewed citation distribution, \(Y=10\) citations may not be much, but for a subfield that is small or has a less skewed size distribution, it may be a lot. Even for subfields with similar size and skewness, thresholding on a particular Y would be unfair if reference list lengths vary. To mitigate this problem, I introduce a clustering sub-procedure and perform field delineation on the subfield level. Second, having access to a whole database of all transactions S is the exception rather than the rule (e.g., in the case of commercial databases like the Web of Science or Scopus), if not impossible. Often, database access is restricted or download limits are imposed. My method does not require access to a full database. Instead, the field is built from a restricted set of candidate transactions. As a consequence of that modification, expert knowledge is not needed in the first delineation step of creating the seed but in a later step, and the risk of expert bias is minimized. The following modified field delineation procedure is sketched in Fig. 3. The notation used throughout this paper is summed up in Table 1.

Creating boundary and seed sets

The first step is to create two sets of transactions. The boundary set \(\varOmega\) contains the transactions that are candidates for belonging to the field (being inside the boundary). It should be devoid of transactions that are completely off topic because, in the next step, a sample will be coded as inside/outside the boundary, and this classification task should decide upon nuances, not obviousness. The seed set A is the first iteration of the field. It should be as precise as possible as it is used to create candidate lists of core facts. But it needs not be as precise as the seed in the original method because expert knowledge is involved in the classification just mentioned. \(\varOmega\) is a superset of A, i.e., the seed is fully contained in the boundary set.

To account for field heterogeneity—the existence of differently sized subfields or of varying selection practices—, two actions are taken. First, the weights of the selections made in a transaction are normalized to sum to unity (Batagelj and Cerinšek 2013). To handle data, a unified field and data model is introduced which maps the field model of transactions and facts to a bipartite graph model of selections (Fig. 2). In a nutshell, bipartite selection graphs consist of two types of vertices with only inter-type connections. The first type of vertices are transactions; the second type of vertices are facts; an edge is created if a fact is selected in a transaction. For each practice, one normalized selection graph is constructed. Details are laid out in the “Technical Appendix”.

Second, delineation is made on the subfield level, i.e., subseeds \(A_{\mathrm{s}}\) are identified. This action is inspired by Mogoutov and Kahane (2007). The goal is to create clusters of transactions based on similar selection profiles (Doreian et al. 2004). There are many ways to construct such similarities (Eck and Waltman 2009). Here, a purely graph-theoretic approach is used that has very natural interpretations. It results in analytical transaction graphs where edge weights resemble transaction similarities in the [0, 1] interval (cf. “Technical Appendix”). Given this graph, the seed is partitioned (single membership) using community detection (Fortunato 2010). Refined computational methods should proceed by detecting dynamic communities.

In the case of scientometrics, when publications are coupled through the citation practice, this is the “bibliographic coupling” network (Kessler 1963). Other transaction graphs are possible, e.g., author-coupled and word-coupled publication graphs. The way selections are normalized had first been proposed by Leydesdorff and Opthof (2010) for counting citations.

Defining the subcores The second step is to identify a subcore \(B_{\mathrm{s}}\) for each subseed \(A_{\mathrm{s}}\). Subcores must be both generic and specific. In the original method, genericness is ensured through requiring core facts to each have at least a certain number of selections from the seed. If one chose the same absolute threshold for all subfields, then small subfields and those with a less skewed size distribution would be punished. To ensure that all subfield cores are equally generic, my method takes advantage of the situation that few facts are selected by, or retrieve, many transactions.

Put shortly, facts \(f_{j,s}\) are ranked such that a fact’s rank increases when it is highly selected in a subseed s but decreases when it is highly selected in the whole seed (\(tf*idf\) principle). The genericness \(\psi _{j,s}\) of a fact j in subseed s is then the cumulative sum of selection fractions \(K^{\mathrm {N}}\).Footnote 2 Finally, facts are thresholded against a genericness parameter \(\varPsi\), the first parameter, such that \(\psi _{j,s}\le \varPsi\). For example, when \(\varPsi =0.1\), then the highest-ranked facts that accumulate no more than the top ten percent of all selections are chosen to constitute a subcore. The exact method is laid out in the “Technical Appendix”.

To ensure specificity, informed manual work must be involved in the delineation procedure at some point. Zitt and Bassecoulard (2006) have proposed to define those facts as belonging to the core that receive at least a certain fraction of their selections from a seed that is highly precise. This requires expert knowledge in the first step of defining the seed. This knowledge can result in a lexical query that does not, or hardly, retrieves false positives or in the identification of curated collections, e.g., conference proceedings or tweet collections, where all transactions are on topic.

Here, I propose an alternative approach: to attribute a specificity to facts, a sample \(\varOmega '\) of the boundary set \(\varOmega\) is coded along an inside/outside dichotomy. This approach changes the kind of expert work from defining a transaction set to defining a codebook on how to classify transactions. Having this codebook, the actual classification task can be outsourced—maybe even to crowd workers if they are well trained and paid. The specificity of fact \(f_{j,s}\) is then \(\upsilon _{j,s}=|\varOmega '_{\mathrm {in}}|_{j,s}/|\varOmega '|_{j,s}\). Here, \(|\varOmega '_{\mathrm {in}}|_{j,s}\) is the size of the subset of transactions in the sample, retrieved by \(f_{j,s}\), that are ruled to belong to the field, and \(|\varOmega '|_{j,s}\) is the size of the subset of transactions in the sample retrieved by \(f_{j,s}\). Finally, facts are thresholded against a specificity parameter \(\varUpsilon\), the second parameter, such that \(\upsilon _{j,s}\ge \varUpsilon\). For example, when \(\varUpsilon =0.5\), then a fact is chosen to co-constitute a subcore if at least half of the transactions it retrieves are relevant for the field (ruled inside the boundary).

Genericness ensures that retrieval is efficient. The larger this parameter is set, the more a subcore consists of large numbers of less selected facts. Specificity ensures that retrieval is accurate. The larger this parameter is set, the more a subcore consists of facts that only retrieve relevant transactions as judged by the coding of the sample \(\varOmega '\). Subfield retrieval is evaluated using recall and precision. \(\mathrm {Recall}=|\varOmega '_{\mathrm {in}}|_{D_{\mathrm{s}}}/|\varOmega '_{\mathrm {in}}|\) is the fraction of relevant transactions in the sample \(\varOmega '\) that are retrieved by the subcore \(D_{\mathrm{s}}\). \(\mathrm {Precision}=|\varOmega '_{\mathrm {in}}|_{D_{\mathrm{s}}}/|\varOmega '|_{D_{\mathrm{s}}}\) is the fraction of transactions retrieved by \(D_{\mathrm{s}}\) from the sample \(\varOmega '\) that are relevant. The evaluation metrics and retrieval parameters are naturally related. Recall is a transaction-side measure and is strongly influenced by the fact-side genericness parameter; precision is a transaction-side measure and is strongly influenced by the fact-side specificity parameter.

Assembling the field

The third step is to create the second iteration of the field by retrieving those transaction sets \(E_{\mathrm{s}}\) that select the subcores defined in the previous step and creating the set union E. One can use a different retrieval parameter setting for each subfield. Then the task is to decide on a setting by trading recall off against precision—how many false positives is one willing to accept for the benefit of reducing false negatives? If the goal is to delineate a field through one set of facts chosen by one parameter setting, not one set and setting for each subfield, then the problem arises that a particular parameter setting can entail varying recall and precision for different subcores. This is because a fact that belongs to the core of one subfield can belong to the periphery of another subfield. Then, define an assembly parameter \(\varPi\), the third parameter: a minimum recall or precision that applies to all subfields alike. From this minimum value, a universal parameter setting can be deduced that maximizes overall precision or recall.

Extending the field

The fourth and final step is originally not intended and only makes sense if transactions and facts are of the same entity type. For example, publications and cited references are of the same type but publications and used words are not; tweets and retweeted tweets are of the same type but tweets and used hashtags are not. The step consists of partitioning E into subfields \(E'_{\mathrm{s}}\) using community detection as in the first step, defining the subcores \(\varGamma _{\mathrm{s}}\) selected by \(E'_{\mathrm{s}}\), and adding to E the facts in \(\varGamma _{\mathrm{s}}\), thresholding on the value of \(\varPsi\) identified in the previous step. Call this third and final iteration of the field the extended field Z. In the scientometric case, this step resembles adding to the field its most cited references because those have important meanings even though they may not be directly related to the topic. This is often the case for methodological contributions.

Delineating Social Network Science

As stated in the introduction, the goal is to delineate SNS as a multidisciplinary science of social networks that roughly combines the classical SNA field and the subfield of Network Science that studies socio-cultural systems. Data was queried from the Web of Science. I chose this database because its records are historical (they go back to 1900), they are systematically collected via journals with not low impact factors (Garfield 1979), and because a lot of effort is put into upholding a high data quality. The Microsoft Academic Graph is also historical and it automatically collects many more records (Sinha et al. 2015), but for that reason its data quality is also lower. Queries were made in 2013 via the online interface at www.webofknowledge.com. Unfortunately, records can only be downloaded in batches of 500. This complicates field delineation enormously and has caused me to delineate SNS on the subfield level but not dynamically.

Creating boundary and seed sets

In this first step, the boundary set \(\varOmega\), from which publications representing SNS are “recruited”, and the seed set A, the first iteration of the field that selects the subcores later used for assembling the field, are created. On the one hand, candidate publications should not be required to use the word SOCIAL NETWORK* in title, abstract, or author keywords (throughout the paper, “words” are meant to include sequences of n tokens or n-grams) because a contributions to SNS may well use a different word (e.g., “social relation”). On the other hand, not all publications using the SOCIAL NETWORK* 2-gram should automatically be inside the boundary. For example, “social network” is also used metaphorically, in the case of which I do not consider the respective publication to be inside SNS. But all candidates for E, the second iteration of the field created in the third step, should use the words SOCIAL and NETWORK*. These thoughts define the two initial sets. The boundary set \(\varOmega\) contains 44,308 publications using the words SOCIAL and NETWORK*. The seed A is a subset of \(\varOmega\) and contains 23,568 publications using SOCIAL NETWORK*. Note that the seed is not very precise. Publication years in \(\varOmega\) range from 1953 to 2014.Footnote 3

This data was then processed. Each publication and reference was transformed into a key such that a cited reference can be matched to a citing publication. Granovetter’s (1973) paper, e.g., has the matchkey GRANOVET_1973_A_1360. All titles, abstracts, and author keywords were preprocessed and stemmed. All words used by at least one author in the seed as a keyword represent the vocabulary. A vocabulary word is selected by a publication if it is used in either the title, abstract, or author keywords. For details of data processing see the Supplementary Information (Section 1).

Based on the description of Scott (2012) and other analyses of SNS (Hummon and Carley 1993; Freeman 2004; Shibata et al. 2007; Lazer et al. 2009; Brandes and Pich 2011; Freeman 2011; Hidalgo 2016; Maltseva and Batagelj 2019), the field is expected to have a social-psychological path with a strong graph-theoretical focus, a diverging ethnographical lineage, a structuralist narrative following the breakthroughs of White et al. in the 70s, a development driven by physics starting around 2000, and a recent surge of research on animal social networks. These paths belong to different scientific disciplines with different styles of practice. Therefore, SNS is not delineated as if it had one core but by accounting for the heterogeneity of subfields with possibly different sizes and publication characteristics.

To partition the seed, I created a selection graph each for the three practices of authorship, citation, and word usage. For the latter, I did not distinguish whether a word is used in title, abstract, or author keywords. KeyWords Plus generated automatically from reviewing reference titles (Garfield and Sher 1993) are not used in this study. The three selection graphs were then projected into fact-coupled transaction graphs following the method depicted in Fig. 2 and described in the “Technical Appendix”. These graphs and their combinations were then clustered using Louvain community detection (Blondel et al. 2008).

Table 2 Seed clustering statistics for different coupling methods

Table 2 shows that graphs from author coupling are two orders of magnitude more sparse than from reference coupling and three orders more than from word coupling. As a consequence, they also differ largely in how many publications belong to the largest connected component (LCC). For author coupling, only \(41.1\%\) of all publications are at least indirectly similar via shared authors. The three types of facts also have a different power of distinction. Modularity Q quantifies the extent to which edges are internalized to clusters (Newman 2006), i.e., how permeable subseed boundaries are. A modularity of \(Q_{\mathrm {ref}}=0.46\) for reference coupling means that cited references are less distinctive than authors (\(Q_{\mathrm {aut}}=0.96\)) but more than words (\(Q_{\mathrm {wrd}}=0.14\)). Rows for hybrid coupling indicate that, once words are part of the coupling mix, Q is low, i.e., subseeds are largely overlapping. This is nothing else but the fact that words obtain their meaning in co-usage, language can be flexibly used, and is less precise in delineating fields than citation (Glänzel and Thijs 2011; Zitt 2015).

Reproducibility is another issue. Louvain community detection has a stochastic element (Lancichinetti and Fortunato 2012). Intuitively, the more boundaries are overlapping, the more publications will be assigned to a partition based on chance. The Adjusted Rand Score quantifies how similar two partitions are (Fortunato 2010).Footnote 4 I arrive at means and standard deviations by comparing the solutions of ten runs. It turns out—counter-intuitively—that clustering word-coupled graphs is most and author-coupled graphs is least reproducible. This is because there is less randomness in partitioning lexical graphs as similarity scores (edge weights) have a much wider spectrum.

Summing up until here, even though the component communities of word-coupled publication graphs are most strongly overlapping, their partitions are most reproducible. But once hybrid coupling is used, including references, authors, or both, reproducibility drops. It is clear that all further results are contingent on the choice of facts for coupling publications. At this point, I decided to exclude author coupling from the following considerations because it decreases reproducibility. But there is also a substantive argument: the cultural and the social operate on different time scales. Words, references, and their co-selections are much more institutionalized than authors and team compositions (Padgett and Powell 2012). From this perspective, not coupling publications via authors means not allowing social currents to have an impact on subsequent results and aiming at a more culture-dependent analysis.

Table 3 Description of subseeds from different coupling methods

In Table 3, the communities or subseeds from reference, word (lexical), and reference/word (hybrid) coupling are described via rankings like top subject categories and facts for the corresponding selection graphs. From the ten clustering runs, the one with the largest modularity is used. Partitions are robust in that they—with one exception—describe five non-trivial communities. By interpreting subseed descriptions, I label these Social Psychology and Epidemiology (SPE), Economic Sociology (ES), Social Network Analysis (SNA), Network Science (NS), and Computational Social Science (CSS). Partitioning for different fact coupling also results in the same temporal ordering. SPE is the oldest community and CSS is the newest. The choice of coupling has an effect on subseed composition. SPE is much larger when delineated lexically or the hybrid way; ES is smaller. Reference coupling results in two subseeds for SNA.

Since no gold standard exists, there is no objective criterion to evaluate the partitions. I chose hybrid coupling because hybrid methods balance the advantages and disadvantages of citation-based and lexical approaches (Braam et al. 1991; Glänzel and Thijs 2011; Zitt 2015). Having excluded author coupling, this prevents either references or words from determining future results.

Defining the subcores

In this second step, the subcores \(B_{\mathrm{s}}\) are defined from which the field is later assembled. To obtain the subcores, the genericness and specificity of each fact (reference and word) is determined for each subseed. Fact genericness \(\psi _{j,s}\) can directly be computed from the normalized selection matrices of the subseeds (cf. “Technical Appendix”). To obtain fact specificity, I took a sample \(\varOmega '\) of 1000 publications from the boundary set, 499 of which are in the seed, and manually decided if they should be inside or outside SNS (relevant or not). The process can be retraced by studying the Supplementary Information (Section 2) which gives 15 examples for each class. Here, it gets clear why the boundary set should not contain publications that are largely off topic. If there were, it would be obvious if they should belong to the field or not, but such a classification is only of limited use. Since the goal was to define a science of social networks, not a sociological network science, I ruled publications inside or relevant when they are truly relational and outside or irrelevant when the NETWORK concept is used metaphorically or in non-social contexts. Publications about engineering networked social systems were ruled inside when they are not purely about issues of implementation. Given this classification, specificity \(\upsilon _{j,s}\) is the fraction of publications retrieved by a fact that are relevant (as judged via the sample \(\varOmega '\)).

Fig. 4
figure 4

Recall and precision of publication retrieval. Recall and precision of the subcores and the total core depend on the efficiency (genericness parameter) and accuracy (specificity parameter) of the delineation procedure. For the example of Social Psychology (SP), the subcore with a genericness of up to 10% and a specificity of at least 50% recalls 89% of the relevant publications at a precision of 43%. For lexical retrieval and the same parameters, recall is 43% at a precision of 29%. Values for the total are obtained by treating the whole field like a subfield

Figure 4 depicts the efficiency and accuracy of the retrieval procedure and how the genericness and specificity parameters influence its recall and precision for the unpartitioned seed and broken down to the five subseeds. I had to introduce an upper limit for genericness of 0.2 to reduce manual labour in using the Web of Science online interface. The plots reveal two things. First, recall is higher for citation-based retrieval, parameter settings being equal. Reference cores are more generic or, put differently, at similar genericness, lexical retrieval is associated with a lower recall because language use is relatively imprecise. Second, idiosyncrasies of subfields point at differences of ideational closure or cultural coherence (Fuchs 2001, p. 55). Social Network Analysis has a very generic small core, few references and words suffice to retrieve a large fraction of relevant publications. Accordingly, the word SOCIAL_NETWORK_ANALYSIS is used by 1105 or \(39\%\) of the publications. \(38\%\) cite the reference WASSERMA_1994_SOCIAL (cf. Table 3c). Social Psychology and Epidemiology is at the other extreme. Lexical retrieval is either good regarding recall or precision, but not both. Only \(8\%\) of the publications use the top-ranked SOCIAL_SUPPORT and only \(5\%\) cite the top-ranked BERKMAN_1979_A_186 (Table 3a). These ideosynchrasies highlight the need for a subfield-specific delineation procedure.

Assembling the field

In this third step, the second iteration E of the field is assembled from the subcores \(B_{\mathrm{s}}\). When delineating SNS using one parameter setting, the problem that a particular setting entails varying recall and precision of the subcores shows as follows. For citation-based retrieval, setting \(\varUpsilon =0.5\) (\(\varPsi =0.1\)) results in an overall precision of 0.76 but, for Social Psychology, of 0.43. This is a direct effect of the boundary problem, and the proposed solution is to decide on a justifiable fuzziness of the boundary. In that case, the researcher must decide how the assembly parameter \(\varPi\) should determine a universal setting of \(\varPsi\) and \(\varUpsilon\). Demanding that \(\varPi\) guarantees a minimum recall for all subfields entails an increase of false positives (an increase of retrieved publications that are irrelevant) as the assembly parameter increases. As a result, the boundary will be more fuzzy. Demanding that \(\varPi\) guarantees a minimum precision for all subfields entails an increase of false negatives (an increase of relevant publications that are not retrieved) as the assembly parameter increases. As a result, the boundary will be less fuzzy. As I want the boundary to contain few irrelevant publications, I prioritize on accuracy and condition the genericness and specificity parameters on a minimum precision \(\varPi\) that should be achieved for all subfields. This parameter resembles the confidence that the boundary separates irrelevant from relevant publications. Given \(\varPi\), those parameters \(\varPsi\) and \(\varUpsilon\) are deduced which maximize overall recall.

Fig. 5
figure 5

Effects of boundary confidence choice. (Left) The fraction of publications retrieved from the boundary set \(\varOmega\) decreases with increasing minimum precision \(\varPi\), but less so for citation-based retrieval. The black trend resembles the set union as the result of hybrid retrieval. (Right) Opposing trends of overall recall and precision according to hybrid retrieval of the classified sample \(\varOmega '\)

Table 4 Parameters for field assembly

Figure 5 reports effects of a given confidence in the boundary. The left plot reports the fraction of publications retrieved from the boundary set \(\varOmega\). Different upper bounds are visible. For citation-based retrieval, no more than \(65\%\) of 44,308 papers can be retrieved as \(\varPsi \le 0.2\) sets a technical limit. Lexical cores can retrieve \(93\%\) at low minimum precision but the fraction quickly drops when subfield boundaries are required to be less fuzzy. The advantage of hybrid retrieval, the set union from the two approaches, is that it balances the high specificity/low genericness of citation-based retrieval and the low specificity/high genericness of lexical retrieval. When the boundary from hybrid retrieval is required to be perfectly precise (\(\varPi =1\)), then the field will consist of about 23,000 publications. But when \(80\%\) of the publications are allowed to be irrelevant (\(\varPi =0.2\)), then the field will be about 35,000 publications strong. The right plot of Fig. 5 reports the efficiency and accuracy of hybrid retrieval conditional on \(\varPi\). Recall is at a satisfactorily high level for the whole range of minimum precision. At this point, I set \(\varPi =0.8\) because a lower value would increase the overall fraction of false positives to over \(10\%\). Table 4 lists the parameters and the average subfield recall and precision that can be achieved for a given minimum precision. It reveals that the parameters corresponding to \(\varPi =0.8\) are \(\varPsi =0.2\) and \(\varUpsilon =0.9\) and that the confidence that the subfield boundaries separate irrelevant from relevant publications is \(97\%\) on average regarding references to concept symbols and \(98\%\) regarding language use.

Table 5 Number of facts used for retrieval

Table 5 states that the subfields contribute different numbers of facts to the overall retrieval core and that a genericness of \(20\%\) translates to different absolute selection thresholds. While Computational Social Science’s core references are cited at least four times, an absolute threshold for Network Science would have to be eight. This demonstrates that the original method of not distinguishing subfields is only applicable to network domains whose subfields do not have varying selection practices or differ in size. Using the deduced parameter setting, the five subcores \(B_{\mathrm{s}}\) (per practice) are defined and the subfields \(E_{\mathrm{s}}\) which select these subcores are retrieved. The set union of \(E_{\mathrm{s}}\) is the second iteration E of SNS and consists of 24,748 publications. This is slightly larger than the size of the seed from which the delineation procedure has, until here, removed irrelevant publications and to which it has added relevant ones from the boundary set.

Extending the field

Table 6 Core description and sourcing

In this fourth step, the second iteration E of the field is extended by adding to it its most cited references. This step is supposed to reconstruct more complete citation paths. The procedure calls for partitioning E into subfields \(E'_{\mathrm{s}}\). Louvain community detection in the network of publications (coupled through references and words) expectedly results in five communities again (modularity \(Q=0.12\)). The subfields \(E'_{\mathrm{s}}\) they represent can be meaningfully mapped to subseeds \(A_{\mathrm{s}}\) via the most used word (Table 6). Clustering consensus from ten runs is \(99\%\) instead of \(92\%\) for the seed (cf. Table 2). Cited cores for the field are also somewhat smaller than for the seed (cf. Table 5). Both results are evidence that the delineation procedure has resulted in more compact subfields and less fuzzy subfield boundaries.

Next, the subcores \(\varGamma _{\mathrm{s}}\) are identified, now only using the genericness parameter \(\varPsi =0.2\) (the same value as for defining the subcores \(B_{\mathrm{s}}\)). Unlike the subcores \(B_{\mathrm{s}}\), the subcores \(\varGamma _{\mathrm{s}}\) need not be specific since scholarly fields are always open to some extent—they also cite core references outside their own boundaries. By counting the fraction of cited references in subcores \(\varGamma _{\mathrm{s}}\) that are also contained in the field E and in the subfield \(E'_{\mathrm{s}}\), I determine the extent to which the field and its subfields are closed. Books are not covered by the Web of Science products I had accessed. Consequently, founding books like Moreno’s Who Shall Survive? (1934) can only show up as cited references, not as citing publications. Social Network Analysis is most self-contained with respect to the whole field: \(38\%\) of its core references are themselves publications in SNS, mirroring its role as the methodological power house of the field. With respect to subfields, Social Psychology and Epidemiology (SPE) and Network Science (NS) are most closed. Computational Social Science (CSS), the youngest subfield, least cites its own publications. As expected for a subfield rooted in computer science, it cites a large fraction of conference proceedings articles (book chapters).

Table 6 further reports that only \(38\%\) of CSS’s 2768 core references could be identified in the database—could be sourced—and added to the field. Its article sourcing rate is smallest, too. NS has the largest sourcing rate overall (\(70\%\)) and for chapters (\(22\%\)). \(93\%\) of SPE’s cited articles were successfully added to the field.Footnote 5 In total, 4965 core references were added to the field. In the resulting set, the extended field Z, I removed some publications or references to prevent meaningless results, artifacts, or the failure of algorithms.Footnote 6

Fig. 6
figure 6

Entity relationship diagram of the final dataset. Tables with primary keys (PK) contain entities (e.g., publications) and their attributes (e.g., the time it was published). Tables that only contain foreign keys (FK) are relational tables that can be directly used for network construction

The final dataset Z, the third iteration of the field, consists of 25,760 publications (journal and conference proceedings articles). Following the disambiguation of author names (Supplementary Information, Section 1.1), 45,580 author identities remain that relate to publications in 68,227 authorships. 574,036 distinct references are selected in 1,125,321 citations (180,861 to publications in SNS). Following the removal of general science language (Supplementary Information, Section 1.3), 23,026 words (occurring in title, abstract, or as author keywords) are used in 201,608 selections. These entities and relationships are displayed in Fig. 6. The dataset is publicly made available (Lietz 2019) and can be explored online (Lietz 2020).

Description of the final dataset

Fig. 7
figure 7

Field growth and statistics. Curvature of the number of publications over time (compared to the broken line representing purely exponential growth) signals slight superexponential growth. The other curves depict the average number of facts selected in a publication per year. Words subsume words in titles and abstracts as well as author keywords

Table 7 Subfields in Social Network Science

The earliest publication in SNS is Hanifan’s “The rural school community center” from 1916 (HANIFAN_1916_A_130) because it is often cited as one of the first occurrences of the SOCIAL_CAPITAL concept. From then on, the field grows continuously with a slight tendency for superexponential growth, as can be seen in the top plot of Fig. 7. It also shows that subfields came to exist at different points in time and exhibit phases of accelerating and decelerating growth. To obtain the final subfields, the hybrid publication graph representing Z is once again clustered using Louvain community detection. Table 7 is a description of the five subfields. Labels still match those of the seed very well (cf. Table 3). The consensus of detecting these communities is 0.91, i.e., extending the field has reduced the consensus from field clustering; boundaries are fuzzier again. Modularity is low (\(Q=0.13\)) because of very high density (\(D=0.57\)).

Some assignments of publications to subfields are counterintuitive. For example, Heider’s article on balance theory (HEIDER_1946_J_107) is not in Social Psychology but in Economic Sociology, together with Cartwright and Harary’s graph theoretical generalization (CARTWRIG_1956_P_277) as well as foundational works of the Harvard school, like Granovetter’s “The strength of weak ties” (GRANOVET_1973_A_1360) and White et al.’s article on blockmodeling (WHITE_1976_A_730). This makes sense because these papers belong to the sociometry tradition initiated by Moreno.

The importance of fractional selection counting in the construction of publication similarity scores is once more demonstrated by the average number of references per publication which is a characteristic score for each subfield, depicted in Fig. 7. The fact that an average paper in Economic Sociology cites almost twice as many references in 2010 than an average paper in Computational Social Science means that a citation in the latter subfield is twice as valuable. Normalized citation counts \(k^{\mathrm {N}}\) account for such differences but are still affected by the size of the respective subfield. Publication fractions K account for size differences but not for different citation practices. Only citation fractions \(K^{\mathrm {N}}\) are comparable across subfields (Table 7). The reference WASSERMA_1994_SOCIAL and word SOCIAL_NETWORK_ANALYSIS are about ten times more common in Social Network Analysis than the top reference O’REILLY_2005_WHAT in Computational Social Science or the top word COMPLEX_NETWORK in Network Science.

The average number of words per publication exhibits a marked jump in 1990 because that year the database producers started including abstracts and author keywords in the Web of Science database. The average number of authors per publication is constantly increasing since the 70s, marking the decade when the field started becoming a “big science” (Price 1986) where knowledge production in teams is increasingly important (Wuchty et al. 2007). There are differences, however. Economic Sociology is much less a team science than Social Psychology & Epidemiology.

Fig. 8
figure 8

Size distributions of Social Network Science. Probability density functions for authorship and citation are plausibly fit by a pure power law \(p(k)\sim k^{-\alpha }\) using the maximum likelihood method (Clauset et al. 2009; Gillespie 2015). Fits are plausible if \(p>0.1\), and the scores are \(p_{\mathrm {aut}}=0.30\) and \(p_{\mathrm {ref}}=0.39\), respectively. Colored points result from logarithmic binning and show that extreme values also fall on the straight lines. The best power-law fit to the word usage distribution (\({\hat{x}}_{\mathrm {min}}=10\pm 14\), \({\hat{\alpha }}=2.0\pm 0.1\)) is not plausible (\(p_{\mathrm {wrd}}=0.00\)). The plots show the actual fit parameters but legends give \(95\%\) confidence intervals from bootstrapping

Figure 8 unveils that SNS is well described by power law size distributions for authorship (Lotka’s Law, Lotka 1926) and citation (Price 1976). There is also evidence for Zipf’s Law (Zipf 2012 [1949]). Even though the word usage distribution is not plausibly fit by a pure power law, all subfields except Economic Sociology are plausibly fitted by Zipf exponents \({\hat{\alpha }}_\mathrm {wrd}\approx 2\) (Lietz 2016, Table 3.8). Finally, Fig. 1 displays the cores of the field’s three practices, all created from the same genericness threshold.Footnote 7 These graphs are filtered counterparts of the normalized fact co-selection matrices \(I^{\mathrm {N}}_\mathrm {aut}\), \(I^{\mathrm {N}}_\mathrm {ref}\), and \(I^{\mathrm {N}}_\mathrm {wrd}\). As shown in Fig. 2, communities of vertices in fact-coupled transaction matrices translate to communities of edges (Ahn et al. 2010) in fact co-selection matrices. Hence, edge colors indicate how facts are overlappingly co-selected in different subfields. The differences in network size mostly result from differences in concentration indicated by the decrease of exponents from \({\hat{\alpha }}_\mathrm {aut}\approx 3\) to \({\hat{\alpha }}_\mathrm {wrd}\approx 2\). This is mirrored in the observation that the weighted fraction \(K^{\mathrm {N}}\) of similarly ranked facts is always (much) larger for words than for references (cf. Table 7).

Discussion and conclusion

All data is produced for a purpose, and I have presented a procedure to retrieve a research dataset representing a socio-cultural field from a corpus not necessarily created for a research purpose. The method is a development of the field delineation procedure of Zitt and Bassecoulard (2006) which departs from expert knowledge but minimizes the associated risk of expert bias by bibliometrically enhancing retrieval via a citing/cited/citing logic. By (a) mapping this logic to the mechanism of how fields reproduce themselves through positive feedback, (b) modifying it to be able to account for field heterogeneity, and (c) generalizing the routine to be able to delineate any type of field, I have proposed a sociologically enhanced information retrieval method. The reliance on a reproductive mechanism (Padgett and Powell 2012) effectively mitigates the risk associated with hidden assumptions because “the field writes the query.” The risk is further reduced by modifying the way expert knowledge is used. Whereas, in the original method, expert knowledge is used to define a precise seed set of transactions, in the method proposed here, it is used to decide if transactions in a candidate set should be inside or outside the field boundary. As my method requires inductive, not deductive, reasoning, experts are enabled to learn and identify, and transcend, their priors (Arthur 1994).

One may ask if this gain is worth a complicated procedure with three parameters, a non-deterministic sub-procedure (clustering), and some manual classification. In the case presented here, why not simply use the SN17 dataset (Maltseva and Batagelj 2019) described in the introduction? The answer lies in the sociological boundary problem that any delineation is a construction. The SN17 dataset and the one discussed here serve different research purposes. If one is fine with having many publications in the corpus that use “social networks” metaphorically, then SN17 is fine. Parameter-free methods as applied to delineate SN17 tend to be black boxes. But all delineations being constructions means that the steps made in field delineation are already the first steps of field analysis. This should be visible in this paper. Therefore, if one wants to have control over the boundary, my method may be an option. Then, the third parameter \(\varPi\)—here, chosen to be the minimum precision—serves as a goodness-of-boundary measure. That said, the SN17 dataset can also be retrieved with the method described here. First, the seed and boundary sets are created as the same set using the same SOCIAL NETWORK* search term and the boundary sample is coded as described. Second, the genericness parameter \(\varPsi\) is set to 1 to retrieve publications via all facts, and the specificity parameter \(\varUpsilon\) is set to 0 to also use all facts for retrieval. \(\varPi\) is then not a retrieval parameter anymore, but a characterization of boundary fuzziness.

Still, evaluations are necessary. Since the mechanistic approach delineates fields in an organic way, certain statistical properties of dynamic systems are expected and can be used for a soft kind of evaluation, namely exponential growth and power law size distributions (Price 1986). While exponential growth is the hallmark of complex adaptive innovation systems, power laws are expected signatures because, as a functional pattern, they point at an optimization process that results in fractal structures (West 2017). Both signatures are found. The field grows slightly superexponentially, indicating that it is innovating successfully. There are also no gaps which could indicate that publications of a particular period have been missed. The field obeys Lotka’s Law and exhibits a power law distribution for the citation practice. Language use statistics deserve a closer look. For the whole field, the size distribution for word usage is not plausibly fit by a power law, but for four of the five subfields, Zipf’s Law holds. This leads to three conjectures. First, the field has not yet self-organized to a scale-free pattern. Second, the way natural language was processed introduced a bias. Third, delineation on the subfield level does not necessarily create a coherent whole. While the first conjecture resembles a finding, the last ones may be limitations of the method and deserve future attention.

Clear limitations exist. First, subcores were defined disregarding time, i.e., they are most effected by recent years with many publications. This recency effect can be avoided by using dynamic community detection. Not identifying subcores over time was the price of using the Web of Science and retrieving data via the online interface. To ease research, database producers could consider calls for data access where users are granted improved data access under defined terms of use. Second, abstracts and author keywords are not available for years before 1990. Since I used all author keywords as the vocabulary which is then extracted from titles and abstracts, I rely on the situation that all relevant keywords are at least used once in, or after, 1990. I think this is a fair assumption. Third, I also provided the expert knowledge when ruling candidate publications inside or outside SNS, i.e., there is no reliability check. For the reader to retrace my decisions, I present selected cases in the Supplementary Information (Section 2).

The dataset has high face validity. Subfield descriptions are robust throughout the delineation procedure (Tables 3, 7). The analysis of the dataset—not described here but in my dissertation (Lietz 2016)—reproduces results known from the previous literature, namely, roots in social psychology and graph theory, a structuralist narrative starting in the 70s, and a turn at the end of the century driven by physics (Freeman 2004; Scott 2012; Maltseva and Batagelj 2019). But the study also uncovers new insights into field dynamics, particularly regarding the paradigm shifting effects that arise when an incommensurable research style forcefully and massively enters a field. In the case of SNS, the mainstream was lastingly altered and old knowledge got more or less lost (Lietz 2016).

I conclude that the boundary constructed for SNS is a fair delineation of the field for the purpose of studying its historical evolution. The main contribution of this paper is a sociologically enhanced information retrieval method that integrates a field model, a retrieval model, and a data model. There is indeed a benefit in importing more social science into information science (Leydesdorff and Van Den Besselaar 1997; Cronin 2008). The fact that a reproductive mechanism is at the heart of the procedure makes it principally applicable to other settings. For example, in a social media monitoring context it is typically difficult to foresee which semantic selectors (e.g., hashtags) will be used in a monitoring phase. It is much easier to define which users (e.g., politicians) are relevant (Stier et al. 2018). Future delineations of a social media monitoring corpus may be improved by starting with a user-based seed set, monitoring the emergent pattern of potential selectors, and adding/removing selectors if necessary.