Abstract

Direct demand modeling is a useful tool to estimate the demand of urban rail transit stations and to determine factors that significantly influence such demand. The construction of a direct demand model involves determination of the catchment area. Although there have been many methods to determine the catchment area, the choice of those methods is very arbitrary. Different methods will lead to different results and their effects on the results are still not clear. This paper intends to investigate this issue by focusing on three aspects related to the catchment area: size of the catchment area, processing methods of the overlapping areas, and whether to apply the distance decay function on the catchment area. Five catchment areas are defined by drawing buffers around each station with radius distance ranging from 300 to 1500 meters with the interval of 300 meters. Three methods to process the overlapping areas are tested, which are the naïve method, Thiessen polygon, and equal division. The effect of distance decay is considered by applying lower weight to the outer catchment area. Data from five cities in the United States are analyzed. Built environment characteristics within the catchment area are extracted as explanatory variables. Annual average weekday ridership of each station is used as the response variable. To further analyze the effect of regression models on the results, three commonly used models, including the linear regression, log-linear regression, and negative binomial regression models, are applied to examine which type of catchment area yields the highest goodness-of-fit. We find that the ideal buffer sizes vary among cities, and different buffer sizes do not have a great impact on the model’s goodness-of-fit and prediction accuracy. When the catchment areas are heavily overlapping, dividing the overlapping area by the number of times of overlapping can improve model results. The application of distance decay function could barely improve the model results. The goodness-of-fit of the three models is comparable, though the log-linear regression model has the highest prediction accuracy. This study could provide useful references for researchers and planners on how to select catchment areas when constructing direct demand models for urban rail transit stations.

1. Introduction

Transit-oriented development (TOD) plays a pivotal role in urban planning. TOD refers to a planning and design method that maximizes the use of public transportation for both residential and commercial areas. For example, a customized urban center with a radius of 400–800 meters can be built to integrate commerce, education, culture, employment, and residence facilities adjacent to the public transportation stations. In TOD planning, an important component is to establish an accurate and reliable direct ridership model to facilitate transportation operators to formulate urban rail transit operation strategies and assist urban planners to design more efficient and convenient urban and transportation plans.

In direct ridership models, land use characteristics within the catchment area are indispensable. A crucial question in building a ridership model is the choice of an appropriate size for catchment area. Some previous studies used 800 meters as the radius of the circular buffer to predict the ridership at the transit station level [1, 2]. Others set up direct ridership models with 500 meters as the buffer radius. In addition, many introduce the concept of a half-mile walking distance as the radius of the catchment area [3, 4]. Cervero [5] studied the commuting options of people living within 0.5 and 3 miles of the San Francisco Bay Area with the results showing that people who work near transit stations are more likely to live within 0.5 miles of urban rail transit stations and use public transit to commute. He also studied the factors affecting travel demand based on a total of 261 light rail stations in the United States and Canada [6]. He suggested that both population and employment densities within 0.5 miles of the station positively correlate with daily transit ridership. Pan et al. [7] created buffer areas of 500 meters, 1000 meters, and 2000 meters to model Shanghai subway ridership and found similar results. There is a wealth of studies on direct ridership modeling which use different radius distances for the catchment area, and it appears that the choice criteria of buffers have not yet reached a consensus.

The task of determining an appropriate size for catchment area is not limited to choosing the radius of the catchment area but includes how to apply the catchment area. Many studies directly used the circular catchment area to obtain values of influencing variables within the catchment area after determining the radius [79]. The circular catchment area does nothing with the overlap area, we regard it as the first method. The most common and straight-forward approach to generate the circular area is by using the ArcGIS software. However, in cities with dense stations, there will be excessive overlaps between the catchment areas, with some areas being repeatedly counted, which may impose a negative impact on the results. Therefore, some studies employed the Thiessen polygons to tackle the overlapping areas by assigning the closets points of a station to a polygon around that station [1, 1012]. In our study, Thiessen polygon is the second method to deal with the overlapping catchment area. Yet, the Thiessen polygon method has its limitations. For example, it does not perform well for densely urbanized areas where the short distances among stations may generate clusters of small polygons, which lead to potential calculation inaccuracy of variable values. To overcome the overlapping issue, we propose to improve the circular buffer area method by dividing the overlapping area by the number of times of overlapping. So, we take the approach of dividing the overlap area equally as the third method to process the coverage overlap area. Besides the treatment of the overlapping area, some scholars used the distance decay method to represent the fact that the attraction of an urban rail transit station decreases as the distance to the station increases [13, 14]. In distance decay approach, the weights of variables change with the distance. Some different weighting methods of distance decay have appeared one after another, and we will compare two distance decay methods with different weighing methods.

Once the catchment area has been determined and properly represented, the next task is to establish the direct ridership models. Regarding the regression models in transportation demand research, most studies in recent years used ordinary multiple linear regression techniques [2, 4, 15]. In this line of studies, there are also some attempts to apply logarithmic transformation on the dependent variable to form a log-linear regression model [16]. Last, some papers adopt the use of negative binomial regression models [9, 17, 18].

In light of this gap in the literature, our paper establishes direct ridership models for five cities in the United States, aiming to answer the following four questions. (1) What is an appropriate size of the catchment area? (2) Among the three methods to process the overlap of catchment area, i.e., the naïve implementation (ordinary circle), combining circle, and Thiessen polygon into new catchment area, and dividing the overlap of the circular buffer area equally into adjacent circles, which method performs the best? (3) Which weighting method is better when using the distance decay approach? (4) Which model performs better among linear regression, log-linear regression, and negative binomial regression techniques? To avoid drawing biased conclusion from one city, we included five different cities in the U.S. to obtain more generalizable findings.

This paper consists of five parts. Section 1 reviews the existing literature in the following aspects: influencing factors of transit ridership, buffer radius selection, coverage overlapping area treatment methods, and model selection. Section 2 describes the research objectives and data sources. Section 3 presents the main research design and methodology, including the treatment of the overlapping area of a catchment area, model treatment, and two simplified weighting methods. Section 4 discusses the model results, interpretation, and analysis. Section 5 concludes our study findings.

2. Literature Review

Understanding the influencing factors of transit ridership has been a recent research focus. In general, the influencing factors can be divided into the following three categories: socioeconomic, land use, and traffic attributes. Former studies show that socioeconomic characteristics such as population and employment are positively correlated with transit ridership [4, 17, 1921]. Land use characteristics include but are not limited to 3Ds, land use density, design, diversity, and mixed land use levels [2226]. Scholars found land use density and diversity have a positive impact on ridership [21]. The traffic attributes and station characteristics are such as bus routes, road density, and accessibility; station types may also have significant influence on transit demand [2, 2730]. Some authors found road density is positively related to ridership; they also noted that transfer station and terminal station increased ridership [2, 8].

In the studies of land use and transportation demand, land use variables are usually measured on the basis of arbitrarily defined areas, and various methods for defining catchment areas lead to distinct results. Ruiz-Pérez and Seguí-Pons [31] compared the effect of four different geographical spatial units (neighborhood, census section, cadastral block, and 400 × 400 m mesh) on bus service level analysis. Results showed that different zones led to significant differences of the service level and combining zonings of different sizes simultaneously was recommended. Guerra et al. [15] studied 6 buffer bands with increments of 0.25 mile from 0.25 mile to 1.5 mile and concluded that different catchment areas have little influence on a model’s predictive power. After analyzing the data of 1,449 rail transit stations in 21 cities in the United States, Jun et al. [21] studied the land use characteristics of the Seoul metropolitan area and the impact of land use characteristics on station-level ridership with different buffer bands for the buffer area of 300 meters, 300–600 meters, and 600–900 meters. Their results show that the impact of population density and mixed land use on ridership is only significant at the 300 meters and 300–600 meters buffer band levels, recommending that a compact urban pedestrian catchment area like Seoul should be defined using a radius of 600 meters. Mitra and Buliung [32] established 4 buffer areas at different scales (250 meters, 400 meters, 800 meters, and 1000 meters) around children’s homes and schools, measured the effects of the built environment characteristics in 4 different scales, and found that the goodness-of-fits of the four models do not exhibit much different though as the distance increases, the model fits slightly worse. In addition, the magnitudes of effects of the individual built environment characteristics are inconsistent with different scale buffers. Relying on survey data, Li et al. [9] identified seven buffer zones of 50 meters, 100 meters, 200 meters, 400 meters, 800 meters, 1200 meters, and 1600 meters and established a regression model on resident’s travel behavior. As can be seen from these studies, no conclusion has been drawn with respect to the size of the catchment area and the effect of the size of the catchment area on the modeling results is still not clear.

The handling of the buffers can significantly affect the results and should be treated with caution. When processing the buffer zone, most articles adopt a naïve circular buffer method, which means that the issue of overlapping areas of station coverage is simply ignored [8, 9, 19, 21, 28, 3335]. Still, there are some other methods. For example, Li et al. [22] used Thiessen polygons to deal with the overlapping issue with the 800 meters circular buffer. When Sun et al. [12] divided the multilevel water catchment area, the radius of the pedestrian and traffic area was determined by the residents’ travel survey, and the division of the potential catchment area was determined by the Thiessen polygon. Gutiérrez et al. [13] generated Thiessen polygons to specifically divide service areas based on multiple circular bands in order to consider competition between stations. The cropped area used by Guerra et al. [15] is similar to the buffer area generated by Thiessen polygon. Kuby et al. [4] used a grid-based connection on-off network method to improve the standard buffer delimitation method of ArcGIS to redefine more accurate service area. The equal division method adopted in this article is not covered by the previous studies, and its merits and limitations are not compared to other buffer delimitation methods. Considering that the Euclidean buffers may overlap, Corazza and Favaretto [36] used the network buffers which were determined based on polygons covering all the edges that are within 400-meter area of the stop.

Distance to the station is another important consideration in the model. Some studies found that the probability of passengers choosing a station is related to the distance to the station. Untermann found that most people are willing to walk up to 500 feet (152.4 meters), 40% are willing to walk 1,000 feet (304.8 meters), but only 10% are willing to walk up to 1 mile [37]. This uncertainty of distance inspires innovations in methodology such as the distance decay method. Gutiérrez et al. [13] combined the distance decay function with the multiple regression analysis to establish a rapid response passenger prediction model using different distance thresholds to improve the model results (3.4% on the 800 m threshold). Yet, the distance decay method is not without its limitations. The studies considering distance decay only infer conclusions based on their own results without comparing their results with other weighting scales or verifying their results using data of other cities.

Model improvement techniques on this topic have been developed and applied over the recent years. In most previous studies, linear regression is frequently used [2, 4, 7, 11, 12, 15, 28, 38]. However, skewed distribution of the dependent variables can be an issue for the linear regression approach. To solve this, logarithmic transformation of dependent variables can be applied [16]. Concerning overdispersed data, some chose negative binomial regression to reduce standard errors [9, 17, 18]. For example, Thompson et al. [17] used a negative binomial regression model to study factors affecting transit ridership in Florida and found that some variables such as population, total household income, and total employment can explain rise in bus ridership.

In summary, although the existing studies have analyzed the impact of the built environment on ridership from many aspects, the effects of the size of the catchment area, processing methods of the overlapping area, whether to consider distance decay and model selection on the results are still not clear. This article tries to study those effects by analyzing the urban rail transit ridership data from five American cities.

3. Methods

3.1. Study Area

We chose five American urban rail transit systems for our analysis (i.e., New York, Chicago, Philadelphia, Boston, and San Francisco Bay Area) as these urban rail transit systems are well developed and serve a large metropolitan population. The spatial analysis unit is census block groups (CBG). Among these cities, New York has the largest urban rail transit system with a total of 36 lines and 472 stations. Its urban rail transit system also has the longest history among the five systems. The rail transit systems of Chicago, Philadelphia, and Boston have similar number of stations, more than 100. The Bay Area Rapid Transit system has 50 stations. The urban rail transit systems of New York and Chicago are shown in Figure 1 as examples.

The buffer areas defined in this paper are circular buffer areas with radii of 300 meters, 600 meters, 900 meters, 1200 meters, and 1500 meters. On this basis, bands of 300–600 meters, 600–900 meters, 900–1200 meters, and 1200–1500 meters were developed as inputs for the distance decay function. The distribution of the urban rail transit stations in the five cities is different: for example, the urban rail transit stations in San Francisco Bay Area are relatively scattered, and the overlapping buffer area is small. In contrast, urban rail transit stations in New York are denser and therefore there is much higher overlapping buffer area (Figure 2).

3.2. Transit Ridership Data

The dependent variable of our study is the average weekday ridership of the five urban rail transit systems in 2010. The year of 2010 is selected because the values of built environment variables of 2010 are very accurate. We obtained ridership data from the New York City Transit Authority (MTA), the Chicago Transit Authority (CTA), the Massachusetts Bay Transportation Authority (MBTA), Southeastern Pennsylvania Transportation Authority (SEPTA), Port Authority Transit Corporation (PATCO), and the Bay Area Rapid Transit (BART), respectively.

To facilitate model selection, we plot histograms of ridership and logarithmic transformed ridership for each city, as shown in Figure 3. It can be seen that New York has the highest ridership. The ridership of all cities is skewed to the right. After performing the logarithmic transformation on the ridership, their distributions are closer to the normal distribution. Through Figure 3, we noticed that the distribution of ridership in Philadelphia is more uneven: in most days the ridership is below 5000 passengers, with only a few days exceeding this value.

3.3. Independent Variables

Based on the literature review, we selected 18 built environment variables as our independent variables. These variables include socioeconomic variables, built environment variables, and station characteristic variables. Table 1 presents the description of these independent variables. The socioeconomic variables are population, employment, proportion of households with one car or less, proportion of low-income family, and so on. Such variables are drawn from the Smart Location Database (SLD) dataset. The built environment variables include the density of the road network, the distance from the station to the Central Business District (CBD), the number of bus stations within the service area of the station, and the number of bus lines within a 200 meters buffer around the station. The station attribute variables include the number of subway lines available at the station, and two dummy variables that indicate whether the station is a transfer station or a terminal station, respectively.

3.4. Processing Method of Overlapping Area of Station Service Coverage

When we draw circular buffers around stations, there could be overlaps between the station buffers if the stations are close to each other (Figure 4). We compare several methods to deal with the overlapping problem here.

3.4.1. Naïve Method

This method ignores this issue and the overlapping areas will be counted multiple times when calculating values of the variables for each buffer. For example, if three buffers intersect in some area, this overlapping area will be counted into all of these three buffers, which means it will be calculated 3 times. As a result, some variables such as population, employment, housing, density of residents, and number of bus stops are repeatedly counted.

3.4.2. Thiessen Polygon

Thiessen polygons are also called Voronoi diagrams or Voronoi polygons. It is one of the basic methods to analyze neighborhood in proximity. Thiessen polygons are used to describe the areas of influence of sample points. For any point in a Thiessen polygon, its distance to the sample point in the polygon is less than its distance to any other sample point. An example of Thiessen polygons based on some stations in Chicago is shown in Figure 5(a) and Figure 5(b).

The specific steps are as follows. First, create circular buffer areas and Thiessen polygons around stations, respectively. Second, use the intersection tool in ArcGIS to intersect the Thiessen polygons with the circular buffer. As we can see in Figure 6(a), it is the boundaries of the Thiessen polygons that cut the overlapping areas of the circular buffers. Finally, we use the fusion function in the data management tool in ArcGIS to form the new catchment area (Figure 6(b)). The advantage of using Thiessen polygons here is that it can handle overlapping areas between buffers to avoid overlapping areas from being double counted.

3.4.3. Equal Division

Equal division method divides the overlapping area by the number of overlapping buffers and applies the division result as a weight to calculate the value of variables. For example, if three buffer areas overlap, the values of some variables such as population, employment, the number of bus stops within the overlapping areas are divided by three and assigned to each buffer area. We use python to implement the aforementioned procedures.

3.5. The Distance Decay Function on Buffer Bands

The theoretical basis of the distance decay function is Tobler’s First Law of Geography, which indicated that sample points that are closer have greater impact on the results than the points that are far away. Previous studies found that when the walking distance of potential users increases, the public transport usage decreases [39]. The effect of distance is converted into the weight in the mathematical model. By applying weight to explanatory variables that are within different buffer band, a distance decay weighted regression model is established.

The buffer bands used in this article are within 300 meters, 300–600 meters, 600–900 meters, 900–1200 meters, and 1200–1500 meters. The weights of these buffer bands are determined based on the distance.

Gutiérrez et al. [13] used a linear distance decay function to perform a weighted regression. We adopt that linear distance decay function and apply the weight of 0.5, 0.4, 0.3, 0.2, 0.1 to the buffer bands of within 300 meters, 300–600 meters, 600–900 meters, 900–1200 meters, and 1200–1500 meters, respectively.

Manout et al. [39] calibrated the distance decay function using the data of the Paris family travel survey. We also adopt this nonlinear distance decay function and apply 0.8, 0.3, 0.1, 0, and 0 to the buffer bands of within 300 meters, 300–600 meters, 600–900 meters, 900–1200 meters, and 1200–1500 meters, respectively. The two weighting methods are shown in Table 2.

4. Model Description

4.1. Multiple Linear Regression

Multiple linear regression is a commonly used regression model [40, 41]. In this article, we employ the multiple linear regression to evaluate the impact of multiple independent variables on station-level ridership. The parameters in linear regression model are determined by minimizing the sum of squared errors. The function of multiple linear regression is as follows:where represents the dependent variable in the model, which is the average annual weekday ridership of an urban rail transit station. are the regression coefficients. For example represents the average change of the dependent variable for each additional unit of while keeping other variables constant.

4.2. Log-Linear Regression

Figure 3 shows that the ridership of the five cities is skewed to the right. As a result, the natural logarithmic transformation of the dependent variable is performed and the transformed variable follows the normal distribution approximately. The function of the log-linear regression is as follows:

The interpretation is similar to that of the multiple linear regression. The only difference is that in the log-linear regression model, each additional unit of will make the dependent variable increase by times.

4.3. Negative Binomial Regression

The dependent variable in this study, ridership, is a count variable. The negative binomial regression model is commonly used to analyze count variable. In addition, it can be observed from the histogram that the ridership is skewed to the right, which satisfies the assumption of the negative binomial regression that the mean is greater than the variance. The function of the negative binomial regression model is shown below:

Among them, are the regression coefficients and are the independent variables.

5. Results

By applying the methods and models mentioned in the previous sections, we got the final result. We used three indicators to evaluate the goodness-of-fit of the models: adjusted R2, mean absolute percentage error (MAPE), and Akaike information criterion (AIC).

Adjusted R2 means degree-of-freedom adjusted coefficient of determination. In the model results, higher adjusted R2 indicates better goodness-of-fit. The function of adjusted R2 is shown below:

In (4), SSR is the sum of squares of the errors. SST is total sum of squares, which is the sum of squares of the difference between the observed value and the mean.

The formula to calculate the MAPE is as follows:where is the predicted value of the dependent variable and is the observed value. In the model results, smaller MAPE indicates better model prediction. Its value can be calculated directly with the forecast package in R. For log-linear regression, we need to convert the log transformed variable dependent to the original dependent variable to calculate the true MAPE.

AIC is another measure to evaluate the goodness-of-fit of statistical models. Smaller AIC value indicates better goodness-of-fit. The AIC value is calculated by the following function:where is the number of variables in the model and is the likelihood function.

Due to the limitation of space, we only illustrate the results of the regression models for the Thiessen polygon with 900 meters radius circle buffer of New York city here, as in Table 3. All the three regression models fit the data well.

It can be observed from Table 3 that population and employment are significant variables that are positively correlated with ridership. In addition, the number of bus lines and the number of urban rail transit lines, whether it is a transfer station or it is a terminal station, are all positively correlated with ridership in the three models. The distance between the station and the CBD is negatively correlated with ridership in the three models, which shows that as the distance increases, people’s intention to choose the urban rail transit decreases.

5.1. Comparison of Five Catchment Areas

To explore the most suitable buffer size, we compare the model results of using five different catchment areas.

The adjusted R2 of stepwise linear regression and the AIC of negative binomial regression for different buffer sizes are shown in Figure 7 and Figure 8, respectively. It can be seen that the goodness-of-fits of models using different buffer sizes are very similar. From Figure 7, we can see that for Chicago and Boston, the buffer of 900 meters performs the best. For New York and Philadelphia, the buffer of 600 meters performs the best. For the Bay Area, the buffer of 1500 meters is the most suitable. Therefore, the most suitable buffer size is different for different cities. We also notice that the difference in goodness-of-fit among various buffer sizes is not much. As a result, researchers could use the handiest buffer size when estimating the station level ridership, which is consistent with the result of Guerra et al.’s study [15].

5.2. Comparison of Three Processing Methods of Overlapping Area

We take buffer sizes of 300 meters, 900 meters, and 1500 meters as examples to show the ratio of overlapping area to total buffer area (Table 4). It can be seen that, as the buffer size increases, the overlapping ratio increases sufficiently, especially for Chicago, New York, and Boston.

The goodness-of-fits of the linear regression model with the three buffer processing methods are shown in Figure 8. In Figure 9, the naive method that does not deal with the overlapping area is represented as type 1, the Thiessen polygon method is represented as type 2, and the equal division method is represented as type 3.

From Figure 9, we can see from the adjusted R2 of Chicago, New York, and Boston that the equal division method has better goodness-of-fit than the Thiessen polygon method and the naive method. From Table 4, we can see that the stations in these three cities are densely distributed and the overlap ratio is high. Due to the scattered distribution of stations in the San Francisco Bay Area, the overlap ratio is low. When the buffer radius is 900 meters, the results of the equal division method and the Thiessen polygon method do not outperform that of the naive method. It implies that for cities with densely distributed urban rail transit stations, the equal division method and Thiessen polygon method can generate better results.

5.3. Comparison of Weighting Methods Based on Two Distance Decay Functions

Figure 10 and 11 show the adjusted R2 of weighting methods based on the linear distance decay function (0.5, 0.4, 0.3, 0.2, and 0.1) and the nonlinear distance decay function (0.8, 0.3, 0.1, 0, and 0) using linear regression and negative binomial regression, respectively. The line in the figure is the average adjusted R2 of ordinary circular buffer without applying weighting, which we called naïve method.

The trends in Figure 10 and Figure 11 are basically the same. In general, the results of the first weighting method are better than that of the second weighting method in most cases, although the difference is marginal. There is a little difference between the results of the first weighting method and the ordinary circular buffer, which shows that the distance decay function could barely improve the model result.

5.4. Comparison of Three Models

The average values of adjusted R2 for different buffer sizes are shown in Figure 12. From the figure, we find that the model results of New York and San Francisco Bay Area have similar pattern: the negative binomial regression model has the best performance while the linear regression model is the worst. However, the differences in goodness-of-fit of the three models are not significant. The model results of Chicago and Boston have similar pattern: linear regression has the best performance while negative binomial regression is the worst. Again, the differences in goodness-of-fit of the three models are not significant. But for Philadelphia, the log-linear model has much worse performance than the other two models. This could be due to that in Philadelphia; the ridership of some stations is extremely high (above 20,000) while that of other stations is usually low (below 3,000), which could be observed from Figure 3.

The average values of MAPE for different buffer sizes are shown in Figure 13. We find that the MAPE of log-linear regression is smaller than that of the other two regression models. Therefore, the log-linear regression model could substantially improve the prediction accuracy.

In addition, the detailed results of different models and methods of the five cities are shown in the Tables 515. These tables present the same pattern and information.

6. Discussion

This article discusses a series of issues concerning the treatment of the catchment area when studying of the impact of the built environment on urban rail transit.

First of all, we studied the effect of buffer size on the station level demand modeling results. The results show that, overall, the optimal buffer size varies across the five cities. The impact of buffer size on the goodness-of-fit of the model is trivial, which is consistent with the conclusions of previous studies [15, 32]. This contrasts with the study of Jun et al. [21] that recommends 600 meters as the radius of the pedestrian catchment area for a compact city like Seoul. We find that for a compact city, such as New York, the size of the buffer still does not have a significant impact on the model results.

Regarding the processing method for the overlapping buffer area, we find that, for cities with densely distributed urban rail transit stations, the equal division method and Thiessen polygon method can generate better results. There have not been any studies that apply the equal division method, but Thiessen polygons have been used by some researchers. For example, Li et al. [22] and Sun et al. [12] used Thiessen polygons to deal with the overlapping area of circular buffers. In the future, the equal division method could also be used to deal with the overlapping buffer area. For the distance decay weighting methods, the weighting method based on linear distance decay function is better than that based on nonlinear distance decay function, although the difference is marginal. The result of applying weighting to buffer bands is similar with that without weighting, which is contradictory to the conclusion of Gutiérrez et al. [13].

Regarding regression models, when the comparison is made based on R2, the three models have similar performance. Only for the city of Philadelphia, the log linear model has a much lower R2, which may be because the ridership of Philadelphia has much higher variation. When the comparison is made based on MAPE, the log linear model has the best performance for all the five cities. It indicates that log linear model has higher prediction power than the other two models and that we could obtain different results by using different measures. This also contrasts with the results of Wang et al. [42].

7. Conclusions

This study evaluates the effects of different buffer sizes, treatments of overlapping buffer area, and regression models on the direct ridership modeling results. To compare the performance of all the models and methods, we conducted extensive experiments using the data of five major cities in the U.S. First, the model results of different buffer sizes (300 meters, 600 meters, 900 meters, 1200 meters, and 1500 meters) are compared. We find that different buffer sizes do not have a great impact on models’ goodness-of-fit and prediction accuracy. Secondly, we compared the model results of the three methods to deal with the overlapping of the catchment area, which are naïve method, Thiessen polygon method, and equal division method. The results show that, for cities with densely distributed stations and high buffer overlapping ratio, the equal division method is better than Thiessen polygon method, and both outperform the naïve method. However, for cities with more scattered stations and low buffer overlapping ratio, the three methods have comparable performance. Thirdly, we perform weighted regression by applying weights to the variables in the circular buffer band of within 300 meters, 300–600 meters, 600–900 meters, 900–1200 meters, and 1200–1500 meters using two simplified weighting methods. The results show that the weighting method of 0.5, 0.4, 0.3, 0.2, and 0.1 produces better results than the method of 0.8, 0.3, 0.1,0, and 0. However, the weighted regression does not improve the goodness-of-fit much. Finally, three regression models, linear regression, log-linear regression, and negative binomial regression were constructed and their results are compared. Based on the adjusted R-square and MAPE values of the models, we find that the goodness-of-fits of these three models are all satisfactory for all cities except for Philadelphia. The differences of values are within 20%, and the log-linear regression model results in high prediction accuracy.

There are also some limitations of this study. First, the range of the buffer size is between 300 meters and 1500 meters, which comes from the previous studies [13, 15, 21, 32]. The interval of 300 meters is used, which we believe should be small enough to study the optimal buffer size, especially considering the inaccuracy caused by extracting data from the CBG. But a smaller interval, such as 100 meters used by Gutiérrez et al. [13], could generate more detailed results. This is one of the limitations of this study and could be further explored in the future. Secondly, only the linear relationship between ridership and built environment variables is considered. In the future, we will consider the nonlinear relationship between the independent variables and the dependent variable and use machine learning tools to build nonlinear models.

Data Availability

The data used to support the findings of the study are from the New York City Transit Authority (MTA), the Chicago Transit Authority (CTA), the Massachusetts Bay Transportation Authority (MBTA), the Southeastern Pennsylvania Transportation Authority (SEPTA), the Port Authority Transit Corporation (PATCO), and the Bay Area Rapid Transit (BART), respectively.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by National Natural Science Foundation of China (grant nos. 71704145 and 51774241) and China Postdoctoral Science Foundation, Sichuan Youth Science and Technology Innovation Research Team Project (grant nos. 2019JDTD0002 and 2020JDTD0027).