Finding top-firms in the data - Data and Top-firms set construction

3.5 Data and Top-firms set construction

3.5.1 Finding top-firms in the data

As discussed in Section 3.1, being able to filter out firms using the optimal allocation rule f is essential to estimate it. Were this not possible our estimates would not capture a unique underlying allocation rule used to assign workforce. This advocates for a careful selection of top-firms.

Based on the relationships in (3.4) and (3.5), a possible approach is to select firms with higher level of productivity. In fact, we know that using the allocation f maximizes it. However, this is a naive and likely poor solution as it is. It is well known that (e.g. Syverson,2011) productivity levels varies strongly across non homogeneous firms (e.g. different industries, different size etc.).

Moreover, there are different drivers for this productivity dispersion. If these are not taken into account and if they have much larger impact on productivity than workforce allocation, we are going to to capture these in our split and not the allocation rule f . Thus, a better approach is to first isolate the residual productivity that can be reasonably attributed to workforce allocation and only then select firms showing higher levels of productivity.

Controlling for other factors (the O variables in (3.4)) allows to split firms in (more or less) homogeneous bins and makes the selection of top-firms within bins more reliable. Nonetheless, there is a trade-off. On the one hand, adding more controls allow to better filter out residual productivity depending on allocation rule. On the other hand, adding too much of them will kill all the residual productivity dispersion, also that arising from the allocation rule. For example, in the extreme case of one firm-one bin, all the firms would be top-firms, and all the allocation rules would be optimal allocation rules.

For this reason, controlling for observables should be supported by economic theory. For example, it known that different industries may behave very differently in term of productivity one from another (Bernard and Jones, 1995or Syverson,2011), so it may reasonable to consider industries when splitting firms. Controlling for firm’s age, on the contrary, might be a more arguable choice: there seems to be contrasting evidence on the relationship between the age and productivity, at least for firms older than 10 years (see Kok, Brouwer, and Fris, 2005). Adding this control could result in adding just more noise to the selection of top-firms.

We now describe the procedure we use to split the data in top-firms and non top-firms. We also refer to the first set the learning set, because on this set we are going to estimate ˆf . The second set is also referred to as control set, and is the set of firms where we are going to evaluate f to compute the JAQ measures (see Remarkˆ 3.2.3). These will be used to assess the extent by which these firms deviate from the optimal rule f .

The learning set is constructed according to Algorithm5, which is briefly described as follows.

According to some criteria, we split the full data in bins and from these we extract a part of top-firms. Further refinement criteria are applied and the learning and control set are defined.

The criteria used to split the data in different bins are defined by control variables.

We now give an example of how the data are split according to Algorithm 5. Consider the

Algorithm 5: Create top-firms set

Input: X full firms data; B1, . . . , BM binning criteria; t thresholding criterion; s splitting variable; R further refinement criterion.

Output: learning set, control set.

for m P 1, . . . , M do

XmÐ tx P X : x satisfies Bmu

top_mÐ tx P Xm : spxq ě tpXmqu. I.e. define top the elements of the subset for which the variable s is above a threshold, computed on the subset itself.

top Ð tptop_mqmu. I.e. join the top elements.

learning set Ð tx P top : x satisfies Ru control set Ð Xzlearning set

following splitting criteria:

1. Size intervals: r50, 100s; r101, 250s; r251, 1000s; r1001, 6000s.

2. Year: 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010.

3. Industry: 58 categories of industries (see Table3.6).

A bin B_m will be defined as one of the possible combinations of the above criteria (e.g. observa-tions in year 2005, firms with size in r50, 100s and within industry 6). We split on value added per employee, and this defines variable s. For each bin, we compute the top productivity decile;

this defines the thresholding criterion t. Then, we collect for each bin, firms with s greater or equal than the (within bin) top productivity decile. Finally, we refine the set by keeping as top-firms only those that for at least 6 years were classified as top-firms in this way. This defines the refinement criterion R. Note that top-firms set is made by firm, and not by firm in particular years. Thus, if a firm is in this set, it is in this set for the whole data time window, even if for some of the year considered its productivity level was not above the threshold. The idea is that allocation rule is not changing with the years, while productivity may undergo some shocks.

Table3.5:SummariesonEmployeesCharacteristics.TotalObservations:28710464 (a)ObservationsonEmployeesperYear year2001200220032004200520062007200820092010 obs.2816199285744928182142834160282842929005562964534295365728366052900661 (b)Age min.16 25%33 50%43 75%53 max.99 mean.42.6 std.dev.12.47 (c)LaborMarketEx- perience min.0 25%8 50%19 75%29 max.83 mean.19.42 std.dev.13.07 (d)Tenure min.0 25%1 50%4 75%10 max.20 mean.5.80 std.dev.5.55 (e)Daysinunemploy- ment min.0 25%0 50%0 75%228 max.5181 mean.192.07 std.dev.365.28

(f)Educationlevel(Obs.) Basic3466087 Highschool13679968 Vocational3811652 University7668471 N/A84286 (g)Female/Malecompo- sition Female15592450 Male13118014

(h)Immigrantcomposition Immigrant4054687 Notimmigrant24655777 (i)Livesinbirth country Yes17321074 No11389390

(j)Graduatedduring recession Yes5311539 No23398925 (k)NarrowEducationlevel N.differentCategories347 Obs.SmallestCategory4 MedianObs.inCategories22670 Obs.BiggestCategory3663270 Mean.Obs.inCategories82739.09 Std.dev.Obs.inCategories245780.28

(l)Municipality(Obs.) N.differentCategories291 Obs.SmallestCategory5868 MedianObs.inCategories43725 Obs.BiggestCategory2569869 Mean.Obs.inCategories98661.39 Std.dev.Obs.inCategories198399.91

(m)N.industries workedin(since 1990) 029312 120627635 26807573 31086231 4142798 515612 61228 769 86 (n)N.firmsworkedat(since1990) 179381721113133 27507355124735 35655882131703 4364822414584 5205673915178 610482911663 74956791722 8215264185 989409191 1035025

Table3.6:SummariesonFirmsCharacteristics.TotalObservations:71432

(a)ObservationsonFirmsperYear

year2001200220032004200520062007200820092010obs.6884693667886891692871527476760072497528

(b)Productivity:valueaddedperemployee(MSEK2017)

min.´17.2825%0.4650%0.6075%0.81max.245.10mean.0.76std.dev.1.70 (c)FirmSize(inEmploy-ees)

min.5025%6550%9975%210max.52151mean.401.93std.dev.1592.15 (d)Sales(MSEK2017)

min.025%77.8450%166.4275%397.12max.120641.4mean.664.92std.dev.2954.57 (e)Totalassets(MSEK2017)

min.025%12.1950%61.7475%203.03max.362723.5mean.789.71std.dev.6959.61 (f)Firmagefrom1990

min.025%950%1375%16max.20mean.12.01std.dev.5.29

(g)Firmindustry

N.differentCategories58Obs.SmallestCategory7MedianObs.inCategories688Obs.BiggestCategory7322Mean.Obs.inCategories1231.59Std.dev.Obs.inCategories1511.39 (h)FamilyFirm

Yes10870No60562 (i)Stateormu-nicipalityowned

Yes9866No61566 (j)ListedFirm

Yes870No70562

(k)Broadindustry(Obs.)

Notharmonizedorother466Agriculture,hunting,forestry,andfishing429Mining141Manufacturing18846Utilities889Construction3753Wholesaleandretail11109Hotelsandrestaurants2070Transport,storage,andcommunications4761Financialintermediation1886Realestate,renting,andbusiness12486Publicadministrationanddefence1496Education3285Healthandsocialwork5577Otherserviceactivities4238

Table 3.7: Occupation Variable (SSYK). Right columns of panels (b), (c), (d) report values in number of employees observed in the data.

(a) Occupation Categories. Major groups, number of subdivisions and skill level.

Code Major groups N. Sub-major

groups

N. minor groups

N. Unit groups

Skill level (ISCO)

1 Legislators, senior officials and managers 3 6 29 N/A

2 Professionals 4 21 67 4th

3 Technicians and associate professionals 4 19 72 3rd

4 Clerks 2 8 17 2nd

5 Service workers and shop sales workers 2 7 27 2nd

6 Skilled agricultural and fishery workers 1 5 11 2nd

7 Craft and related trades workers 4 16 58 2nd

8 Plant and machine operators and assemblers 3 20 59 2nd

9 Elementary occupations 3 10 14 1st

0 Armed forces 1 1 1 N/A

Totals 9 27 113 355

(b) Composition Major groups (Obs.).

Not available 1317155

Managers 1279702

Professionals 5760078

Technicians and associate professionals 5580636

Clerks 2474938

Service, shop, and market sales workers 6105558

Skilled agricultural and fishery worker 110989

Craft and trades related workers 1787794

Plant and machine operators and assemblers 2597792

Elementary occupations 1695822

Total 28710464

N. different Categories 113

Obs. Smallest Category 114

Median Obs. in Categories 115047

Obs. Biggest Category 4549131

Mean. Obs. in Categories 243451.19

Std. dev. Obs. in Categories 469362.06

N/A 1200480

Total 28710464

(d) Unit groups composition.

N. different Categories 355

Obs. Smallest Category 7

Median Obs. in Categories 19863

Obs. Biggest Category 1526392

Mean. Obs. in Categories 67435.17

Std. dev. Obs. in Categories 152118.08

N/A 4770978

Total 28710464

Nel documento Machine Learning methods and applications in Economics (pagine 143-148)