Commit 89e60e48 authored by wanglch's avatar wanglch
Browse files

Initial commit

parents
Pipeline #2484 canceled with stages
V-February Flow
Data Components:
- Code: The-Stack-V2
- CodeText: SE, whatever we’ve scraped
- WebText: HQ DCLM
DATA MIXES
- ~85% Source Code
- ~10% CodeText
- ~5% Webtext
- ~85% The-stack-V2
- ~15% CodeText
- ~0% Webtext
- ~100% Source Code
Deepseek Coder
StarCoder 2
Arctic
\ No newline at end of file
V-February Flow
Data Components:
Code:
- The-Stack-V2
CodeText:
- SE, whatever we’ve scraped
WebText:
- HQ DCLM
DATA MIXES
~85% Source Code
~10% CodeText
~5% WebText
~85% The-stack-V2
~15% CodeText
~0% WebText
~100% Source Code
\ No newline at end of file
February Flow
Data Components:
Code:
The-Stack-V2
CodeText:
SE, whatever we’ve scraped
WebText:
HQ DCLM
DATA MIXES
~85% Source Code
~10% CodeText
~5% Webtext
~85% The-stack-V2
~15% CodeText
~0% Webtext
~100% Source Code
\ No newline at end of file
P1: 100% Source code
P2: 80% code
20% language
Code Data Recipe [Stacoder]
1) Order by Repo ✓
2) Call Heuristic Filters ✗
3) Group by Repo, lang → minhash ✓
4) Pack into Repo-level docs Ø
5) Select PL's Ø
Go) Pack into FIM tokens ✗
✓: Eng. Done
✗: Eng. Definitely NOT done
Ø: So so easy
Use Preprocessed code/text, webtext
\ No newline at end of file
P1: 100% Source code
P2: 80% code
20% language
Code Data Recipe [Stageder]
1) Order by Repo ✓
2) Call Heuristic filters x
3) Group by Repo, lang → minhash ✓
4) Pack into Repo-level docs □
5) Select PL's □
• Pack into FIM tokens * x
✓: Eng. Done
x: Eng. definitely NOT done
D: so so easy
Use Preprocessed code, webtext
\ No newline at end of file
P1: 100% Source code [Granite]
P2: 80% code 20% language
Code Data Recipe [stackcoder]
1) Order by Repo ✓
2) Call Heuristic Filters ❌
3) Group by Repo, lang → minhash ✓
4) Pack into Repo-level docs □
5) Select PL's □
sinon) Pack into FIM tokens ✗
✓: Eng Done
✗: Eng definitely NOT done
□: so so easy
Use Preprocessed code/text, webtext
\ No newline at end of file
P1: 100% Source code
P2: 80% code
20% language
Code Data Recipe [StackCoder]
1) Order by Repo ✓
2) Call Heuristic Filters ×
3) Group by Repo, lang → minhash ✓
4) Pack into Repo-level docs □
5) Select PL's △
6) Pack into FIM tokens *
✓: Eng Done
*X: Eng definitively NOT done
D: so so easy
Use Preprocessed code, text, webtext
\ No newline at end of file
P1: 100% Source code
P2: 80% Code
20% Language
Code Data Recipe [Stacoder]
1) Order by Repo ✓
2) Call Heuristic Filters ×
3) Group by Repo, lang → minhash ✓
4) Pack into Repo-level docs
5) Select PL's
6) Pack into FIM tokens
✓: Eng Done
X: Eng definitely NOT done
D: So so easy
*not critical
Use Preprocessed code/text, webtext
\ No newline at end of file
ARCH + TRAINING
- Pick Arch like OLMO -IB
- OR replicate a 3D model
- Follow standard LR flow
Eval:
Hacky nonsense for now
\ No newline at end of file
ARCH + TRAINING
- Pick arch like OLMO-IB
- OR replicate a 3D model
- Follow standard LR flow
Eval:
Hacky nonsense for now
\ No newline at end of file
ARCH + TRAINING
- Pick Arch like OLMO -IB
~ OR ~ replicate a 3D model
- Follow standard LR flow
Eval:
Hacky nonsense for now
\ No newline at end of file
ARCH + TRAINING
- Pick Arch like OLMo-1B
- OR
- Replicate a 3D model
- Follow standard LR flow
Eval:
Hacky nonsense for now
\ No newline at end of file
ARCH + TRAINING
- Pick Arch like OLMO-IB
OR
- Or replicate a 3D model
- Follow standard LR flow
Eval:
Hacky nonsense for now
\ No newline at end of file
stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment similar distaste in many areas of the business community. Our actions sought to denormalise the tobacco industry by disrupting its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association.
Tobacco industry posturing about its corporate responsibility can never hide the ugly consequences of its ongoing efforts to “work with all relevant stakeholders for the preservation of opportunities for informed adults to consume tobacco products” (translation: “we will build alliances with others who want to profit from tobacco use, to do all we can to counteract effective tobacco control”). BAT has 15.4% and Philip Morris 16.4% of the global cigarette market. With 4.9 million smokers currently dying from tobacco use each year, and the industry unblinkingly itertoolsing that its products are addictive, this leaves BAT to argue why it should not be held to be largely accountable for the annual deaths of some 754 600 smokers, and Philip Morris some 803 600 smokers.
REFERENCES
1 British American Tobacco. Social Report. http://www.bat.com/204pp
2 Wroe D. Tobacco ad campaign angers MPs. The Age (Melbourne) 2004; May 17 http://www.theage.com.au/articles/2004/05/16/108464669771.html?toneclick=true
3 Hirschhorn N. Corporate social responsibility and the tobacco industry: hope or hype? Tobacco Control 2004;13:447–53
4 Ethical Corporation Asia 2004. Conference website. http://www.ethicalcorp.org/asia2004/
5 Chapman S, Shatenstein S. Extreme corporate makeover: tobacco companies, corporate responsibility and the corruption of “ethics”. Globalink petition. http://petition.globalink.org/view.php?cmd=extreme
6 Mackay J, Erikson M. The tobacco atlas. Geneva: World Health Organization, 2002.
INDUSTRY WATCH
Corporate social responsibility and the tobacco industry: hope or hype?
N Hirschhorn
Corporate social responsibility (CSR) emerged from a realisation among transnational corporations of the need to account for and redress their adverse impact on society: specifically, on human rights, labour practices, and the environment. Two transnational tobacco companies have recently adopted CSR: Philip Morris, and British American Tobacco. This report explains the origins and theory behind CSR; examines internal company documents from Philip Morris showing the company’s deliberations on the matter, and the company’s perspective on its own behaviour; and reflects on whether marketing tobacco is antithetical to social responsibility.
Over the past three decades increasing pressure from non-governmental organisations (NGOs), governments and the United Nations, has required transnational corporations (TNCs) to examine and redress the adverse impact their businesses have on society and the environment. Many have responded by taking up what is known as “corporate social responsibility” (CSR); only recently have two major cigarette companies followed suit: Philip Morris (PM) and British American Tobacco (BAT). This report first provides the context and development of CSR; then, from internal company documents, examines how PM came to its own version. This paper examines whether a tobacco company espousing CSR should be judged simply as a corporate entity along standards of business ethics, or as an irrevocably negative force in the realm of public health, thereby rendering CSR an oxymoron.
CORPORATE SOCIAL RESPONSIBILITY: THE CONTEXT
The term “corporate social responsibility” is in vogue at the moment but as a concept it is vague and means different things to different people.
Some writers on CSR trace its American roots to the 19th century when large industries engaged in philanthropy and established great public institutions, a form of “noblese oblige”. But the notion that corporations should be required to return more to society because of their impact on society was driven by pressures from the civil rights, peace, and environmental movements of the last half century. The unprecedented expansion of power and influence of TNCs over the past three decades has accelerated global trade and development, but also environmental damage and abuses of
Abbreviations: ASH, Action on Smoking and Health; BAT, British American Tobacco; CERES, Coalition for Environmentally Responsible Economies; CSR, corporate social responsibility; DJSI, Dow Jones Sustainability Index; GCAC, Global Corporate Affairs Council; GRI, Global Reporting Initiative; MSA, Master Settlement Agreement; NGOs, non-governmental organisations; PM, Philip Morris; TNCs, transnational corporations; UNEP, United Nations Environment Program
www.tobaccocontrol.com
\ No newline at end of file
stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that while vigilance, tobacco control advocates can easily foment similar distaste in many areas of the business community. Our actions sought to denormalise the tobacco industry by disrupting its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association.
Tobacco industry posturing about its corporate responsibility can never hide the ugly consequences of its ongoing efforts to “work with all relevant stakeholders for the preservation of opportunities for informed adults to consume tobacco products” (translation: “we will build alliances with others who want to profit from tobacco use, to do all we can to counteract effective tobacco control”). BAT has 15.4% and Philip Morris 16.4% of the global cigarette market. With 4.9 million smokers currently dying from tobacco use each year, and the industry unblinkingly concurring that its products are addictive, this leaves BAT to argue why it should not be held to be largely accountable for the annual deaths of some 754 600 smokers, and Philip Morris some 803 600 smokers.
REFERENCES
1. British American Tobacco. Social Report. http://www.bat.com/204.aspx
2. Wroe D. Tobacco ad campaign angers MPs. The Age (Melbourne) 2004; May 17 http://www.theage.com.au/articles/2004/05/16/ 108464669771.html?oneclick=true
3. Hirschhorn N. Corporate social responsibility and the tobacco industry: hope or hype? Tobacco Control 2004;13:447–53.
4. Ethical Corporation Asia 2004. Conference website. http://www.ethicalcorp.com/asia2004/
5. Chapman S, Shatenstein S. Extreme corporate makeover: tobacco companies, corporate responsibility and the corruption of “ethics”. Globalink petition. http://petition.globalink.org/view.php?codae=xtrreme.
6. Mackay J, Eriksen M. The tobacco atlas. Geneva: World Health Organization, 2002.
INDUSTRY WATCH
Corporate social responsibility and the tobacco industry: hope or hype?
N Hirschhorn
Corporate social responsibility (CSR) emerged from a realisation among transnational corporations of the need to account for and redress their adverse impact on society: specifically, on human rights, labour practices, and the environment. Two transnational tobacco companies have recently adopted CSR: Philip Morris, and British American Tobacco. This report explains the origins and theory behind CSR; examines internal company documents from Philip Morris showing the company’s deliberations on the matter, and the company’s perspective on its own behaviour; and reflects on whether marketing tobacco is antithetical to social responsibility.
Over the past three decades increasing pressure from non-governmental organisations (NGOs), governments, and the United Nations, has required transnational corporations (TNCs) to examine and redress the adverse impact their businesses have on society and the environment. Many have responded by taking up what is known as “corporate social responsibility” (CSR); only recently have two major cigarette companies followed suit: Philip Morris (PM) and British American Tobacco (BAT). This report first provides the context and development of CSR; then, from internal company documents, examines how PM came to its own version. This paper examines whether a tobacco company espousing CSR should be judged simply as a corporate entity along standards of business ethics, or as an irrevocably negative force in the realm of public health, thereby rendering CSR an oxymoron.
\ No newline at end of file
stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment similar distaste in many areas of the business community. Our actions sought to denormalise the tobacco industry by disrupting its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association.
Tobacco industry posturing about its corporate responsibility can never hide the ugly consequences of its ongoing efforts to “work with all relevant stakeholders for the preservation of opportunities for informed adults to consume tobacco products” (translation: “we will build alliances with others who want to profit from tobacco use, to do all we can to counteract effective tobacco control”). BAT has 15.4% and Philip Morris 16.4% of the global cigarette market. With 4.9 million smokers currently dying from tobacco use each year, and the industry unblinkingly concurring that its products are addictive, this leaves BAT to argue why it should not be held to be largely accountable for the annual deaths of some 754 600 smokers, and Philip Morris some 803 600 smokers.
REFERENCES
1 British American Tobacco. Social Report. http://www.bat.com/204pp
2 Wroe D. Tobacco ad campaign angers MPs. The Age (Melbourne) 2004, May 17 http://www.theage.com.au/articles/2004/05/16/10846466971.html?Oneclick=true
3 Hirschhorn N. Corporate social responsibility and the tobacco industry: hope or hype? Tobacco Control 2004;13:447–53.
4 Ethical Corporation Asia 2004. Conference website. http://www.ethicalcorp.com/asia2004/
5 Chapman S, Shatenstein S. Extreme corporate makeover: tobacco companies, corporate responsibility and the corruption of “ethics”. Globalink petition. http://petition.globalink.org/view.php?code=extreme.
6 Mackay J, Erikson M. The tobacco atlas. Geneva: World Health Organization, 2002.
INDUSTRY WATCH
Corporate social responsibility and the tobacco industry: hope or hype?
N Hirschhorn
Corporate social responsibility (CSR) emerged from a realisation among transnational corporations of the need to account for and redress their adverse impact on society: specifically, on human rights, labour practices, and the environment. Two transnational tobacco companies have recently adopted CSR: Philip Morris, and British American Tobacco. This report explains the origins and theory behind CSR; examines internal company documents from Philip Morris showing the company’s deliberations on the matter, and the company’s perspective on its own behaviour; and reflects on whether marketing tobacco is antithetical to social responsibility.
Over the past three decades increasing pressure from non-governmental organisations (NGOs), governments and the United Nations, has required transnational corporations (TNCs) to examine and redress the adverse impact their businesses have on society and the environment. Many have responded by taking up what is known as “corporate social responsibility” (CSR); only recently have two major cigarette companies followed suit: Philip Morris (PM) and British American Tobacco (BAT). This report first provides the context and development of CSR; then, from internal company documents, examines how PM came to its own version. This paper examines whether a tobacco company espousing CSR should be judged simply as a corporate entity along standards of business ethics, or as an irrefutably negative force in the realm of public health, thereby rendering CSR an oxymoron.
CORPORATE SOCIAL RESPONSIBILITY: THE CONTEXT
The term “corporate social responsibility” is in vogue at the moment but as a concept it is vague and means different things to different people.
Some writers on CSR trace its American roots to the 19th century when large industries engaged in philanthropy and established great public institutions, a form of “noble’s obligation”. But the notion that corporations should be required to return more to society because of their impact on society was driven by pressures from the civil rights, peace, and environmental movements of the last half century. The unprecedented expansion of power and influence of TNCs over the past three decades has accelerated global trade and development, but also environmental damage and abuses of
Abbreviations: ASH, Action on Smoking and Health; BAT, British American Tobacco; CERES, Coalition for Environmentally Responsible Economies; CSR, corporate social responsibility; DJSI, Dow Jones Sustainability Index; GCAC, Global Corporate Affairs Council; GRI, Global Reporting Initiative; MSA, Master Settlement Agreement; NGOs, non-governmental organisations; PM, Philip Morris; TNCs, transnational corporations; UNEP, United Nations Environment Program
\ No newline at end of file
stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment similar distaste in many areas of the business community. Our actions sought to denormalise the tobacco industry by disrupting its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association.
Tobacco industry posturing about its corporate responsibility can never hide the ugly consequences of its ongoing efforts to “work with all relevant stakeholders for the preservation of opportunities for informed adults to consume tobacco products” (translation: “we will build alliances with others who want to profit from tobacco use, to do all we can to counteract effective tobacco control”). BAT has 15.4% and Philip Morris 16.4% of the global cigarette market. With 4.9 million smokers currently dying from tobacco use each year, and the industry unblinkingly concurring that its products are addictive, this leaves BAT to argue why it should not be held to be largely accountable for the annual deaths of some 754 600 smokers, and Philip Morris some 803 600 smokers.
REFERENCES
1 British American Tobacco. Social Report. http://www.bat.com/204pp
2 Wroe D. Tobacco ad campaign angers MPs. The Age (Melbourne) 2004, May 17 http://www.theage.com.au/articles/2004/05/16/108464669771.htmlToneclick=true.
3 Hirschhorn N. Corporate social responsibility and the tobacco industry: hope or hype? Tobacco Control 2004;13:447–53.
4 Ethical Corporation Asia 2004. Conference website. http://www.ethicalcorp.com/asia2004/.
5 Chapman S, Shatenstein S. Extreme corporate makeover: tobacco companies, corporate responsibility and the corruption of “ethics”. Globalink petition. http://petition.globalink.org/view.php?code=extreme.
6 Mackay J, Eriksen M. The tobacco atlas. Geneva: World Health Organization, 2002.
Corporate social responsibility and the tobacco industry: hope or hype?
N Hirschhorn
Corporate social responsibility (CSR) emerged from a realisation among transnational corporations of the need to account for and redress their adverse impact on society: specifically, on human rights, labour practices, and the environment. Two transnational tobacco companies have recently adopted CSR: Philip Morris, and British American Tobacco. This report explains the origins and theory behind CSR; examines internal company documents from Philip Morris showing the company’s deliberations on the matter, and the company’s perspective on its own behaviour; and reflects on whether marketing tobacco is antithetical to social responsibility.
Over the past three decades increasing pressure from non-governmental organisations (NGOs), governments and the United Nations, has required transnational corporations (TNCs) to examine and redress the adverse impact their businesses have on society and the environment. Many have responded by taking up what is known as “corporate social responsibility” (CSR); only recently have two major cigarette companies followed suit: Philip Morris (PM) and British American Tobacco (BAT). This report first provides the context and development of CSR; then, from internal company documents, examines how PM came to its own version. This paper examines whether a tobacco company espousing CSR should be judged simply as a corporate entity along standards of business ethics, or as an irremediably negative force in the realm of public health, thereby rendering CSR an oxymoron.
CORPORATE SOCIAL RESPONSIBILITY: THE CONTEXT
The term “corporate social responsibility” is in vogue at the moment but as a concept it is vague and means different things to different people. Some writers on CSR trace its American roots to the 19th century when large industries engaged in philanthropy and established great public institutions, a form of “noblesse oblige”. But the notion that corporations should be required to return more to society because of their impact on society was driven by pressures from the civil rights, peace, and environmental movements of the last half century. The unprecedented expansion of power and influence of TNCs over the past three decades has accelerated global trade and development, but also environmental damage and abuses of power.
Abbreviations: ASH, Action on Smoking and Health; BAT, British American Tobacco; CERES, Coalition for Environmentally Responsible Economies; CSR, corporate social responsibility; DJSI, Dow Jones Sustainability Index; GCAC, Global Corporate Affairs Council; GRI, Global Reporting Initiative; MSA, Master Settlement Agreement; NGOs, non-governmental organisations; PM, Philip Morris; TNCs, transnational corporations; UNEP, United Nations Environment Program
www.tobaccocontrol.com
\ No newline at end of file
stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry’s public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment similar distaste in many areas of the business community. Our actions sought to denormalise the tobacco industry by disrupting its efforts to take its place alongside other industries—often with considerable social credit—in the hope that it might gain by association.
Tobacco industry posturing about its corporate responsibility can never hide the ugly consequences of its ongoing efforts to “work with all relevant stakeholders for the preservation of opportunities for informed adults to consume tobacco products”1 (translation: “we will build alliances with others who want to profit from tobacco use, to do all we can to counterfeit effective tobacco control”). BAT has 15.4% and Philip Morris 16.4% of the global cigarette market.2 With 4.9 million smokers currently dying from tobacco use each year, and the industry unblinkingly concurring that its products are addictive, this leaves BAT to argue why it should not be held to be largely accountable for the annual deaths of some 754 600 smokers, and Philip Morris some 803 600 smokers.
REFERENCES
1 British American Tobacco. Social Report. http://www.bat.com/204pp
2 Wore D. Tobacco ad campaign angers MPs. The Age (Melbourne) 2004; May 17 http://www.theage.com.au/articles/2004/05/16/108464669771.htmlToneclick=true
3 Hirschhorn N. Corporate social responsibility and the tobacco industry: hope or hype? Tobacco Control 2004;13:447–53.
4 Ethical Corporation Asia 2004. Conference website. http://www.ethicalcorp.com/asia2004/
5 Chapman S, Shatenstein S. Extreme corporate makeover: tobacco companies, corporate responsibility and the corruption of “ethics”. Globalink petition. http://petition.globalink.org/view.php?code=extreme
6 Mackay J, Eriksen M. The tobacco atlas. Geneva: World Health Organization, 2002.
INDUSTRY WATCH
Corporate social responsibility and the tobacco industry: hope or hype?
N Hirschhorn
Corporate social responsibility (CSR) emerged from a realisation among transnational corporations of the need to account for and redress their adverse impact on society: specifically, on human rights, labour practices, and the environment. Two transnational tobacco companies have recently adopted CSR: Philip Morris, and British American Tobacco. This report explains the origins and theory behind CSR; examines internal company documents from Philip Morris showing the company’s deliberations on the matter, and the company’s perspective on its own behaviour; and reflects on whether marketing tobacco is antithetical to social responsibility.
Over the past three decades increasing pressure from non-governmental organisations (NGOs), governments and the United Nations, has required transnational corporations (TNCs) to examine and redress the adverse impact their businesses have on society and the environment. Many have responded by taking up what is known as “corporate social responsibility” (CSR); only recently have two major cigarette companies followed suit: Philip Morris (PM) and British American Tobacco (BAT). This report first provides the context and development of CSR; then, from internal company documents, examines how PM came to its own version. This paper examines whether a tobacco company espousing CSR should be judged simply as a corporate entity along standards of business ethics, or as an irretrievably negative force in the realm of public health, thereby rendering CSR an oxymoron.
CORPORATE SOCIAL RESPONSIBILITY: THE CONTEXT
The term “corporate social responsibility” is in vogue at the moment but as a concept it is vague and means different things to different people.3 Some writers on CSR trace its American roots to the 19th century when large industries engaged in philanthropy and established great public institutions, a form of “noblese oblige”. But the notion that corporations should be required to return more to society because of their impact on society was driven by pressures from the civil rights, peace, and environmental movements of the last half century:4 The unprecedented expansion of power and influence of TNCs over the past three decades has accelerated global trade and development, but also environmental damage and abuses of
Abbreviations: ASH, Action on Smoking and Health; BAT, British American Tobacco; CERES, Coalition for Environmentally Responsible Economies; CSR, corporate social responsibility; DJSI, Dow Jones Sustainability Index; GCAC, Global Corporate Affairs Council; GPI, Global Reporting Initiative; MSA, Master Settlement Agreement; NGOs, non-governmental organisations; PM, Philip Morris; TNCs, transnational corporations; UNEP, United Nations Environment Program
13 November 2003
Received 13 November 2003
Accepted 15 July 2004
Correspondence to:
Dr Norbert Hirschhorn,
Nastolantie 6, A3 00600
Helsinki, Finland;
bertzpoef@yahoo.com
www.tobaccocontrol.com
\ No newline at end of file
2.1.1 Pretraining data: OLMo 2 Mix 1124
The mix used for this stage is shown in Table 1. It consists of approximately 3.9 trillion tokens, with over 95% derived from web data. We refer to this set as OLMo 2 Mix 1124. This is the same pretraining data used in OLMoE (Muennighoff et al., 2024).
We combine data from DCLM (Li et al., 2024) and Dolma 1.7 (Soldaini et al., 2024). From DCLM, we use the “baseline 1.0” mix. From Dolma, we use the arXiv (Together AI, 2023), OpenWebMath (Paster et al., 2023), Algebraic Stack, peS2o (Soldaini and Lo, 2023), and Wikipedia subsets. arXiv, OpenWebMath, and Algebraic Stack were originally part of ProofPile II (Azerbayev et al., 2023).
Finally, we include code from StarCoder (Li et al., 2023b), which is derived from permissively-licensed repositories from GitHub (Kocetkov et al., 2022). In an attempt to include higher quality code, we remove any document from a repository with fewer than 2 stars on GitHub. Further, through manual inspection of this source, we found it to contain documents encoded in binary format or containing mostly numerical content; to remove them, we discarded documents whose most frequent word constitutes over 30% of the document, or whose top-2 most frequent words constitute over 50% of the document. To mitigate possible training loss spikes, we remove documents with repeated sequences of 32 or more n-grams. We report details and show effectiveness of this intervention in Section §3.1.
2.1.2 Mid-training data: Dolmino Mix 1124
After the initial pretraining stage on mostly web data, we further train with a mixture of web data that has been more restrictively filtered for quality and a collection of domain-specific high quality data, much of which is synthetic. The purpose of this mixture is to imbue the model with math-centric skills and provide focused exposure to STEM references and high quality text. We generate several variants of this mixture, with varying sizes, but generally refer to this mixture as Dolmino Mix 1124. The base sources from which Dolmino Mix 1124 is subsampled are described in Table 2. We refer the reader to Section §4 for a deep dive detailing our processes for experimenting and curating data for this mix.
\ No newline at end of file
Table 1 Composition of the pretraining data for OLMo 2. The OLMo 2 1124 Mix is composed of StarCoder (Li et al., 2023b; Kocetkov et al., 2022), peS2o (Soldaini and Lo, 2023), web text from DCLM (Li et al., 2024) and Wiki come from Dolma 1.7 (Soldaini et al., 2024). arXiv comes from Red-Pajama (Together AI, 2023), while OpenWebMath (Paster et al., 2023) and Algebraic Stack come from ProofPile II (Azerbayev et al., 2023).
2.1.1 Pretraining data: OLMo 2 Mix 1124
The mix used for this stage is shown in Table 1. It consists of approximately 3.9 trillion tokens, with over 95% derived from web data. We refer to this set as OLMo 2 Mix 1124. This is the same pretraining data used in OLMoE (Muennighoff et al., 2024).
We combine data from DCLM (Li et al., 2024) and Dolma 1.7 (Soldaini et al., 2024). From DCLM, we use the “baseline 1.0” mix. From Dolma, we use the arXiv (Together AI, 2023), OpenWebMath (Paster et al., 2023), Algebraic Stack, peS2o (Soldaini and Lo, 2023), and Wikipedia subsets. arXiv, OpenWebMath, and Algebraic Stack were originally part of ProofPile II (Azerbayev et al., 2023).
Finally, we include code from StarCoder (Li et al., 2023b), which is derived from permissively-licensed repositories from GitHub (Kocetkov et al., 2022). In an attempt to include higher quality code, we remove any document from a repository with fewer than 2 stars on GitHub. Further, through manual inspection of this source, we found it to contain documents encoded in binary format or containing mostly numerical content; to remove them, we discarded documents whose most frequent word constitutes over 30% of the document, or whose top-2 most frequent words constitute over 50% of the document. To mitigate possible training loss spikes, we remove documents with repeated sequences of 32 or more n-grams. We report details and show effectiveness of this intervention in Section §3.1.
2.1.2 Mid-training data: Dolmino Mix 1124
After the initial pretraining stage on mostly web data, we further train with a mixture of web data that has been more restrictively filtered for quality and a collection of domain-specific high quality data, much of which is synthetic. The purpose of this mixture is to imbue the model with math-centric skills and provide focused exposure to STEM references and high quality text. We generate several variants of this mixture, with varying sizes, but generally refer to this mixture as Dolmino Mix 1124. The base sources from which Dolmino Mix 1124 is subsampled are described in Table 2. We refer the reader to Section §4 for a deep dive detailing our processes for experimenting and curating data for this mix.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment